Missed check for too-many-children in bgworker spawning

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Missed check for too-many-children in bgworker spawning

Tom Lane-2
Over in [1] we have a report of a postmaster shutdown that seems to
have occurred because some client logic was overaggressively spawning
connection requests, causing the postmaster's child-process arrays to
be temporarily full, and then some parallel query tried to launch a
new bgworker process.  The postmaster's bgworker-spawning logic lacks
any check for the arrays being full, so when AssignPostmasterChildSlot
failed to find a free slot, kaboom!

The attached proposed patch fixes this by making bgworker spawning
include a canAcceptConnections() test.  That's perhaps overkill, since
we really just need to check the CountChildren() total; but I judged
that going through that function and having it decide what to test or
not test was a better design than duplicating the CountChildren() test
elsewhere.

I'd first imagined also replacing the one-size-fits-all check

    if (CountChildren(BACKEND_TYPE_ALL) >= MaxLivePostmasterChildren())
        result = CAC_TOOMANY;

with something like

    switch (backend_type)
    {
        case BACKEND_TYPE_NORMAL:
            if (CountChildren(backend_type) >= 2 * MaxConnections)
                result = CAC_TOOMANY;
            break;
        case BACKEND_TYPE_AUTOVAC:
            if (CountChildren(backend_type) >= 2 * autovacuum_max_workers)
                result = CAC_TOOMANY;
            break;
        ...
    }

so as to subdivide the pool of child-process slots and prevent client
requests from consuming slots meant for background processes.  But on
closer examination that's not really worth the trouble, because this
pool is already considerably bigger than MaxBackends; so even if we
prevented a failure here we could still have bgworker startup failure
later on when it tries to acquire a PGPROC.

Barring objections, I'll apply and back-patch this soon.

                        regards, tom lane

[1] https://www.postgresql.org/message-id/flat/CADCf-WNZk_9680Q0YjfBzuiR0Oe8LzvDs2Ts3_tq6Tv1e8raQQ%40mail.gmail.com


diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index eb9e022..7947c96 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -409,7 +409,7 @@ static void SendNegotiateProtocolVersion(List *unrecognized_protocol_options);
 static void processCancelRequest(Port *port, void *pkt);
 static int initMasks(fd_set *rmask);
 static void report_fork_failure_to_client(Port *port, int errnum);
-static CAC_state canAcceptConnections(void);
+static CAC_state canAcceptConnections(int backend_type);
 static bool RandomCancelKey(int32 *cancel_key);
 static void signal_child(pid_t pid, int signal);
 static bool SignalSomeChildren(int signal, int targets);
@@ -2401,13 +2401,15 @@ processCancelRequest(Port *port, void *pkt)
  * canAcceptConnections --- check to see if database state allows connections.
  */
 static CAC_state
-canAcceptConnections(void)
+canAcceptConnections(int backend_type)
 {
  CAC_state result = CAC_OK;
 
  /*
  * Can't start backends when in startup/shutdown/inconsistent recovery
- * state.
+ * state.  We treat autovac workers the same as user backends for this
+ * purpose.  However, bgworkers are excluded from this test; we expect
+ * bgworker_should_start_now() decided whether the DB state allows them.
  *
  * In state PM_WAIT_BACKUP only superusers can connect (this must be
  * allowed so that a superuser can end online backup mode); we return
@@ -2415,7 +2417,8 @@ canAcceptConnections(void)
  * that neither CAC_OK nor CAC_WAITBACKUP can safely be returned until we
  * have checked for too many children.
  */
- if (pmState != PM_RUN)
+ if (pmState != PM_RUN &&
+ backend_type != BACKEND_TYPE_BGWORKER)
  {
  if (pmState == PM_WAIT_BACKUP)
  result = CAC_WAITBACKUP; /* allow superusers only */
@@ -2435,9 +2438,9 @@ canAcceptConnections(void)
  /*
  * Don't start too many children.
  *
- * We allow more connections than we can have backends here because some
+ * We allow more connections here than we can have backends because some
  * might still be authenticating; they might fail auth, or some existing
- * backend might exit before the auth cycle is completed. The exact
+ * backend might exit before the auth cycle is completed.  The exact
  * MaxBackends limit is enforced when a new backend tries to join the
  * shared-inval backend array.
  *
@@ -4114,7 +4117,7 @@ BackendStartup(Port *port)
  bn->cancel_key = MyCancelKey;
 
  /* Pass down canAcceptConnections state */
- port->canAcceptConnections = canAcceptConnections();
+ port->canAcceptConnections = canAcceptConnections(BACKEND_TYPE_NORMAL);
  bn->dead_end = (port->canAcceptConnections != CAC_OK &&
  port->canAcceptConnections != CAC_WAITBACKUP);
 
@@ -5486,7 +5489,7 @@ StartAutovacuumWorker(void)
  * we have to check to avoid race-condition problems during DB state
  * changes.
  */
- if (canAcceptConnections() == CAC_OK)
+ if (canAcceptConnections(BACKEND_TYPE_AUTOVAC) == CAC_OK)
  {
  /*
  * Compute the cancel key that will be assigned to this session. We
@@ -5863,6 +5866,19 @@ assign_backendlist_entry(RegisteredBgWorker *rw)
  Backend    *bn;
 
  /*
+ * Check that database state allows another connection.  Currently the
+ * only possible failure is CAC_TOOMANY, so we just log an error message
+ * based on that rather than checking the error code precisely.
+ */
+ if (canAcceptConnections(BACKEND_TYPE_BGWORKER) != CAC_OK)
+ {
+ ereport(LOG,
+ (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+ errmsg("no slot available for new worker process")));
+ return false;
+ }
+
+ /*
  * Compute the cancel key that will be assigned to this session. We
  * probably don't need cancel keys for background workers, but we'd better
  * have something random in the field to prevent unfriendly people from
Reply | Threaded
Open this post in threaded view
|

Re: Missed check for too-many-children in bgworker spawning

Robert Haas
On Sun, Oct 6, 2019 at 1:17 PM Tom Lane <[hidden email]> wrote:

> Over in [1] we have a report of a postmaster shutdown that seems to
> have occurred because some client logic was overaggressively spawning
> connection requests, causing the postmaster's child-process arrays to
> be temporarily full, and then some parallel query tried to launch a
> new bgworker process.  The postmaster's bgworker-spawning logic lacks
> any check for the arrays being full, so when AssignPostmasterChildSlot
> failed to find a free slot, kaboom!
>
> The attached proposed patch fixes this by making bgworker spawning
> include a canAcceptConnections() test.  That's perhaps overkill, since
> we really just need to check the CountChildren() total; but I judged
> that going through that function and having it decide what to test or
> not test was a better design than duplicating the CountChildren() test
> elsewhere.
>
> I'd first imagined also replacing the one-size-fits-all check
>
>     if (CountChildren(BACKEND_TYPE_ALL) >= MaxLivePostmasterChildren())
>         result = CAC_TOOMANY;
>
> with something like
>
>     switch (backend_type)
>     {
>         case BACKEND_TYPE_NORMAL:
>             if (CountChildren(backend_type) >= 2 * MaxConnections)
>                 result = CAC_TOOMANY;
>             break;
>         case BACKEND_TYPE_AUTOVAC:
>             if (CountChildren(backend_type) >= 2 * autovacuum_max_workers)
>                 result = CAC_TOOMANY;
>             break;
>         ...
>     }
>
> so as to subdivide the pool of child-process slots and prevent client
> requests from consuming slots meant for background processes.  But on
> closer examination that's not really worth the trouble, because this
> pool is already considerably bigger than MaxBackends; so even if we
> prevented a failure here we could still have bgworker startup failure
> later on when it tries to acquire a PGPROC.
>
> Barring objections, I'll apply and back-patch this soon.

I think it used to work this way -- not sure if it was ever committed
this way, but it at least did during development -- and we ripped it
out because somebody (Magnus?) pointed out that if you got close to
the connection limit, you could see parallel queries start failing,
and that would suck. Falling back to non-parallel seems more OK in
that situation than actually failing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: Missed check for too-many-children in bgworker spawning

Tom Lane-2
Robert Haas <[hidden email]> writes:
> On Sun, Oct 6, 2019 at 1:17 PM Tom Lane <[hidden email]> wrote:
>> The attached proposed patch fixes this by making bgworker spawning
>> include a canAcceptConnections() test.

> I think it used to work this way -- not sure if it was ever committed
> this way, but it at least did during development -- and we ripped it
> out because somebody (Magnus?) pointed out that if you got close to
> the connection limit, you could see parallel queries start failing,
> and that would suck. Falling back to non-parallel seems more OK in
> that situation than actually failing.

I'm not following your point?  Whatever you might think the appropriate
response is, I'm pretty sure "elog(FATAL) out of the postmaster" is not
it.  Moreover, we have to --- and already do, I trust --- deal with
other resource-exhaustion errors in exactly the same code path, notably
fork(2) failure which we simply can't predict or prevent.  Doesn't the
parallel query logic already deal sanely with failure to obtain as many
workers as it wanted?

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Missed check for too-many-children in bgworker spawning

Robert Haas
On Mon, Oct 7, 2019 at 4:03 PM Tom Lane <[hidden email]> wrote:
> I'm not following your point?  Whatever you might think the appropriate
> response is, I'm pretty sure "elog(FATAL) out of the postmaster" is not
> it.  Moreover, we have to --- and already do, I trust --- deal with
> other resource-exhaustion errors in exactly the same code path, notably
> fork(2) failure which we simply can't predict or prevent.  Doesn't the
> parallel query logic already deal sanely with failure to obtain as many
> workers as it wanted?

If we fail to obtain workers because there are not adequate workers
slots available, parallel query deals with that smoothly.  But, once
we have a slot, any further failure will trigger the parallel query to
ERROR out.  For the case where we get a slot but can't start the
worker process, see WaitForParallelWorkersToFinish and/or
WaitForParallelWorkersToAttach and comments therein. Once we're
attached, any error messages thrown by the worker are propagated back
to the master; see HandleParallelMessages and pq_redirect_to_shm_mq.

Now you could argue that the master ought to selectively ignore
certain kinds of errors and just continue on, while rethrowing others,
say based on the errcode(). Such design ideas have been roundly panned
in other contexts, though, so I'm not sure it would be a great idea to
do it here either. But in any case, it's not how the current system
behaves, or was designed to behave.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: Missed check for too-many-children in bgworker spawning

Tom Lane-2
Robert Haas <[hidden email]> writes:
> On Mon, Oct 7, 2019 at 4:03 PM Tom Lane <[hidden email]> wrote:
>> ... Moreover, we have to --- and already do, I trust --- deal with
>> other resource-exhaustion errors in exactly the same code path, notably
>> fork(2) failure which we simply can't predict or prevent.  Doesn't the
>> parallel query logic already deal sanely with failure to obtain as many
>> workers as it wanted?

> If we fail to obtain workers because there are not adequate workers
> slots available, parallel query deals with that smoothly.  But, once
> we have a slot, any further failure will trigger the parallel query to
> ERROR out.

Well, that means we have a not-very-stable system then.

We could improve on matters so far as the postmaster's child-process
arrays are concerned, by defining separate slot "pools" for the different
types of child processes.  But I don't see much point if the code is
not prepared to recover from a fork() failure --- and if it is, that
would a fortiori deal with out-of-child-slots as well.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Missed check for too-many-children in bgworker spawning

Robert Haas
On Wed, Oct 9, 2019 at 10:21 AM Tom Lane <[hidden email]> wrote:
> Well, that means we have a not-very-stable system then.
>
> We could improve on matters so far as the postmaster's child-process
> arrays are concerned, by defining separate slot "pools" for the different
> types of child processes.  But I don't see much point if the code is
> not prepared to recover from a fork() failure --- and if it is, that
> would a fortiori deal with out-of-child-slots as well.

I would say rather that if fork() is failing on your system, you have
a not very stable system. The fact that parallel query is going to
fail is sad, but not as sad as the fact that connecting to the
database is also going to fail, and that logging into the system to
try to fix the problem may well fail as well. Code that tries to make
parallel query cope with this situation without an error wouldn't
often be tested, so it might be buggy, and it wouldn't necessarily be
a benefit if it did work. I expect many people would rather have the
query fail and free up slots in the system process table than consume
precisely all of them and then try to execute the query at a
slower-than-expected rate.

Anyway, here's some previous discussion on this topic for your consideration:

https://www.postgresql.org/message-id/flat/CAKJS1f_6H2Gh3QyORyRP%2BG3YB3gZiNms_8QdtO5gvitfY5N9ig%40mail.gmail.com

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: Missed check for too-many-children in bgworker spawning

Tom Lane-2
Robert Haas <[hidden email]> writes:
> On Wed, Oct 9, 2019 at 10:21 AM Tom Lane <[hidden email]> wrote:
>> We could improve on matters so far as the postmaster's child-process
>> arrays are concerned, by defining separate slot "pools" for the different
>> types of child processes.  But I don't see much point if the code is
>> not prepared to recover from a fork() failure --- and if it is, that
>> would a fortiori deal with out-of-child-slots as well.

> I would say rather that if fork() is failing on your system, you have
> a not very stable system. The fact that parallel query is going to
> fail is sad, but not as sad as the fact that connecting to the
> database is also going to fail, and that logging into the system to
> try to fix the problem may well fail as well.

True, it's not a situation you especially want to be in.  However,
I've lost count of the number of times that I've heard someone talk
about how their system was overstressed to the point that everything
else was failing, but Postgres kept chugging along.  That's a good
reputation to have and we shouldn't just walk away from it.

> Code that tries to make
> parallel query cope with this situation without an error wouldn't
> often be tested, so it might be buggy, and it wouldn't necessarily be
> a benefit if it did work. I expect many people would rather have the
> query fail and free up slots in the system process table than consume
> precisely all of them and then try to execute the query at a
> slower-than-expected rate.

I find that argument to be utter bunkum.  The parallel query code is
*already* designed to silently degrade performance when its primary
resource limit (shared bgworker slots) is exhausted.  How can it be
all right to do that but not all right to cope with fork failure
similarly?  If we think running up against the kernel limits is a
case that we can roll over and die on, why don't we rip out the
virtual-FD stuff in fd.c?

As for "might be buggy", if we ripped out every part of Postgres
that's under-tested, I'm afraid there might not be much left.
In any case, a sane design for this would make as much as possible
of the code handle "out of shared bgworker slots" just the same as
resource failures later on, so that there wouldn't be that big a gap
in coverage.

Having said all that, I made a patch that causes the postmaster
to reserve separate child-process-array slots for autovac workers
and bgworkers, as per attached, so that excessive connection
requests can't DOS those subsystems.  But I'm not sure that it's
worth the complication; it wouldn't be necessary if the parallel
query launch code were more robust.

                        regards, tom lane


diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 85f15a5..8c66578 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2441,17 +2441,43 @@ canAcceptConnections(int backend_type)
  /*
  * Don't start too many children.
  *
- * We allow more connections here than we can have backends because some
- * might still be authenticating; they might fail auth, or some existing
- * backend might exit before the auth cycle is completed.  The exact
- * MaxBackends limit is enforced when a new backend tries to join the
- * shared-inval backend array.
+ * We allow more connections (including walsenders) here than we can have
+ * backends because some might still be authenticating; they might fail
+ * auth, or some existing backend might exit before the auth cycle is
+ * completed.  The exact MaxBackends limit is enforced when a new backend
+ * tries to join the shared-inval backend array.
  *
- * The limit here must match the sizes of the per-child-process arrays;
- * see comments for MaxLivePostmasterChildren().
+ * On the other hand, there's no need to allow extra autovac workers or
+ * bgworkers, because the requesting logic for those should never request
+ * more processes than the corresponding limit.
+ *
+ * This logic must agree with MaxLivePostmasterChildren(), which sets the
+ * total size of the per-child-process arrays.
+ *
+ * Note: currently, we can't see backend_type == BACKEND_TYPE_WALSND here,
+ * but it seems less weird to include it in the switch than not.  We must
+ * treat normal backends and walsenders alike for this purpose because one
+ * type can metamorphose into the other later.
  */
- if (CountChildren(BACKEND_TYPE_ALL) >= MaxLivePostmasterChildren())
- result = CAC_TOOMANY;
+ switch (backend_type)
+ {
+ case BACKEND_TYPE_NORMAL:
+ case BACKEND_TYPE_WALSND:
+ if (CountChildren(BACKEND_TYPE_NORMAL | BACKEND_TYPE_WALSND) >=
+ 2 * (MaxConnections + max_wal_senders))
+ result = CAC_TOOMANY;
+ break;
+ case BACKEND_TYPE_AUTOVAC:
+ if (CountChildren(BACKEND_TYPE_AUTOVAC) >= autovacuum_max_workers)
+ result = CAC_TOOMANY;
+ break;
+ case BACKEND_TYPE_BGWORKER:
+ if (CountChildren(BACKEND_TYPE_BGWORKER) >= max_worker_processes)
+ result = CAC_TOOMANY;
+ break;
+ default:
+ elog(ERROR, "unexpected backend_type: %d", backend_type);
+ }
 
  return result;
 }
@@ -5334,8 +5360,18 @@ static int
 CountChildren(int target)
 {
  dlist_iter iter;
+ bool check_walsender;
  int cnt = 0;
 
+ /*
+ * In common cases we don't need to distinguish normal from walsender
+ * children, so we can save a few cycles and shared-memory touches by not
+ * checking for walsenders if not.
+ */
+ check_walsender =
+ (target & (BACKEND_TYPE_NORMAL | BACKEND_TYPE_WALSND)) != 0 &&
+ (target & (BACKEND_TYPE_NORMAL | BACKEND_TYPE_WALSND)) != (BACKEND_TYPE_NORMAL | BACKEND_TYPE_WALSND);
+
  dlist_foreach(iter, &BackendList)
  {
  Backend    *bp = dlist_container(Backend, elem, iter.cur);
@@ -5343,11 +5379,7 @@ CountChildren(int target)
  if (bp->dead_end)
  continue;
 
- /*
- * Since target == BACKEND_TYPE_ALL is the most common case, we test
- * it first and avoid touching shared memory for every child.
- */
- if (target != BACKEND_TYPE_ALL)
+ if (check_walsender)
  {
  /*
  * Assign bkend_type for any recently announced WAL Sender
@@ -5356,11 +5388,11 @@ CountChildren(int target)
  if (bp->bkend_type == BACKEND_TYPE_NORMAL &&
  IsPostmasterChildWalSender(bp->child_slot))
  bp->bkend_type = BACKEND_TYPE_WALSND;
-
- if (!(target & bp->bkend_type))
- continue;
  }
 
+ if (!(target & bp->bkend_type))
+ continue;
+
  cnt++;
  }
  return cnt;
@@ -5626,15 +5658,31 @@ CreateOptsFile(int argc, char *argv[], char *fullprogname)
  * (the PMChildFlags array, and if EXEC_BACKEND the ShmemBackendArray).
  * These arrays include regular backends, autovac workers, walsenders
  * and background workers, but not special children nor dead_end children.
- * This allows the arrays to have a fixed maximum size, to wit the same
- * too-many-children limit enforced by canAcceptConnections().  The exact value
- * isn't too critical as long as it's more than MaxBackends.
+ * This allows the arrays to have a fixed maximum size.  The total amount
+ * of space we reserve here mustn't be less than the number of child processes
+ * that canAcceptConnections() will allow.
  */
 int
 MaxLivePostmasterChildren(void)
 {
- return 2 * (MaxConnections + autovacuum_max_workers + 1 +
- max_wal_senders + max_worker_processes);
+ int nchildren = 0;
+
+ /*
+ * For children that are launched from asynchronous connection requests,
+ * allow some extra space in the arrays so that we don't deny a connection
+ * that would have succeeded by the time it is ready to join the shared-
+ * memory data structures.  The 2X multiplier is arbitrary, and probably
+ * overkill, but the amount of space wasted is minimal.
+ */
+ nchildren += 2 * (MaxConnections + max_wal_senders);
+
+ /* Allow for the autovac launcher and its workers */
+ nchildren += 1 + autovacuum_max_workers;
+
+ /* Allow for bgworkers */
+ nchildren += max_worker_processes;
+
+ return nchildren;
 }
 
 /*