checkpointer and other server processes crashing

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

checkpointer and other server processes crashing

Joe Abbate
Hello,

We've been experiencing PG server process crashes about every other week
on a mostly read only website (except for a single insert/update on page
access).  Typical log entries look like

LOG:  checkpointer process (PID 11200) was terminated by signal 9: Killed
LOG:  terminating any other active server processes

Other than the checkpointer, the server process that was terminated was
either doing a "BEGIN READ WRITE", a "COMMIT" or executing a specific
SELECT.

The database is always recovered within a second and everything else
appears to resume normally.  We're not certain about what triggers this,
but in several instances the web logs show an external bot issuing
multiple HEAD requests on what is logically a single page.  The web
server logs show "broken pipe" and EOF errors, and PG logs sometimes
shows a number of "incomplete startup packet" messages before the
termination message.

This started roughly when the site was migrated to Go, whose web
"processes" run as "goroutines", scheduled by Go's runtime (previously
the site used Python and Gunicorn to serve the pages, which probably
isolated the PG processes from a barrage of nearly simultaneous requests).

As I understand it, the PG server processes doing a SELECT are spawned
as children of the Go process, so presumably if a "goroutine" dies, the
associated PG process would die too, but I'm not sure I grasp why that
would cause a recovery/restart.  I also don't understand where the
checkpointer process fits in the picture (and what would cause it to die).

For the record, this is on PG 11.9 running on Debian.

TIA,

Joe


Reply | Threaded
Open this post in threaded view
|

Re: checkpointer and other server processes crashing

Adrian Klaver-4
On 2/15/21 1:15 PM, Joe Abbate wrote:
> Hello,
>
> We've been experiencing PG server process crashes about every other week
> on a mostly read only website (except for a single insert/update on page
> access).  Typical log entries look like
>
> LOG:  checkpointer process (PID 11200) was terminated by signal 9: Killed
> LOG:  terminating any other active server processes

Have you looked at the system logs to see if the OOM killer is involved?

>
> Other than the checkpointer, the server process that was terminated was
> either doing a "BEGIN READ WRITE", a "COMMIT" or executing a specific
> SELECT.
>
> The database is always recovered within a second and everything else
> appears to resume normally.  We're not certain about what triggers this,
> but in several instances the web logs show an external bot issuing
> multiple HEAD requests on what is logically a single page.  The web
> server logs show "broken pipe" and EOF errors, and PG logs sometimes
> shows a number of "incomplete startup packet" messages before the
> termination message.
>
> This started roughly when the site was migrated to Go, whose web
> "processes" run as "goroutines", scheduled by Go's runtime (previously
> the site used Python and Gunicorn to serve the pages, which probably
> isolated the PG processes from a barrage of nearly simultaneous requests).
>
> As I understand it, the PG server processes doing a SELECT are spawned
> as children of the Go process, so presumably if a "goroutine" dies, the
> associated PG process would die too, but I'm not sure I grasp why that
> would cause a recovery/restart.  I also don't understand where the
> checkpointer process fits in the picture (and what would cause it to die).
>
> For the record, this is on PG 11.9 running on Debian.
>
> TIA,
>
> Joe
>
>


--
Adrian Klaver
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: checkpointer and other server processes crashing

Tim Cross
In reply to this post by Joe Abbate

Joe Abbate <[hidden email]> writes:

> Hello,
>
> We've been experiencing PG server process crashes about every other week
> on a mostly read only website (except for a single insert/update on page
> access).  Typical log entries look like
>
> LOG:  checkpointer process (PID 11200) was terminated by signal 9: Killed
> LOG:  terminating any other active server processes
>
> Other than the checkpointer, the server process that was terminated was
> either doing a "BEGIN READ WRITE", a "COMMIT" or executing a specific
> SELECT.
>
> The database is always recovered within a second and everything else
> appears to resume normally.  We're not certain about what triggers this,
> but in several instances the web logs show an external bot issuing
> multiple HEAD requests on what is logically a single page.  The web
> server logs show "broken pipe" and EOF errors, and PG logs sometimes
> shows a number of "incomplete startup packet" messages before the
> termination message.
>
> This started roughly when the site was migrated to Go, whose web
> "processes" run as "goroutines", scheduled by Go's runtime (previously
> the site used Python and Gunicorn to serve the pages, which probably
> isolated the PG processes from a barrage of nearly simultaneous requests).
>
> As I understand it, the PG server processes doing a SELECT are spawned
> as children of the Go process, so presumably if a "goroutine" dies, the
> associated PG process would die too, but I'm not sure I grasp why that
> would cause a recovery/restart.  I also don't understand where the
> checkpointer process fits in the picture (and what would cause it to die).
>

A signal 9 typically means something is explicitly killing processes. I
would check your system logs in case something is killing processes due
to running out of some resource (like memory). If it is a fairly recent
Debian system, journalctl might be useful for checking.

--
Tim Cross


Reply | Threaded
Open this post in threaded view
|

Re: checkpointer and other server processes crashing

Joe Abbate
In reply to this post by Adrian Klaver-4
On 15/2/21 16:29, Adrian Klaver wrote:
> On 2/15/21 1:15 PM, Joe Abbate wrote:
>> We've been experiencing PG server process crashes about every other
>> week on a mostly read only website (except for a single insert/update
>> on page access).  Typical log entries look like
>>
>> LOG:  checkpointer process (PID 11200) was terminated by signal 9: Killed
>> LOG:  terminating any other active server processes
>
> Have you looked at the system logs to see if the OOM killer is involved?

No, I hadn't, but now I see that of the past three crashes, four
postgres processes were OOM victims.  I guess I have to read the OOM
articles I've been skipping (read one a long time ago).  If you have any
specific suggestions, let me know.

Thanks Adrian,

Joe


Reply | Threaded
Open this post in threaded view
|

Re: checkpointer and other server processes crashing

Adrian Klaver-4
On 2/15/21 1:50 PM, Joe Abbate wrote:

> On 15/2/21 16:29, Adrian Klaver wrote:
>> On 2/15/21 1:15 PM, Joe Abbate wrote:
>>> We've been experiencing PG server process crashes about every other
>>> week on a mostly read only website (except for a single insert/update
>>> on page access).  Typical log entries look like
>>>
>>> LOG:  checkpointer process (PID 11200) was terminated by signal 9:
>>> Killed
>>> LOG:  terminating any other active server processes
>>
>> Have you looked at the system logs to see if the OOM killer is involved?
>
> No, I hadn't, but now I see that of the past three crashes, four
> postgres processes were OOM victims.  I guess I have to read the OOM
> articles I've been skipping (read one a long time ago).  If you have any
> specific suggestions, let me know.

There are suggestions here:

https://www.postgresql.org/docs/12/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT

It probably would not hurt to figure why this seemed to happen with the
Python -> Go switch. Even if you can get the OOM not to kick in, the
fact that it was kicking in would indicate you now have memory hungry
processes that did not exist before. The questions being is this
strictly due to the language/framework change, natural growth of data
set being worked with, or something else?

>
> Thanks Adrian,
>
> Joe


--
Adrian Klaver
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: checkpointer and other server processes crashing

Peter J. Holzer
On 2021-02-15 15:34:30 -0800, Adrian Klaver wrote:
> It probably would not hurt to figure why this seemed to happen with the
> Python -> Go switch. Even if you can get the OOM not to kick in, the fact
> that it was kicking in would indicate you now have memory hungry processes
> that did not exist before.

Or just more of them. I could imagine that switching from
Python/Gunicorn to Go increased the number of queries that could be
in-flight at the same time.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | [hidden email]         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: checkpointer and other server processes crashing

Peter J. Holzer
In reply to this post by Joe Abbate
On 2021-02-15 16:15:44 -0500, Joe Abbate wrote:
> As I understand it, the PG server processes doing a SELECT are spawned as
> children of the Go process,

There's a misconception here: The server processes are children of the
postmaster process, not of the client program (whether it's written in
Go or any other language). Think about it: The database server and the
client may not even be on the same machine, so there cannot be a
parent/child relationship between client and server processes.

> so presumably if a "goroutine" dies, the associated PG process would
> die too,

A server process will terminate when the connection is closed, but it
may not notice that immediately. Moreover, since Goroutines are handled
by the Go runtime, not the Linux kernel, the connection may not be
closed just because a goroutine terminates (actually, I think the
standard API pretty much guarantuees the existence of a connection
pool).

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | [hidden email]         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

signature.asc (849 bytes) Download Attachment