PMChildFlags array

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

PMChildFlags array

bhargav kamineni
Hi,

Observed below errors  in logfile

2019-09-20 02:00:24.504 UTC,,,99779,,5d73303a.185c3,73,,2019-09-07 04:21:14 UTC,,0,FATAL,XX000,"no free slots in PMChildFlags array",,,,,,,,,""
2019-09-20 02:00:24.505 UTC,,,109949,,5d8432b8.1ad7d,1,,2019-09-20 02:00:24 UTC,,0,ERROR,58P01,"could not open shared memory segment ""/PostgreSQL.2520932"": No such file or directory",,,,,,,,,""
2019-09-20 02:00:24.505 UTC,,,109950,,5d8432b8.1ad7e,1,,2019-09-20 02:00:24 UTC,,0,ERROR,58P01,"could not open shared memory segment ""/PostgreSQL.2520932"": No such file or directory",,,,,,,,,""

 
what could be the possible reasons for this to occur and is there any chance of database corruption after this event ?


Regards,
Bhargav   

Reply | Threaded
Open this post in threaded view
|

Re: PMChildFlags array

bhargav kamineni
Any suggestions on this ?

On Thu, 3 Oct 2019 at 16:27, bhargav kamineni <[hidden email]> wrote:
Hi,

Observed below errors  in logfile

2019-09-20 02:00:24.504 UTC,,,99779,,5d73303a.185c3,73,,2019-09-07 04:21:14 UTC,,0,FATAL,XX000,"no free slots in PMChildFlags array",,,,,,,,,""
2019-09-20 02:00:24.505 UTC,,,109949,,5d8432b8.1ad7d,1,,2019-09-20 02:00:24 UTC,,0,ERROR,58P01,"could not open shared memory segment ""/PostgreSQL.2520932"": No such file or directory",,,,,,,,,""
2019-09-20 02:00:24.505 UTC,,,109950,,5d8432b8.1ad7e,1,,2019-09-20 02:00:24 UTC,,0,ERROR,58P01,"could not open shared memory segment ""/PostgreSQL.2520932"": No such file or directory",,,,,,,,,""

 
what could be the possible reasons for this to occur and is there any chance of database corruption after this event ?


Regards,
Bhargav   

Reply | Threaded
Open this post in threaded view
|

Re: PMChildFlags array

Adrian Klaver-4
In reply to this post by bhargav kamineni
On 10/3/19 3:57 AM, bhargav kamineni wrote:

> Hi,
>
> Observed below errors  in logfile
>
> 2019-09-20 02:00:24.504 UTC,,,99779,,5d73303a.185c3,73,,2019-09-07
> 04:21:14 UTC,,0,FATAL,XX000,"no free slots in PMChildFlags array",,,,,,,,,""
> 2019-09-20 02:00:24.505 UTC,,,109949,,5d8432b8.1ad7d,1,,2019-09-20
> 02:00:24 UTC,,0,ERROR,58P01,"could not open shared memory segment
> ""/PostgreSQL.2520932"": No such file or directory",,,,,,,,,""
> 2019-09-20 02:00:24.505 UTC,,,109950,,5d8432b8.1ad7e,1,,2019-09-20
> 02:00:24 UTC,,0,ERROR,58P01,"could not open shared memory segment
> ""/PostgreSQL.2520932"": No such file or directory",,,,,,,,,""
>

Postgres version?

OS and version?

What was the database doing just before the FATAL line?

> what could be the possible reasons for this to occur and is there any
> chance of database corruption after this event ?

The source(backend/storage/ipc/pmsignal.c ) says:

"/* Out of slots ... should never happen, else postmaster.c messed up */
         elog(FATAL, "no free slots in PMChildFlags array");
"

Someone else will need to comment on what 'messed up' could be.

>
>
> Regards,
> Bhargav
>


--
Adrian Klaver
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: PMChildFlags array

bhargav kamineni
> Hi,
>
> Observed below errors  in logfile
>
> 2019-09-20 02:00:24.504 UTC,,,99779,,5d73303a.185c3,73,,2019-09-07
> 04:21:14 UTC,,0,FATAL,XX000,"no free slots in PMChildFlags array",,,,,,,,,""
> 2019-09-20 02:00:24.505 UTC,,,109949,,5d8432b8.1ad7d,1,,2019-09-20
> 02:00:24 UTC,,0,ERROR,58P01,"could not open shared memory segment
> ""/PostgreSQL.2520932"": No such file or directory",,,,,,,,,""
> 2019-09-20 02:00:24.505 UTC,,,109950,,5d8432b8.1ad7e,1,,2019-09-20
> 02:00:24 UTC,,0,ERROR,58P01,"could not open shared memory segment
> ""/PostgreSQL.2520932"": No such file or directory",,,,,,,,,""
>

>Postgres version?

PostgreSQL 10.8

>OS and version?

NAME="Ubuntu"
VERSION="18.04.1 LTS (Bionic Beaver)"

What was the database doing just before the FATAL line?

Postgres was rejecting a bunch of connections from a user who is having a connection limit set. that was the the FATAL error that i could see in log file.
 FATAL,53300,"too many connections for role ""user_app"""

db=\du user_app
                           List of roles
  Role name   |          Attributes           |     Member of      
--------------+-------------------------------+--------------------
 user_app | No inheritance               +| {application_role}
              | 100 connections              +|
              | Password valid until infinity | 

> what could be the possible reasons for this to occur and is there any
> chance of database corruption after this event ?

The source(backend/storage/ipc/pmsignal.c ) says:

"/* Out of slots ... should never happen, else postmaster.c messed up */
         elog(FATAL, "no free slots in PMChildFlags array");
"

Someone else will need to comment on what 'messed up' could be

On Thu, 3 Oct 2019 at 18:56, Adrian Klaver <[hidden email]> wrote:
On 10/3/19 3:57 AM, bhargav kamineni wrote:
> Hi,
>
> Observed below errors  in logfile
>
> 2019-09-20 02:00:24.504 UTC,,,99779,,5d73303a.185c3,73,,2019-09-07
> 04:21:14 UTC,,0,FATAL,XX000,"no free slots in PMChildFlags array",,,,,,,,,""
> 2019-09-20 02:00:24.505 UTC,,,109949,,5d8432b8.1ad7d,1,,2019-09-20
> 02:00:24 UTC,,0,ERROR,58P01,"could not open shared memory segment
> ""/PostgreSQL.2520932"": No such file or directory",,,,,,,,,""
> 2019-09-20 02:00:24.505 UTC,,,109950,,5d8432b8.1ad7e,1,,2019-09-20
> 02:00:24 UTC,,0,ERROR,58P01,"could not open shared memory segment
> ""/PostgreSQL.2520932"": No such file or directory",,,,,,,,,""
>

Postgres version?

OS and version?

What was the database doing just before the FATAL line?

> what could be the possible reasons for this to occur and is there any
> chance of database corruption after this event ?

The source(backend/storage/ipc/pmsignal.c ) says:

"/* Out of slots ... should never happen, else postmaster.c messed up */
         elog(FATAL, "no free slots in PMChildFlags array");
"

Someone else will need to comment on what 'messed up' could be.

>
>
> Regards,
> Bhargav
>


--
Adrian Klaver
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: PMChildFlags array

Tom Lane-2
bhargav kamineni <[hidden email]> writes:
> Postgres was rejecting a bunch of connections from a user who is having a
> connection limit set. that was the the FATAL error that i could see in log
> file.
>  FATAL,53300,"too many connections for role ""user_app"""

> db=\du user_app
>                            List of roles
>   Role name   |          Attributes           |     Member of
> --------------+-------------------------------+--------------------
>  user_app | No inheritance               +| {application_role}
>               | 100 connections              +|
>               | Password valid until infinity |

Hm, what's the overall max_connections limit?  (I'm wondering
in particular if it's more or less than 100.)

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: PMChildFlags array

bhargav kamineni
bhargav kamineni <[hidden email]> writes:
> Postgres was rejecting a bunch of connections from a user who is having a
> connection limit set. that was the the FATAL error that i could see in log
> file.
>  FATAL,53300,"too many connections for role ""user_app"""

> db=\du user_app
>                            List of roles
>   Role name   |          Attributes           |     Member of
> --------------+-------------------------------+--------------------
>  user_app | No inheritance               +| {application_role}
>               | 100 connections              +|
>               | Password valid until infinity |

>Hm, what's the overall max_connections limit?  (I'm wondering
in particular if it's more or less than 100.)

its set to 500;
show max_connections ;
 max_connections
-----------------
 500


On Thu, 3 Oct 2019 at 22:52, Tom Lane <[hidden email]> wrote:
bhargav kamineni <[hidden email]> writes:
> Postgres was rejecting a bunch of connections from a user who is having a
> connection limit set. that was the the FATAL error that i could see in log
> file.
>  FATAL,53300,"too many connections for role ""user_app"""

> db=\du user_app
>                            List of roles
>   Role name   |          Attributes           |     Member of
> --------------+-------------------------------+--------------------
>  user_app | No inheritance               +| {application_role}
>               | 100 connections              +|
>               | Password valid until infinity |

Hm, what's the overall max_connections limit?  (I'm wondering
in particular if it's more or less than 100.)

                        regards, tom lane
Reply | Threaded
Open this post in threaded view
|

Re: PMChildFlags array

Tom Lane-2
In reply to this post by bhargav kamineni
bhargav kamineni <[hidden email]> writes:
>> What was the database doing just before the FATAL line?

> Postgres was rejecting a bunch of connections from a user who is having a
> connection limit set. that was the the FATAL error that i could see in log
> file.
>  FATAL,53300,"too many connections for role ""user_app"""

So ... how many is "a bunch"?

Looking at the code, it seems like it'd be possible for a sufficiently
aggressive spawner of incoming connections to reach the
MaxLivePostmasterChildren limit.  While the postmaster would correctly
reject additional connection attempts after that, what it would not do
is ensure that any child slots are left for new parallel worker processes.
So we could hypothesize that the error you're seeing in the log is from
failure to spawn a parallel worker process, due to being out of child
slots.

However, given that max_connections = 500, MaxLivePostmasterChildren()
would be 1000-plus.  This would mean that reaching this condition would
require *at least* 500 concurrent connection-attempts-that-haven't-yet-
been-rejected, maybe well more than that if you didn't have close to
500 legitimately open sessions.  That seems like a lot, enough to suggest
that you've got some pretty serious bug in your client-side logic.

Anyway, I think it's clearly a bug that canAcceptConnections() thinks the
number of acceptable connections is identical to the number of allowed
child processes; it needs to be less, by the number of background
processes we want to support.  But it seems like a darn hard-to-hit bug,
so I'm not quite sure that that explains your observation.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: PMChildFlags array

Alvaro Herrera-9
In reply to this post by bhargav kamineni
On 2019-Oct-03, bhargav kamineni wrote:

> bhargav kamineni <[hidden email]> writes:
> > Postgres was rejecting a bunch of connections from a user who is having a
> > connection limit set. that was the the FATAL error that i could see in log
> > file.
> >  FATAL,53300,"too many connections for role ""user_app"""
>
> > db=\du user_app
> >                            List of roles
> >   Role name   |          Attributes           |     Member of
> > --------------+-------------------------------+--------------------
> >  user_app | No inheritance               +| {application_role}
> >               | 100 connections              +|
> >               | Password valid until infinity |

Was the machine overloaded at the time the problem occurred?

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: PMChildFlags array

bhargav kamineni
In reply to this post by Tom Lane-2

Thanks Tom Lane for detailing the issue.
>So ... how many is "a bunch"?
more than 85

>Looking at the code, it seems like it'd be possible for a sufficiently
>aggressive spawner of incoming connections to reach the
>MaxLivePostmasterChildren limit.  While the postmaster would correctly
>reject additional connection attempts after that, what it would not do
>is ensure that any child slots are left for new parallel worker processes.
>So we could hypothesize that the error you're seeing in the log is from
>failure to spawn a parallel worker process, due to being out of child
>slots.
Thanks Tom Lane for detailing the issue.

we have enabled "max_parallel_workers_per_gather = 4".  20 days before we ran into this issue .


>However, given that max_connections = 500, MaxLivePostmasterChildren()
>would be 1000-plus.  This would mean that reaching this condition would
>require *at least* 500 concurrent connection-attempts-that-haven't-yet-
>been-rejected, maybe well more than that if you didn't have close to
>500 legitimately open sessions.  That seems like a lot, enough to suggest
>that you've got some pretty serious bug in your client-side logic.

below errors observed after crash in postgres logfile :

ERROR:  xlog flush request  is not satisfied for couple of tables , we have initiated the vacuum full on those tables and the error went off after that.
ERROR:  right sibling's left-link doesn't match: block 273660 links to 273500 instead of expected 273661 in index -- observed this error while doing vacuum freeze on databsase , we have dropped this index and created a new one

Observations :

Vacuum freeze analyze job is getting stuck at database end which is initiated thru cronjob, pg_cancel_backend(), pg_termiante_backend() is not able to terminate those stuck  process , Restarting the database only able to clear those process , i am thinking this is happening due to corruption (if this is true how can i detect this ? pg_dump ?). is  there any way to overcome this problem ?

does migrating the database to a new instance (pg_basebackup and switching over to new instance ) solves this issue ?

Anyway, I think it's clearly a bug that canAcceptConnections() thinks the
number of acceptable connections is identical to the number of allowed
child processes; it needs to be less, by the number of background
processes we want to support.  But it seems like a darn hard-to-hit bug,
so I'm not quite sure that that explains your observation.

On Fri, 4 Oct 2019 at 03:49, Tom Lane <[hidden email]> wrote:
bhargav kamineni <[hidden email]> writes:
>> What was the database doing just before the FATAL line?

> Postgres was rejecting a bunch of connections from a user who is having a
> connection limit set. that was the the FATAL error that i could see in log
> file.
>  FATAL,53300,"too many connections for role ""user_app"""

So ... how many is "a bunch"?

Looking at the code, it seems like it'd be possible for a sufficiently
aggressive spawner of incoming connections to reach the
MaxLivePostmasterChildren limit.  While the postmaster would correctly
reject additional connection attempts after that, what it would not do
is ensure that any child slots are left for new parallel worker processes.
So we could hypothesize that the error you're seeing in the log is from
failure to spawn a parallel worker process, due to being out of child
slots.

However, given that max_connections = 500, MaxLivePostmasterChildren()
would be 1000-plus.  This would mean that reaching this condition would
require *at least* 500 concurrent connection-attempts-that-haven't-yet-
been-rejected, maybe well more than that if you didn't have close to
500 legitimately open sessions.  That seems like a lot, enough to suggest
that you've got some pretty serious bug in your client-side logic.

Anyway, I think it's clearly a bug that canAcceptConnections() thinks the
number of acceptable connections is identical to the number of allowed
child processes; it needs to be less, by the number of background
processes we want to support.  But it seems like a darn hard-to-hit bug,
so I'm not quite sure that that explains your observation.

                        regards, tom lane
Reply | Threaded
Open this post in threaded view
|

Re: PMChildFlags array

Tom Lane-2
bhargav kamineni <[hidden email]> writes:
>> So ... how many is "a bunch"?
> more than 85

Hm.  That doesn't seem like it'd be enough to trigger the problem;
you'd need about max_connections excess connections (that are shortly
going to be rejected) to run into this problem, and you said you
had max_connections = 500.  Maybe several different clients were all
doing this at once?

But anyway, AFAICS there is only one code path that could lead to the
reported error message, so one way or another you got there.  I've
pushed a fix for this, which will be in next month's releases.

> below errors observed after crash in postgres logfile :

> ERROR:  xlog flush request  is not satisfied for couple of tables , we have
> initiated the vacuum full on those tables and the error went off after that.
> ERROR:  right sibling's left-link doesn't match: block 273660 links to
> 273500 instead of expected 273661 in index -- observed this error while
> doing vacuum freeze on databsase , we have dropped this index and created a
> new one

That seems unrelated.  A postmaster crash shouldn't have any
data-corruption consequences, since it never touches any
relation files directly.

                        regards, tom lane