[Bus error] huge_pages default value (try) not fall back

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

[Bus error] huge_pages default value (try) not fall back

Fan Liu

Hi,

 

We have seen a bus error when running postgresql in container (where on K8s). According current finding, there is bug on K8s, they are working on it.

But we also want to know why huge_pages default value(try) didn’t fall back.

 

K8s BUG https://github.com/kubernetes/kubernetes/issues/71233

 

Problem quick summary:

When hugepage not working, initdb produce bus error.

 

Logs:

2020-02-17 06:33:21,606 INFO: trying to bootstrap a new cluster

2020-02-17 06:33:21,610 INFO: pg_ctl args: ('-o', '--auth-host=md5 --auth-local=trust --encoding=UTF8 --locale=en_US.UTF-8 --data-checksums --username=postgres --pwfile=/tmp/tmpcdHEH3'), {}

The files belonging to this database system will be owned by user "postgres".

This user must also own the server process.

 

The database cluster will be initialized with locale "en_US.UTF-8".

The default text search configuration will be set to "english".

 

Data page checksums are enabled.

 

fixing permissions on existing directory /var/lib/postgresql/data/pgdata ... ok

creating subdirectories ... ok

sh: line 1:   100 Bus error               (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=100 -c shared_buffers=1000 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1

sh: line 1:   102 Bus error               (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=50 -c shared_buffers=500 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1

sh: line 1:   104 Bus error               (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=40 -c shared_buffers=400 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1

sh: line 1:   106 Bus error               (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=30 -c shared_buffers=300 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1

sh: line 1:   108 Bus error               (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=200 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1

selecting default max_connections ... 20

sh: line 1:   110 Bus error               (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=16384 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1

sh: line 1:   112 Bus error               (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=8192 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1

sh: line 1:   114 Bus error               (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=4096 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1

sh: line 1:   116 Bus error               (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=3584 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1

sh: line 1:   118 Bus error               (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=3072 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1

sh: line 1:   120 Bus error               (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=2560 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1

sh: line 1:   122 Bus error               (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=2048 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1

sh: line 1:   124 Bus error               (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=1536 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1

sh: line 1:   126 Bus error               (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=1000 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1

sh: line 1:   128 Bus error               (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=900 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1

sh: line 1:   130 Bus error               (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=800 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1

sh: line 1:   132 Bus error               (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=700 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1

sh: line 1:   134 Bus error               (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=600 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1

sh: line 1:   136 Bus error               (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=500 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1

sh: line 1:   138 Bus error               (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=400 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1

sh: line 1:   140 Bus error               (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=300 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1

sh: line 1:   142 Bus error               (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=200 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1

sh: line 1:   144 Bus error               (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=100 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1

sh: line 1:   146 Bus error               (core dumped) "/usr/lib/postgresql10/bin/postgres" --boot -x0 -F -c max_connections=20 -c shared_buffers=50 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1

selecting default shared_buffers ... 400kB

selecting default timezone ... UTC

selecting dynamic shared memory implementation ... posix

creating configuration files ... ok

child process was terminated by signal 7: Bus error

initdb: removing contents of data directory "/var/lib/postgresql/data/pgdata"

pg_ctl: database system initialization failed

running bootstrap script ... 2020-02-17 06:33:22,254 INFO: removing initialize key after failed attempt to bootstrap the cluster

------------ end of log -------------

 

Hugepage:

Output from "kubectl describe node"
========================
Capacity:
cpu: 56
ephemeral-storage: 365912640Ki
hugepages-1Gi: 16Gi
hugepages-2Mi: 0
memory: 131922340Ki
pods: 110
Allocatable:
cpu: 55900m
ephemeral-storage: 337225088466
hugepages-1Gi: 16Gi
hugepages-2Mi: 0
memory: 114792724Ki
pods: 110

========================
Grub command line:
GRUB_CMDLINE_LINUX_DEFAULT="console=tty0 console=ttyS0,115200 no_timer_check nofb nomodeset vga=normal default_hugepagesz=1G hugepagesz=1G hugepages=16 hugepagesz=2M hugepages=0"

========================

 

 

BRs,

Fan Liu

ADP Document Database PG

 

Reply | Threaded
Open this post in threaded view
|

Re: [Bus error] huge_pages default value (try) not fall back

Dmitry Dolgov
> On Tue, Feb 18, 2020 at 07:52:50AM +0000, Fan Liu wrote:
> Hi,
>
> We have seen a bus error when running postgresql in container (where on K8s). According current finding, there is bug on K8s, they are working on it.
> But we also want to know why huge_pages default value(try) didn't fall back.
>
> K8s BUG https://github.com/kubernetes/kubernetes/issues/71233
>
> Problem quick summary:
> When hugepage not working, initdb produce bus error.

Thanks for reporting!

This one is fun. If I understand everything correctly, Postgres will
fall back to non huge pages if it fails to allocate some. But in this
case kernel actually allocates everything without problems (there are
some available huge pages on a node after all), and return SIGBUS only
when a first page fault within this cgroup happened, see the docs [1]:

    The HugeTLB controller allows to limit the HugeTLB usage per control
    group and enforces the controller limit during page fault. Since
    HugeTLB doesn't support page reclaim, enforcing the limit at page
    fault time implies that, the application will get SIGBUS signal if
    it tries to access HugeTLB pages beyond its limit. This requires the
    application to know beforehand how much HugeTLB pages it would
    require for its use.

Unfortunately I'm not sure what would be the best solution in this
situation.

[1]: https://www.kernel.org/doc/html/latest/_sources/admin-guide/cgroup-v1/hugetlb.rst.txt


Reply | Threaded
Open this post in threaded view
|

RE: [Bus error] huge_pages default value (try) not fall back

Fan Liu

-----Original Message-----
From: Dmitry Dolgov <[hidden email]>
Sent: 2020年2月18日 17:33
To: Fan Liu <[hidden email]>
Cc: [hidden email]
Subject: Re: [Bus error] huge_pages default value (try) not fall back

> On Tue, Feb 18, 2020 at 07:52:50AM +0000, Fan Liu wrote:
> Hi,
>
> We have seen a bus error when running postgresql in container (where on K8s). According current finding, there is bug on K8s, they are working on it.
> But we also want to know why huge_pages default value(try) didn't fall back.
>
> K8s BUG
> https://protect2.fireeye.com/v1/url?k=dbfabaf1-872eb600-dbfafa6a-86468
> 5b2085c-354ea5332684eaef&q=1&e=4521865a-6ad9-42a9-b74a-2b5462a7c73b&u=
> https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fissues%2F71233
>
> Problem quick summary:
> When hugepage not working, initdb produce bus error.

Thanks for reporting!

This one is fun. If I understand everything correctly, Postgres will fall back to non huge pages if it fails to allocate some. But in this case kernel actually allocates everything without problems (there are some available huge pages on a node after all), and return SIGBUS only when a first page fault within this cgroup happened, see the docs [1]:

    The HugeTLB controller allows to limit the HugeTLB usage per control
    group and enforces the controller limit during page fault. Since
    HugeTLB doesn't support page reclaim, enforcing the limit at page
    fault time implies that, the application will get SIGBUS signal if
    it tries to access HugeTLB pages beyond its limit. This requires the
    application to know beforehand how much HugeTLB pages it would
    require for its use.

Unfortunately I'm not sure what would be the best solution in this situation.

[1]: https://www.kernel.org/doc/html/latest/_sources/admin-guide/cgroup-v1/hugetlb.rst.txt


---------------------------------------------------------

Hi Dmitry,
Thank you for the explanation.

In the K8s BUG https://protect2.fireeye.com/v1/url?k=dbfabaf1-872eb600-dbfafa6a-86468, there is someone proposed a workaround.

"Modify the docker image to be able to set huge_pages = off in /usr/share/postgresql/postgresql.conf.sample before initdb was ran (this is what I did)."

I am working on this workaround , but has not really tested yet.  So, do you think this could avoid this issue?  Or do you see any side impact for this workaround?

BRs,
Fan Liu



BRs,
Fan Liu
ADP Document Database PG


Reply | Threaded
Open this post in threaded view
|

Re: [Bus error] huge_pages default value (try) not fall back

Dmitry Dolgov
> On Tue, Feb 18, 2020 at 12:31:51PM +0000, Fan Liu wrote:
>
> "Modify the docker image to be able to set huge_pages = off in /usr/share/postgresql/postgresql.conf.sample before initdb was ran (this is what I did)."
>
> I am working on this workaround , but has not really tested yet.  So, do you think this could avoid this issue?  Or do you see any side impact for this workaround?

If you don't necessarily need to use huge pages, then yes, I guess it
should work. In case if initdb tries to read config from some other
location, you can always point it to whatever you need via -L option.


Reply | Threaded
Open this post in threaded view
|

RE: [Bus error] huge_pages default value (try) not fall back

Fan Liu

-----Original Message-----
From: Fan Liu
Sent: 2020年2月21日 10:15
To: Dmitry Dolgov <[hidden email]>; [hidden email]
Subject: RE: [Bus error] huge_pages default value (try) not fall back


-----Original Message-----

>From: Dmitry Dolgov <[hidden email]>
>Sent: 2020年2月19日 17:36
>To: Fan Liu <[hidden email]>
>Cc: [hidden email]
>Subject: Re: [Bus error] huge_pages default value (try) not fall back
>
>> On Tue, Feb 18, 2020 at 12:31:51PM +0000, Fan Liu wrote:
>>
>> "Modify the docker image to be able to set huge_pages = off in /usr/share/postgresql/postgresql.conf.sample before initdb was ran (this is what I did)."
>>
>> I am working on this workaround , but has not really tested yet.  So, do you think this could avoid this issue?  Or do you see any side impact for this workaround?
>
>If you don't necessarily need to use huge pages, then yes, I guess it should work. In case if initdb tries to read config from some other location, you can always point it to whatever you need via -L option.
-----------------------------------

Hi Dmitry,

I had try the workaround. The result is that there is still bus error, but postgresql did come up.

I am not that understand why this could happen.

Attached core dump file, could you take a look?

$ file core
core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/lib/postgresql10/bin/postgres --boot -x0 -F -c max_connections=20 -c share', real uid: 26, effective uid: 26, real gid: 26, effective gid: 26, execfn: '/usr/lib/postgresql10/bin/postgres', platform: 'x86_64'

BRs,
Fan Liu
Reply | Threaded
Open this post in threaded view
|

Re: [Bus error] huge_pages default value (try) not fall back

Dmitry Dolgov
> On Fri, Feb 21, 2020, 3:19 AM Fan Liu <[hidden email]> wrote: 
>
> Attached core dump file, could you take a look?

I can take a look on Monday, but at the same time if you have issues on initdb stage you try to run it with -d option and check out debugging output, should be helpful.
Reply | Threaded
Open this post in threaded view
|

RE: [Bus error] huge_pages default value (try) not fall back

Fan Liu

>From: Dmitry Dolgov <[hidden email]>
>Sent: 2020222 2:58
>To: Fan Liu <[hidden email]>
>Cc: PostgreSQL mailing lists <[hidden email]>
>Subject: Re: [Bus error] huge_pages default value (try) not fall back

> 

>> On Fri, Feb 21, 2020, 3:19 AM Fan Liu <[hidden email]> wrote: 

>> 

>> Attached core dump file, could you take a look?

>I can take a look on Monday, but at the same time if you have issues on initdb stage you try to run it with -d option and check out debugging output, should be helpful.

 

 

Hi Dmitry,

 

Appreciate for your support.

I will working on a new package and ask my collector for validation and collect logs.

 

BRs,

Fan Liu

 

Reply | Threaded
Open this post in threaded view
|

Re: [Bus error] huge_pages default value (try) not fall back

Dmitry Dolgov
>> On Fri, Feb 21, 2020, 3:19 AM Fan Liu <[hidden email]<mailto:[hidden email]>> wrote:
>>
>> Attached core dump file, could you take a look?
>
> I can take a look on Monday, but at the same time if you have issues
> on initdb stage you try to run it with -d option and check out
> debugging output, should be helpful.

Unfortunately, I wasn't able to get a meaningful stack trace from this
dump, most likely due to different versions (I hoped that the latest
pgdg package with 10 for bionic would fit). But you can also try to post
it following this instructions [1].

[1]: https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend


Reply | Threaded
Open this post in threaded view
|

RE: [Bus error] huge_pages default value (try) not fall back

Fan Liu
I had created a new package for my customer with -d for initdb, but they said they can accept current workaround.
As I don't have an NODE has hugepage on, I will not able to collect the logs.

I'd like to thank you again for the supporting and troubleshooting. I think we may close this ticket.

BRs,
Fan Liu
ADP Document Database PG

-----Original Message-----
From: Dmitry Dolgov <[hidden email]>
Sent: 2020年2月24日 18:39
To: Fan Liu <[hidden email]>
Cc: PostgreSQL mailing lists <[hidden email]>
Subject: Re: [Bus error] huge_pages default value (try) not fall back

>> On Fri, Feb 21, 2020, 3:19 AM Fan Liu <[hidden email]<mailto:[hidden email]>> wrote:
>>
>> Attached core dump file, could you take a look?
>
> I can take a look on Monday, but at the same time if you have issues
> on initdb stage you try to run it with -d option and check out
> debugging output, should be helpful.

Unfortunately, I wasn't able to get a meaningful stack trace from this dump, most likely due to different versions (I hoped that the latest pgdg package with 10 for bionic would fit). But you can also try to post it following this instructions [1].

[1]: https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend