Why is infinite_recurse test suddenly failing?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Why is infinite_recurse test suddenly failing?

Tom Lane-2
In the past week, four different buildfarm members have shown
non-reproducing segfaults in the "select infinite_recurse()"
test case, rather than the expected detection of stack overrun
before we get to the point of a segfault.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bonito&dt=2019-05-01%2023%3A05%3A36
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=takin&dt=2019-05-01%2008%3A16%3A48
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=buri&dt=2019-04-27%2023%3A54%3A46
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=demoiselle&dt=2019-04-27%2014%3A55%3A52

They're all on HEAD, and they all look like

2019-05-01 23:11:00.145 UTC [13933:65] LOG:  server process (PID 17161) was terminated by signal 11: Segmentation fault
2019-05-01 23:11:00.145 UTC [13933:66] DETAIL:  Failed process was running: select infinite_recurse();

I scraped the buildfarm database and verified that there are no similar
failures for at least three months back; nor, offhand, can I remember ever
having seen this test fail in many years.  So it seems we broke something
recently.  No idea what though.

(Another possibility, seeing that these are all members of Mark's PPC64
flotilla, is that there's some common misconfiguration --- but it's hard
to credit that such a problem would only affect HEAD not the back
branches.)

Thoughts?

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Why is infinite_recurse test suddenly failing?

Andres Freund
Hi,

On 2019-05-02 11:02:03 -0400, Tom Lane wrote:
> In the past week, four different buildfarm members have shown
> non-reproducing segfaults in the "select infinite_recurse()"
> test case, rather than the expected detection of stack overrun
> before we get to the point of a segfault.

I was just staring at bonito's failure in confusion.


> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bonito&dt=2019-05-01%2023%3A05%3A36
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=takin&dt=2019-05-01%2008%3A16%3A48
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=buri&dt=2019-04-27%2023%3A54%3A46
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=demoiselle&dt=2019-04-27%2014%3A55%3A52
>
> They're all on HEAD, and they all look like
>
> 2019-05-01 23:11:00.145 UTC [13933:65] LOG:  server process (PID 17161) was terminated by signal 11: Segmentation fault
> 2019-05-01 23:11:00.145 UTC [13933:66] DETAIL:  Failed process was running: select infinite_recurse();
>
> I scraped the buildfarm database and verified that there are no similar
> failures for at least three months back; nor, offhand, can I remember ever
> having seen this test fail in many years.  So it seems we broke something
> recently.  No idea what though.

I can't recall any recent changes to relevant area of code.


> (Another possibility, seeing that these are all members of Mark's PPC64
> flotilla, is that there's some common misconfiguration --- but it's hard
> to credit that such a problem would only affect HEAD not the back
> branches.)

Hm, I just noticed:
                                       'HEAD' => [
                                                   'force_parallel_mode = regress'
                                                 ]

on all those animals. So it's not necessarily the case that HEAD and
backbranch runs are behaving all that identical.  Note that isn't a
recent config change, so it's not an explanation as to why they started
to fail only recently.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Why is infinite_recurse test suddenly failing?

Tom Lane-2
Andres Freund <[hidden email]> writes:
> Hm, I just noticed:
>                                        'HEAD' => [
>                                                    'force_parallel_mode = regress'
>                                                  ]

Oooh, I didn't see that.

> on all those animals. So it's not necessarily the case that HEAD and
> backbranch runs are behaving all that identical.  Note that isn't a
> recent config change, so it's not an explanation as to why they started
> to fail only recently.

No, but it does point at another area of the code in which a relevant
change could've occurred.

While we're looking at this --- Mark, if you could install gdb
on your buildfarm hosts, that would be really handy.  I think that's
the only extra thing the buildfarm script needs to extract stack
traces from core dumps.  We'd likely already know where the problem
is if we had a stack trace ...

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Why is infinite_recurse test suddenly failing?

Mark Wong
On Thu, May 02, 2019 at 11:45:34AM -0400, Tom Lane wrote:

> Andres Freund <[hidden email]> writes:
> > Hm, I just noticed:
> >                                        'HEAD' => [
> >                                                    'force_parallel_mode = regress'
> >                                                  ]
>
> Oooh, I didn't see that.
>
> > on all those animals. So it's not necessarily the case that HEAD and
> > backbranch runs are behaving all that identical.  Note that isn't a
> > recent config change, so it's not an explanation as to why they started
> > to fail only recently.
>
> No, but it does point at another area of the code in which a relevant
> change could've occurred.
>
> While we're looking at this --- Mark, if you could install gdb
> on your buildfarm hosts, that would be really handy.  I think that's
> the only extra thing the buildfarm script needs to extract stack
> traces from core dumps.  We'd likely already know where the problem
> is if we had a stack trace ...

Ok, I think I have gdb installed now...

Regards,
Mark

--
Mark Wong
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/


Reply | Threaded
Open this post in threaded view
|

Re: Why is infinite_recurse test suddenly failing?

Tom Lane-2
Mark Wong <[hidden email]> writes:
> On Thu, May 02, 2019 at 11:45:34AM -0400, Tom Lane wrote:
>> While we're looking at this --- Mark, if you could install gdb
>> on your buildfarm hosts, that would be really handy.  I think that's
>> the only extra thing the buildfarm script needs to extract stack
>> traces from core dumps.  We'd likely already know where the problem
>> is if we had a stack trace ...

> Ok, I think I have gdb installed now...

Thanks!

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Why is infinite_recurse test suddenly failing?

Andres Freund
In reply to this post by Mark Wong
Hi,

On 2019-05-03 10:08:59 -0700, Mark Wong wrote:
> Ok, I think I have gdb installed now...

Thanks! Any chance you could turn on force_parallel_mode for the other
branches it applies to too? Makes it easier to figure out whether
breakage is related to that, or independent.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Why is infinite_recurse test suddenly failing?

Tom Lane-2
In reply to this post by Tom Lane-2
We just got another one of these, on yet another member of Mark's
ppc64 armada:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=shoveler&dt=2019-05-10%2014%3A04%3A34

Now we have a stack trace (thanks Mark!), but it is pretty unsurprising:

Core was generated by `postgres: debian regression [local] SELECT                                     '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  sysmalloc (nb=8208, av=0x3fff916e0d28 <main_arena>) at malloc.c:2748
2748 malloc.c: No such file or directory.
#0  sysmalloc (nb=8208, av=0x3fff916e0d28 <main_arena>) at malloc.c:2748
#1  0x00003fff915bedc8 in _int_malloc (av=0x3fff916e0d28 <main_arena>, bytes=8192) at malloc.c:3865
#2  0x00003fff915c1064 in __GI___libc_malloc (bytes=8192) at malloc.c:2928
#3  0x00000000106acfd8 in AllocSetContextCreateInternal (parent=0x1000babdad0, name=0x1085508c "inline_function", minContextSize=<optimized out>, initBlockSize=<optimized out>, maxBlockSize=8388608) at aset.c:477
#4  0x00000000103d5e00 in inline_function (funcid=20170, result_type=<optimized out>, result_collid=<optimized out>, input_collid=<optimized out>, funcvariadic=<optimized out>, func_tuple=<optimized out>, context=0x3fffe3da15d0, args=<optimized out>) at clauses.c:4459
#5  simplify_function (funcid=<optimized out>, result_type=<optimized out>, result_typmod=<optimized out>, result_collid=<optimized out>, input_collid=<optimized out>, args_p=<optimized out>, funcvariadic=<optimized out>, process_args=<optimized out>, allow_non_const=<optimized out>, context=<optimized out>) at clauses.c:4040
#6  0x00000000103d2e74 in eval_const_expressions_mutator (node=0x1000babe968, context=0x3fffe3da15d0) at clauses.c:2474
#7  0x00000000103511bc in expression_tree_mutator (node=<optimized out>, mutator=0x103d2b10 <eval_const_expressions_mutator>, context=0x3fffe3da15d0) at nodeFuncs.c:2893
#8  0x00000000103d2cbc in eval_const_expressions_mutator (node=0x1000babe9c0, context=0x3fffe3da15d0) at clauses.c:3606
#9  0x00000000103510c8 in expression_tree_mutator (node=<optimized out>, mutator=<optimized out>, context=<optimized out>) at nodeFuncs.c:2942
#10 0x00000000103d2cbc in eval_const_expressions_mutator (node=0x1000babea40, context=0x3fffe3da15d0) at clauses.c:3606
#11 0x00000000103d2ae8 in eval_const_expressions (root=<optimized out>, node=<optimized out>) at clauses.c:2266
#12 0x00000000103b6264 in preprocess_expression (root=0x1000babee18, expr=<optimized out>, kind=1) at planner.c:1087
#13 0x00000000103b496c in subquery_planner (glob=<optimized out>, parse=<optimized out>, parent_root=<optimized out>, hasRecursion=<optimized out>, tuple_fraction=<optimized out>) at planner.c:769
#14 0x00000000103b3c58 in standard_planner (parse=<optimized out>, cursorOptions=<optimized out>, boundParams=<optimized out>) at planner.c:406
#15 0x00000000103b3a68 in planner (parse=<optimized out>, cursorOptions=<optimized out>, boundParams=<optimized out>) at planner.c:275
#16 0x00000000104cc2cc in pg_plan_query (querytree=0x1000babe7f8, cursorOptions=256, boundParams=0x0) at postgres.c:878
#17 0x00000000102ef850 in init_execution_state (lazyEvalOK=<optimized out>, queryTree_list=<optimized out>, fcache=<optimized out>) at functions.c:507
#18 init_sql_fcache (finfo=<optimized out>, collation=<optimized out>, lazyEvalOK=<optimized out>) at functions.c:770
#19 fmgr_sql (fcinfo=<optimized out>) at functions.c:1053
#20 0x00000000102cef24 in ExecInterpExpr (state=<optimized out>, econtext=<optimized out>, isnull=<optimized out>) at execExprInterp.c:625
#21 0x00000000102cddb8 in ExecInterpExprStillValid (state=0x1000bab41e8, econtext=0x1000bab3ed8, isNull=<optimized out>) at execExprInterp.c:1769
#22 0x0000000010314f10 in ExecEvalExprSwitchContext (state=0x1000bab41e8, econtext=0x1000bab3ed8, isNull=<optimized out>) at ../../../src/include/executor/executor.h:307
#23 ExecProject (projInfo=0x1000bab41e0) at ../../../src/include/executor/executor.h:341
#24 ExecResult (pstate=<optimized out>) at nodeResult.c:136
#25 0x00000000102e319c in ExecProcNodeFirst (node=0x1000bab3dc0) at execProcnode.c:445
#26 0x00000000102d9c94 in ExecProcNode (node=<optimized out>) at ../../../src/include/executor/executor.h:239
#27 ExecutePlan (estate=<optimized out>, planstate=<optimized out>, use_parallel_mode=false, operation=<optimized out>, numberTuples=<optimized out>, direction=<optimized out>, dest=<optimized out>, execute_once=<optimized out>, sendTuples=<optimized out>) at execMain.c:1648
#28 standard_ExecutorRun (queryDesc=<optimized out>, direction=<optimized out>, count=<optimized out>, execute_once=<optimized out>) at execMain.c:365
#29 0x00000000102d9ac8 in ExecutorRun (queryDesc=<optimized out>, direction=<optimized out>, count=<optimized out>, execute_once=<optimized out>) at execMain.c:309
#30 0x00000000102efe84 in postquel_getnext (es=<optimized out>, fcache=<optimized out>) at functions.c:867
#31 fmgr_sql (fcinfo=<optimized out>) at functions.c:1153
#32 0x00000000102cef24 in ExecInterpExpr (state=<optimized out>, econtext=<optimized out>, isnull=<optimized out>) at execExprInterp.c:625
#33 0x00000000102cddb8 in ExecInterpExprStillValid (state=0x1000baa8158, econtext=0x1000baa7e48, isNull=<optimized out>) at execExprInterp.c:1769
#34 0x0000000010314f10 in ExecEvalExprSwitchContext (state=0x1000baa8158, econtext=0x1000baa7e48, isNull=<optimized out>) at ../../../src/include/executor/executor.h:307
#35 ExecProject (projInfo=0x1000baa8150) at ../../../src/include/executor/executor.h:341
#36 ExecResult (pstate=<optimized out>) at nodeResult.c:136
#37 0x00000000102e319c in ExecProcNodeFirst (node=0x1000baa7d30) at execProcnode.c:445

... lots and lots of repetitions ...

#11809 0x00000000102e319c in ExecProcNodeFirst (node=0x10008c01e90) at execProcnode.c:445
#11810 0x00000000102d9c94 in ExecProcNode (node=<optimized out>) at ../../../src/include/executor/executor.h:239
#11811 ExecutePlan (estate=<optimized out>, planstate=<optimized out>, use_parallel_mode=false, operation=<optimized out>, numberTuples=<optimized out>, direction=<optimized out>, dest=<optimized out>, execute_once=<optimized out>, sendTuples=<optimized out>) at execMain.c:1648
#11812 standard_ExecutorRun (queryDesc=<optimized out>, direction=<optimized out>, count=<optimized out>, execute_once=<optimized out>) at execMain.c:365
#11813 0x00000000102d9ac8 in ExecutorRun (queryDesc=<optimized out>, direction=<optimized out>, count=<optimized out>, execute_once=<optimized out>) at execMain.c:309
#11814 0x00000000104d39ec in PortalRunSelect (portal=0x10008be9de8, forward=<optimized out>, count=0, dest=<optimized out>) at pquery.c:929
#11815 0x00000000104d34c0 in PortalRun (portal=0x10008be9de8, count=<optimized out>, isTopLevel=<optimized out>, run_once=<optimized out>, dest=<optimized out>, altdest=<optimized out>, completionTag=0x3fffe3ecd6c0 "") at pquery.c:770
#11816 0x00000000104d1bc4 in exec_simple_query (query_string=<optimized out>) at postgres.c:1215
#11817 0x00000000104ced50 in PostgresMain (argc=<optimized out>, argv=<optimized out>, dbname=<optimized out>, username=<optimized out>) at postgres.c:4249
#11818 0x00000000104110fc in BackendRun (port=<optimized out>) at postmaster.c:4430
#11819 BackendStartup (port=<optimized out>) at postmaster.c:4121
#11820 ServerLoop () at postmaster.c:1704
#11821 PostmasterMain (argc=<optimized out>, argv=<optimized out>) at postmaster.c:1377
#11822 0x000000001034def4 in main (argc=8, argv=0x10008b7efb0) at main.c:228


So that lets out any theory that somehow we're getting into a weird
control path that misses calling check_stack_depth;
expression_tree_mutator does so for one, and it was called just nine
stack frames down from the crash.

Casting about for possible explanations, I notice that the failure
seems to have occurred at a nesting depth of 982 SQL-function calls
((11809 - 25)/12).  I'd previously scraped the buildfarm database
to find out what nesting depths we normally get to before erroring
out, by counting the number of "infinite_recurse" context messages
in the postmaster log.  Here's the entry for shoveler:

 shoveler      | 2019-05-03 14:19:19 |  1674

So this failed at substantially less stack depth than it's successfully
consumed in other runs, and *very* substantially less stack than ought
to be there, considering we pay attention to getrlimit in setting
max_stack_depth and add quite a generous amount of slop too.

I am wondering if, somehow, the stack depth limit seen by the postmaster
sometimes doesn't apply to its children.  That would be pretty wacko
kernel behavior, especially if it's only intermittently true.
But we're running out of other explanations.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Why is infinite_recurse test suddenly failing?

Andres Freund
Hi,

On 2019-05-10 11:38:57 -0400, Tom Lane wrote:

> Core was generated by `postgres: debian regression [local] SELECT                                     '.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  sysmalloc (nb=8208, av=0x3fff916e0d28 <main_arena>) at malloc.c:2748
> 2748 malloc.c: No such file or directory.
> #0  sysmalloc (nb=8208, av=0x3fff916e0d28 <main_arena>) at malloc.c:2748
> #1  0x00003fff915bedc8 in _int_malloc (av=0x3fff916e0d28 <main_arena>, bytes=8192) at malloc.c:3865
> #2  0x00003fff915c1064 in __GI___libc_malloc (bytes=8192) at malloc.c:2928
> #3  0x00000000106acfd8 in AllocSetContextCreateInternal (parent=0x1000babdad0, name=0x1085508c "inline_function", minContextSize=<optimized out>, initBlockSize=<optimized out>, maxBlockSize=8388608) at aset.c:477
> #4  0x00000000103d5e00 in inline_function (funcid=20170, result_type=<optimized out>, result_collid=<optimized out>, input_collid=<optimized out>, funcvariadic=<optimized out>, func_tuple=<optimized out>, context=0x3fffe3da15d0, args=<optimized out>) at clauses.c:4459
> #5  simplify_function (funcid=<optimized out>, result_type=<optimized out>, result_typmod=<optimized out>, result_collid=<optimized out>, input_collid=<optimized out>, args_p=<optimized out>, funcvariadic=<optimized out>, process_args=<optimized out>, allow_non_const=<optimized out>, context=<optimized out>) at clauses.c:4040
> #6  0x00000000103d2e74 in eval_const_expressions_mutator (node=0x1000babe968, context=0x3fffe3da15d0) at clauses.c:2474
> #7  0x00000000103511bc in expression_tree_mutator (node=<optimized out>, mutator=0x103d2b10 <eval_const_expressions_mutator>, context=0x3fffe3da15d0) at nodeFuncs.c:2893


> So that lets out any theory that somehow we're getting into a weird
> control path that misses calling check_stack_depth;
> expression_tree_mutator does so for one, and it was called just nine
> stack frames down from the crash.

Right. There's plenty places checking it...


> I am wondering if, somehow, the stack depth limit seen by the postmaster
> sometimes doesn't apply to its children.  That would be pretty wacko
> kernel behavior, especially if it's only intermittently true.
> But we're running out of other explanations.

I wonder if this is a SIGSEGV that actually signals an OOM
situation. Linux, if it can't actually extend the stack on-demand due to
OOM, sends a SIGSEGV.  The signal has that information, but
unfortunately the buildfarm code doesn't print it.  p $_siginfo would
show us some of that...

Mark, how tight is the memory on that machine? Does dmesg have any other
information (often segfaults are logged by the kernel with the code
IIRC).

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Why is infinite_recurse test suddenly failing?

Tom Lane-2
Andres Freund <[hidden email]> writes:
> On 2019-05-10 11:38:57 -0400, Tom Lane wrote:
>> I am wondering if, somehow, the stack depth limit seen by the postmaster
>> sometimes doesn't apply to its children.  That would be pretty wacko
>> kernel behavior, especially if it's only intermittently true.
>> But we're running out of other explanations.

> I wonder if this is a SIGSEGV that actually signals an OOM
> situation. Linux, if it can't actually extend the stack on-demand due to
> OOM, sends a SIGSEGV.  The signal has that information, but
> unfortunately the buildfarm code doesn't print it.  p $_siginfo would
> show us some of that...

> Mark, how tight is the memory on that machine? Does dmesg have any other
> information (often segfaults are logged by the kernel with the code
> IIRC).

It does sort of smell like a resource exhaustion problem, especially
if all these buildfarm animals are VMs running on the same underlying
platform.  But why would that manifest as "you can't have a measly two
megabytes of stack" and not as any other sort of OOM symptom?

Mark, if you don't mind modding your local copies of the buildfarm
script, I think what Andres is asking for is a pretty trivial addition
in PGBuild/Utils.pm's sub get_stack_trace:

        my $cmdfile = "./gdbcmd";
        my $handle;
        open($handle, '>', $cmdfile) || die "opening $cmdfile: $!";
        print $handle "bt\n";
+ print $handle "p $_siginfo\n";
        close($handle);

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Why is infinite_recurse test suddenly failing?

Andrew Dunstan

On 5/10/19 3:35 PM, Tom Lane wrote:

> Andres Freund <[hidden email]> writes:
>> On 2019-05-10 11:38:57 -0400, Tom Lane wrote:
>>> I am wondering if, somehow, the stack depth limit seen by the postmaster
>>> sometimes doesn't apply to its children.  That would be pretty wacko
>>> kernel behavior, especially if it's only intermittently true.
>>> But we're running out of other explanations.
>> I wonder if this is a SIGSEGV that actually signals an OOM
>> situation. Linux, if it can't actually extend the stack on-demand due to
>> OOM, sends a SIGSEGV.  The signal has that information, but
>> unfortunately the buildfarm code doesn't print it.  p $_siginfo would
>> show us some of that...
>> Mark, how tight is the memory on that machine? Does dmesg have any other
>> information (often segfaults are logged by the kernel with the code
>> IIRC).
> It does sort of smell like a resource exhaustion problem, especially
> if all these buildfarm animals are VMs running on the same underlying
> platform.  But why would that manifest as "you can't have a measly two
> megabytes of stack" and not as any other sort of OOM symptom?
>
> Mark, if you don't mind modding your local copies of the buildfarm
> script, I think what Andres is asking for is a pretty trivial addition
> in PGBuild/Utils.pm's sub get_stack_trace:
>
> my $cmdfile = "./gdbcmd";
> my $handle;
> open($handle, '>', $cmdfile) || die "opening $cmdfile: $!";
> print $handle "bt\n";
> + print $handle "p $_siginfo\n";
> close($handle);
>
>


I think we'll need to write that as:


    print $handle 'p $_siginfo',"\n";


As you have it written perl will try to interpolate a variable called
$_siginfo.


cheers


andrew






Reply | Threaded
Open this post in threaded view
|

Re: Why is infinite_recurse test suddenly failing?

Michael Paquier-2
On Fri, May 10, 2019 at 05:26:43PM -0400, Andrew Dunstan wrote:
> I think we'll need to write that as:
>
>     print $handle 'p $_siginfo',"\n";
>
> As you have it written perl will try to interpolate a variable called
> $_siginfo.

Anything in double quotes with a dollar sign would be interpreted as a
variable, and single quotes make that safe.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Why is infinite_recurse test suddenly failing?

Andrew Dunstan

On 5/12/19 4:57 AM, Michael Paquier wrote:
> On Fri, May 10, 2019 at 05:26:43PM -0400, Andrew Dunstan wrote:
>> I think we'll need to write that as:
>>
>>     print $handle 'p $_siginfo',"\n";
>>
>> As you have it written perl will try to interpolate a variable called
>> $_siginfo.
> Anything in double quotes with a dollar sign would be interpreted as a
> variable, and single quotes make that safe.


Yes, that's why I did it that way.


cheers


andrew



Reply | Threaded
Open this post in threaded view
|

Re: Why is infinite_recurse test suddenly failing?

Mark Wong
In reply to this post by Andres Freund
On Fri, May 03, 2019 at 11:45:33AM -0700, Andres Freund wrote:
> Hi,
>
> On 2019-05-03 10:08:59 -0700, Mark Wong wrote:
> > Ok, I think I have gdb installed now...
>
> Thanks! Any chance you could turn on force_parallel_mode for the other
> branches it applies to too? Makes it easier to figure out whether
> breakage is related to that, or independent.

Slowly catching up on my collection of ppc64le animals...

I still need to upgrade the build farm client (v8) on:
* dhole
* vulpes
* wobbegong
* cuon
* batfish
* devario
* cardinalfish

The following I've enabled force_parallel_mode for HEAD, 11, 10, and
9.6:

* buri
* urocryon
* ayu
* shoveler
* chimaera
* bonito
* takin
* bufflehead
* elasmobranch
* demoiselle
* cavefish

Regards,
Mark


Reply | Threaded
Open this post in threaded view
|

Re: Why is infinite_recurse test suddenly failing?

Tom Lane-2
Mark Wong <[hidden email]> writes:
> The following I've enabled force_parallel_mode for HEAD, 11, 10, and
> 9.6:

Thanks Mark!

In theory, the stack trace we now have from shoveler proves that parallel
mode has nothing to do with this, because the crash happens during
planning (specifically, inlining SQL functions).  I wonder though if
it's possible that previous force_parallel_mode queries have had some
undesirable effect on the process's stack allocation.  Just grasping
at straws, because it's sure not clear what's going wrong.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Why is infinite_recurse test suddenly failing?

Tom Lane-2
In reply to this post by Mark Wong
Mark Wong <[hidden email]> writes:
> Slowly catching up on my collection of ppc64le animals...

Oh, btw ... you didn't answer my question from before: are these animals
all running on a common platform (and if so, what is that), or are they
really different hardware?  If they're all VMs on one machine then it
might be that there's some common-mode effect from the underlying system.

(Another thing I notice, now, is that these are all Linux variants;
I'd been thinking you had some BSDen in there too, but now I see
that none of those are ppc64.  Hm.)

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Why is infinite_recurse test suddenly failing?

Mark Wong
In reply to this post by Andres Freund
On Fri, May 10, 2019 at 11:27:07AM -0700, Andres Freund wrote:

> Hi,
>
> On 2019-05-10 11:38:57 -0400, Tom Lane wrote:
> > Core was generated by `postgres: debian regression [local] SELECT                                     '.
> > Program terminated with signal SIGSEGV, Segmentation fault.
> > #0  sysmalloc (nb=8208, av=0x3fff916e0d28 <main_arena>) at malloc.c:2748
> > 2748 malloc.c: No such file or directory.
> > #0  sysmalloc (nb=8208, av=0x3fff916e0d28 <main_arena>) at malloc.c:2748
> > #1  0x00003fff915bedc8 in _int_malloc (av=0x3fff916e0d28 <main_arena>, bytes=8192) at malloc.c:3865
> > #2  0x00003fff915c1064 in __GI___libc_malloc (bytes=8192) at malloc.c:2928
> > #3  0x00000000106acfd8 in AllocSetContextCreateInternal (parent=0x1000babdad0, name=0x1085508c "inline_function", minContextSize=<optimized out>, initBlockSize=<optimized out>, maxBlockSize=8388608) at aset.c:477
> > #4  0x00000000103d5e00 in inline_function (funcid=20170, result_type=<optimized out>, result_collid=<optimized out>, input_collid=<optimized out>, funcvariadic=<optimized out>, func_tuple=<optimized out>, context=0x3fffe3da15d0, args=<optimized out>) at clauses.c:4459
> > #5  simplify_function (funcid=<optimized out>, result_type=<optimized out>, result_typmod=<optimized out>, result_collid=<optimized out>, input_collid=<optimized out>, args_p=<optimized out>, funcvariadic=<optimized out>, process_args=<optimized out>, allow_non_const=<optimized out>, context=<optimized out>) at clauses.c:4040
> > #6  0x00000000103d2e74 in eval_const_expressions_mutator (node=0x1000babe968, context=0x3fffe3da15d0) at clauses.c:2474
> > #7  0x00000000103511bc in expression_tree_mutator (node=<optimized out>, mutator=0x103d2b10 <eval_const_expressions_mutator>, context=0x3fffe3da15d0) at nodeFuncs.c:2893
>
>
> > So that lets out any theory that somehow we're getting into a weird
> > control path that misses calling check_stack_depth;
> > expression_tree_mutator does so for one, and it was called just nine
> > stack frames down from the crash.
>
> Right. There's plenty places checking it...
>
>
> > I am wondering if, somehow, the stack depth limit seen by the postmaster
> > sometimes doesn't apply to its children.  That would be pretty wacko
> > kernel behavior, especially if it's only intermittently true.
> > But we're running out of other explanations.
>
> I wonder if this is a SIGSEGV that actually signals an OOM
> situation. Linux, if it can't actually extend the stack on-demand due to
> OOM, sends a SIGSEGV.  The signal has that information, but
> unfortunately the buildfarm code doesn't print it.  p $_siginfo would
> show us some of that...
>
> Mark, how tight is the memory on that machine?

There's about 2GB allocated:

debian@postgresql-debian:~$ cat /proc/meminfo
MemTotal:        2080704 kB
MemFree:         1344768 kB
MemAvailable:    1824192 kB


At the moment it looks like plenty. :)  Maybe I should set something up
to monitor these things.

> Does dmesg have any other
> information (often segfaults are logged by the kernel with the code
> IIRC).

It's been up for about 49 days:

debian@postgresql-debian:~$ uptime
 14:54:30 up 49 days, 14:59,  3 users,  load average: 0.00, 0.34, 1.04


I see one line from dmesg that is related to postgres:

[3939350.616849] postgres[17057]: bad frame in setup_rt_frame: 00003fffe3d9fe00 nip 00003fff915bdba0 lr 00003fff915bde9c


But only that one time in 49 days up.  Otherwise I see a half dozen
hung_task_timeout_secs messages around jdb2 and dhclient.

Regards,
Mark

--
Mark Wong
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/


Reply | Threaded
Open this post in threaded view
|

Re: Why is infinite_recurse test suddenly failing?

Mark Wong
In reply to this post by Tom Lane-2
On Tue, May 14, 2019 at 10:52:07AM -0400, Tom Lane wrote:
> Mark Wong <[hidden email]> writes:
> > Slowly catching up on my collection of ppc64le animals...
>
> Oh, btw ... you didn't answer my question from before: are these animals
> all running on a common platform (and if so, what is that), or are they
> really different hardware?  If they're all VMs on one machine then it
> might be that there's some common-mode effect from the underlying system.

Sorry, I was almost there. :)

These systems are provisioned with OpenStack.  Additionally, a couple
more (cardinalfish, devario) are using docker under that.

> (Another thing I notice, now, is that these are all Linux variants;
> I'd been thinking you had some BSDen in there too, but now I see
> that none of those are ppc64.  Hm.)

Right, the BSDen I have are on different hardware.

Regards,
Mark

--
Mark Wong
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/


Reply | Threaded
Open this post in threaded view
|

Re: Why is infinite_recurse test suddenly failing?

Mark Wong
In reply to this post by Andrew Dunstan
On Fri, May 10, 2019 at 05:26:43PM -0400, Andrew Dunstan wrote:

>
> On 5/10/19 3:35 PM, Tom Lane wrote:
> > Andres Freund <[hidden email]> writes:
> >> On 2019-05-10 11:38:57 -0400, Tom Lane wrote:
> >>> I am wondering if, somehow, the stack depth limit seen by the postmaster
> >>> sometimes doesn't apply to its children.  That would be pretty wacko
> >>> kernel behavior, especially if it's only intermittently true.
> >>> But we're running out of other explanations.
> >> I wonder if this is a SIGSEGV that actually signals an OOM
> >> situation. Linux, if it can't actually extend the stack on-demand due to
> >> OOM, sends a SIGSEGV.  The signal has that information, but
> >> unfortunately the buildfarm code doesn't print it.  p $_siginfo would
> >> show us some of that...
> >> Mark, how tight is the memory on that machine? Does dmesg have any other
> >> information (often segfaults are logged by the kernel with the code
> >> IIRC).
> > It does sort of smell like a resource exhaustion problem, especially
> > if all these buildfarm animals are VMs running on the same underlying
> > platform.  But why would that manifest as "you can't have a measly two
> > megabytes of stack" and not as any other sort of OOM symptom?
> >
> > Mark, if you don't mind modding your local copies of the buildfarm
> > script, I think what Andres is asking for is a pretty trivial addition
> > in PGBuild/Utils.pm's sub get_stack_trace:
> >
> > my $cmdfile = "./gdbcmd";
> > my $handle;
> > open($handle, '>', $cmdfile) || die "opening $cmdfile: $!";
> > print $handle "bt\n";
> > + print $handle "p $_siginfo\n";
> > close($handle);
> >
> >
>
>
> I think we'll need to write that as:
>
>
>     print $handle 'p $_siginfo',"\n";

Ok, I have this added to everyone now.

I think I also have caught up on this thread, but let me know if I
missed anything.

Regards,
Mark

--
Mark Wong
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/


Reply | Threaded
Open this post in threaded view
|

Re: Why is infinite_recurse test suddenly failing?

Andres Freund
Hi,

On 2019-05-14 08:31:37 -0700, Mark Wong wrote:
> Ok, I have this added to everyone now.
>
> I think I also have caught up on this thread, but let me know if I
> missed anything.

I notice https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=demoiselle&dt=2019-05-19%2014%3A22%3A23
failed recently, but unfortunately does not appear to have gdb
installed? Or the buildfarm version is too old? Or ulimits are set
strictly on a system wide basis?

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Why is infinite_recurse test suddenly failing?

Mark Wong
On Sun, May 19, 2019 at 02:38:26PM -0700, Andres Freund wrote:

> Hi,
>
> On 2019-05-14 08:31:37 -0700, Mark Wong wrote:
> > Ok, I have this added to everyone now.
> >
> > I think I also have caught up on this thread, but let me know if I
> > missed anything.
>
> I notice https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=demoiselle&dt=2019-05-19%2014%3A22%3A23
> failed recently, but unfortunately does not appear to have gdb
> installed? Or the buildfarm version is too old? Or ulimits are set
> strictly on a system wide basis?

It looks like I did have gdb on there:

opensuse@postgresql-opensuse-p9:~> gdb --version
GNU gdb (GDB; openSUSE Leap 15.0) 8.1


I'm on v9 of the build-farm here (have it on my todo list to get
everything up to 10.)


I hope nothing is overriding my core size ulimit:

opensuse@postgresql-opensuse-p9:~> ulimit -c
unlimited


This animal is using clang.  I wonder if gdb is disagreeing with the
clang binaries?

Regards,
Mark

--
Mark Wong
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/


12