BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

classic Classic list List threaded Threaded
33 messages Options
12
Reply | Threaded
Open this post in threaded view
|

BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

PG Bug reporting form
The following bug has been logged on the website:

Bug reference:      15804
Logged by:          Yulian Khodorkovskiy
Email address:      [hidden email]
PostgreSQL version: Unsupported/Unknown
Operating system:   Centos 7.4
Description:        

The following assertion fails when compiling postgres 12 on Linux (centos
7.4) with EXEC_BACKEND and logging_collector enabled:

    `Assert(UsedShmemSegAddr != NULL);` in `PGSharedMemoryNoReAttach()`

Commit 57431a911d3a650451d198846ad3194900666152 appears to have introduced
this regression by moving SysLogger_Start() before reset_shared() is called
and shared memory is initialized.

For what it's worth, Windows 10/jacana (and maybe other windows builds) on
the build farm does not use logging_collector, which is perhaps why this
regression was not caught.

Yuli

Reply | Threaded
Open this post in threaded view
|

Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

Yuli Khodorkovskiy


> On May 14, 2019, at 2:28 PM, PG Bug reporting form <[hidden email]> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference:      15804
> Logged by:          Yulian Khodorkovskiy
> Email address:      [hidden email]
> PostgreSQL version: Unsupported/Unknown
> Operating system:   Centos 7.4
> Description:        
>
> The following assertion fails when compiling postgres 12 on Linux (centos
> 7.4) with EXEC_BACKEND and logging_collector enabled:
>
>    `Assert(UsedShmemSegAddr != NULL);` in `PGSharedMemoryNoReAttach()`
>
> Commit 57431a911d3a650451d198846ad3194900666152 appears to have introduced
> this regression by moving SysLogger_Start() before reset_shared() is called
> and shared memory is initialized.
>
> For what it's worth, Windows 10/jacana (and maybe other windows builds) on
> the build farm does not use logging_collector, which is perhaps why this
> regression was not caught.
>
> Yuli
>
Attached is a patch that fixes the issue in the bug report.

0001-Initialize-shared-memory-before-calling-SysLogger_St.patch (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

Tom Lane-2
Yuli Khodorkovskiy <[hidden email]> writes:
> Attached is a patch that fixes the issue in the bug report.

I do not think this is a very good way to fix it (even assuming
that it works, which I'm unsure of --- in particular, doing this
would alter the order of changes to postmaster.pid, possibly
breaking pg_ctl or other tools that look at that file).  The whole
point of commit 57431a911 was to switch to log-collector logging
before we've done anything very interesting, and that would surely
include shmem setup.  I'm a bit surprised actually that Peter
didn't move the SysLogger_Start call even further up; in principle
it ought to be safe to do it as soon as we have the data directory
lock file.

It might be better to give up the assertion in PGSharedMemoryNoReAttach,
and just make it work more like PGSharedMemoryDetach, ie "detach if
UsedShmemSegAddr is set, else do nothing".  I don't remember for sure,
but if we do that, there might be no functional difference anymore
between those two functions, in which case we might as well merge 'em.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

Tom Lane-2
I wrote:
> It might be better to give up the assertion in PGSharedMemoryNoReAttach,
> and just make it work more like PGSharedMemoryDetach, ie "detach if
> UsedShmemSegAddr is set, else do nothing".  I don't remember for sure,
> but if we do that, there might be no functional difference anymore
> between those two functions, in which case we might as well merge 'em.

Apropos of this: I notice that commit 57431a911 is already relying
on PGSharedMemoryDetach to do the "right thing" here, because that
gets called in the non-EXEC_BACKEND code path, cf SysLogger_Start:

            /* Drop our connection to postmaster's shared memory, as well */
            dsm_detach_all();
            PGSharedMemoryDetach();

I'm also not too happy that this comment doesn't mention the fact that
we'd (now) only be attached to shmem in the case of a postmaster restart.

BTW, it looks to me like dsm_detach_all doesn't get called anywhere in
the EXEC_BACKEND case, which seems like a separate bug.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

Michael Paquier-2
In reply to this post by Tom Lane-2
On Tue, May 14, 2019 at 03:52:04PM -0400, Tom Lane wrote:
> It might be better to give up the assertion in PGSharedMemoryNoReAttach,
> and just make it work more like PGSharedMemoryDetach, ie "detach if
> UsedShmemSegAddr is set, else do nothing".  I don't remember for sure,
> but if we do that, there might be no functional difference anymore
> between those two functions, in which case we might as well merge 'em.

It would be nice to fix that before beta1 is out, or we are going to
get complaints :(

So what about dropping PGSharedMemoryNoReAttach() completely and using
(or abusing of?) the detach routine so as we rely on its no-op
behavior when a subprocess is not attached yet to a shared memory
segment.  On Windows, the sub-processes calling NoReAttach don't map
to a segment per the comments in the code, and I can indeed see some
"could not map" LOG messages coming from these in the logs if
enforcing the use of the detach routine in postmaster.c.  It could
actually be an advantage to unmap using UnmapViewOfFile() so as a
subsequent call of ReAttach does not finish by mapping an extra time.
We may consider lowering the LOG messages to DEBUG1, or extend the
detach routine so as we control the level of logs with an elevel to
avoid sparse LOG messages when unmapping from shared segments where we
know it would fail.

I have been playing with the attached using EXEC_BACKEND on Linux and
some Windows builds FWIW.  Thoughts?
--
Michael

shmem-noattach.patch (4K) Download Attachment
signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

Tom Lane-2
Michael Paquier <[hidden email]> writes:
> On Tue, May 14, 2019 at 03:52:04PM -0400, Tom Lane wrote:
>> It might be better to give up the assertion in PGSharedMemoryNoReAttach,
>> and just make it work more like PGSharedMemoryDetach, ie "detach if
>> UsedShmemSegAddr is set, else do nothing".  I don't remember for sure,
>> but if we do that, there might be no functional difference anymore
>> between those two functions, in which case we might as well merge 'em.

> It would be nice to fix that before beta1 is out, or we are going to
> get complaints :(

Agreed, that's why I put it on the open items list ;-)

> So what about dropping PGSharedMemoryNoReAttach() completely and using
> (or abusing of?) the detach routine so as we rely on its no-op
> behavior when a subprocess is not attached yet to a shared memory
> segment.  On Windows, the sub-processes calling NoReAttach don't map
> to a segment per the comments in the code, and I can indeed see some
> "could not map" LOG messages coming from these in the logs if
> enforcing the use of the detach routine in postmaster.c.  It could
> actually be an advantage to unmap using UnmapViewOfFile() so as a
> subsequent call of ReAttach does not finish by mapping an extra time.
> We may consider lowering the LOG messages to DEBUG1, or extend the
> detach routine so as we control the level of logs with an elevel to
> avoid sparse LOG messages when unmapping from shared segments where we
> know it would fail.

Hm.  I don't much like ignoring errors, which is what that would
amount to.  It seems like in win32_shmem.c, PGSharedMemoryNoReAttach
is already more or less right: resetting UsedShmemSegAddr to NULL and
then calling PGSharedMemoryDetach is exactly what to do.  It just needs
to lose the no-longer-correct assertions.  Maybe sysv_shmem.c's version
should look the same?

The real point of PGSharedMemoryNoReAttach, now that I think about it,
is that it's dealing with a case where we passed down UsedShmemSegAddr
via the BackendParameters mechanism, but we know that the segment is
not really attached (because of exec()).  This is different from the
API expectations of PGSharedMemoryDetach, which thinks that non-nullness
of UsedShmemSegAddr corresponds to whether the segment is attached or
not.  So we can't unify those functions without breaking things, or
at least losing the ability to distinguish errors from not-errors.

BTW, another thing I think is worth looking into is exactly why the
buildfarm failed to detect this problem.  According to the code
coverage report,

https://coverage.postgresql.org/src/backend/postmaster/syslogger.c.gcov.html

we *are* exercising the syslogger at some point during a standard
test run.  Perhaps the Windows animals don't run whatever test
case that is?

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

Michael Paquier-2
On Wed, May 15, 2019 at 10:39:49AM -0400, Tom Lane wrote:

> Hm.  I don't much like ignoring errors, which is what that would
> amount to.  It seems like in win32_shmem.c, PGSharedMemoryNoReAttach
> is already more or less right: resetting UsedShmemSegAddr to NULL and
> then calling PGSharedMemoryDetach is exactly what to do.  It just needs
> to lose the no-longer-correct assertions.  Maybe sysv_shmem.c's version
> should look the same?
>
> The real point of PGSharedMemoryNoReAttach, now that I think about it,
> is that it's dealing with a case where we passed down UsedShmemSegAddr
> via the BackendParameters mechanism, but we know that the segment is
> not really attached (because of exec()).  This is different from the
> API expectations of PGSharedMemoryDetach, which thinks that non-nullness
> of UsedShmemSegAddr corresponds to whether the segment is attached or
> not.  So we can't unify those functions without breaking things, or
> at least losing the ability to distinguish errors from not-errors.
My thoughts were to switch the LOG to DEBUG1 for the code paths where
we do not expect a failure, which is the code path from the
postmaster, so that actually comes closer to your upthread suggestion
of merging both functions.  The only thing which changes is that we
would call UnmapViewOfFile(), and fail silently.  We don't generate an
ERROR now for the existing code, just a LOG.  So there is no actual
hard failure now?  Anyway, you are right that it is no time to
redesign this interface with the current timing.

Using vcregress check with logging_collector and assertions enabled
is enough to see the failures on Windows, but the assertions actually
triggered are not the same ones as the EXEC_BACKEND case with sysv
because it's actually the three assertions in
pgwin32_ReserveSharedMemoryRegion() which blow up, not the ones in
PGSharedMemoryNoReAttach() which is able to handle its own.  Removing
some assertions are per the attached avoids problems.  I am able to
pass check with the logging collector enabled on Windows.

> BTW, another thing I think is worth looking into is exactly why the
> buildfarm failed to detect this problem.  According to the code
> coverage report,
>
> https://coverage.postgresql.org/src/backend/postmaster/syslogger.c.gcov.html
>
> we *are* exercising the syslogger at some point during a standard
> test run.  Perhaps the Windows animals don't run whatever test
> case that is?

Now from what I can see from the related buildfarm members:
- culicidae uses EXEC_BACKEND on Linux and has assertions, but not
logging_collector.
- dory, hamerkop, woodloose and bowerbird test on Windows with MSVC,
have assertions enabled, but not logging_collector.
- jacana tests on Windows with mingw, has assertions, but it does not
enforce logging_collector.
The Windows animals do not run bincheck.

So my bet is that the coverage is from pg_ctl/t/004_logrotate.pl,
which is a test skipped on Windows.  culicidae runs that, so the fact
that the failure got undetected is actually a bit of a mystery because
this uses sysv_shmem.c.

Except culicidae, I am seeing no members using EXEC_BACKEND which
enable the syslogger and use TAP tests :(
--
Michael

shmem-remove-asserts.patch (915 bytes) Download Attachment
signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

Michael Paquier-2
On Thu, May 16, 2019 at 10:38:08AM +0900, Michael Paquier wrote:
> So my bet is that the coverage is from pg_ctl/t/004_logrotate.pl,
> which is a test skipped on Windows.  culicidae runs that, so the fact
> that the failure got undetected is actually a bit of a mystery because
> this uses sysv_shmem.c.
>
> Except culicidae, I am seeing no members using EXEC_BACKEND which
> enable the syslogger and use TAP tests :(

Oh, actually culicidae catches the failure, but that's burried in the
logs, and I am not able to catch up that on my machine in the TAP
tests:
TRAP: FailedAssertion("!(UsedShmemSegAddr != ((void *)0))", File:
"pg_shmem.c", Line: 848)
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=culicidae&dt=2019-05-16%2000%3A30%3A04&stg=pg_ctl-check

We definitely need to improve that.

Also, I am noticing another consequence as the handling of backend
variable files also suffers consequences:
could not open backend variables file
"pgsql_tmp/pgsql_tmp.backend_var.21912.1": No such file or directory

So it seems that reverting 57431a91 is an option on the table?
Opinions?
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

Andres Freund
Hi,

On 2019-05-16 11:19:03 +0900, Michael Paquier wrote:
> So it seems that reverting 57431a91 is an option on the table?
> Opinions?

Seems a bit early to go that way. Given that this bug was reported
yesterday, and that there's reasonable sketches how to fix this, I'm not
clear why reverting would already be on the table?

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

Tom Lane-2
In reply to this post by Michael Paquier-2
Michael Paquier <[hidden email]> writes:
> Also, I am noticing another consequence as the handling of backend
> variable files also suffers consequences:
> could not open backend variables file
> "pgsql_tmp/pgsql_tmp.backend_var.21912.1": No such file or directory

Um ... where/how are you seeing that?

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

Michael Paquier-2
On Wed, May 15, 2019 at 11:32:38PM -0400, Tom Lane wrote:
> Michael Paquier <[hidden email]> writes:
> > Also, I am noticing another consequence as the handling of backend
> > variable files also suffers consequences:
> > could not open backend variables file
> > "pgsql_tmp/pgsql_tmp.backend_var.21912.1": No such file or directory
>
> Um ... where/how are you seeing that?

On HEAD with EXEC_BACKEND, just like that:
$ cd src/bin/pg_ctl && PROVE_TESTS=t/004_logrotate.pl make check
$ cat tmp_check/log/004_logrotate_primary.log
2019-05-16 12:44:03.427 JST [26433] LOG:  redirecting log output to
logging collector process
2019-05-16 12:44:03.427 JST [26433] HINT:  Future log output will
appear in directory "log".
could not open backend variables file
"pgsql_tmp/pgsql_tmp.backend_var.26433.1": No such file or directory

The test is able to pass, but we have a race condition between the
moment the backend file gets saved and the moment we allow it to be
read.   I have not spent much time checking the stack between
InitializeMaxBackends() and RemovePgTempFiles() in postmaster.c, but
57431a9 triggers the failure.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

Michael Paquier-2
On Thu, May 16, 2019 at 12:56:17PM +0900, Michael Paquier wrote:
> The test is able to pass, but we have a race condition between the
> moment the backend file gets saved and the moment we allow it to be
> read.   I have not spent much time checking the stack between
> InitializeMaxBackends() and RemovePgTempFiles() in postmaster.c, but
> 57431a9 triggers the failure.

Oh, I think I got it.  The issue is that we call RemovePgTempFiles()
after starting the syslogger.  This cannot be run with other processes
running in parallel, and with EXEC_BACKEND there is the extra case
where we have a pgsql_tmp/ at the root of the data folder, so the
syslogger complains on that.  By making RemovePgTempFiles() happen
before starting the syslogger, then the test complains again about the
assertion without my previous patch applied of course.  With the patch
applied, I get no complains.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

Tom Lane-2
Michael Paquier <[hidden email]> writes:
> On Thu, May 16, 2019 at 12:56:17PM +0900, Michael Paquier wrote:
>> The test is able to pass, but we have a race condition between the
>> moment the backend file gets saved and the moment we allow it to be
>> read.   I have not spent much time checking the stack between
>> InitializeMaxBackends() and RemovePgTempFiles() in postmaster.c, but
>> 57431a9 triggers the failure.

> Oh, I think I got it.  The issue is that we call RemovePgTempFiles()
> after starting the syslogger.  This cannot be run with other processes
> running in parallel, and with EXEC_BACKEND there is the extra case
> where we have a pgsql_tmp/ at the root of the data folder, so the
> syslogger complains on that.  By making RemovePgTempFiles() happen
> before starting the syslogger, then the test complains again about the
> assertion without my previous patch applied of course.  With the patch
> applied, I get no complains.

Hm, should we separate the cleanup of the root pgsql_tmp/ from the rest of
what RemovePgTempFiles does?  I'm still feeling like we should be trying
to launch the syslogger as soon as possible.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

Michael Paquier-2
On Thu, May 16, 2019 at 12:51:34AM -0400, Tom Lane wrote:
> Hm, should we separate the cleanup of the root pgsql_tmp/ from the rest of
> what RemovePgTempFiles does?  I'm still feeling like we should be trying
> to launch the syslogger as soon as possible.

Indeed, this should be safe, the syslogger is not going to touch any
of the other temp files anyway.  If we begin to do that, we had better
update the comments at the top of RemovePgTempFiles()'s call, and
expose RemovePgTempFilesInDir().

Anyway, I have to admit that I am not really enthusiastic about
relaxing the assertions for the win32 and the sysv interfaces, and do
that inconsistently on top of it.

As far as my first investigations have gone, assertions in
pgwin32_ReserveSharedMemoryRegion() and
PGSharedMemoryReAttach():sysv_shmem.c would need relaxing, but we rely
on that since a7e5878, which is quite some time ago.  And, actually,
the failure for pgwin32_ReserveSharedMemoryRegion() should never
happen at all, because we need to call reset_shared() before starting
the syslogger as well, no?  If we want to be able to log the creation
socket messages with the logging collector, it seems to me that we
would need first to remove the dependency with the port number in
PGSharedMemoryCreate() to find a free IPC key.  Thoughts?
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

Michael Paquier-2
On Thu, May 16, 2019 at 03:06:28PM +0900, Michael Paquier wrote:
> As far as my first investigations have gone, assertions in
> pgwin32_ReserveSharedMemoryRegion() and
> PGSharedMemoryReAttach():sysv_shmem.c would need relaxing, but we rely
> on that since a7e5878, which is quite some time ago.  And, actually,
> the failure for pgwin32_ReserveSharedMemoryRegion() should never
> happen at all, because we need to call reset_shared() before starting
> the syslogger as well, no?

Trying to avoid calling pgwin32_ReserveSharedMemoryRegion() for
processes which are never going to connect to shared memory is
actually an interesting option, because this way there is no need to
relax the assertions it includes, and there is no point in reserving a
shared memory area if it is never going to use it.  This would not
cause a process trying to reattach to cause conflicts as well, which
is why we pre-allocate the region in the first place.

So, I have been hacking my way through this idea, and finished with
the attached.  I have added the changes to src/tools/msvc/ and
postgresql.conf.sample (enabling logging_collector by default) to test
the viability of the change, and after fixing the first assertion
issues and the second one with the temporary files, I have bumped into
a third issue causing recovery/t/017_shm.pl to remain stuck on Linux
with -DEXEC_BACKEND as the syslogger keeps starting and stopping, as
it enters.  This gets masked because of the temporary file issues
and/or the assertion problems.

This may sound rough, but based on the timing and the time I can spend
on that, I really think that we should revert 57431a9, as this has not
been completely thought through for EXEC_BACKEND, and the patch I am
attaching is a set of rough counter-measures which keep piling up,
giving a result I am not comfortable with at all.  I am not sure that
I could actually go to the bottom of it without reworking the way we
attach, detach and reattach to shared memory a process, particularly
on Windows where we don't actually need to allocate a region for a
process we know will never use shared memory (right?).  And this is no
time to do so for v12.  Last year I did the same kind of mistake by
underestimating the Windows-related stuff for v11 just before a
release which has resulted in an urgent revert as of df8b5f3.  I don't
want to do the same mistake again.
--
Michael

shmem-asserts-v3.patch (6K) Download Attachment
signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

Andres Freund
Hi,

Peter, it'd be really good if you could chime in here PDQ.


On 2019-05-17 13:25:36 +0900, Michael Paquier wrote:

> This may sound rough, but based on the timing and the time I can spend
> on that, I really think that we should revert 57431a9, as this has not
> been completely thought through for EXEC_BACKEND, and the patch I am
> attaching is a set of rough counter-measures which keep piling up,
> giving a result I am not comfortable with at all.  I am not sure that
> I could actually go to the bottom of it without reworking the way we
> attach, detach and reattach to shared memory a process, particularly
> on Windows where we don't actually need to allocate a region for a
> process we know will never use shared memory (right?).  And this is no
> time to do so for v12.

I'm not sure I see this as sufficient reason to revert.


> Last year I did the same kind of mistake by
> underestimating the Windows-related stuff for v11 just before a
> release which has resulted in an urgent revert as of df8b5f3.  I don't
> want to do the same mistake again.

I don't think a windows-only patch is the same thing as a feature, one
that makes debugging easier to boot, that's applicable to all platforms.



> diff --git a/src/tools/msvc/config_default.pl b/src/tools/msvc/config_default.pl
> index d7a9fc5039..de070e26e5 100644
> --- a/src/tools/msvc/config_default.pl
> +++ b/src/tools/msvc/config_default.pl
> @@ -3,7 +3,7 @@ use strict;
>  use warnings;
>
>  our $config = {
> - asserts => 0,    # --enable-cassert
> + asserts => 1,    # --enable-cassert
>       # float4byval=>1,         # --disable-float4-byval, on by default
>
>   # float8byval=> $platformbits == 64, # --disable-float8-byval,
> @@ -17,7 +17,7 @@ our $config = {
>   gss       => undef,    # --with-gssapi=<path>
>   icu       => undef,    # --with-icu=<path>
>   nls       => undef,    # --enable-nls=<path>
> - tap_tests => undef,    # --enable-tap-tests
> + tap_tests => 1,    # --enable-tap-tests
>   tcl       => undef,    # --with-tcl=<path>
>   perl      => undef,    # --with-perl
>   python    => undef,    # --with-python=<path>

Hm? Is this to coerce cfbot?

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

Michael Paquier-2
On Fri, May 17, 2019 at 01:24:43PM -0700, Andres Freund wrote:
> I don't think a windows-only patch is the same thing as a feature, one
> that makes debugging easier to boot, that's applicable to all
> platforms.

We are discussing here a set of problems which breaks the logging
collector to work on Windows, requiring mostly a Windows patch.  And I
am not much willing to mess up with the area of shared memory handling
on Windows a couple of days before a beta.

>> diff --git a/src/tools/msvc/config_default.pl b/src/tools/msvc/config_default.pl
>> @@ -17,7 +17,7 @@ our $config = {
>>   gss       => undef,    # --with-gssapi=<path>
>>   icu       => undef,    # --with-icu=<path>
>>   nls       => undef,    # --enable-nls=<path>
>> - tap_tests => undef,    # --enable-tap-tests
>> + tap_tests => 1,    # --enable-tap-tests
>>   tcl       => undef,    # --with-tcl=<path>
>>   perl      => undef,    # --with-perl
>>   python    => undef,    # --with-python=<path>
>
> Hm? Is this to coerce cfbot?
This is part of the patch only because it has proved to be useful for
testing with MSVC on my VMs.  I am not suggesting to add that in this
patch if this gets merged into HEAD.  It is actually a good idea to
enforce that sometimes for the cfbot..  I'll remember your suggestion
for future patches.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

Tom Lane-2
Michael Paquier <[hidden email]> writes:
> On Fri, May 17, 2019 at 01:24:43PM -0700, Andres Freund wrote:
>> I don't think a windows-only patch is the same thing as a feature, one
>> that makes debugging easier to boot, that's applicable to all
>> platforms.

> We are discussing here a set of problems which breaks the logging
> collector to work on Windows, requiring mostly a Windows patch.  And I
> am not much willing to mess up with the area of shared memory handling
> on Windows a couple of days before a beta.

Yeah.  I started digging around this finally, and I agree that the
Windows side of it is a mess --- in particular, on Windows
internal_forkexec() calls pgwin32_ReserveSharedMemoryRegion, which
means we have a hard-wired assumption there that no postmaster child
process is launched till after shared memory exists.  EXEC_BACKEND
on Unix will not reproduce that.

At this point I agree with Michael that reverting 57431a911 is the most
prudent thing to do for beta1.  While it's probably not *that* hard to
fix, it will require Windows-specific testing.  Moreover, the fact that
there are race conditions involved means that one person's testing might
not catch everything.  So I do not think we have enough time to get a
trustworthy fix in place by Monday.  And we do have other things to
worry about.

(I'm also getting annoyed that Peter, as the original author of this
commit, isn't doing anything about the issue.)

I wouldn't be totally opposed to un-reverting after beta1, if a
better-tested patch emerges.  But we don't have such a patch today
and I don't see how we'll get there this weekend.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

Tom Lane-2
In reply to this post by Michael Paquier-2
[ some notes for whoever tries to fix this in the future ]

Michael Paquier <[hidden email]> writes:
> Trying to avoid calling pgwin32_ReserveSharedMemoryRegion() for
> processes which are never going to connect to shared memory is
> actually an interesting option, because this way there is no need to
> relax the assertions it includes,

Yeah, I'd come to the same conclusion while thinking about this,
but I don't think you're going about it in quite the right way.

In the first place, it's a mistake to try to drive the what-to-do-about-
shared-memory decision strictly off the type of the subprocess.  In the
case of the syslogger, in particular, we have not only the initial launch
to think about but also a possible relaunch if the first syslogger child
crashes.  In that scenario there *would* be shared memory to detach from.

So the sketch that's emerging in my mind is

1. The postmaster should have its own state variable, say
"static bool shared_memory_exists", which it sets at the appropriate
time.  (We could rely on testing UsedShmemSegAddr for that, but I don't
think that'd be good design; UsedShmemSegAddr is really private state of
the particular shmem implementation.)  Since internal_forkexec is
inside postmaster.c and runs in the postmaster process, it could just
test that state variable directly to see if it needs to call
pgwin32_ReserveSharedMemoryRegion.

2. I think we'd be well advised to get rid of the process-type-specific
tests in SubPostmasterMain, too.  The idea in my mind is to have the
postmaster pass down an additional switch explicitly saying what to do
about shared memory, so that what SubPostmasterMain does is roughly

        if (strcmp(argv[2], "--reattach") == 0)
                PGSharedMemoryReAttach();
        else if (strcmp(argv[2], "--noreattach") == 0)
                PGSharedMemoryNoReAttach();
        else
                /* do nothing, shared memory does not exist yet */ ;

Then the problem for the syslogger is solved by passing --noreattach
or not depending on whether shared_memory_exists is true yet.

I've not worked that out in any detail; getting the switches included
properly might be too much of a pain for this to be an improvement.
But for sure I don't want there to be multiple copies of that list of
which subprocess types use shared memory.

A variant idea is to include shared_memory_exists in what's passed down
by the BackendParameters mechanism, and then subprocesses can adjust
their behavior for themselves.

In any case, the thrust of all of this is that we shouldn't touch any
of the assertions in the shmem support files; rather, the way to fix
it is to improve the logic in postmaster.c so that we don't call those
functions at all in child processes that are launched before shmem
exists.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

Michael Paquier-2
In reply to this post by Tom Lane-2
On Sat, May 18, 2019 at 12:18:28PM -0400, Tom Lane wrote:
> At this point I agree with Michael that reverting 57431a911 is the most
> prudent thing to do for beta1.  While it's probably not *that* hard to
> fix, it will require Windows-specific testing.  Moreover, the fact that
> there are race conditions involved means that one person's testing might
> not catch everything.  So I do not think we have enough time to get a
> trustworthy fix in place by Monday.  And we do have other things to
> worry about.

Thanks for confirming.  There is little time for doing the revert, and
it would have been nice seeing comments from Peter.  Anyway, I should
be able to do that myself in 24 hours or so after coming back in
front of my desk so as the buildfarm can have a couple of cycles for
its work.  If you prefer doing that yourself, that's of course fine by
me.

> I wouldn't be totally opposed to un-reverting after beta1, if a
> better-tested patch emerges.  But we don't have such a patch today
> and I don't see how we'll get there this weekend.

Possible.  My recommendation is to keep this as potential v13 work and
focus on v12 stability.  I have no doubt that after beta1 emerges we
will get enough reports to work on.
--
Michael

signature.asc (849 bytes) Download Attachment
12