Using XLogFileNameP in critical section

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Using XLogFileNameP in critical section

Masahiko Sawada-2
Hi,

I encountered that the assertion error is reported instead of a proper
PANIC message when failed to fsync WAL. The cause is that there are
multiple places where we call XLogFileNameP function that calls palloc
during critical section, for example XLogWrite function.

TRAP: FailedAssertion("CritSectionCount == 0 ||
(context)->allowInCritSection", File: "mcxt.c", Line: 956)

As far as I can see there are five places we need to fix.I've attached a patch.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

fix_XLogFileNameP.patch (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Using XLogFileNameP in critical section

Michael Paquier-2
On Fri, Nov 29, 2019 at 06:44:58PM +0100, Masahiko Sawada wrote:

> I encountered that the assertion error is reported instead of a proper
> PANIC message when failed to fsync WAL. The cause is that there are
> multiple places where we call XLogFileNameP function that calls palloc
> during critical section, for example XLogWrite function.
>
> TRAP: FailedAssertion("CritSectionCount == 0 ||
> (context)->allowInCritSection", File: "mcxt.c", Line: 956)
>
> As far as I can see there are five places we need to fix.  I've
> attached a patch.
+    msg = "could not fdatasync file \"%s\": %m";
Missing some translations, no?

You are missing a couple of code paths in walreceiver.c,
XLogWalRcvWrite(), where XLogFileNameP is used on a PANIC.  This
brings me the following points:
1) If you look closely, all the callers of XLogFileNameP() are used
for the generation of error strings.
2) I am ready to bet that we'll have the same discussion in the future
because somebody will make the same mistake for a new code path.

I think that we had better just get rid of XLogFileNameP() (on HEAD)
and just change those code paths so as they use a buffer of size
MAXFNAMELEN, with XLogFileName() generating the file name.  This leads
actually to some simplifications, see for example XLogWalRcvWrite..
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Using XLogFileNameP in critical section

Alvaro Herrera-9
On 2019-Dec-02, Michael Paquier wrote:

> I think that we had better just get rid of XLogFileNameP() (on HEAD)
> and just change those code paths so as they use a buffer of size
> MAXFNAMELEN, with XLogFileName() generating the file name.  This leads
> actually to some simplifications, see for example XLogWalRcvWrite..

+1, that function is a footgun and there's no possible way to fix it so
that it's never misused(*).  Also, put a comment atop XLogFileName
indicating never to create such a "helper" function again.

(*) unless there was a way to have a StaticAssert() that the function is
not called in a critical section -- have the compiler work for us.  I
don't think it's possible to write such a thing.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Using XLogFileNameP in critical section

Masahiko Sawada-2
On Mon, 2 Dec 2019 at 16:56, Alvaro Herrera <[hidden email]> wrote:

>
> On 2019-Dec-02, Michael Paquier wrote:
>
> > I think that we had better just get rid of XLogFileNameP() (on HEAD)
> > and just change those code paths so as they use a buffer of size
> > MAXFNAMELEN, with XLogFileName() generating the file name.  This leads
> > actually to some simplifications, see for example XLogWalRcvWrite..
>
> +1, that function is a footgun and there's no possible way to fix it so
> that it's never misused(*).  Also, put a comment atop XLogFileName
> indicating never to create such a "helper" function again.
>
> (*) unless there was a way to have a StaticAssert() that the function is
> not called in a critical section -- have the compiler work for us.  I
> don't think it's possible to write such a thing.
Agreed. I've attached the updated version patch. Please review it.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

fix_XLogFileNameP_v2.patch (14K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Using XLogFileNameP in critical section

Michael Paquier-2
On Mon, Dec 02, 2019 at 10:14:58PM +0100, Masahiko Sawada wrote:
> Agreed. I've attached the updated version patch. Please review it.

Thanks, applied on HEAD after a few edits.  gettext() does not set
errno, so the new style of issue_xlog_fsync() is actually fine.
Please note that there was one mistake in the patch: you forgot to
assign back errno in assign_xlog_sync_method() after generating the
file name.

Per the low probability of the failures, I did not backpatch that
stuff.  I quickly looked at applying that further down, and attached
is a version for v12 FWIW, and I suspect much more conflicts the more
you go down (wal segment size added in 11, different code paths for
replication, etc.).
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Using XLogFileNameP in critical section

Alvaro Herrera-9
On 2019-Dec-03, Michael Paquier wrote:

> Per the low probability of the failures, I did not backpatch that
> stuff.  I quickly looked at applying that further down, and attached
> is a version for v12 FWIW, and I suspect much more conflicts the more
> you go down (wal segment size added in 11, different code paths for
> replication, etc.).

You didn't attach anything, but I concur about the low probability
aspect: the assertion failure does not occur in production builds
(obviously); and only an out-of-memory situation is a real problem when
an fsync fails.  Anyway this should be a very localized fix, right?

I'm not sure that the internationalization stuff in issue_xlog_fsync is
correct.  I think the _() should be gettext_noop(), or alternatively the
errmsg() should be errmsg_internal(); otherwise the translation is
invoked twice.  (I didn't verify this.)

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Using XLogFileNameP in critical section

Tom Lane-2
Alvaro Herrera <[hidden email]> writes:
> I'm not sure that the internationalization stuff in issue_xlog_fsync is
> correct.  I think the _() should be gettext_noop(), or alternatively the
> errmsg() should be errmsg_internal(); otherwise the translation is
> invoked twice.  (I didn't verify this.)

Also, buildfarm member drongo is not happy:

postgres.def : error LNK2001: unresolved external symbol XLogFileNameP [C:\prog\bf\root\HEAD\pgsql.build\postgres.vcxproj]
Release/postgres/postgres.lib : fatal error LNK1120: 1 unresolved externals [C:\prog\bf\root\HEAD\pgsql.build\postgres.vcxproj]

I'm guessing you missed a reference someplace.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Using XLogFileNameP in critical section

Tom Lane-2
I wrote:
> Also, buildfarm member drongo is not happy:
> postgres.def : error LNK2001: unresolved external symbol XLogFileNameP [C:\prog\bf\root\HEAD\pgsql.build\postgres.vcxproj]
> Release/postgres/postgres.lib : fatal error LNK1120: 1 unresolved externals [C:\prog\bf\root\HEAD\pgsql.build\postgres.vcxproj]
> I'm guessing you missed a reference someplace.

Hm ... grep swears up and down that there is no remaining instance
of the string "XLogFileNameP" anywhere in the tree.  So this doesn't
seem to be the fault of 9989d37d1 per se.  What my eye now falls on
is this, a bit further up in the build log [1]:

...
PreLinkEvent:
  Generate DEF file
  perl src\tools\msvc\gendef.pl Release\postgres x64
  :VCEnd
  Not re-generating POSTGRES.DEF, file already exists.
Link:
...

So it seems that the problem might really be a faulty rule in our
MSVC build script about when postgres.def needs to be regenerated?
Or else it's some weird caching problem on drongo --- the lack of
complaints from other Windows critters might point the finger
that way.

                        regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2019-12-03%2007%3A30%3A01


Reply | Threaded
Open this post in threaded view
|

Re: Using XLogFileNameP in critical section

Masahiko Sawada-2
In reply to this post by Michael Paquier-2
On Tue, 3 Dec 2019 at 07:09, Michael Paquier <[hidden email]> wrote:
>
> On Mon, Dec 02, 2019 at 10:14:58PM +0100, Masahiko Sawada wrote:
> > Agreed. I've attached the updated version patch. Please review it.
>
> Thanks, applied on HEAD after a few edits.  gettext() does not set
> errno, so the new style of issue_xlog_fsync() is actually fine.

Thanks!

> Please note that there was one mistake in the patch: you forgot to
> assign back errno in assign_xlog_sync_method() after generating the
> file name.

My bad. Thank you for fixing it.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Using XLogFileNameP in critical section

Michael Paquier-2
In reply to this post by Tom Lane-2
On Tue, Dec 03, 2019 at 11:24:57AM -0500, Tom Lane wrote:
> So it seems that the problem might really be a faulty rule in our
> MSVC build script about when postgres.def needs to be regenerated?
> Or else it's some weird caching problem on drongo --- the lack of
> complaints from other Windows critters might point the finger
> that way.

Yes, I saw the failure from the buildfarm logs, but I got to the
conclusion that the animal just got crazy with a portion of its
caching because there are no more references to routine removed.  So I
did not bother much.

FWIW, I have seen sometimes similar warnings regarding conflicting
projects when doing a compilation on MSVC, applying a patch and then
recompiling (because that's just faster than recompiling the whole).
Perhaps we could do better here.  I am not completely sure what
though, it's not a problem I have spent much brain-time on.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Using XLogFileNameP in critical section

Michael Paquier-2
In reply to this post by Alvaro Herrera-9
On Tue, Dec 03, 2019 at 09:35:00AM -0300, Alvaro Herrera wrote:
> You didn't attach anything, but I concur about the low probability
> aspect: the assertion failure does not occur in production builds
> (obviously); and only an out-of-memory situation is a real problem
> when
> an fsync fails.  Anyway this should be a very localized fix, right?

Sorry.  You get something like the attached.  The recent refactoring
work you committed in this area causes already conflicts on
REL_12_STABLE.

> I'm not sure that the internationalization stuff in issue_xlog_fsync
> is correct.  I think the _() should be gettext_noop(), or alternatively
> the errmsg() should be errmsg_internal(); otherwise the translation is
> invoked twice.  (I didn't verify this.)

Hmm.  We actually do both in tablecmds.c:ATWrongRelkindError(), and
that's the code I was looking at yesterday when thinking about the
problem..  However, parse_agg.c, parse_expr.c and parse_func.c among
others like vacuumlazy.c use directly errmsg_internal() without
translating the string first.  So there is indeed duplicated work for
both.  Does the attached patch look correct to you?
--
Michael

gettext-fix.patch (936 bytes) Download Attachment
attachment1 (11K) Download Attachment
signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Using XLogFileNameP in critical section

Andrew Dunstan
In reply to this post by Tom Lane-2

On 12/3/19 11:24 AM, Tom Lane wrote:

> I wrote:
>> Also, buildfarm member drongo is not happy:
>> postgres.def : error LNK2001: unresolved external symbol XLogFileNameP [C:\prog\bf\root\HEAD\pgsql.build\postgres.vcxproj]
>> Release/postgres/postgres.lib : fatal error LNK1120: 1 unresolved externals [C:\prog\bf\root\HEAD\pgsql.build\postgres.vcxproj]
>> I'm guessing you missed a reference someplace.
> Hm ... grep swears up and down that there is no remaining instance
> of the string "XLogFileNameP" anywhere in the tree.  So this doesn't
> seem to be the fault of 9989d37d1 per se.  What my eye now falls on
> is this, a bit further up in the build log [1]:
>
> ...
> PreLinkEvent:
>   Generate DEF file
>   perl src\tools\msvc\gendef.pl Release\postgres x64
>   :VCEnd
>   Not re-generating POSTGRES.DEF, file already exists.
> Link:
> ...
>
> So it seems that the problem might really be a faulty rule in our
> MSVC build script about when postgres.def needs to be regenerated?
> Or else it's some weird caching problem on drongo --- the lack of
> complaints from other Windows critters might point the finger
> that way.
>
> regards, tom lane
>
> [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2019-12-03%2007%3A30%3A01
>


this was pilot error on my part. Should be fixed now.


cheers


andrew