BUG #16039: PANIC when activating replication slots in Postgres 12.0 64bit under Windows

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

BUG #16039: PANIC when activating replication slots in Postgres 12.0 64bit under Windows

PG Bug reporting form
The following bug has been logged on the website:

Bug reference:      16039
Logged by:          Hans Buschmann
Email address:      [hidden email]
PostgreSQL version: 12.0
Operating system:   Windows Server 2019 64bit
Description:        

We just moved our production cluster from pg 11.5 to pg 12.0 under windows
using pg_dump/initdb/pg_restore.

When we activated the replication slots by  

SELECT * FROM pg_create_physical_replication_slot('sam_repli_3');

and tried restarting the server, we got a PANIC in error log:

CPS PRD 2019-10-04 19:10:07 CEST  00000  1:> LOG:  database system was shut
down at 2019-10-04 19:10:02 CEST
CPS PRD 2019-10-04 19:10:07 CEST  XX000  2:> PANIC:  could not fsync file
"pg_replslot/sam_repli_3/state": Bad file descriptor
CPS PRD 2019-10-04 19:10:07 CEST  00000  6:> LOG:  startup process (PID
4028) was terminated by exception 0xC0000409
CPS PRD 2019-10-04 19:10:07 CEST  00000  7:> HINT:  See C include file
"ntstatus.h" for a description of the hexadecimal value.
CPS PRD 2019-10-04 19:10:07 CEST  00000  8:> LOG:  aborting startup due to
startup process failure
CPS PRD 2019-10-04 19:10:07 CEST  00000  9:> LOG:  database system is shut
down

We use the EDB distribution from the website under Windows Server 2019
(September 2019 patch level).

select version ();
                          version
------------------------------------------------------------
 PostgreSQL 12.0, compiled by Visual C++ build 1914, 64-bit
(1 Zeile)

This seems to me like a fatal bug which makes the streaming replication
unusable under Windows x64 /pg12.

The same configuration worked flawlessly under pg 11.x until pg 11.5

By searching on google we encountered a similar error from 2015 under pg
9.4.1 reported under BUG #13287:

https://www.postgresql.org/message-id/flat/20150514105514.2691.67352%40wrigleys.postgresql.org

We were able to remove the replication slot by deleting the directory
pg_replslot/sam_repli_3.

The server seems to be working fine now, but without streaming replication
of course.


BTW: Is it necessary to restart the cluster after creating the replication
slot or is it a consequence of the error, that the
newly created slot is not active automatically?

select * from pg_replication_slots;
  slot_name  | plugin | slot_type | datoid | database | temporary | active |
active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
-------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
 sam_repli_3 |        | physical  |        |          | f         | f      |
           |      |              |             |
 sam_repli_5 |        | physical  |        |          | f         | f      |
           |      |              |             |
(2 Zeilen)


Many thanks!


Hans Buschmann

Reply | Threaded
Open this post in threaded view
|

Re: BUG #16039: PANIC when activating replication slots in Postgres 12.0 64bit under Windows

Andres Freund
Hi,

Thanks for the report!

On 2019-10-04 19:28:28 +0000, PG Bug reporting form wrote:

> We just moved our production cluster from pg 11.5 to pg 12.0 under windows
> using pg_dump/initdb/pg_restore.
>
> When we activated the replication slots by
>
> SELECT * FROM pg_create_physical_replication_slot('sam_repli_3');
>
> and tried restarting the server, we got a PANIC in error log:
>
> CPS PRD 2019-10-04 19:10:07 CEST  00000  1:> LOG:  database system was shut
> down at 2019-10-04 19:10:02 CEST
> CPS PRD 2019-10-04 19:10:07 CEST  XX000  2:> PANIC:  could not fsync file
> "pg_replslot/sam_repli_3/state": Bad file descriptor
> CPS PRD 2019-10-04 19:10:07 CEST  00000  6:> LOG:  startup process (PID
> 4028) was terminated by exception 0xC0000409
> CPS PRD 2019-10-04 19:10:07 CEST  00000  7:> HINT:  See C include file
> "ntstatus.h" for a description of the hexadecimal value.
> CPS PRD 2019-10-04 19:10:07 CEST  00000  8:> LOG:  aborting startup due to
> startup process failure
> CPS PRD 2019-10-04 19:10:07 CEST  00000  9:> LOG:  database system is shut
> down
>
> We use the EDB distribution from the website under Windows Server 2019
> (September 2019 patch level).
>
> select version ();
>                           version
> ------------------------------------------------------------
>  PostgreSQL 12.0, compiled by Visual C++ build 1914, 64-bit
> (1 Zeile)
>
> This seems to me like a fatal bug which makes the streaming replication
> unusable under Windows x64 /pg12.
>
> The same configuration worked flawlessly under pg 11.x until pg 11.5
>
> By searching on google we encountered a similar error from 2015 under pg
> 9.4.1 reported under BUG #13287:
>
> https://www.postgresql.org/message-id/flat/20150514105514.2691.67352%40wrigleys.postgresql.org

Uh, Michael? You just reintroduced this bug in

commit 82a5649fb9dbef12d04cd24799be6bf298d889a6
Author: Michael Paquier <[hidden email]>
Date:   2019-03-09 08:50:55 +0900

    Tighten use of OpenTransientFile and CloseTransientFile

    This fixes two sets of issues related to the use of transient files in
    the backend:
    1) OpenTransientFile() has been used in some code paths with read-write
    flags while read-only is sufficient, so switch those calls to be
    read-only where necessary.  These have been reported by Joe Conway.

You pretty much entirely reverted:

commit dfbaed459754e71e01bb0cc90a12802bba3f9786
Author: Andres Freund <[hidden email]>
Date:   2015-04-28 00:12:38 +0200

    Use a fd opened for read/write when syncing slots during startup.

    Some operating systems, including the reporter's windows, return EBADFD
    or similar when fsync() is invoked on a O_RDONLY file descriptor.
    Unfortunately RestoreSlotFromDisk() does exactly that; which causes
    failures after restarts in at least some scenarios.

    If you hit the bug the error message will be something like
    ERROR: could not fsync file "pg_replslot/$name/state": Bad file descriptor

    Simply use O_RDWR instead of O_RDONLY when opening the relevant file
    descriptor to fix the bug.  Unfortunately I have no way of verifying the
    fix, but we've seen similar problems in the past.

    This bug goes back to 9.4 where slots were introduced. Backpatch
    accordingly.

    Reported-By: Patrice Drolet
    Bug: #13143:
    Discussion: [hidden email]

I realize I perhaps should have added a comment explaining this, but
this is far from the only location that knows we have to know open fds
r/w to be able to fsync them.

What were you even trying to fix by changing this?


Seems also pretty clear that we need a few animals running with fsync
enabled. Not sure how we best can write test infrastructure to make it
easy to set that for all tests. Guess I best start a thread about it on
-hackers.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16039: PANIC when activating replication slots in Postgres 12.0 64bit under Windows

Andres Freund
Hi Hans,

On 2019-10-04 13:06:05 -0700, Andres Freund wrote:
> On 2019-10-04 19:28:28 +0000, PG Bug reporting form wrote:
> > CPS PRD 2019-10-04 19:10:07 CEST  XX000  2:> PANIC:  could not fsync file
> > "pg_replslot/sam_repli_3/state": Bad file descriptor
> > CPS PRD 2019-10-04 19:10:07 CEST  00000  6:> LOG:  startup process (PID
> > 4028) was terminated by exception 0xC0000409

Thanks again for the report! I've committed the fix. Unfortunately,
unless you can compile yourself, you'll have to wait until the next
minor release for the fix.  Given this is the .0 release of 12, and that
we already have a number of bugs to fix, I'm sure that won't be too far
away however.

If you want to work around the problem, you can, but it's not entirely
unproblematic: If, just when restarting, you set fsync to false, you'll
avoid the problem. But then you'd have to immediately re-enable
afterwards, and there's still a small data loss window.  If however all
you want is to continue testing, fsync=off might be a reasonable
workaround.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

AW: BUG #16039: PANIC when activating replication slots in Postgres 12.0 64bit under Windows

Hans Buschmann

________________________________________
Von: Andres Freund <[hidden email]>
Gesendet: Freitag, 4. Oktober 2019 22:49
An: Hans Buschmann; [hidden email]; Michael Paquier
Betreff: Re: BUG #16039: PANIC when activating replication slots in Postgres 12.0 64bit under Windows

>If you want to work around the problem, you can, but it's not entirely
>unproblematic: If, just when restarting, you set fsync to false, you'll
>avoid the problem. But then you'd have to immediately re-enable
>afterwards, and there's still a small data loss window.  If however all
>you want is to continue testing, fsync=off might be a reasonable
>workaround.

Hi Andres,

Thanks for the quick fix. I reported it immediately to get it soon fixed before 12.1.

In the meantime we continue to use pg 12 on the production System (quite confident in overall Software Quality!).

I can't reenable it because there may be unexpected restarts (power failures for example) and then the unattended restart would not succeed.

In quite a Long journey I have got deeper Knowledge of the whole System. So now I am able to recompile from source BUT:

It is VERY difficult to get the whole Environment under Windows!

I use Visual Studio Express 2019 (Version 16.3 from Oct 2019) and got the core (source tarball) compiled without error.

But I have not managed to get all the additional packages downloaded and adopted to the most recent Visual Studio (e.g zlib, XML, ssl etc.)

The Manual only states the Locations for download, but doesn't help in compiling.

It would be VERY helpfull to provide a .ZIP file for download on the Postgres Website containing all the additional packages in a usable Directory tree and some config file for compiling. This (mostly) should exist already for the buildfarm or for Building the release.

Because the upstream Software doesn't Change often, it would (as quick alternative) be sufficient to have the include files and the compiled libraries to easily compile Postgres from source.

Thanks for considering this proposal!

Hans Buschmann

PS Uppercase Problems are from the mail System!




Reply | Threaded
Open this post in threaded view
|

Re: BUG #16039: PANIC when activating replication slots in Postgres 12.0 64bit under Windows

Michael Paquier-2
In reply to this post by Andres Freund
On Fri, Oct 04, 2019 at 01:06:05PM -0700, Andres Freund wrote:
> I realize I perhaps should have added a comment explaining this, but
> this is far from the only location that knows we have to know open fds
> r/w to be able to fsync them.

Sorry for the late reply here.  It looks like I messed up here, my
apologies for that.  And thanks for fixing the issue.

It would have been nice to add some sanity checks based on fcntl() but
directory handling in pg_fsync() makes that annoying.  Anyway, I have
checked the code with a little trick, and I have spotted a second bug:
CheckPointLogicalRewriteHeap() fsyncs a logical rewrite mapping file
with RDONLY.  This is incorrect since b89e151.

> What were you even trying to fix by changing this?

Hardening of the code.  Some code paths clearly relied on the
operations to be read-only.

> Seems also pretty clear that we need a few animals running with fsync
> enabled. Not sure how we best can write test infrastructure to make it
> easy to set that for all tests. Guess I best start a thread about it on
> -hackers.

I think that we would need more infrastructure here for TAP tests, aka
how to be able to enforce some parameters when setting the
configuration of a new node.

Attached are two patches: the actual bug fix and an extra patch with
the trick I have used to find it out (contrib/test_decoding/ was the
part which has blown up).
--
Michael

fsync-chkpt-fix.patch (581 bytes) Download Attachment
fsync-trick.patch (1K) Download Attachment
signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #16039: PANIC when activating replication slots in Postgres 12.0 64bit under Windows

Michael Paquier-2
In reply to this post by Hans Buschmann
On Sat, Oct 05, 2019 at 07:40:00AM +0000, Hans Buschmann wrote:
> Thanks for the quick fix. I reported it immediately to get it soon
> fixed before 12.1.

Sorry that you got annoyed by that :/
You can blame me about it..

> But I have not managed to get all the additional packages downloaded
> and adopted to the most recent Visual Studio (e.g zlib, XML, ssl
> etc.)
>
> The Manual only states the Locations for download, but doesn't help
> in compiling.

There is documentation on the matter:
https://www.postgresql.org/docs/devel/install-windows-full.html#id-1.6.4.8.10

> It would be VERY helpful to provide a .ZIP file for download on the
> Postgres Website containing all the additional packages in a usable
> Directory tree and some config file for compiling. This (mostly)
> should exist already for the buildfarm or for Building the release.

I think that this would be difficult, per licensing issues and such,
no?  The maintenance of such a tarball would be annoying as well, for
example OpenSSL version changes quite often and needs to be bundled
with the installer so as DLLs loading it are able to work.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #16039: PANIC when activating replication slots in Postgres 12.0 64bit under Windows

Michael Paquier-2
In reply to this post by Michael Paquier-2
On Sun, Oct 06, 2019 at 01:55:48PM +0900, Michael Paquier wrote:
> It would have been nice to add some sanity checks based on fcntl() but
> directory handling in pg_fsync() makes that annoying.  Anyway, I have
> checked the code with a little trick, and I have spotted a second bug:
> CheckPointLogicalRewriteHeap() fsyncs a logical rewrite mapping file
> with RDONLY.  This is incorrect since b89e151.

Andres, others, any thoughts about this issue?  Are there any
objections if I just fix it?
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #16039: PANIC when activating replication slots in Postgres 12.0 64bit under Windows

Andres Freund
Hi,

On 2019-10-08 09:32:40 +0900, Michael Paquier wrote:
> On Sun, Oct 06, 2019 at 01:55:48PM +0900, Michael Paquier wrote:
> > It would have been nice to add some sanity checks based on fcntl() but
> > directory handling in pg_fsync() makes that annoying.

I wondered about adding something like that too. Not sure what you mean
by directory handling problems? Couldn't that just be solved by doing an
fstat()?


> > Anyway, I have checked the code with a little trick, and I have
> > spotted a second bug: CheckPointLogicalRewriteHeap() fsyncs a
> > logical rewrite mapping file with RDONLY.  This is incorrect since
> > b89e151.

Yuck :(. Luckily that's a pretty narrow case to hit.  We really need
windows coverage for this stuff. And also just general buildfarm
coverage, it's not like we're immune from bugs on unixoid OSs etiher.


> Andres, others, any thoughts about this issue?  Are there any
> objections if I just fix it?

Not here.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16039: PANIC when activating replication slots in Postgres 12.0 64bit under Windows

Michael Paquier-2
On Tue, Oct 08, 2019 at 09:09:53AM -0700, Andres Freund wrote:
> I wondered about adding something like that too. Not sure what you mean
> by directory handling problems? Couldn't that just be solved by doing an
> fstat()?

Yeah, we should just do that.

> Yuck :(. Luckily that's a pretty narrow case to hit.  We really need
> windows coverage for this stuff. And also just general buildfarm
> coverage, it's not like we're immune from bugs on unixoid OSs etiher.

test_decoding hits that easily for me, and we test that on Windows for
some time now as there is a buildfarm module.  That's a bit depressing.

>> Andres, others, any thoughts about this issue?  Are there any
>> objections if I just fix it?
>
> Not here.

Done this one.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #16039: PANIC when activating replication slots in Postgres 12.0 64bit under Windows

Andres Freund
Hi,

On 2019-10-09 13:42:07 +0900, Michael Paquier wrote:
> > Yuck :(. Luckily that's a pretty narrow case to hit.  We really need
> > windows coverage for this stuff. And also just general buildfarm
> > coverage, it's not like we're immune from bugs on unixoid OSs etiher.
>
> test_decoding hits that easily for me, and we test that on Windows for
> some time now as there is a buildfarm module.  That's a bit depressing.

Well, that's because it intentionally tries to hit that case ;)

But yea, it's not good.

I think we really ought to remove the -F from pg_regress.c, and leave
that up to the caller. Right now it's just about impossible to
configure without patching the source, unless I miss something.


> >> Andres, others, any thoughts about this issue?  Are there any
> >> objections if I just fix it?
> >
> > Not here.
>
> Done this one.

Thanks.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16039: PANIC when activating replication slots in Postgres 12.0 64bit under Windows

Michael Paquier-2
On Thu, Oct 10, 2019 at 08:25:15AM -0700, Andres Freund wrote:
> I think we really ought to remove the -F from pg_regress.c, and leave
> that up to the caller. Right now it's just about impossible to
> configure without patching the source, unless I miss something.

Well, I have just posted something about that:
https://www.postgresql.org/message-id/20191009062640.GB21379@...

Perhaps there are better ideas for that particular problem, let's
see.
--
Michael

signature.asc (849 bytes) Download Attachment