Standby corruption after master is restarted

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Standby corruption after master is restarted

hasegeli
One of our standby servers on a testing environment at InnoGames broke
today.  The master and the other standby server is working fine.  I
noticed that it happened right after the master had restarted.  It
looks to me like the WAL file is corrupted.  I wanted to check, if it
can be a bug.  These were on the log file:

> 2017-08-24 12:19:51 UTC [20351-2] LOG:  replication terminated by primary server
> 2017-08-24 12:19:51 UTC [20351-3] DETAIL:  End of WAL reached on timeline 1 at 0/9B9F5FF0.
> 2017-08-24 12:19:51 UTC [20351-4] FATAL:  could not send end-of-streaming message to primary: no COPY in progress
> 2017-08-24 12:19:51 UTC [26673-7] FATAL:  invalid memory alloc request size 1298694144
> 2017-08-24 12:19:51 UTC [26672-3] LOG:  startup process (PID 26673) exited with exit code 1
> 2017-08-24 12:19:51 UTC [26672-4] LOG:  terminating any other active server processes

I tried to start it again to get more logs:

> 2017-08-24 14:07:46 UTC [28861-12698] LOCATION:  KnownAssignedXidsRemovePreceding, procarray.c:3738
> 2017-08-24 14:07:46 UTC [28861-12699] DEBUG:  00000: transaction ID wrap limit is 2147484807, limited by database with OID 1
> 2017-08-24 14:07:46 UTC [28861-12700] CONTEXT:  xlog redo at 0/9B9F5F80 for XLOG/CHECKPOINT_SHUTDOWN: redo 0/9B9F5F80; tli 1; prev tli 1; fpw true; xid 0:5869245; oid 20756; multi 1; offset 0; oldest xid 1160 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown
> 2017-08-24 14:07:46 UTC [28861-12701] LOCATION:  SetTransactionIdLimit, varsup.c:347
> 2017-08-24 14:07:46 UTC [28861-12702] DEBUG:  00000: prune KnownAssignedXids to 5869245
> 2017-08-24 14:07:46 UTC [28861-12703] CONTEXT:  xlog redo at 0/9B9F5F80 for XLOG/CHECKPOINT_SHUTDOWN: redo 0/9B9F5F80; tli 1; prev tli 1; fpw true; xid 0:5869245; oid 20756; multi 1; offset 0; oldest xid 1160 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown
> 2017-08-24 14:07:46 UTC [28861-12704] LOCATION:  KnownAssignedXidsRemovePreceding, procarray.c:3738
> 2017-08-24 14:07:46 UTC [28861-12705] FATAL:  XX000: invalid memory alloc request size 1298694144
> 2017-08-24 14:07:46 UTC [28861-12706] LOCATION:  palloc_extended, mcxt.c:961

These from pg_controldata:

> pg_control version number:            960
> Catalog version number:               201608131
> Database system identifier:           6423674571818176107
> Database cluster state:               in archive recovery
> pg_control last modified:             Thu 24 Aug 2017 02:01:42 PM UTC
> Latest checkpoint location:           0/9B9C1188
> Prior checkpoint location:            0/9B9C1188
> Latest checkpoint's REDO location:    0/9B9BEC60
> Latest checkpoint's REDO WAL file:    00000001000000000000009B
> Latest checkpoint's TimeLineID:       1
> Latest checkpoint's PrevTimeLineID:   1
> Latest checkpoint's full_page_writes: on
> Latest checkpoint's NextXID:          0:5868410
> Latest checkpoint's NextOID:          28526
> Latest checkpoint's NextMultiXactId:  1
> Latest checkpoint's NextMultiOffset:  0
> Latest checkpoint's oldestXID:        1160
> Latest checkpoint's oldestXID's DB:   1
> Latest checkpoint's oldestActiveXID:  5868410
> Latest checkpoint's oldestMultiXid:   1
> Latest checkpoint's oldestMulti's DB: 1
> Latest checkpoint's oldestCommitTsXid:0
> Latest checkpoint's newestCommitTsXid:0
> Time of latest checkpoint:            Thu 24 Aug 2017 12:02:56 PM UTC
> Fake LSN counter for unlogged rels:   0/1
> Minimum recovery ending location:     0/9B9E13C0
> Min recovery ending loc's timeline:   1
> Backup start location:                0/0
> Backup end location:                  0/0
> End-of-backup record required:        no
> wal_level setting:                    replica
> wal_log_hints setting:                on
> max_connections setting:              400
> max_worker_processes setting:         8
> max_prepared_xacts setting:           0
> max_locks_per_xact setting:           64
> track_commit_timestamp setting:       off
> Maximum data alignment:               8
> Database block size:                  8192
> Blocks per segment of large relation: 131072
> WAL block size:                       8192
> Bytes per WAL segment:                16777216
> Maximum length of identifiers:        64
> Maximum columns in an index:          32
> Maximum size of a TOAST chunk:        1996
> Size of a large-object chunk:         2048
> Date/time type storage:               64-bit integers
> Float4 argument passing:              by value
> Float8 argument passing:              by value
> Data page checksum version:           1

I tried pg_xlogdump and it failed on the spot:

> rmgr: XLOG        len (rec/tot):    106/   106, tx:          0, lsn: 0/9B9F5F80, prev 0/9B9F5F48, desc: CHECKPOINT_SHUTDOWN redo 0/9B9F5F80; tli 1; prev tli 1; fpw true; xid 0:5869245; oid 20756; multi 1; offset 0; oldest xid 1160 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown
> pg_xlogdump: FATAL:  error in WAL record at 0/9B9F5F80: unexpected pageaddr 0/979F6000 in log segment 00000000000000000000009B, offset 10444800

This is from the master at the same place:

> rmgr: XLOG        len (rec/tot):    106/   106, tx:          0, lsn: 0/9B9F5F80, prev 0/9B9F5F48, desc: CHECKPOINT_SHUTDOWN redo 0/9B9F5F80; tli 1; prev tli 1; fpw true; xid 0:5869245; oid 20756; multi 1; offset 0; oldest xid 1160 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown
> rmgr: XLOG        len (rec/tot):     49/  2805, tx:          0, lsn: 0/9B9F5FF0, prev 0/9B9F5F80, desc: FPI_FOR_HINT , blkref #0: rel 1663/16396/16398 blk 0 FPW
> rmgr: Heap        len (rec/tot):     54/  5046, tx:    5869245, lsn: 0/9B9F6B00, prev 0/9B9F5FF0, desc: INSERT off 54, blkref #0: rel 1663/16396/16427 blk 64203 FPW
> rmgr: Btree       len (rec/tot):     53/  7961, tx:    5869245, lsn: 0/9B9F7EB8, prev 0/9B9F6B00, desc: INSERT_LEAF off 281, blkref #0: rel 1663/16396/16446 blk 21754 FPW
> rmgr: Transaction len (rec/tot):     34/    34, tx:    5869245, lsn: 0/9B9F9DF0, prev 0/9B9F7EB8, desc: COMMIT 2017-08-24 12:19:59.560138 UTC

I still have the database available.


--
Sent via pgsql-bugs mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Reply | Threaded
Open this post in threaded view
|

Re: Standby corruption after master is restarted

Andres Freund
Hi,

On 2017-08-24 16:50:59 +0200, Emre Hasegeli wrote:
> One of our standby servers on a testing environment at InnoGames broke
> today.  The master and the other standby server is working fine.  I
> noticed that it happened right after the master had restarted.  It
> looks to me like the WAL file is corrupted.  I wanted to check, if it
> can be a bug.  These were on the log file:

To clarify: Is fsync=off used anywhere, and was this an OS level /
hardware level restart? And just on the master, not a standby?

Greetings,

Andres Freund


--
Sent via pgsql-bugs mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Reply | Threaded
Open this post in threaded view
|

Re: Standby corruption after master is restarted

hasegeli
> To clarify: Is fsync=off used anywhere

No.

> and was this an OS level / hardware level restart?

It was only Postgres restart.

> And just on the master, not a standby?

Yes, only master.


--
Sent via pgsql-bugs mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Reply | Threaded
Open this post in threaded view
|

Re: Standby corruption after master is restarted

hasegeli
In reply to this post by hasegeli
We experienced this issue again, this time on production.  The primary
instance was in a loop of being killed by Linux OOM-killer and being
restarted in 1 minute intervals.  The corruption only happened on one
of the two standbys.  The primary and the other standby have no
problems.  Only the was killed and restarted, the standbys were not.
There weren't any unusual settings, "fsync" was not disabled.  Here is
the information I collected.

The logs at the time standby broke:

> 2018-03-28 14:00:30 UTC [3693-67] LOG:  invalid resource manager ID 39 at 1DFB/D43BE688
> 2018-03-28 14:00:30 UTC [25347-1] LOG:  started streaming WAL from primary at 1DFB/D4000000 on timeline 5
> 2018-03-28 14:00:59 UTC [3748-357177] LOG:  restartpoint starting: time
> 2018-03-28 14:01:23 UTC [25347-2] FATAL:  could not receive data from WAL stream: SSL SYSCALL error: EOF detected
> 2018-03-28 14:01:24 UTC [3693-68] FATAL:  invalid memory alloc request size 1916035072

And from the next try:

> 2018-03-28 14:02:15 UTC [26808-5] LOG:  consistent recovery state reached at 1DFB/D6BDDFF8
> 2018-03-28 14:02:15 UTC [26808-6] FATAL:  invalid memory alloc request size 191603507

The output of pg_controldata:

> pg_control version number:            942
> Catalog version number:               201510051
> Database system identifier:           6223303827582122631
> Database cluster state:               in archive recovery
> pg_control last modified:             2018-03-28T15:25:51 UTC
> Latest checkpoint location:           1DFB/D1E17FD8
> Prior checkpoint location:            1DFB/D1E17FD8
> Latest checkpoint's REDO location:    1DFB/D1E17FD8
> Latest checkpoint's REDO WAL file:    0000000500001DFB000000D1
> Latest checkpoint's TimeLineID:       5
> Latest checkpoint's PrevTimeLineID:   5
> Latest checkpoint's full_page_writes: on
> Latest checkpoint's NextXID:          2/3522301360
> Latest checkpoint's NextOID:          141900
> Latest checkpoint's NextMultiXactId:  3434
> Latest checkpoint's NextMultiOffset:  7985
> Latest checkpoint's oldestXID:        3329390243
> Latest checkpoint's oldestXID's DB:   16385
> Latest checkpoint's oldestActiveXID:  0
> Latest checkpoint's oldestMultiXid:   1
> Latest checkpoint's oldestMulti's DB: 16385
> Latest checkpoint's oldestCommitTsXid:0
> Latest checkpoint's newestCommitTsXid:0
> Time of latest checkpoint:            2018-03-28T13:55:52 UTC
> Fake LSN counter for unlogged rels:   0/1
> Minimum recovery ending location:     1DFB/D6BDDFF8
> Min recovery ending loc's timeline:   5
> Backup start location:                0/0
> Backup end location:                  0/0
> End-of-backup record required:        no
> wal_level setting:                    hot_standby
> wal_log_hints setting:                on
> max_connections setting:              500
> max_worker_processes setting:         8
> max_prepared_xacts setting:           0
> max_locks_per_xact setting:           64
> track_commit_timestamp setting:       off
> Maximum data alignment:               8
> Database block size:                  8192
> Blocks per segment of large relation: 131072
> WAL block size:                       8192
> Bytes per WAL segment:                16777216
> Maximum length of identifiers:        64
> Maximum columns in an index:          32
> Maximum size of a TOAST chunk:        1996
> Size of a large-object chunk:         2048
> Date/time type storage:               64-bit integers
> Float4 argument passing:              by value
> Float8 argument passing:              by value
> Data page checksum version:           0

Hexdump of the WAL file at the position on standby:

0bddfc0 0002 0003 0004 0005 0006 0007 0008 0009
0bddfd0 000a 000b 000c 000d 000e 000f 0010 0011
0bddfe0 0012 0013 0014 0015 0016 0017 0000 0000
0bddff0 0000 0017 0000 0000 5f32 7234 4b55 7045
0bde000 d087 0005 0005 0000 e000 66bd 1dfb 0000
0bde010 1931 0000 0000 0000 5a43 7746 7166 6e34
0bde020 304e 764e 9c32 0158 5400 e709 0900 6f66
0bde030 0765 7375 6111 646e 6f72 6469 370d 312e

Hexdump at the same position on the master:

0bddfc0 0002 0003 0004 0005 0006 0007 0008 0009
0bddfd0 000a 000b 000c 000d 000e 000f 0010 0011
0bddfe0 0012 0013 0014 0015 0016 0017 0000 0000
0bddff0 0000 0017 0000 0000 006a 0000 0000 0000
0bde000 d087 0005 0005 0000 e000 d6bd 1dfb 0000
0bde010 0062 0000 0000 0000 df90 d6bd 1dfb 0000
0bde020 0000 0000 63d3 aa61 50ff dff8 d6bd 1dfb
0bde030 0000 0005 0000 0005 0000 0001 0000 0002

Reply | Threaded
Open this post in threaded view
|

Re: Standby corruption after master is restarted

Tomas Vondra-4
Hi Emre,

On 03/28/2018 07:50 PM, Emre Hasegeli wrote:
> We experienced this issue again, this time on production.  The primary
> instance was in a loop of being killed by Linux OOM-killer and being
> restarted in 1 minute intervals.  The corruption only happened on one
> of the two standbys.  The primary and the other standby have no
> problems.  Only the was killed and restarted, the standbys were not.
> There weren't any unusual settings, "fsync" was not disabled.  Here is
> the information I collected.
>

I've been trying to reproduce this by running a master with a couple of
replicas, and randomly restarting the master (while pgbench is running).
But so far no luck, so I guess something else is required to reproduce
the behavior ...

> The logs at the time standby broke:
>
>> 2018-03-28 14:00:30 UTC [3693-67] LOG:  invalid resource manager ID 39 at 1DFB/D43BE688
>> 2018-03-28 14:00:30 UTC [25347-1] LOG:  started streaming WAL from primary at 1DFB/D4000000 on timeline 5
>> 2018-03-28 14:00:59 UTC [3748-357177] LOG:  restartpoint starting: time
>> 2018-03-28 14:01:23 UTC [25347-2] FATAL:  could not receive data from WAL stream: SSL SYSCALL error: EOF detected
>> 2018-03-28 14:01:24 UTC [3693-68] FATAL:  invalid memory alloc request size 1916035072
>
> And from the next try:
>
>> 2018-03-28 14:02:15 UTC [26808-5] LOG:  consistent recovery state reached at 1DFB/D6BDDFF8
>> 2018-03-28 14:02:15 UTC [26808-6] FATAL:  invalid memory alloc request size 191603507
>

In the initial report (from August 2018) you shared pg_xlogdump output,
showing that the corrupted WAL record is an FPI_FOR_HINT right after
CHECKPOINT_SHUTDOWN. Was it the same case this time?

BTW which versions are we talking about? I see the initial report
mentioned catversion 201608131, this one mentions 201510051, so I'm
guessing 9.6 and 9.5. Which minor versions?

Is the master under load (accepting writes) before shutdown?

How was it restarted, actually? I see you're mentioning OOM killer, so I
guess "kill -9". What about the first report - was it the same case, or
was it restarted "nicely" using pg_ctl?

Could the replica receive the WAL in some other way - say, from a WAL
archive? What archive/restore commands you use?


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Standby corruption after master is restarted

Tomas Vondra-4
On 04/14/2018 07:38 PM, Tomas Vondra wrote:
>
> How was it restarted, actually? I see you're mentioning OOM killer, so I
> guess "kill -9". What about the first report - was it the same case, or
> was it restarted "nicely" using pg_ctl?
>

Also, which process gets killed by the OOM killer? A random backend,
walsender or some other process?

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Standby corruption after master is restarted

hasegeli
In reply to this post by Tomas Vondra-4
> In the initial report (from August 2018) you shared pg_xlogdump output,
> showing that the corrupted WAL record is an FPI_FOR_HINT right after
> CHECKPOINT_SHUTDOWN. Was it the same case this time?

No, this time the xlogdump on master looks like this:

> rmgr: Heap2       len (rec/tot):    110/   110, tx:          0, lsn: 1DFB/D6BDDF20, prev 1DFB/D6BDDEE0, desc: CLEAN remxid 0, blkref #0: rel 1663/16385/141744 blk 331540
> rmgr: Heap2       len (rec/tot):    100/   100, tx:          0, lsn: 1DFB/D6BDDF90, prev 1DFB/D6BDDF20, desc: CLEAN remxid 0, blkref #0: rel 1663/16385/141744 blk 331541
> rmgr: XLOG        len (rec/tot):    106/   106, tx:          0, lsn: 1DFB/D6BDDFF8, prev 1DFB/D6BDDF90, desc: CHECKPOINT_SHUTDOWN redo 1DFB/D6BDDFF8; tli 5; prev tli 5; fpw true; xid 2/3522301709; oid 141900; multi 3434; offset 7985; oldest xid 3329390243 in DB 16385; oldest multi 1 in DB 16385; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown
> rmgr: Heap        len (rec/tot):     56/  2511, tx: 3522301709, lsn: 1DFB/D6BDE080, prev 1DFB/D6BDDFF8, desc: INSERT off 93, blkref #0: rel 1663/40796/40826 blk 0 FPW
> rmgr: Transaction len (rec/tot):     34/    34, tx: 3522301709, lsn: 1DFB/D6BDEA50, prev 1DFB/D6BDE080, desc: COMMIT 2018-03-28 14:01:36.800811 UTC
> rmgr: Heap        len (rec/tot):     56/  1257, tx: 3522301710, lsn: 1DFB/D6BDEA78, prev 1DFB/D6BDEA50, desc: INSERT off 41, blkref #0: rel 1663/40796/40820 blk 345119 FPW
> rmgr: Btree       len (rec/tot):     55/  1857, tx: 3522301710, lsn: 1DFB/D6BDEF68, prev 1DFB/D6BDEA78, desc: INSERT_LEAF off 125, blkref #0: rel 1663/40796/40839 blk 107903 FPW
> rmgr: Transaction len (rec/tot):     34/    34, tx: 3522301710, lsn: 1DFB/D6BDF6B0, prev 1DFB/D6BDEF68, desc: COMMIT 2018-03-28 14:01:37.361010 UTC

> BTW which versions are we talking about? I see the initial report
> mentioned catversion 201608131, this one mentions 201510051, so I'm
> guessing 9.6 and 9.5. Which minor versions?

The first incident was on 9.6.3, the second 9.5.7.

> Is the master under load (accepting writes) before shutdown?

Yes.  Especially on the second incident it was being written heavily.
The master gets up to 10k INSERT transactions per second before being
killed.

> How was it restarted, actually? I see you're mentioning OOM killer, so I
> guess "kill -9". What about the first report - was it the same case, or
> was it restarted "nicely" using pg_ctl?

On the first incident the master was being restarted as part of
upgrade from 9.6.3 to 9.6.4.  It should have been nicely done by the
Debian upgrade scripts.

> Could the replica receive the WAL in some other way - say, from a WAL
> archive? What archive/restore commands you use?

We have repmgr + barman setup:

> archive_command = 'test $(psql -Atc "select failed_count from pg_stat_archiver") -gt 10 || rsync --quiet --contimeout=2 --timeout=2 %p BARMAN_HOST::barman/'
> restore_command = 'barman-wal-restore --gzip BARMAN_HOST BARMAN_CLUSTER %f %p'

Reply | Threaded
Open this post in threaded view
|

Re: Standby corruption after master is restarted

hasegeli
In reply to this post by Tomas Vondra-4
> Also, which process gets killed by the OOM killer? A random backend,
> walsender or some other process?

It was always a SELECT query that was killed.  I assume it was because
of the GIN index we have with huge pending list.  Postgres was
"terminating other connections because of crash of another server
process".

Reply | Threaded
Open this post in threaded view
|

Re: Standby corruption after master is restarted

Tomas Vondra-4
In reply to this post by hasegeli


On 04/16/2018 10:55 AM, Emre Hasegeli wrote:

>> In the initial report (from August 2018) you shared pg_xlogdump output,
>> showing that the corrupted WAL record is an FPI_FOR_HINT right after
>> CHECKPOINT_SHUTDOWN. Was it the same case this time?
>
> No, this time the xlogdump on master looks like this:
>
>> rmgr: Heap2       len (rec/tot):    110/   110, tx:          0, lsn: 1DFB/D6BDDF20, prev 1DFB/D6BDDEE0, desc: CLEAN remxid 0, blkref #0: rel 1663/16385/141744 blk 331540
>> rmgr: Heap2       len (rec/tot):    100/   100, tx:          0, lsn: 1DFB/D6BDDF90, prev 1DFB/D6BDDF20, desc: CLEAN remxid 0, blkref #0: rel 1663/16385/141744 blk 331541
>> rmgr: XLOG        len (rec/tot):    106/   106, tx:          0, lsn: 1DFB/D6BDDFF8, prev 1DFB/D6BDDF90, desc: CHECKPOINT_SHUTDOWN redo 1DFB/D6BDDFF8; tli 5; prev tli 5; fpw true; xid 2/3522301709; oid 141900; multi 3434; offset 7985; oldest xid 3329390243 in DB 16385; oldest multi 1 in DB 16385; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown
>> rmgr: Heap        len (rec/tot):     56/  2511, tx: 3522301709, lsn: 1DFB/D6BDE080, prev 1DFB/D6BDDFF8, desc: INSERT off 93, blkref #0: rel 1663/40796/40826 blk 0 FPW
>> rmgr: Transaction len (rec/tot):     34/    34, tx: 3522301709, lsn: 1DFB/D6BDEA50, prev 1DFB/D6BDE080, desc: COMMIT 2018-03-28 14:01:36.800811 UTC
>> rmgr: Heap        len (rec/tot):     56/  1257, tx: 3522301710, lsn: 1DFB/D6BDEA78, prev 1DFB/D6BDEA50, desc: INSERT off 41, blkref #0: rel 1663/40796/40820 blk 345119 FPW
>> rmgr: Btree       len (rec/tot):     55/  1857, tx: 3522301710, lsn: 1DFB/D6BDEF68, prev 1DFB/D6BDEA78, desc: INSERT_LEAF off 125, blkref #0: rel 1663/40796/40839 blk 107903 FPW
>> rmgr: Transaction len (rec/tot):     34/    34, tx: 3522301710, lsn: 1DFB/D6BDF6B0, prev 1DFB/D6BDEF68, desc: COMMIT 2018-03-28 14:01:37.361010 UTC
>

OK, and on the standby? I suppose it breaks right after the
CHECKPOINT_SHUTDOWN record, right?

>> BTW which versions are we talking about? I see the initial report
>> mentioned catversion 201608131, this one mentions 201510051, so I'm
>> guessing 9.6 and 9.5. Which minor versions?
>
> The first incident was on 9.6.3, the second 9.5.7.
>
>> Is the master under load (accepting writes) before shutdown?
>
> Yes.  Especially on the second incident it was being written heavily.
> The master gets up to 10k INSERT transactions per second before being
> killed.
>
>> How was it restarted, actually? I see you're mentioning OOM killer, so I
>> guess "kill -9". What about the first report - was it the same case, or
>> was it restarted "nicely" using pg_ctl?
>
> On the first incident the master was being restarted as part of
> upgrade from 9.6.3 to 9.6.4.  It should have been nicely done by the
> Debian upgrade scripts.
>

Hmmm. I'm not sure what exactly debian upgrade script (and IIRC it
depends on packages). I wonder if it does fast/smart/immediate.

>> Could the replica receive the WAL in some other way - say, from a WAL
>> archive? What archive/restore commands you use?
>
> We have repmgr + barman setup:
>
>> archive_command = 'test $(psql -Atc "select failed_count from pg_stat_archiver") -gt 10 || rsync --quiet --contimeout=2 --timeout=2 %p BARMAN_HOST::barman/'
>> restore_command = 'barman-wal-restore --gzip BARMAN_HOST BARMAN_CLUSTER %f %p'

And can you determine how the broken WAL got to the standby? Was it by
directly fetching from the primary, or from the barman archive? Is it
broken in the archive too?

I think one hypothesis is that the standby starts fetching a WAL segment
from the primary, and while doing that it gets recycled and overwritten
(so the standby gets bogus somewhere half-way through). That should not
be possible, of course, but perhaps there's something broken right after
a restart? A race condition or something like that?

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Standby corruption after master is restarted

hasegeli
> OK, and on the standby? I suppose it breaks right after the
> CHECKPOINT_SHUTDOWN record, right?

Actually pg_xlogdump fails right at the SHUTDOWN record:

> rmgr: Heap2       len (rec/tot):     58/    58, tx:          0, lsn: 1DFB/D6BDDEA0, prev 1DFB/D6BDDE60, desc: CLEAN remxid 0, blkref #0: rel 1663/16385/141744 blk 331538
> rmgr: Heap2       len (rec/tot):     58/    58, tx:          0, lsn: 1DFB/D6BDDEE0, prev 1DFB/D6BDDEA0, desc: CLEAN remxid 0, blkref #0: rel 1663/16385/141744 blk 331539
> rmgr: Heap2       len (rec/tot):    110/   110, tx:          0, lsn: 1DFB/D6BDDF20, prev 1DFB/D6BDDEE0, desc: CLEAN remxid 0, blkref #0: rel 1663/16385/141744 blk 331540
> rmgr: Heap2       len (rec/tot):    100/   100, tx:          0, lsn: 1DFB/D6BDDF90, prev 1DFB/D6BDDF20, desc: CLEAN remxid 0, blkref #0: rel 1663/16385/141744 blk 331541
> pg_xlogdump: FATAL:  error in WAL record at 1DFB/D6BDDF90: unexpected pageaddr 1DFB/66BDE000 in log segment 0000000000001DFB000000D6, offset 12443648

> Hmmm. I'm not sure what exactly debian upgrade script (and IIRC it
> depends on packages). I wonder if it does fast/smart/immediate.

I couldn't find it quickly.

> And can you determine how the broken WAL got to the standby? Was it by
> directly fetching from the primary, or from the barman archive? Is it
> broken in the archive too?

It must be streamed from the primary.  The log was on on the archive
during the incident.  The good log file came there later.  Such logs
are repeated on the standby before the error:

> 2018-03-28 14:00:30 UTC [3693-67] LOG:  invalid resource manager ID 39 at 1DFB/D43BE688
> 2018-03-28 14:00:30 UTC [25347-1] LOG:  started streaming WAL from primary at 1DFB/D4000000 on timeline 5
> 2018-03-28 14:00:59 UTC [3748-357177] LOG:  restartpoint starting: time
> WARNING:  terminating connection because of crash of another server process
> DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared m
emory.
> HINT:  In a moment you should be able to reconnect to the database and repeat your command.
> 2018-03-28 14:01:23 UTC [25347-2] FATAL:  could not receive data from WAL stream: SSL SYSCALL error: EOF detected

> I think one hypothesis is that the standby starts fetching a WAL segment
> from the primary, and while doing that it gets recycled and overwritten
> (so the standby gets bogus somewhere half-way through). That should not
> be possible, of course, but perhaps there's something broken right after
> a restart? A race condition or something like that?

If I haven't missed anything on the hexdumps I posted, the logs of the
master and the standby are the same until the position on
pg_controldata.  I cannot understand how it can fail for something
later.  Is it possible that this is triggered if the replication is
interrupted on a certain point?

Reply | Threaded
Open this post in threaded view
|

Re: Standby corruption after master is restarted

Tomas Vondra-4


On 04/16/2018 04:55 PM, Emre Hasegeli wrote:

>> OK, and on the standby? I suppose it breaks right after the
>> CHECKPOINT_SHUTDOWN record, right?
>
> Actually pg_xlogdump fails right at the SHUTDOWN record:
>
>> rmgr: Heap2       len (rec/tot):     58/    58, tx:          0, lsn: 1DFB/D6BDDEA0, prev 1DFB/D6BDDE60, desc: CLEAN remxid 0, blkref #0: rel 1663/16385/141744 blk 331538
>> rmgr: Heap2       len (rec/tot):     58/    58, tx:          0, lsn: 1DFB/D6BDDEE0, prev 1DFB/D6BDDEA0, desc: CLEAN remxid 0, blkref #0: rel 1663/16385/141744 blk 331539
>> rmgr: Heap2       len (rec/tot):    110/   110, tx:          0, lsn: 1DFB/D6BDDF20, prev 1DFB/D6BDDEE0, desc: CLEAN remxid 0, blkref #0: rel 1663/16385/141744 blk 331540
>> rmgr: Heap2       len (rec/tot):    100/   100, tx:          0, lsn: 1DFB/D6BDDF90, prev 1DFB/D6BDDF20, desc: CLEAN remxid 0, blkref #0: rel 1663/16385/141744 blk 331541
>> pg_xlogdump: FATAL:  error in WAL record at 1DFB/D6BDDF90: unexpected pageaddr 1DFB/66BDE000 in log segment 0000000000001DFB000000D6, offset 12443648
>
>> Hmmm. I'm not sure what exactly debian upgrade script (and IIRC it
>> depends on packages). I wonder if it does fast/smart/immediate.
>
> I couldn't find it quickly.
>
>> And can you determine how the broken WAL got to the standby? Was it by
>> directly fetching from the primary, or from the barman archive? Is it
>> broken in the archive too?
>
> It must be streamed from the primary.  The log was on on the archive
> during the incident.  The good log file came there later.  Such logs
> are repeated on the standby before the error:
>

OK, so the standby connects to the primary, gets a (corrupted) WAL
segment from it, which then gets archived? That would contradict the
theory of race condition between recycling the WAL segment and copying
it to the standby, because we never recycle segments that have not been
archived yet (obviously).

But I'm not sure that's the exact sequence of events, of course.

Can you check if the "incorrect" part of the WAL segment matches some
previous segment? Verifying that shouldn't be very difficult (just cut a
bunch of bytes using hexdump, compare to the incorrect data). Assuming
you still have the WAL archive, of course. That would tell us that the
corrupted part comes from an old recycled segment.

>> 2018-03-28 14:00:30 UTC [3693-67] LOG:  invalid resource manager ID 39 at 1DFB/D43BE688
>> 2018-03-28 14:00:30 UTC [25347-1] LOG:  started streaming WAL from primary at 1DFB/D4000000 on timeline 5
>> 2018-03-28 14:00:59 UTC [3748-357177] LOG:  restartpoint starting: time
>> WARNING:  terminating connection because of crash of another server process
>> DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared m
> emory.
>> HINT:  In a moment you should be able to reconnect to the database and repeat your command.
>> 2018-03-28 14:01:23 UTC [25347-2] FATAL:  could not receive data from WAL stream: SSL SYSCALL error: EOF detected
>

Hmmm, I see you're using SSL. I don't think that could break affect
anything, but maybe I should try mimicking this aspect too.

>> I think one hypothesis is that the standby starts fetching a WAL segment
>> from the primary, and while doing that it gets recycled and overwritten
>> (so the standby gets bogus somewhere half-way through). That should not
>> be possible, of course, but perhaps there's something broken right after
>> a restart? A race condition or something like that?
>
> If I haven't missed anything on the hexdumps I posted, the logs of the
> master and the standby are the same until the position on
> pg_controldata.  I cannot understand how it can fail for something
> later.  Is it possible that this is triggered if the replication is
> interrupted on a certain point?
>

Which position? The controldata shows location of the last checkpoint,
which says where the replay should start. Or am I missing something?

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Standby corruption after master is restarted

hasegeli
> Can you check if the "incorrect" part of the WAL segment matches some
> previous segment? Verifying that shouldn't be very difficult (just cut a
> bunch of bytes using hexdump, compare to the incorrect data). Assuming
> you still have the WAL archive, of course. That would tell us that the
> corrupted part comes from an old recycled segment.

I had found and saved the recycled WAL file from the archive after the
incident.  Here is the hexdump of it at the same position:

0bddfc0 3253 4830 616f 5034 5243 4d79 664f 6164
0bddfd0 3967 592d 7963 7967 5541 4a59 3066 4f50
0bddfe0 2d55 346e 4254 3559 6a4e 726b 4e30 6f52
0bddff0 3876 4751 4a38 5956 5f32 7234 4b55 7045
0bde000 d087 0005 0005 0000 e000 66bd 1dfb 0000
0bde010 1931 0000 0000 0000 5a43 7746 7166 6e34
0bde020 304e 764e 9c32 0158 5400 e709 0900 6f66
0bde030 0765 7375 6111 646e 6f72 6469 370d 312e

If you compare it with the other 2 I have posted, you would notice
that the corrupted file on standby is combination of the two.  The
data on it starts with the data on the master, and continues with the
data of the recycled file.  The switch is at the position 0bddff8
which is the position printed as "Minimum recovery ending location" by
pg_controldata.

> Hmmm, I see you're using SSL. I don't think that could break affect
> anything, but maybe I should try mimicking this aspect too.

This is the connection information.  Although the master shows SSL
compression is disabled in despite of being explicitly asked for.

> primary_conninfo = 'host=MASTER_NODE port=5432 dbname=repmgr user=repmgr connect_timeout=10 sslcompression=1'

Reply | Threaded
Open this post in threaded view
|

Re: Standby corruption after master is restarted

Tomas Vondra-4


On 04/17/2018 10:55 AM, Emre Hasegeli wrote:

>> Can you check if the "incorrect" part of the WAL segment matches some
>> previous segment? Verifying that shouldn't be very difficult (just cut a
>> bunch of bytes using hexdump, compare to the incorrect data). Assuming
>> you still have the WAL archive, of course. That would tell us that the
>> corrupted part comes from an old recycled segment.
>
> I had found and saved the recycled WAL file from the archive after the
> incident.  Here is the hexdump of it at the same position:
>
> 0bddfc0 3253 4830 616f 5034 5243 4d79 664f 6164
> 0bddfd0 3967 592d 7963 7967 5541 4a59 3066 4f50
> 0bddfe0 2d55 346e 4254 3559 6a4e 726b 4e30 6f52
> 0bddff0 3876 4751 4a38 5956 5f32 7234 4b55 7045
> 0bde000 d087 0005 0005 0000 e000 66bd 1dfb 0000
> 0bde010 1931 0000 0000 0000 5a43 7746 7166 6e34
> 0bde020 304e 764e 9c32 0158 5400 e709 0900 6f66
> 0bde030 0765 7375 6111 646e 6f72 6469 370d 312e
>
> If you compare it with the other 2 I have posted, you would notice
> that the corrupted file on standby is combination of the two.  The
> data on it starts with the data on the master, and continues with the
> data of the recycled file.  The switch is at the position 0bddff8
> which is the position printed as "Minimum recovery ending location" by
> pg_controldata.
>

OK, this seems to confirm the theory that there's a race condition
between segment recycling and replicating. It's likely limited to short
period after a crash, otherwise we'd probably see many more reports.

But it's still just  hunch - someone needs to read through the code and
check how it behaves in these situations. Not sure when I'll have time
for that.

>> Hmmm, I see you're using SSL. I don't think that could break affect
>> anything, but maybe I should try mimicking this aspect too.
>
> This is the connection information.  Although the master shows SSL
> compression is disabled in despite of being explicitly asked for.
>
>> primary_conninfo = 'host=MASTER_NODE port=5432 dbname=repmgr user=repmgr connect_timeout=10 sslcompression=1'

Hmmm, that seems like a separate issue. When you say 'master shows SSL
compression is disabled' where do you see that?

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Standby corruption after master is restarted

hasegeli
> OK, this seems to confirm the theory that there's a race condition between
> segment recycling and replicating. It's likely limited to short period after
> a crash, otherwise we'd probably see many more reports.

I am suspicious about this part.  It cannot be happening 2 times to us
and never to anybody else.  Maybe people do not report it because it
is easy to deal with the problem.  You just delete the corrupted WAL
file.  Or maybe there is something special in our infrastructure.

Thank you for your help so far.

>> This is the connection information.  Although the master shows SSL
>> compression is disabled in despite of being explicitly asked for.

> Hmmm, that seems like a separate issue. When you say 'master shows SSL
> compression is disabled' where do you see that?

I thought it might be related first, so I mentioned.  Then I found out
on pg_stat_ssl view on the server that it is actually disabled.
Probably the configuration of OpenSSL is disabling it.

Reply | Threaded
Open this post in threaded view
|

Re: Standby corruption after master is restarted

Kyotaro HORIGUCHI-2
Hello. I found how this occurs.

# added -hackers

I've seen similar case with inadequite operation but this time I
found that it can happen under normal operation.

At Tue, 17 Apr 2018 17:14:05 +0200, Emre Hasegeli <[hidden email]> wrote in <CAE2gYzwB0+MM3L6-zdK0JKO_aXqHNCKU7VAiDJk_kKRRJ4B=[hidden email]>

> > OK, this seems to confirm the theory that there's a race condition between
> > segment recycling and replicating. It's likely limited to short period after
> > a crash, otherwise we'd probably see many more reports.
>
> I am suspicious about this part.  It cannot be happening 2 times to us
> and never to anybody else.  Maybe people do not report it because it
> is easy to deal with the problem.  You just delete the corrupted WAL
> file.  Or maybe there is something special in our infrastructure.
>
> Thank you for your help so far.
>
> >> This is the connection information.  Although the master shows SSL
> >> compression is disabled in despite of being explicitly asked for.
>
> > Hmmm, that seems like a separate issue. When you say 'master shows SSL
> > compression is disabled' where do you see that?
>
> I thought it might be related first, so I mentioned.  Then I found out
> on pg_stat_ssl view on the server that it is actually disabled.
> Probably the configuration of OpenSSL is disabling it.
A segment is not cleard on recycling. walreceiver writes WAL
record by record so startup process can see arbitrary byte
sequence after the last valid record when replication connection
is lost or standby is restarted.

The following scenario results in the similar situation.

1. create master and standby and run.

   It makes happen this easily if wal_keep_segments is set large
   (20 or so) on master and 0 on standby.

2. Write WAL to recycle happens on standby. Explicit checkpoints
   on standby make it faster. May be required to run several
   rounds before we see recycled segment on standby.

   maybe_loop {
     master:
       create table t (a int);
       insert into t (select a from generate_series(0, 150000) a);
       delete from t;
       checkpoint;

     standby:
       checkpoint;
       <check for recycled segments in pg_wal>
   }

3. stop master

4. standby starts to complain that master is missing.

  At this time, standby complains for several kinds of failure. I
  saw 'invalid record length' and 'incorrect prev-link' this
  time. I saw 'invalid resource manager ID' when mixing different
  size records. If XLogReadRecord saw a record with impossibly
  large tot_length there, it will causes the palloc failure and
  startup process dies.


5. If you see 'zero length record', it's nothing interesting.
  Repeat 3 and 4 to see another.


This doesn't seem to happen on master since XLogWrite writes in
units of page so redo always sees zero length record or wrong
magic, or pageaddr at page boundaries after a crash. On the other
hand walreceiver writes in unit of records so redo can see
arbitrary byte sequence as xl_tot_len, xl_prev, xl_rmid while
fetching the next record. Espcially the first among them can
cause FATAL with "invalid memory alloc".

I think this behavior is a bug. XLogReadRecord is considering the
case but palloc_extended() breaks it. So in the attached, add a
new flag MCXT_ALLOC_NO_PARAMERR to palloc_extended() and
allocate_recordbuf calls it with the new flag. That alone fixes
the problem. However, the patch frees read state buffer facing
errorneous records since such records can leave a too-large
buffer allocated.

After applying this patch, the behavior for the situation changes
as follows.

> LOG:  started streaming WAL from primary at 0/83000000 on timeline 1
> LOG:  record length 1298694144 at 0/83C17B70 too long
> FATAL:  terminating walreceiver process due to administrator command
> LOG:  record length 1298694144 at 0/83C17B70 too long
> LOG:  record length 1298694144 at 0/83C17B70 too long
> LOG:  received promote request
> LOG:  redo done at 0/83C17B38
> LOG:  last completed transaction was at log time 2018-04-26 19:10:12.360253+09
> LOG:  selected new timeline ID: 2
> LOG:  archive recovery complete
> LOG:  database system is ready to accept connections

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 3a86f3497e..02c26224fc 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -165,7 +165,8 @@ allocate_recordbuf(XLogReaderState *state, uint32 reclength)
  if (state->readRecordBuf)
  pfree(state->readRecordBuf);
  state->readRecordBuf =
- (char *) palloc_extended(newSize, MCXT_ALLOC_NO_OOM);
+ (char *) palloc_extended(newSize,
+ MCXT_ALLOC_NO_OOM | MCXT_ALLOC_NO_PARAMERR);
  if (state->readRecordBuf == NULL)
  {
  state->readRecordBufSize = 0;
@@ -601,10 +602,19 @@ err:
 
 /*
  * Invalidate the xlogreader's read state to force a re-read.
+ *
+ * read state might hold too large buffer on reading invalid record so release
+ * the buffer as well.
  */
 void
 XLogReaderInvalReadState(XLogReaderState *state)
 {
+ if (state->readRecordBuf != NULL)
+ {
+ pfree(state->readRecordBuf);
+ state->readRecordBuf = NULL;
+ state->readRecordBufSize = 0;
+ }
  state->readSegNo = 0;
  state->readOff = 0;
  state->readLen = 0;
diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index ebe0342f18..6e7f95d11f 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -996,7 +996,12 @@ palloc_extended(Size size, int flags)
 
  if (((flags & MCXT_ALLOC_HUGE) != 0 && !AllocHugeSizeIsValid(size)) ||
  ((flags & MCXT_ALLOC_HUGE) == 0 && !AllocSizeIsValid(size)))
+ {
+ if (flags & MCXT_ALLOC_NO_PARAMERR)
+ return NULL;
+
  elog(ERROR, "invalid memory alloc request size %zu", size);
+ }
 
  context->isReset = false;
 
diff --git a/src/include/common/fe_memutils.h b/src/include/common/fe_memutils.h
index 458743dd40..5859ec087f 100644
--- a/src/include/common/fe_memutils.h
+++ b/src/include/common/fe_memutils.h
@@ -16,7 +16,8 @@
 #define MCXT_ALLOC_HUGE 0x01 /* allow huge allocation (> 1 GB) not
  * actually used for frontends */
 #define MCXT_ALLOC_NO_OOM 0x02 /* no failure if out-of-memory */
-#define MCXT_ALLOC_ZERO 0x04 /* zero allocated memory */
+#define MCXT_ALLOC_NO_PARAMERR 0x04 /* no failure on parameter error */
+#define MCXT_ALLOC_ZERO 0x08 /* zero allocated memory */
 
 /*
  * "Safe" memory allocation functions --- these exit(1) on failure
diff --git a/src/include/utils/palloc.h b/src/include/utils/palloc.h
index 781e948f69..eb1ecc9ee6 100644
--- a/src/include/utils/palloc.h
+++ b/src/include/utils/palloc.h
@@ -63,7 +63,8 @@ extern PGDLLIMPORT MemoryContext CurrentMemoryContext;
  */
 #define MCXT_ALLOC_HUGE 0x01 /* allow huge allocation (> 1 GB) */
 #define MCXT_ALLOC_NO_OOM 0x02 /* no failure if out-of-memory */
-#define MCXT_ALLOC_ZERO 0x04 /* zero allocated memory */
+#define MCXT_ALLOC_NO_PARAMERR 0x04 /* no failure on parameter error */
+#define MCXT_ALLOC_ZERO 0x08 /* zero allocated memory */
 
 /*
  * Fundamental memory-allocation operations (more are in utils/memutils.h)
Reply | Threaded
Open this post in threaded view
|

Re: Standby corruption after master is restarted

Michael Paquier-2
On Thu, Apr 26, 2018 at 07:53:04PM +0900, Kyotaro HORIGUCHI wrote:
> I think this behavior is a bug. XLogReadRecord is considering the
> case but palloc_extended() breaks it. So in the attached, add a
> new flag MCXT_ALLOC_NO_PARAMERR to palloc_extended() and
> allocate_recordbuf calls it with the new flag. That alone fixes
> the problem. However, the patch frees read state buffer facing
> errorneous records since such records can leave a too-large
> buffer allocated.

This problem is already discussed here:
https://commitfest.postgresql.org/18/1516/

And here is the thread:
https://www.postgresql.org/message-id/flat/0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05

Tsunakawa-san and I discussed a couple of approaches.  Extending
palloc_extended so as an incorrect length does not result in an error is
also something that crossed by mind, but the length handling is
different on the backend and the frontend, so I discarded the idea you
are proposing here and instead relied on a check with AllocSizeIsValid,
which gives a more simple patch:
https://www.postgresql.org/message-id/20180314052753.GA16179%40paquier.xyz

This got no interest from committers yet unfortunately, but I think that
this is a real problem which should be back-patched :(
--
Michael

signature.asc (849 bytes) Download Attachment