12.3 replicas falling over during WAL redo

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

12.3 replicas falling over during WAL redo

Ben Chobot
We have a few hundred postgres servers in AWS EC2, all of which do streaming replication to at least two replicas. As we've transitioned our fleet to from 9.5 to 12.3, we've noticed an alarming increase in the frequency of a streaming replica dying during replay. Postgres will log something like:

2020-07-31T16:55:22.602488+00:00 hostA postgres[31875]: [19137-1] db=,user= LOG:  restartpoint starting: time
2020-07-31T16:55:24.637150+00:00 hostA postgres[24076]: [15754-1] db=,user= FATAL:  incorrect index offsets supplied
2020-07-31T16:55:24.637261+00:00 hostA postgres[24076]: [15754-2] db=,user= CONTEXT:  WAL redo at BCC/CB7AF8B0 for Btree/VACUUM: lastBlockVacuumed 1720
2020-07-31T16:55:24.642877+00:00 hostA postgres[24074]: [8-1] db=,user= LOG:  startup process (PID 24076) exited with exit code 1

...or:

2020-07-31T10:43:35.520066+00:00 hostB postgres[11836]: [12-1] db=,user= WARNING:  will not overwrite a used ItemId
2020-07-31T10:43:35.520144+00:00 hostB postgres[11836]: [12-2] db=,user= CONTEXT:  WAL redo at AD1/6948E9E0 for Heap/HOT_UPDATE: off 4 xmax 128308492 flags 0x10 ; new off 2 xmax 0
2020-07-31T10:43:35.520180+00:00 hostB postgres[11836]: [13-1] db=,user= PANIC:  failed to add tuple
2020-07-31T10:43:35.520220+00:00 hostB postgres[11836]: [13-2] db=,user= CONTEXT:  WAL redo at AD1/6948E9E0 for Heap/HOT_UPDATE: off 4 xmax 128308492 flags 0x10 ; new off 2 xmax 0

...or:
2020-07-30T14:59:36.839243+00:00 hostC postgres[24338]: [45253-1] db=,user= WARNING:  specified item offset is too large
2020-07-30T14:59:36.839307+00:00 hostC postgres[24338]: [45253-2] db=,user= CONTEXT:  WAL redo at A0A/AC4204A0 for Btree/INSERT_LEAF: off 48
2020-07-30T14:59:36.839337+00:00 hostC postgres[24338]: [45254-1] db=,user= PANIC:  btree_xlog_insert: failed to add item
2020-07-30T14:59:36.839366+00:00 hostC postgres[24338]: [45254-2] db=,user= CONTEXT:  WAL redo at A0A/AC4204A0 for Btree/INSERT_LEAF: off 48
2020-07-30T14:59:37.587173+00:00 hostC postgres[24337]: [11-1] db=,user= LOG:  startup process (PID 24338) was terminated by signal 6: Aborted

Each time, a simple restart of the postgres service will bring the database back to a happy state and it will merrily replicate past the LSN where it had died before. This has never (yet) happened on a primary db, and (so far) always on only one one of the replicas the primary is replicating to, leading me to think there isn't anything actually wrong with my data. Still, this is no way to run a database. We never saw this problem in 9.5, but it happened 3 times just on Friday. We have taken the opportunity to enable checksuming with our move to 12, but this doesn't appear to be related to that, to my untrained eyes.

A lot of suggestions I've heard to fix problems which sound like this involve reindexing, and while we haven't yet tried that, it doesn't seem likely to help. We upgraded from 9.5 to 12 using pglogical, so all our indices were created from 12.3 code. That said, we were running pg_repack in an automated fashion for a bit, which seemed to be causing issues in 12 that we haven't had time to track down and so have currently disabled.

With the non-durability of the problem, I'm out of ideas of what to look for. Suggestions?
Reply | Threaded
Open this post in threaded view
|

Re: 12.3 replicas falling over during WAL redo

Alvaro Herrera-9
On 2020-Aug-01, Ben Chobot wrote:

> We have a few hundred postgres servers in AWS EC2, all of which do streaming
> replication to at least two replicas. As we've transitioned our fleet to
> from 9.5 to 12.3, we've noticed an alarming increase in the frequency of a
> streaming replica dying during replay. Postgres will log something like:
>
> |2020-07-31T16:55:22.602488+00:00 hostA postgres[31875]: [19137-1] db=,user=
> LOG: restartpoint starting: time 2020-07-31T16:55:24.637150+00:00 hostA
> postgres[24076]: [15754-1] db=,user= FATAL: incorrect index offsets supplied
> 2020-07-31T16:55:24.637261+00:00 hostA postgres[24076]: [15754-2] db=,user=
> CONTEXT: WAL redo at BCC/CB7AF8B0 for Btree/VACUUM: lastBlockVacuumed 1720
> 2020-07-31T16:55:24.642877+00:00 hostA postgres[24074]: [8-1] db=,user= LOG:
> startup process (PID 24076) exited with exit code 1|

I've never seen this one.

Can you find out what the index is being modified by those LSNs -- is it
always the same index?  Can you have a look at nearby WAL records that
touch the same page of the same index in each case?

One possibility is that the storage forgot a previous write.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: 12.3 replicas falling over during WAL redo

Ben Chobot


Alvaro Herrera wrote on 8/1/20 9:35 AM:

> On 2020-Aug-01, Ben Chobot wrote:
>
>> We have a few hundred postgres servers in AWS EC2, all of which do streaming
>> replication to at least two replicas. As we've transitioned our fleet to
>> from 9.5 to 12.3, we've noticed an alarming increase in the frequency of a
>> streaming replica dying during replay. Postgres will log something like:
>>
>> |2020-07-31T16:55:22.602488+00:00 hostA postgres[31875]: [19137-1] db=,user=
>> LOG: restartpoint starting: time 2020-07-31T16:55:24.637150+00:00 hostA
>> postgres[24076]: [15754-1] db=,user= FATAL: incorrect index offsets supplied
>> 2020-07-31T16:55:24.637261+00:00 hostA postgres[24076]: [15754-2] db=,user=
>> CONTEXT: WAL redo at BCC/CB7AF8B0 for Btree/VACUUM: lastBlockVacuumed 1720
>> 2020-07-31T16:55:24.642877+00:00 hostA postgres[24074]: [8-1] db=,user= LOG:
>> startup process (PID 24076) exited with exit code 1|
> I've never seen this one.
>
> Can you find out what the index is being modified by those LSNs -- is it
> always the same index?  Can you have a look at nearby WAL records that
> touch the same page of the same index in each case?
>
> One possibility is that the storage forgot a previous write.

I'd be happy to, if you tell me how. :)

We're using xfs for our postgres filesystem, on ubuntu bionic. Of course
it's always possible there's something wrong in the filesystem or the
EBS layer, but that is one thing we have not changed in the migration
from 9.5 to 12.3.


Reply | Threaded
Open this post in threaded view
|

Re: 12.3 replicas falling over during WAL redo

Kyotaro Horiguchi-4
At Sat, 1 Aug 2020 09:58:05 -0700, Ben Chobot <[hidden email]> wrote in

>
>
> Alvaro Herrera wrote on 8/1/20 9:35 AM:
> > On 2020-Aug-01, Ben Chobot wrote:
> >
> >> We have a few hundred postgres servers in AWS EC2, all of which do
> >> streaming
> >> replication to at least two replicas. As we've transitioned our fleet
> >> to
> >> from 9.5 to 12.3, we've noticed an alarming increase in the frequency
> >> of a
> >> streaming replica dying during replay. Postgres will log something
> >> like:
> >>
> >> |2020-07-31T16:55:22.602488+00:00 hostA postgres[31875]: [19137-1]
> >> |db=,user=
> >> LOG: restartpoint starting: time 2020-07-31T16:55:24.637150+00:00
> >> hostA
> >> postgres[24076]: [15754-1] db=,user= FATAL: incorrect index offsets
> >> supplied
> >> 2020-07-31T16:55:24.637261+00:00 hostA postgres[24076]: [15754-2]
> >> db=,user=
> >> CONTEXT: WAL redo at BCC/CB7AF8B0 for Btree/VACUUM: lastBlockVacuumed
> >> 1720
> >> 2020-07-31T16:55:24.642877+00:00 hostA postgres[24074]: [8-1]
> >> db=,user= LOG:
> >> startup process (PID 24076) exited with exit code 1|
> > I've never seen this one.
> >
> > Can you find out what the index is being modified by those LSNs -- is
> > it
> > always the same index?  Can you have a look at nearby WAL records that
> > touch the same page of the same index in each case?
> >
> > One possibility is that the storage forgot a previous write.
>
> I'd be happy to, if you tell me how. :)
>
> We're using xfs for our postgres filesystem, on ubuntu bionic. Of
> course it's always possible there's something wrong in the filesystem
> or the EBS layer, but that is one thing we have not changed in the
> migration from 9.5 to 12.3.

All of the cited log lines seem suggesting relation with deleted btree
page items. As a possibility I can guess, that can happen if the pages
were flushed out during a vacuum after the last checkpoint and
full-page-writes didn't restored the page to the state before the
index-item deletion happened(that is, if full_page_writes were set to
off.). (If it found to be the cause, I'm not sure why that didn't
happen on 9.5.)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: 12.3 replicas falling over during WAL redo

Ben Chobot
Kyotaro Horiguchi wrote on 8/2/20 9:39 PM:

> At Sat, 1 Aug 2020 09:58:05 -0700, Ben Chobot <[hidden email]> wrote in
> All of the cited log lines seem suggesting relation with deleted btree
> page items. As a possibility I can guess, that can happen if the pages
> were flushed out during a vacuum after the last checkpoint and
> full-page-writes didn't restored the page to the state before the
> index-item deletion happened(that is, if full_page_writes were set to
> off.). (If it found to be the cause, I'm not sure why that didn't
> happen on 9.5.)
>
> regards.

We have always had full_page_writes enabled.


Reply | Threaded
Open this post in threaded view
|

Re: 12.3 replicas falling over during WAL redo

Peter Geoghegan-4
In reply to this post by Kyotaro Horiguchi-4
On Sun, Aug 2, 2020 at 9:39 PM Kyotaro Horiguchi
<[hidden email]> wrote:
> All of the cited log lines seem suggesting relation with deleted btree
> page items. As a possibility I can guess, that can happen if the pages
> were flushed out during a vacuum after the last checkpoint and
> full-page-writes didn't restored the page to the state before the
> index-item deletion happened(that is, if full_page_writes were set to
> off.). (If it found to be the cause, I'm not sure why that didn't
> happen on 9.5.)

There is also a Heap/HOT_UPDATE log line with similar errors.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: 12.3 replicas falling over during WAL redo

Ben Chobot
In reply to this post by Ben Chobot
Ben Chobot wrote on 8/1/20 9:58 AM:
Alvaro Herrera wrote on 8/1/20 9:35 AM:
On 2020-Aug-01, Ben Chobot wrote:

Can you find out what the index is being modified by those LSNs -- is it
always the same index?  Can you have a look at nearby WAL records that
touch the same page of the same index in each case?

One possibility is that the storage forgot a previous write.

So here is me fumbling how to do this. Hopefully this is useful?

The error for this host was:

2020-07-30T14:59:36.839243+00:00 host postgres[24338]: [45253-1] db=,user= WARNING:  specified item offset is too large
2020-07-30T14:59:36.839307+00:00 host postgres[24338]: [45253-2] db=,user= CONTEXT:  WAL redo at A0A/AC4204A0 for Btree/INSERT_LEAF: off 48
2020-07-30T14:59:36.839337+00:00 host postgres[24338]: [45254-1] db=,user= PANIC:  btree_xlog_insert: failed to add item
2020-07-30T14:59:36.839366+00:00 host postgres[24338]: [45254-2] db=,user= CONTEXT:  WAL redo at A0A/AC4204A0 for Btree/INSERT_LEAF: off 48
2020-07-30T14:59:37.587173+00:00 host postgres[24337]: [11-1] db=,user= LOG:  startup process (PID 24338) was terminated by signal 6: Aborted
2020-07-30T14:59:37.587266+00:00 host postgres[24337]: [12-1] db=,user= LOG:  terminating any other active server processes

So, starting from a filesystem snapshot where the db had crashed and not yet been started, I did:

/usr/lib/postgresql/12/bin/pg_waldump -p /var/lib/postgresql/12/main/pg_wal/ 0000000100000A0A000000AB 0000000100000A0A000000AD | grep A0A/AC4204A0

This shows me:

rmgr: Btree       len (rec/tot):     72/    72, tx:   76396065, lsn: A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0: rel 16605/16613/60529051 blk 6501

So then I did:

/usr/lib/postgresql/12/bin/pg_waldump -p /var/lib/postgresql/12/main/pg_wal/ 0000000100000A0A000000AB 0000000100000A0A000000AD | grep 16605/16613/60529051
rmgr: Btree       len (rec/tot):     53/   813, tx:   76393363, lsn: A0A/AB289778, prev A0A/AB287F30, desc: INSERT_LEAF off 26, blkref #0: rel 16605/16613/60529051 blk 6089 FPW
rmgr: Btree       len (rec/tot):     72/    72, tx:   76393390, lsn: A0A/AB2BE870, prev A0A/AB2BD6E0, desc: INSERT_LEAF off 128, blkref #0: rel 16605/16613/60529051 blk 6519
rmgr: Btree       len (rec/tot):     72/    72, tx:   76393394, lsn: A0A/AB2C43D0, prev A0A/AB2C4378, desc: INSERT_LEAF off 41, blkref #0: rel 16605/16613/60529051 blk 6501
rmgr: Btree       len (rec/tot):     72/    72, tx:   76393992, lsn: A0A/AB5FA848, prev A0A/AB5FA7F0, desc: INSERT_LEAF off 133, blkref #0: rel 16605/16613/60529051 blk 5999
rmgr: Btree       len (rec/tot):     72/    72, tx:   76394938, lsn: A0A/ABE49738, prev A0A/ABE47AF8, desc: INSERT_LEAF off 189, blkref #0: rel 16605/16613/60529051 blk 6519
rmgr: Btree       len (rec/tot):     72/    72, tx:   76394977, lsn: A0A/ABEB0330, prev A0A/ABEB02D8, desc: INSERT_LEAF off 204, blkref #0: rel 16605/16613/60529051 blk 6519
rmgr: Btree       len (rec/tot):     72/    72, tx:   76396065, lsn: A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0: rel 16605/16613/60529051 blk 6501
rmgr: Btree       len (rec/tot):     72/    72, tx:   76396075, lsn: A0A/AC427518, prev A0A/AC4274C0, desc: INSERT_LEAF off 211, blkref #0: rel 16605/16613/60529051 blk 6519
rmgr: Btree       len (rec/tot):     53/  2529, tx:   76396250, lsn: A0A/AC5352F8, prev A0A/AC533A00, desc: INSERT_LEAF off 87, blkref #0: rel 16605/16613/60529051 blk 5639 FPW
pg_waldump: fatal: error in WAL record at A0A/AC5411B0: invalid resource manager ID 110 at A0A/AC5411E0

...and I have no idea what I'm looking at. I assume/hope the error at the end is due to the db shutting down, and nothing to be particularly worried about?

I have the specific index that postgres tripped on here and I'll spend some time today trying to see if the other crashes were the same index.
Reply | Threaded
Open this post in threaded view
|

Re: 12.3 replicas falling over during WAL redo

Alvaro Herrera-9
On 2020-Aug-03, Ben Chobot wrote:

> rmgr: Btree       len (rec/tot):     72/    72, tx:   76396065, lsn:
> A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0: rel
> 16605/16613/60529051 blk 6501
>
> So then I did:
>
> /usr/lib/postgresql/12/bin/pg_waldump -p /var/lib/postgresql/12/main/pg_wal/
> 0000000100000A0A000000AB 0000000100000A0A000000AD | grep
> 16605/16613/60529051

Yep. Looking at the ones in block 6501,

> rmgr: Btree       len (rec/tot):     72/    72, tx:   76393394, lsn:
> A0A/AB2C43D0, prev A0A/AB2C4378, desc: INSERT_LEAF off 41, blkref #0: rel
> 16605/16613/60529051 blk 6501

> rmgr: Btree       len (rec/tot):     72/    72, tx:   76396065, lsn:
> A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0: rel
> 16605/16613/60529051 blk 6501

My question was whether the block has received the update that added the
item in offset 41; that is, is the LSN in the crashed copy of the page
equal to A0A/AB2C43D0?  If it's an older value, then the write above was
lost for some reason.

> pg_waldump: fatal: error in WAL record at A0A/AC5411B0: invalid resource
> manager ID 110 at A0A/AC5411E0
>
> ...and I have no idea what I'm looking at. I assume/hope the error at the
> end is due to the db shutting down, and nothing to be particularly worried
> about?

Yeah.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: 12.3 replicas falling over during WAL redo

Ben Chobot
Alvaro Herrera wrote on 8/3/20 12:34 PM:

> On 2020-Aug-03, Ben Chobot wrote:
>
> Yep. Looking at the ones in block 6501,
>
>> rmgr: Btree       len (rec/tot):     72/    72, tx:   76393394, lsn:
>> A0A/AB2C43D0, prev A0A/AB2C4378, desc: INSERT_LEAF off 41, blkref #0: rel
>> 16605/16613/60529051 blk 6501
>> rmgr: Btree       len (rec/tot):     72/    72, tx:   76396065, lsn:
>> A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0: rel
>> 16605/16613/60529051 blk 6501
> My question was whether the block has received the update that added the
> item in offset 41; that is, is the LSN in the crashed copy of the page
> equal to A0A/AB2C43D0?  If it's an older value, then the write above was
> lost for some reason.

How do I tell?


Reply | Threaded
Open this post in threaded view
|

Re: 12.3 replicas falling over during WAL redo

Ben Chobot
In reply to this post by Peter Geoghegan-4
Peter Geoghegan wrote on 8/3/20 11:25 AM:

> On Sun, Aug 2, 2020 at 9:39 PM Kyotaro Horiguchi
> <[hidden email]> wrote:
>> All of the cited log lines seem suggesting relation with deleted btree
>> page items. As a possibility I can guess, that can happen if the pages
>> were flushed out during a vacuum after the last checkpoint and
>> full-page-writes didn't restored the page to the state before the
>> index-item deletion happened(that is, if full_page_writes were set to
>> off.). (If it found to be the cause, I'm not sure why that didn't
>> happen on 9.5.)
> There is also a Heap/HOT_UPDATE log line with similar errors.

Yes, and I have the pg_waldump output for it. But, that table is quite
large, and the transaction that contains the LSN in the error log is
1,752 waldump lines long. I'm happy to share what would be useful to
help debug it but I'm guessing it should be a subset of that.




Reply | Threaded
Open this post in threaded view
|

Re: 12.3 replicas falling over during WAL redo

Ben Chobot
In reply to this post by Ben Chobot
Alvaro Herrera wrote on 8/1/20 9:35 AM:
>>> On 2020-Aug-01, Ben Chobot wrote:
>>>
>>> Can you find out what the index is being modified by those LSNs --
>>> is it
>>> always the same index?  Can you have a look at nearby WAL records that
>>> touch the same page of the same index in each case?

They turn out to be different indices on different tables.



Reply | Threaded
Open this post in threaded view
|

Re: 12.3 replicas falling over during WAL redo

Alvaro Herrera-9
In reply to this post by Ben Chobot
On 2020-Aug-03, Ben Chobot wrote:

> Alvaro Herrera wrote on 8/3/20 12:34 PM:
> > On 2020-Aug-03, Ben Chobot wrote:
> >
> > Yep. Looking at the ones in block 6501,
> >
> > > rmgr: Btree       len (rec/tot):     72/    72, tx:   76393394, lsn:
> > > A0A/AB2C43D0, prev A0A/AB2C4378, desc: INSERT_LEAF off 41, blkref #0: rel
> > > 16605/16613/60529051 blk 6501
> > > rmgr: Btree       len (rec/tot):     72/    72, tx:   76396065, lsn:
> > > A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0: rel
> > > 16605/16613/60529051 blk 6501
> > My question was whether the block has received the update that added the
> > item in offset 41; that is, is the LSN in the crashed copy of the page
> > equal to A0A/AB2C43D0?  If it's an older value, then the write above was
> > lost for some reason.
>
> How do I tell?

You can use pageinspect's page_header() function to obtain the page's
LSN.  You can use dd to obtain the page from the file,

dd if=16605/16613/60529051 bs=8192 count=1 seek=6501 of=/tmp/page.6501

then put that binary file in a bytea column, perhaps like

create table page (raw bytea);
insert into page select pg_read_binary_file('/tmp/page');

and with that you can run page_header:

create extension pageinspect;
select h.* from page, page_header(raw) h;

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: 12.3 replicas falling over during WAL redo

Peter Geoghegan-4
On Mon, Aug 3, 2020 at 2:35 PM Alvaro Herrera <[hidden email]> wrote:
> You can use pageinspect's page_header() function to obtain the page's
> LSN.  You can use dd to obtain the page from the file,
>
> dd if=16605/16613/60529051 bs=8192 count=1 seek=6501 of=/tmp/page.6501

Ben might find this approach to dumping out a single page image
easier, since it doesn't involve relfilenodes or filesystem files:

https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#contrib.2Fpageinspect_page_dump

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: 12.3 replicas falling over during WAL redo

Ben Chobot
Peter Geoghegan wrote on 8/3/20 3:04 PM:
On Mon, Aug 3, 2020 at 2:35 PM Alvaro Herrera [hidden email] wrote:
You can use pageinspect's page_header() function to obtain the page's
LSN.  You can use dd to obtain the page from the file,

dd if=16605/16613/60529051 bs=8192 count=1 seek=6501 of=/tmp/page.6501
Ben might find this approach to dumping out a single page image
easier, since it doesn't involve relfilenodes or filesystem files:

https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#contrib.2Fpageinspect_page_dump

I don't think that works, because it sounds like this requires I connect to the db, but that means I have to start it up. I am (wrongly?) assuming we should get what we can while the db is still in a just-crashed state.

Still, I seem to have mangled Alvaro's suggestion somehow, because this doesn't look right:

# select h.* from public.page , page_header(raw) h;
 lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid
-----+----------+-------+-------+-------+---------+----------+---------+-----------
 0/0 |        0 |     0 |     0 |     0 |       0 |        0 |       0 |         0
(1 row)

I'll keep trying though.
Reply | Threaded
Open this post in threaded view
|

Re: 12.3 replicas falling over during WAL redo

Ben Chobot
In reply to this post by Alvaro Herrera-9
Alvaro Herrera wrote on 8/3/20 2:34 PM:
On 2020-Aug-03, Ben Chobot wrote:

Alvaro Herrera wrote on 8/3/20 12:34 PM:
On 2020-Aug-03, Ben Chobot wrote:

Yep. Looking at the ones in block 6501,

rmgr: Btree       len (rec/tot):     72/    72, tx:   76393394, lsn:
A0A/AB2C43D0, prev A0A/AB2C4378, desc: INSERT_LEAF off 41, blkref #0: rel
16605/16613/60529051 blk 6501
rmgr: Btree       len (rec/tot):     72/    72, tx:   76396065, lsn:
A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0: rel
16605/16613/60529051 blk 6501
My question was whether the block has received the update that added the
item in offset 41; that is, is the LSN in the crashed copy of the page
equal to A0A/AB2C43D0?  If it's an older value, then the write above was
lost for some reason.
How do I tell?
You can use pageinspect's page_header() function to obtain the page's
LSN.  You can use dd to obtain the page from the file,

dd if=16605/16613/60529051 bs=8192 count=1 seek=6501 of=/tmp/page.6501 

If I use skip instead of seek....

then put that binary file in a bytea column, perhaps like

create table page (raw bytea);
insert into page select pg_read_binary_file('/tmp/page');

and with that you can run page_header:

create extension pageinspect;
select h.* from page, page_header(raw) h;

     lsn      | checksum | flags | lower | upper | special | pagesize | version | prune_xid
--------------+----------+-------+-------+-------+---------+----------+---------+-----------
 A0A/99BA11F8 |     -215 |     0 |   180 |  7240 |    8176 |     8192 |       4 |         0

As I understand what we're looking at, this means the WAL stream was assuming this page was last touched by A0A/AB2C43D0, but the page itself thinks it was last touched by A0A/99BA11F8, which means at least one write to the page is missing?
Reply | Threaded
Open this post in threaded view
|

Re: 12.3 replicas falling over during WAL redo

Alvaro Herrera-9
On 2020-Aug-03, Ben Chobot wrote:

> Alvaro Herrera wrote on 8/3/20 2:34 PM:
> > On 2020-Aug-03, Ben Chobot wrote:

> > dd if=16605/16613/60529051 bs=8192 count=1 seek=6501 of=/tmp/page.6501
>
> If I use skip instead of seek....

Argh, yes, I did correct that in my test and forgot to copy and paste.

>      lsn      | checksum | flags | lower | upper | special | pagesize |
> version | prune_xid
> --------------+----------+-------+-------+-------+---------+----------+---------+-----------
>  A0A/99BA11F8 |     -215 |     0 |   180 |  7240 |    8176 |     8192
> |       4 |         0
>
> As I understand what we're looking at, this means the WAL stream was
> assuming this page was last touched by A0A/AB2C43D0, but the page itself
> thinks it was last touched by A0A/99BA11F8, which means at least one write
> to the page is missing?

Yeah, that's exactly what we're seeing.  Somehow an older page version
was resurrected.  Of course, this should never happen.

So my theory has been proved.  What now?

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: 12.3 replicas falling over during WAL redo

Alvaro Herrera-9
On 2020-Aug-03, Alvaro Herrera wrote:

> >      lsn      | checksum | flags | lower | upper | special | pagesize |
> > version | prune_xid
> > --------------+----------+-------+-------+-------+---------+----------+---------+-----------
> >  A0A/99BA11F8 |     -215 |     0 |   180 |  7240 |    8176 |     8192
> > |       4 |         0
> >
> > As I understand what we're looking at, this means the WAL stream was
> > assuming this page was last touched by A0A/AB2C43D0, but the page itself
> > thinks it was last touched by A0A/99BA11F8, which means at least one write
> > to the page is missing?
>
> Yeah, that's exactly what we're seeing.  Somehow an older page version
> was resurrected.  Of course, this should never happen.

... although, the block should have been in shared buffers, and it is
there that the previous WAL record would have updated -- not necessarily
flushed to disk.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: 12.3 replicas falling over during WAL redo

Kyotaro Horiguchi-4
At Tue, 4 Aug 2020 09:53:36 -0400, Alvaro Herrera <[hidden email]> wrote in

> On 2020-Aug-03, Alvaro Herrera wrote:
>
> > >      lsn      | checksum | flags | lower | upper | special | pagesize |
> > > version | prune_xid
> > > --------------+----------+-------+-------+-------+---------+----------+---------+-----------
> > >  A0A/99BA11F8 |     -215 |     0 |   180 |  7240 |    8176 |     8192
> > > |       4 |         0
> > >
> > > As I understand what we're looking at, this means the WAL stream was
> > > assuming this page was last touched by A0A/AB2C43D0, but the page itself
> > > thinks it was last touched by A0A/99BA11F8, which means at least one write
> > > to the page is missing?
> >
> > Yeah, that's exactly what we're seeing.  Somehow an older page version
> > was resurrected.  Of course, this should never happen.
>
> ... although, the block should have been in shared buffers, and it is
> there that the previous WAL record would have updated -- not necessarily
> flushed to disk.

Yeah. On the other hand, the WAL records shown upthread desn't have a FPW.

> rmgr: Btree       len (rec/tot):     72/    72, tx:   76393394, lsn:
> A0A/AB2C43D0, prev A0A/AB2C4378, desc: INSERT_LEAF off 41, blkref #0: rel
> 16605/16613/60529051 blk 6501

> rmgr: Btree       len (rec/tot):     72/    72, tx:   76396065, lsn:
> A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0: rel
> 16605/16613/60529051 blk 6501

There must be a record for the page 6501 conveying FPW after the last
checkpoint.  If it is not found, something wrong on deciding whether
to attach FPW.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: 12.3 replicas falling over during WAL redo

Ben Chobot
In reply to this post by Alvaro Herrera-9
Alvaro Herrera wrote on 8/3/20 4:54 PM:
On 2020-Aug-03, Ben Chobot wrote:

Alvaro Herrera wrote on 8/3/20 2:34 PM:
On 2020-Aug-03, Ben Chobot wrote:

  
dd if=16605/16613/60529051 bs=8192 count=1 seek=6501 of=/tmp/page.6501
If I use skip instead of seek....
Argh, yes, I did correct that in my test and forgot to copy and paste.

     lsn      | checksum | flags | lower | upper | special | pagesize |
version | prune_xid
--------------+----------+-------+-------+-------+---------+----------+---------+-----------
 A0A/99BA11F8 |     -215 |     0 |   180 |  7240 |    8176 |     8192
|       4 |         0

As I understand what we're looking at, this means the WAL stream was
assuming this page was last touched by A0A/AB2C43D0, but the page itself
thinks it was last touched by A0A/99BA11F8, which means at least one write
to the page is missing?
Yeah, that's exactly what we're seeing.  Somehow an older page version
was resurrected.  Of course, this should never happen.

So my theory has been proved.  What now?

Just to close the loop on this, we haven't seen the issue since we've stopped expanding our filesystems by moving LVM extents between devices, so while I don't know exactly where the bug lies, I feel it's quite likely not in Postgres.