Reorderbuffer crash during recovery

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Reorderbuffer crash during recovery

vignesh C
Hi,

I found couple of crashes in reorderbuffer while review/testing of
logical_work_mem and logical streaming of large in-progress
transactions. Stack trace of the same are given below:
Issue 1:
#0  0x00007f985c7d8337 in raise () from /lib64/libc.so.6
#1  0x00007f985c7d9a28 in abort () from /lib64/libc.so.6
#2  0x0000000000ec514d in ExceptionalCondition
(conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24
"FailedAssertion",
    fileName=0x10eab00 "../../../../src/include/lib/ilist.h",
lineNumber=458) at assert.c:54
#3  0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60,
off=48) at ../../../../src/include/lib/ilist.h:458
#4  0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0,
oldestRunningXid=895) at reorderbuffer.c:1910
#5  0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0,
buf=0x7fff7e7b1e40) at decode.c:332
#6  0x0000000000b3c363 in LogicalDecodingProcessRecord (ctx=0x33424b0,
record=0x3342770) at decode.c:121
#7  0x0000000000b704b2 in XLogSendLogical () at walsender.c:2845
#8  0x0000000000b6e9f8 in WalSndLoop (send_data=0xb7038b
<XLogSendLogical>) at walsender.c:2199
#9  0x0000000000b6bbf5 in StartLogicalReplication (cmd=0x33167a8) at
walsender.c:1128
#10 0x0000000000b6ce83 in exec_replication_command
(cmd_string=0x328a0a0 "START_REPLICATION SLOT \"sub1\" LOGICAL 0/0
(proto_version '1', publication_names '\"pub1\"')")
    at walsender.c:1545
#11 0x0000000000c39f85 in PostgresMain (argc=1, argv=0x32b51c0,
dbname=0x32b50e0 "testdb", username=0x32b50c0 "user1") at
postgres.c:4256
#12 0x0000000000b10dc7 in BackendRun (port=0x32ad890) at postmaster.c:4498
#13 0x0000000000b0ff3e in BackendStartup (port=0x32ad890) at postmaster.c:4189
#14 0x0000000000b08505 in ServerLoop () at postmaster.c:1727
#15 0x0000000000b0781a in PostmasterMain (argc=3, argv=0x3284cb0) at
postmaster.c:1400
#16 0x000000000097492d in main (argc=3, argv=0x3284cb0) at main.c:210

Issue 2:
#0  0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6
#1  0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6
#2  0x0000000000ec4e1d in ExceptionalCondition
(conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
errorType=0x10ea284 "FailedAssertion",
    fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
#3  0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:3052
#4  0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:1318
#5  0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
txn=0x2b9d778) at reorderbuffer.c:1257
#6  0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
oldestRunningXid=3835) at reorderbuffer.c:1973
#7  0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0,
buf=0x7ffcbc74cc00) at decode.c:332
#8  0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0,
record=0x2b67990) at decode.c:121
#9  0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845

From initial analysis it looks like:
Issue1 it seems like if all the reorderbuffer has been flushed and
then the server restarts. This problem occurs.
Issue 2 it seems like if there are many subtransactions present and
then the server restarts. This problem occurs. The subtransaction's
final_lsn is not being set and when ReorderBufferRestoreCleanup is
called the assert fails. May be for this we might have to set the
subtransaction's final_lsn before cleanup(not sure).

I could not reproduce this issue consistently with a test case, But I
felt this looks like a problem from review.

For issue1, I could reproduce by the following steps:
1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always.
2) Have many open transactions with subtransactions open.
3) Attach one of the transaction from gdb and call abort().

I'm not sure of the fix for this. If I get time I will try to spend
more time to find out the fix.
Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Reorderbuffer crash during recovery

Dilip Kumar-2
On Wed, Nov 6, 2019 at 5:20 PM vignesh C <[hidden email]> wrote:

>
> Hi,
>
> I found couple of crashes in reorderbuffer while review/testing of
> logical_work_mem and logical streaming of large in-progress
> transactions. Stack trace of the same are given below:
> Issue 1:
> #0  0x00007f985c7d8337 in raise () from /lib64/libc.so.6
> #1  0x00007f985c7d9a28 in abort () from /lib64/libc.so.6
> #2  0x0000000000ec514d in ExceptionalCondition
> (conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24
> "FailedAssertion",
>     fileName=0x10eab00 "../../../../src/include/lib/ilist.h",
> lineNumber=458) at assert.c:54
> #3  0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60,
> off=48) at ../../../../src/include/lib/ilist.h:458
> #4  0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0,
> oldestRunningXid=895) at reorderbuffer.c:1910
> #5  0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0,
> buf=0x7fff7e7b1e40) at decode.c:332
> #6  0x0000000000b3c363 in LogicalDecodingProcessRecord (ctx=0x33424b0,
> record=0x3342770) at decode.c:121
> #7  0x0000000000b704b2 in XLogSendLogical () at walsender.c:2845
> #8  0x0000000000b6e9f8 in WalSndLoop (send_data=0xb7038b
> <XLogSendLogical>) at walsender.c:2199
> #9  0x0000000000b6bbf5 in StartLogicalReplication (cmd=0x33167a8) at
> walsender.c:1128
> #10 0x0000000000b6ce83 in exec_replication_command
> (cmd_string=0x328a0a0 "START_REPLICATION SLOT \"sub1\" LOGICAL 0/0
> (proto_version '1', publication_names '\"pub1\"')")
>     at walsender.c:1545
> #11 0x0000000000c39f85 in PostgresMain (argc=1, argv=0x32b51c0,
> dbname=0x32b50e0 "testdb", username=0x32b50c0 "user1") at
> postgres.c:4256
> #12 0x0000000000b10dc7 in BackendRun (port=0x32ad890) at postmaster.c:4498
> #13 0x0000000000b0ff3e in BackendStartup (port=0x32ad890) at postmaster.c:4189
> #14 0x0000000000b08505 in ServerLoop () at postmaster.c:1727
> #15 0x0000000000b0781a in PostmasterMain (argc=3, argv=0x3284cb0) at
> postmaster.c:1400
> #16 0x000000000097492d in main (argc=3, argv=0x3284cb0) at main.c:210
>
> Issue 2:
> #0  0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6
> #1  0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6
> #2  0x0000000000ec4e1d in ExceptionalCondition
> (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
> errorType=0x10ea284 "FailedAssertion",
>     fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
> #3  0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
> txn=0x2bafb08) at reorderbuffer.c:3052
> #4  0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> txn=0x2bafb08) at reorderbuffer.c:1318
> #5  0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> txn=0x2b9d778) at reorderbuffer.c:1257
> #6  0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
> oldestRunningXid=3835) at reorderbuffer.c:1973
> #7  0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0,
> buf=0x7ffcbc74cc00) at decode.c:332
> #8  0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0,
> record=0x2b67990) at decode.c:121
> #9  0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845
>
> From initial analysis it looks like:
> Issue1 it seems like if all the reorderbuffer has been flushed and
> then the server restarts. This problem occurs.
> Issue 2 it seems like if there are many subtransactions present and
> then the server restarts. This problem occurs. The subtransaction's
> final_lsn is not being set and when ReorderBufferRestoreCleanup is
> called the assert fails. May be for this we might have to set the
> subtransaction's final_lsn before cleanup(not sure).
>
> I could not reproduce this issue consistently with a test case, But I
> felt this looks like a problem from review.
>
> For issue1, I could reproduce by the following steps:
> 1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always.
> 2) Have many open transactions with subtransactions open.
> 3) Attach one of the transaction from gdb and call abort().

Do you need subtransactions for the issue1? It appears that after the
restart if the changes list is empty it will hit the assert.  Am I
missing something?

>
> I'm not sure of the fix for this. If I get time I will try to spend
> more time to find out the fix.
> Thoughts?
>


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Reorderbuffer crash during recovery

vignesh C
On Wed, Nov 6, 2019 at 5:41 PM Dilip Kumar <[hidden email]> wrote:

>
> On Wed, Nov 6, 2019 at 5:20 PM vignesh C <[hidden email]> wrote:
> >
> > Hi,
> >
> > I found couple of crashes in reorderbuffer while review/testing of
> > logical_work_mem and logical streaming of large in-progress
> > transactions. Stack trace of the same are given below:
> > Issue 1:
> > #0  0x00007f985c7d8337 in raise () from /lib64/libc.so.6
> > #1  0x00007f985c7d9a28 in abort () from /lib64/libc.so.6
> > #2  0x0000000000ec514d in ExceptionalCondition
> > (conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24
> > "FailedAssertion",
> >     fileName=0x10eab00 "../../../../src/include/lib/ilist.h",
> > lineNumber=458) at assert.c:54
> > #3  0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60,
> > off=48) at ../../../../src/include/lib/ilist.h:458
> > #4  0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0,
> > oldestRunningXid=895) at reorderbuffer.c:1910
> > #5  0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0,
> > buf=0x7fff7e7b1e40) at decode.c:332
> > #6  0x0000000000b3c363 in LogicalDecodingProcessRecord (ctx=0x33424b0,
> > record=0x3342770) at decode.c:121
> > #7  0x0000000000b704b2 in XLogSendLogical () at walsender.c:2845
> > #8  0x0000000000b6e9f8 in WalSndLoop (send_data=0xb7038b
> > <XLogSendLogical>) at walsender.c:2199
> > #9  0x0000000000b6bbf5 in StartLogicalReplication (cmd=0x33167a8) at
> > walsender.c:1128
> > #10 0x0000000000b6ce83 in exec_replication_command
> > (cmd_string=0x328a0a0 "START_REPLICATION SLOT \"sub1\" LOGICAL 0/0
> > (proto_version '1', publication_names '\"pub1\"')")
> >     at walsender.c:1545
> > #11 0x0000000000c39f85 in PostgresMain (argc=1, argv=0x32b51c0,
> > dbname=0x32b50e0 "testdb", username=0x32b50c0 "user1") at
> > postgres.c:4256
> > #12 0x0000000000b10dc7 in BackendRun (port=0x32ad890) at postmaster.c:4498
> > #13 0x0000000000b0ff3e in BackendStartup (port=0x32ad890) at postmaster.c:4189
> > #14 0x0000000000b08505 in ServerLoop () at postmaster.c:1727
> > #15 0x0000000000b0781a in PostmasterMain (argc=3, argv=0x3284cb0) at
> > postmaster.c:1400
> > #16 0x000000000097492d in main (argc=3, argv=0x3284cb0) at main.c:210
> >
> > Issue 2:
> > #0  0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6
> > #1  0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6
> > #2  0x0000000000ec4e1d in ExceptionalCondition
> > (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
> > errorType=0x10ea284 "FailedAssertion",
> >     fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
> > #3  0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
> > txn=0x2bafb08) at reorderbuffer.c:3052
> > #4  0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> > txn=0x2bafb08) at reorderbuffer.c:1318
> > #5  0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> > txn=0x2b9d778) at reorderbuffer.c:1257
> > #6  0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
> > oldestRunningXid=3835) at reorderbuffer.c:1973
> > #7  0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0,
> > buf=0x7ffcbc74cc00) at decode.c:332
> > #8  0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0,
> > record=0x2b67990) at decode.c:121
> > #9  0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845
> >
> > From initial analysis it looks like:
> > Issue1 it seems like if all the reorderbuffer has been flushed and
> > then the server restarts. This problem occurs.
> > Issue 2 it seems like if there are many subtransactions present and
> > then the server restarts. This problem occurs. The subtransaction's
> > final_lsn is not being set and when ReorderBufferRestoreCleanup is
> > called the assert fails. May be for this we might have to set the
> > subtransaction's final_lsn before cleanup(not sure).
> >
> > I could not reproduce this issue consistently with a test case, But I
> > felt this looks like a problem from review.
> >
> > For issue1, I could reproduce by the following steps:
> > 1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always.
> > 2) Have many open transactions with subtransactions open.
> > 3) Attach one of the transaction from gdb and call abort().
>
> Do you need subtransactions for the issue1? It appears that after the
> restart if the changes list is empty it will hit the assert.  Am I
> missing something?
>

When I had reported this issue I could reproduce this issue with
sub-transactions. Now I have tried without using sub-transactions and
could still reproduce this issue. You are right Issue 1 will appear in
both the cases with and without subtransactions.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Reorderbuffer crash during recovery

Tomas Vondra-4
On Thu, Nov 07, 2019 at 11:01:17AM +0530, Dilip Kumar wrote:

>On Thu, Nov 7, 2019 at 9:55 AM vignesh C <[hidden email]> wrote:
>>
>> On Wed, Nov 6, 2019 at 5:41 PM Dilip Kumar <[hidden email]> wrote:
>> >
>> > On Wed, Nov 6, 2019 at 5:20 PM vignesh C <[hidden email]> wrote:
>> > >
>> > > Hi,
>> > >
>> > > ...
>> > >
>> > > Issue1 it seems like if all the reorderbuffer has been flushed and
>> > > then the server restarts. This problem occurs.
>> > > Issue 2 it seems like if there are many subtransactions present and
>> > > then the server restarts. This problem occurs. The subtransaction's
>> > > final_lsn is not being set and when ReorderBufferRestoreCleanup is
>> > > called the assert fails. May be for this we might have to set the
>> > > subtransaction's final_lsn before cleanup(not sure).
>> > >
>> > > I could not reproduce this issue consistently with a test case, But I
>> > > felt this looks like a problem from review.
>> > >
>> > > For issue1, I could reproduce by the following steps:
>> > > 1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always.
>> > > 2) Have many open transactions with subtransactions open.
>> > > 3) Attach one of the transaction from gdb and call abort().
>> >
>> > Do you need subtransactions for the issue1? It appears that after the
>> > restart if the changes list is empty it will hit the assert.  Am I
>> > missing something?
>> >
>>
>> When I had reported this issue I could reproduce this issue with
>> sub-transactions. Now I have tried without using sub-transactions and
>> could still reproduce this issue. You are right Issue 1 will appear in
>> both the cases with and without subtransactions.
>
>Okay, thanks for the confirmation.
>

I'm a bit confused - does this happen only with the logical_work_mem
patches, or with clean master too? If only with the patches, which
version exactly?

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Reorderbuffer crash during recovery

akapila
On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra
<[hidden email]> wrote:
>
> I'm a bit confused - does this happen only with the logical_work_mem
> patches, or with clean master too?
>

This occurs with the clean master.  This is a base code problem
revealed while doing stress testing of logical_work_mem patches.


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Reorderbuffer crash during recovery

Andres Freund
Hi,

On 2019-11-07 17:03:44 +0530, Amit Kapila wrote:
> On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra
> <[hidden email]> wrote:
> >
> > I'm a bit confused - does this happen only with the logical_work_mem
> > patches, or with clean master too?
> >
>
> This occurs with the clean master.  This is a base code problem
> revealed while doing stress testing of logical_work_mem patches.

As far as I can tell there are no repro steps included? Any chance to
get those?

Greetings,

Andres Freund