BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

classic Classic list List threaded Threaded
46 messages Options
123
Reply | Threaded
Open this post in threaded view
|

BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

PG Bug reporting form
The following bug has been logged on the website:

Bug reference:      15727
Logged by:          Roman Zharkov
Email address:      [hidden email]
PostgreSQL version: Unsupported/Unknown
Operating system:   Ubuntu 5.4.0-6ubuntu1~16.04.10
Description:        

Hello,
Our test server crashed while benchmarking connection pooler called Yandex
Odyssey.
There is no debugging info in the code dump now, because it takes few hours
to reproduce the error. We will try to reproduce the error with debugging
info.

Server version - PostgreSQL 12devel-master, commit - bb76134 on
x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0
20160609, 64-bit

The error does not reproduce on PostgreSQL 12devel-master, commit -
2fadf24

2019-04-01 15:27:38.829 +07 [7524] ERROR:  unexpected table_lock_tuple
status: 1
2019-04-01 15:27:38.829 +07 [7524] STATEMENT:  UPDATE pgbench_accounts SET
abalance = 1 WHERE aid = 1;
2019-04-01 15:27:38.829 +07 [7524] PANIC:  cannot abort transaction
400048276, it was already committed
2019-04-01 15:27:39.001 +07 [7499] ERROR:  unexpected table_lock_tuple
status: 1
2019-04-01 15:27:39.001 +07 [7499] STATEMENT:  UPDATE pgbench_accounts SET
abalance = 1 WHERE aid = 2;
2019-04-01 15:27:39.001 +07 [7499] PANIC:  cannot abort transaction
400051561, it was already committed
2019-04-01 15:27:41.193 +07 [7509] ERROR:  unexpected table_lock_tuple
status: 1
2019-04-01 15:27:41.193 +07 [7509] STATEMENT:  UPDATE pgbench_accounts SET
abalance = 1 WHERE aid = 1;
2019-04-01 15:27:41.193 +07 [7509] PANIC:  cannot abort transaction
400088425, it was already committed
2019-04-01 15:41:01.176 +07 [3559] LOG:  server process (PID 7499) was
terminated by signal 6: Aborted
2019-04-01 15:41:01.176 +07 [3559] DETAIL:  Failed process was running:
UPDATE pgbench_accounts SET abalance = 1 WHERE aid = 2;
2019-04-01 15:41:01.176 +07 [3559] LOG:  terminating any other active server
processes

[New LWP 7509]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: zharkov test 127.0.0.1(42666) UPDATE      
'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f1f4c6c7428 in __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:54
54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007f1f4c6c7428 in __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007f1f4c6c902a in __GI_abort () at abort.c:89
#2  0x000000000085faac in errfinish ()
#3  0x00000000008633e4 in elog_finish ()
#4  0x000000000050266d in RecordTransactionAbort ()
#5  0x0000000000502788 in AbortTransaction ()
#6  0x0000000000502fb5 in AbortCurrentTransaction ()
#7  0x000000000075029f in PostgresMain ()
#8  0x0000000000480593 in ServerLoop ()
#9  0x00000000006dfc04 in PostmasterMain ()
#10 0x00000000004813b9 in main ()

Reply | Threaded
Open this post in threaded view
|

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

Andres Freund
Hi,

On 2019-04-02 10:30:15 +0000, PG Bug reporting form wrote:

> The following bug has been logged on the website:
>
> Bug reference:      15727
> Logged by:          Roman Zharkov
> Email address:      [hidden email]
> PostgreSQL version: Unsupported/Unknown
> Operating system:   Ubuntu 5.4.0-6ubuntu1~16.04.10
> Description:        
>
> Hello,
> Our test server crashed while benchmarking connection pooler called Yandex
> Odyssey.

What's the exact workload? Just pgbench?


> There is no debugging info in the code dump now, because it takes few hours
> to reproduce the error. We will try to reproduce the error with debugging
> info.

Unfortunately a backtrace is probably not going to help up much, I'm
afraid. The issue likely is triggered at an earlier state than the error
thrown here.


> Server version - PostgreSQL 12devel-master, commit - bb76134 on
> x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0
> 20160609, 64-bit
>
> The error does not reproduce on PostgreSQL 12devel-master, commit -
> 2fadf24
>
> 2019-04-01 15:27:38.829 +07 [7524] ERROR:  unexpected table_lock_tuple
> status: 1

Hm, so that appears to be related to my changes. Could you run this with
log_error_verbosity=verbose, so we know which of the codepaths this
hits?  Apparently we're encountering an invisible tuple somewhere, which
I thought was not possible (hence the error) - previously we just
ignored that.


> 2019-04-01 15:27:38.829 +07 [7524] STATEMENT:  UPDATE pgbench_accounts SET
> abalance = 1 WHERE aid = 1;
> 2019-04-01 15:27:38.829 +07 [7524] PANIC:  cannot abort transaction
> 400048276, it was already committed

But that's probably a separate issue.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

Tom Lane-2
Andres Freund <[hidden email]> writes:
>> 2019-04-01 15:27:38.829 +07 [7524] STATEMENT:  UPDATE pgbench_accounts SET
>> abalance = 1 WHERE aid = 1;
>> 2019-04-01 15:27:38.829 +07 [7524] PANIC:  cannot abort transaction
>> 400048276, it was already committed

> But that's probably a separate issue.

What that seems to indicate is that the "unexpected table_lock_tuple
status" error was thrown during commit, which seems pretty odd.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

Andres Freund
Hi,

On 2019-04-02 11:44:26 -0400, Tom Lane wrote:

> Andres Freund <[hidden email]> writes:
> >> 2019-04-01 15:27:38.829 +07 [7524] STATEMENT:  UPDATE pgbench_accounts SET
> >> abalance = 1 WHERE aid = 1;
> >> 2019-04-01 15:27:38.829 +07 [7524] PANIC:  cannot abort transaction
> >> 400048276, it was already committed
>
> > But that's probably a separate issue.
>
> What that seems to indicate is that the "unexpected table_lock_tuple
> status" error was thrown during commit, which seems pretty odd.

I suspect that's a deferred trigger. But that code obviously could throw
errors, so we gotta handle that correctly.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

Alvaro Herrera-9
In reply to this post by Tom Lane-2
On 2019-Apr-02, Tom Lane wrote:

> Andres Freund <[hidden email]> writes:
> >> 2019-04-01 15:27:38.829 +07 [7524] STATEMENT:  UPDATE pgbench_accounts SET
> >> abalance = 1 WHERE aid = 1;
> >> 2019-04-01 15:27:38.829 +07 [7524] PANIC:  cannot abort transaction
> >> 400048276, it was already committed
>
> > But that's probably a separate issue.
>
> What that seems to indicate is that the "unexpected table_lock_tuple
> status" error was thrown during commit, which seems pretty odd.

AFAICS this error can only come from ExecDelete(), because the value 1
is TM_Invisible and the other callsites where the "unexpected
table_lock_tuple" error appears use different wording for that one.

Maybe it's the result of a deferred constraint being checked at that
time ... maybe it's trying to honor an "on cascade delete" setting for
an FK, and the affected tuple has already been updated or deleted?

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

Tom Lane-2
In reply to this post by Andres Freund
Andres Freund <[hidden email]> writes:
> On 2019-04-02 11:44:26 -0400, Tom Lane wrote:
>> What that seems to indicate is that the "unexpected table_lock_tuple
>> status" error was thrown during commit, which seems pretty odd.

> I suspect that's a deferred trigger. But that code obviously could throw
> errors, so we gotta handle that correctly.

Deferred triggers execute before the real transaction commit, not during
the critical section surrounding where we set the bit in pg_clog.
But this error is seemingly getting thrown from within that critical
section, otherwise we wouldn't have inconsistent xact status afterwards.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

Andres Freund
Hi,

On 2019-04-02 11:51:38 -0400, Tom Lane wrote:

> Andres Freund <[hidden email]> writes:
> > On 2019-04-02 11:44:26 -0400, Tom Lane wrote:
> >> What that seems to indicate is that the "unexpected table_lock_tuple
> >> status" error was thrown during commit, which seems pretty odd.
>
> > I suspect that's a deferred trigger. But that code obviously could throw
> > errors, so we gotta handle that correctly.
>
> Deferred triggers execute before the real transaction commit, not during
> the critical section surrounding where we set the bit in pg_clog.
> But this error is seemingly getting thrown from within that critical
> section, otherwise we wouldn't have inconsistent xact status afterwards.

I don't think it was thrown during the critical section - we'd otherwise
see the ERROR about unexpected lock status converted to a PANIC (and the
"cannot abort transaction, already committed" is a PANIC itself).  I
suspect it's more likely that we corrupt the xact.c status somewhere -
but it's hard to say without knowing more about the workload.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

Andres Freund
In reply to this post by Alvaro Herrera-9
Hi,

On 2019-04-02 12:51:08 -0300, Alvaro Herrera wrote:

> On 2019-Apr-02, Tom Lane wrote:
>
> > Andres Freund <[hidden email]> writes:
> > >> 2019-04-01 15:27:38.829 +07 [7524] STATEMENT:  UPDATE pgbench_accounts SET
> > >> abalance = 1 WHERE aid = 1;
> > >> 2019-04-01 15:27:38.829 +07 [7524] PANIC:  cannot abort transaction
> > >> 400048276, it was already committed
> >
> > > But that's probably a separate issue.
> >
> > What that seems to indicate is that the "unexpected table_lock_tuple
> > status" error was thrown during commit, which seems pretty odd.
>
> AFAICS this error can only come from ExecDelete(), because the value 1
> is TM_Invisible and the other callsites where the "unexpected
> table_lock_tuple" error appears use different wording for that one.

Hm? Why couldn't it be the ExecUpdate() case?


> Maybe it's the result of a deferred constraint being checked at that
> time ... maybe it's trying to honor an "on cascade delete" setting for
> an FK, and the affected tuple has already been updated or deleted?

Then it ought to get TM_Deleted, no? We ought to wait till that
transaction commits, and then roll back. So there's something odd
happening here.  I suspect there has to be some additional log output or
such to explain this.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

Alvaro Herrera-9
On 2019-Apr-02, Andres Freund wrote:

> On 2019-04-02 12:51:08 -0300, Alvaro Herrera wrote:

> > AFAICS this error can only come from ExecDelete(), because the value 1
> > is TM_Invisible and the other callsites where the "unexpected
> > table_lock_tuple" error appears use different wording for that one.
>
> Hm? Why couldn't it be the ExecUpdate() case?

You're right, that one too.

> > Maybe it's the result of a deferred constraint being checked at that
> > time ... maybe it's trying to honor an "on cascade delete" setting for
> > an FK, and the affected tuple has already been updated or deleted?
>
> Then it ought to get TM_Deleted, no? We ought to wait till that
> transaction commits, and then roll back.

I was thinking that this would have happened in the same transaction;
but yeah I didn't spend too much time analyzing the exact code flow.
Anyway I agree that there's something odd going on, and perhaps you just
unmasked an earlier bug.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

r.zharkov
In reply to this post by Andres Freund
On 2019-04-02 23:13, Andres Freund wrote:

> Hi,
>
> On 2019-04-02 12:51:08 -0300, Alvaro Herrera wrote:
>> On 2019-Apr-02, Tom Lane wrote:
>>
>> > Andres Freund <[hidden email]> writes:
>> > >> 2019-04-01 15:27:38.829 +07 [7524] STATEMENT:  UPDATE pgbench_accounts SET
>> > >> abalance = 1 WHERE aid = 1;
>> > >> 2019-04-01 15:27:38.829 +07 [7524] PANIC:  cannot abort transaction
>> > >> 400048276, it was already committed
>> >
>> > > But that's probably a separate issue.
>> >
>> > What that seems to indicate is that the "unexpected table_lock_tuple
>> > status" error was thrown during commit, which seems pretty odd.
>>
>> AFAICS this error can only come from ExecDelete(), because the value 1
>> is TM_Invisible and the other callsites where the "unexpected
>> table_lock_tuple" error appears use different wording for that one.
>
> Hm? Why couldn't it be the ExecUpdate() case?
>
>
>> Maybe it's the result of a deferred constraint being checked at that
>> time ... maybe it's trying to honor an "on cascade delete" setting for
>> an FK, and the affected tuple has already been updated or deleted?
>
> Then it ought to get TM_Deleted, no? We ought to wait till that
> transaction commits, and then roll back. So there's something odd
> happening here.  I suspect there has to be some additional log output
> or
> such to explain this.
>
> Greetings,
>
> Andres Freund

I made the new core file:

[New LWP 5967]
[Thread debugging using libthread_db enabled]
Using host libthread_db library
"/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: zharkov test 127.0.0.1(35888) UPDATE    
     '.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007fe194769428 in __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:54
54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007fe194769428 in __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007fe19476b02a in __GI_abort () at abort.c:89
#2  0x0000000000a9c0ac in errfinish (dummy=0) at elog.c:552
#3  0x0000000000a9e653 in elog_finish (elevel=22, fmt=0xb695a0 "cannot
abort transaction %u, it was already committed") at elog.c:1365
#4  0x0000000000562c87 in RecordTransactionAbort (isSubXact=false) at
xact.c:1644
#5  0x0000000000563c93 in AbortTransaction () at xact.c:2670
#6  0x0000000000564516 in AbortCurrentTransaction () at xact.c:3191
#7  0x00000000009145b9 in PostgresMain (argc=1, argv=0x255d970,
dbname=0x255d950 "test", username=0x252dd88 "zharkov") at
postgres.c:4033
#8  0x000000000086406c in BackendRun (port=0x2556020) at
postmaster.c:4399
#9  0x00000000008637cd in BackendStartup (port=0x2556020) at
postmaster.c:4090
#10 0x000000000085facf in ServerLoop () at postmaster.c:1703
#11 0x000000000085f2dd in PostmasterMain (argc=3, argv=0x252bcf0) at
postmaster.c:1376
#12 0x000000000077aeba in main (argc=3, argv=0x252bcf0) at main.c:228

Log:
2019-04-02 21:51:19.639 +07 [5967] STATEMENT:  UPDATE pgbench_accounts
SET abalance = 1 WHERE aid = 1;
2019-04-02 21:51:19.639 +07 [5967] PANIC:  cannot abort transaction
200011850, it was already committed
2019-04-02 21:53:57.924 +07 [2473] LOG:  server process (PID 5967) was
terminated by signal 6: Aborted
2019-04-02 21:53:57.924 +07 [2473] DETAIL:  Failed process was running:
UPDATE pgbench_accounts SET abalance = 1 WHERE aid = 1;
2019-04-02 21:53:57.924 +07 [2473] LOG:  terminating any other active
server processes

We run pgbench with two custom scripts.

First:
\set i random_zipfian(1, 100000 * :scale, 1.01)
SELECT abalance FROM pgbench_accounts WHERE aid = :i;

Second:
\set i random_zipfian(1, 100000 * :scale, 1.01)
UPDATE pgbench_accounts SET abalance = 1 WHERE aid = :i;

pgbench connects to the database through the connection pooller called
Odyssey.
We use pool of 16 or 32 backends.
We run four series of pgbench tests with different connections count (
100, 150, 200, ... , 1000 )
Three series with pool size 16 runs normally.
But fourth series fails with 200-300 connections. It took about five
hours to reproduce the error this time.

Here is the little piece of the pgbench log:

Bench start
Tue Apr  2 14:43:12 UTC 2019
Timestamp 1554216192
transaction type: multiple scripts
scaling factor: 1
query mode: simple
number of clients: 200
number of threads: 32
duration: 300 s
number of transactions actually processed: 8062981
latency average = 7.442 ms
tps = 26875.375100 (including connections establishing)
tps = 26875.694564 (excluding connections establishing)
SQL script 1: ./sql/ycsb_read_zipf.sql
  - weight: 1 (targets 50.0% of total)
  - 4019635 transactions (49.9% of total, tps = 13398.171023)
  - latency average = 6.955 ms
  - latency stddev = 8.366 ms
SQL script 2: ./sql/ycsb_update_zipf.sql
  - weight: 1 (targets 50.0% of total)
  - 4021508 transactions (49.9% of total, tps = 13404.414071)
  - latency average = 7.909 ms
  - latency stddev = 8.478 ms
Bench end
Tue Apr  2 14:48:12 UTC 2019
Timestamp 1554216492
Bench start
Tue Apr  2 14:48:12 UTC 2019
Timestamp 1554216492
client 220 script 1 aborted in command 1 query 0: PANIC:  cannot abort
transaction 200011850, it was already committed
WARNING:  terminating connection because of crash of another server
process

--
regards,

Roman Zharkov


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

Andres Freund
Hi,

On 2019-04-03 00:04:15 +0700, [hidden email] wrote:

> pgbench connects to the database through the connection pooller called
> Odyssey.
> We use pool of 16 or 32 backends.
> We run four series of pgbench tests with different connections count ( 100,
> 150, 200, ... , 1000 )
> Three series with pool size 16 runs normally.
> But fourth series fails with 200-300 connections. It took about five hours
> to reproduce the error this time.
>
> Here is the little piece of the pgbench log:
>
> Bench start
> Tue Apr  2 14:43:12 UTC 2019
> Timestamp 1554216192
> transaction type: multiple scripts
> scaling factor: 1

Is this the actual scaling factor? Did you create the pgbench database
with foreign keys?

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

Tom Lane-2
In reply to this post by r.zharkov
[hidden email] writes:
> I made the new core file:

Thanks, but this isn't much help --- it just shows what we already
know, ie the "cannot abort transaction" error occurs after some other
error was thrown.

What would be more helpful would be to adjust the places that
can throw the "unexpected table_lock_tuple status" error text to be
PANIC rather than ERROR, so that the core dump would show a stack
trace showing how we got to whichever of those places this is.
Or, run the test with a higher log verbosity, so that you can find
out which of those places is throwing the error to begin with;
then just PANIC-ize that one.

FWIW, I see six potential candidates, not two:

pgsql/src/backend/commands/trigger.c: 3380: elog(ERROR, "unexpected table_lock_tuple status: %u", test);
pgsql/src/backend/executor/nodeLockRows.c: 232: elog(ERROR, "unexpected table_lock_tuple status: %u",
pgsql/src/backend/executor/nodeModifyTable.c: 881: elog(ERROR, "unexpected table_lock_tuple status: %u",
pgsql/src/backend/executor/nodeModifyTable.c: 1384: elog(ERROR, "unexpected table_lock_tuple status: %u",
pgsql/src/backend/executor/execReplication.c: 211: elog(ERROR, "unexpected table_lock_tuple status: %u", res);
pgsql/src/backend/executor/execReplication.c: 375: elog(ERROR, "unexpected table_lock_tuple status: %u", res);


                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

Andres Freund
Hi,

On 2019-04-02 13:13:58 -0400, Tom Lane wrote:
> FWIW, I see six potential candidates, not two:
>
> pgsql/src/backend/commands/trigger.c: 3380: elog(ERROR, "unexpected table_lock_tuple status: %u", test);
> pgsql/src/backend/executor/nodeLockRows.c: 232: elog(ERROR, "unexpected table_lock_tuple status: %u",
> pgsql/src/backend/executor/execReplication.c: 211: elog(ERROR, "unexpected table_lock_tuple status: %u", res);
> pgsql/src/backend/executor/execReplication.c: 375: elog(ERROR, "unexpected table_lock_tuple status: %u", res);

These have an explicit case / separate error for TM_Invisible (= 1 -
which is the status we're seeing according to the error message)
though. So afaict they can't be relevant here.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

Alvaro Herrera-9
In reply to this post by Tom Lane-2
On 2019-Apr-02, Tom Lane wrote:

> FWIW, I see six potential candidates, not two:
>
> pgsql/src/backend/commands/trigger.c: 3380: elog(ERROR, "unexpected table_lock_tuple status: %u", test);
> pgsql/src/backend/executor/nodeLockRows.c: 232: elog(ERROR, "unexpected table_lock_tuple status: %u",
> pgsql/src/backend/executor/nodeModifyTable.c: 881: elog(ERROR, "unexpected table_lock_tuple status: %u",
> pgsql/src/backend/executor/nodeModifyTable.c: 1384: elog(ERROR, "unexpected table_lock_tuple status: %u",
> pgsql/src/backend/executor/execReplication.c: 211: elog(ERROR, "unexpected table_lock_tuple status: %u", res);
> pgsql/src/backend/executor/execReplication.c: 375: elog(ERROR, "unexpected table_lock_tuple status: %u", res);

The other four handle the TM_Invisible case with a different error
message.  Anyway, the six of them are unexpected so changing them all to
PANIC ought to be good enough.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

Tom Lane-2
In reply to this post by Andres Freund
Andres Freund <[hidden email]> writes:
> On 2019-04-02 13:13:58 -0400, Tom Lane wrote:
>> FWIW, I see six potential candidates, not two:

> These have an explicit case / separate error for TM_Invisible (= 1 -
> which is the status we're seeing according to the error message)
> though. So afaict they can't be relevant here.

Oh, right, so it has to be from one of the calls in nodeModifyTable.c.
But it's hardly plausible that that runs post-commit.  So now I'm
thinking that what we really need to know is why a TM_Invisible
result was returned.  That's going to be a bit harder ...

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

r.zharkov
In reply to this post by Tom Lane-2


> 3 апр. 2019 г., в 0:13, Tom Lane <[hidden email]> написал(а):
>
> [hidden email] writes:
>> I made the new core file:
>
> Thanks, but this isn't much help --- it just shows what we already
> know, ie the "cannot abort transaction" error occurs after some other
> error was thrown.
>
> What would be more helpful would be to adjust the places that
> can throw the "unexpected table_lock_tuple status" error text to be
> PANIC rather than ERROR, so that the core dump would show a stack
> trace showing how we got to whichever of those places this is.
> Or, run the test with a higher log verbosity, so that you can find
> out which of those places is throwing the error to begin with;
> then just PANIC-ize that one.
>
> FWIW, I see six potential candidates, not two:
>
> pgsql/src/backend/commands/trigger.c: 3380:                elog(ERROR, "unexpected table_lock_tuple status: %u", test);
> pgsql/src/backend/executor/nodeLockRows.c: 232:                elog(ERROR, "unexpected table_lock_tuple status: %u",
> pgsql/src/backend/executor/nodeModifyTable.c: 881:                            elog(ERROR, "unexpected table_lock_tuple status: %u",
> pgsql/src/backend/executor/nodeModifyTable.c: 1384:                            elog(ERROR, "unexpected table_lock_tuple status: %u",
> pgsql/src/backend/executor/execReplication.c: 211:                elog(ERROR, "unexpected table_lock_tuple status: %u", res);
> pgsql/src/backend/executor/execReplication.c: 375:                elog(ERROR, "unexpected table_lock_tuple status: %u", res);
>
>
>            regards, tom lane

Ok. I will made it tomorrow morning.

regards,
Roman Zharkov


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

Andres Freund
In reply to this post by Tom Lane-2
Hi,

On 2019-04-02 13:30:56 -0400, Tom Lane wrote:
> But it's hardly plausible that that runs post-commit.

Yea, one surely would hope that that's not happening. I suspect it's
possible it's running pre-commit and somehow the failure handling is
screwed up leading to the 'already committed' error.


> So now I'm thinking that what we really need to know is why a
> TM_Invisible result was returned.  That's going to be a bit harder ...

Yea. I re-skimmed through the code and can't immediately see anything
wrong - not that such subtle issues would necessarily be noticed without
more information.

I wonder, is there a chance that you're using an extension with xact
hooks, or a modified version of postgres?  Could you show us the exact
schema used, and the config?

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

r.zharkov
In reply to this post by Tom Lane-2
On 2019-04-03 00:13, Tom Lane wrote:

> [hidden email] writes:
>> I made the new core file:
>
> Thanks, but this isn't much help --- it just shows what we already
> know, ie the "cannot abort transaction" error occurs after some other
> error was thrown.
>
> What would be more helpful would be to adjust the places that
> can throw the "unexpected table_lock_tuple status" error text to be
> PANIC rather than ERROR, so that the core dump would show a stack
> trace showing how we got to whichever of those places this is.
> Or, run the test with a higher log verbosity, so that you can find
> out which of those places is throwing the error to begin with;
> then just PANIC-ize that one.
>
> FWIW, I see six potential candidates, not two:
>
> pgsql/src/backend/commands/trigger.c: 3380: elog(ERROR,
> "unexpected table_lock_tuple status: %u", test);
> pgsql/src/backend/executor/nodeLockRows.c: 232: elog(ERROR,
> "unexpected table_lock_tuple status: %u",
> pgsql/src/backend/executor/nodeModifyTable.c: 881: elog(ERROR,
> "unexpected table_lock_tuple status: %u",
> pgsql/src/backend/executor/nodeModifyTable.c: 1384: elog(ERROR,
> "unexpected table_lock_tuple status: %u",
> pgsql/src/backend/executor/execReplication.c: 211: elog(ERROR,
> "unexpected table_lock_tuple status: %u", res);
> pgsql/src/backend/executor/execReplication.c: 375: elog(ERROR,
> "unexpected table_lock_tuple status: %u", res);
>
>
> regards, tom lane

Hello,
I have made the new core dump. Here are the results:
PostgreSQL 12devel-master/bb76134

Log:
TRAP: FailedAssertion("!((((tuple->t_data->t_infomask) & 0x0080) ||
(((tuple->t_data->t_infomask) & (0x1000 | ((0x0040 | 0x0010) | 0x0040 |
0x0010))) == 0x0040)))", File: "heapam.c", Line:
2019-04-03 15:13:36.668 +07 [23163] LOG:  server process (PID 27904) was
terminated by signal 6: Aborted
2019-04-03 15:13:36.668 +07 [23163] DETAIL:  Failed process was running:
UPDATE pgbench_accounts SET abalance = 1 WHERE aid = 2;
2019-04-03 15:13:36.668 +07 [23163] LOG:  terminating any other active
server processes
2019-04-03 15:13:36.669 +07 [27907] WARNING:  terminating connection
because of crash of another server process

Backtrace:
[New LWP 27904]
[Thread debugging using libthread_db enabled]
Using host libthread_db library
"/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: zharkov test 127.0.0.1(44582) UPDATE    
     '.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f5e0dc7b428 in __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:54
54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007f5e0dc7b428 in __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007f5e0dc7d02a in __GI_abort () at abort.c:89
#2  0x0000000000a9b7ce in ExceptionalCondition (
     conditionName=0xb3de08 "!((((tuple->t_data->t_infomask) & 0x0080) ||
(((tuple->t_data->t_infomask) & (0x1000 | ((0x0040 | 0x0010) | 0x0040 |
0x0010))) == 0x0040)))",
     errorType=0xb39a69 "FailedAssertion", fileName=0xb39ae8 "heapam.c",
lineNumber=4286) at assert.c:54
#3  0x00000000004ea024 in heap_lock_tuple (relation=0x7f5e0f252d80,
tuple=0x30f4dd0, cid=0, mode=LockTupleNoKeyExclusive,
wait_policy=LockWaitBlock, follow_updates=false,
     buffer=0x7ffed968cbb4, tmfd=0x7ffed968cd80) at heapam.c:4286
#4  0x00000000004f5903 in heapam_tuple_lock (relation=0x7f5e0f252d80,
tid=0x7ffed968cdf0, snapshot=0x303cef0, slot=0x30f4d80, cid=0,
mode=LockTupleNoKeyExclusive,
     wait_policy=LockWaitBlock, flags=2 '\002', tmfd=0x7ffed968cd80) at
heapam_handler.c:352
#5  0x000000000074055a in table_lock_tuple (rel=0x7f5e0f252d80,
tid=0x7ffed968cdf0, snapshot=0x303cef0, slot=0x30f4d80, cid=0,
mode=LockTupleNoKeyExclusive, wait_policy=LockWaitBlock,
     flags=2 '\002', tmfd=0x7ffed968cd80) at
../../../src/include/access/tableam.h:993
#6  0x00000000007421cc in ExecUpdate (mtstate=0x30eb780,
tupleid=0x7ffed968cdf0, oldtuple=0x0, slot=0x30ed1c8,
planSlot=0x30ed050, epqstate=0x30eb878, estate=0x30eb3d8,
canSetTag=true)
     at nodeModifyTable.c:1242
#7  0x0000000000743756 in ExecModifyTable (pstate=0x30eb780) at
nodeModifyTable.c:2070
#8  0x0000000000712671 in ExecProcNodeFirst (node=0x30eb780) at
execProcnode.c:445
#9  0x000000000070748c in ExecProcNode (node=0x30eb780) at
../../../src/include/executor/executor.h:239
#10 0x0000000000709d11 in ExecutePlan (estate=0x30eb3d8,
planstate=0x30eb780, use_parallel_mode=false, operation=CMD_UPDATE,
sendTuples=false, numberTuples=0,
     direction=ForwardScanDirection, dest=0x30ee508, execute_once=true)
at execMain.c:1643
#11 0x0000000000707ac9 in standard_ExecutorRun (queryDesc=0x303faf8,
direction=ForwardScanDirection, count=0, execute_once=true) at
execMain.c:362
#12 0x0000000000707905 in ExecutorRun (queryDesc=0x303faf8,
direction=ForwardScanDirection, count=0, execute_once=true) at
execMain.c:306
#13 0x0000000000915950 in ProcessQuery (plan=0x30df120,
sourceText=0x30161d8 "UPDATE pgbench_accounts SET abalance = 1 WHERE aid
= 2;", params=0x0, queryEnv=0x0, dest=0x30ee508,
     completionTag=0x7ffed968d270 "") at pquery.c:161
#14 0x000000000091739c in PortalRunMulti (portal=0x3098078,
isTopLevel=true, setHoldSnapshot=false, dest=0x30ee508,
altdest=0x30ee508, completionTag=0x7ffed968d270 "") at pquery.c:1283
#15 0x00000000009168e8 in PortalRun (portal=0x3098078,
count=9223372036854775807, isTopLevel=true, run_once=true,
dest=0x30ee508, altdest=0x30ee508, completionTag=0x7ffed968d270 "")
     at pquery.c:796
#16 0x000000000091027a in exec_simple_query (query_string=0x30161d8
"UPDATE pgbench_accounts SET abalance = 1 WHERE aid = 2;") at
postgres.c:1215
#17 0x00000000009148ff in PostgresMain (argc=1, argv=0x3042980,
dbname=0x3042960 "test", username=0x3012d98 "zharkov") at
postgres.c:4247
#18 0x000000000086406c in BackendRun (port=0x303b030) at
postmaster.c:4399
#19 0x00000000008637cd in BackendStartup (port=0x303b030) at
postmaster.c:4090
#20 0x000000000085facf in ServerLoop () at postmaster.c:1703
#21 0x000000000085f2dd in PostmasterMain (argc=3, argv=0x3010d00) at
postmaster.c:1376
#22 0x000000000077aeba in main (argc=3, argv=0x3010d00) at main.c:228


The changes I made:
diff --git a/src/backend/commands/trigger.c
b/src/backend/commands/trigger.c
index e03ffdd..ab3bace 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -3358,7 +3358,7 @@ GetTupleForTrigger(EState *estate,
                                         ereport(ERROR,
                                                         
(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
                                                          errmsg("could
not serialize access due to concurrent update")));
-                               elog(ERROR, "unexpected table_lock_tuple
status: %u", test);
+                               elog(PANIC, "unexpected table_lock_tuple
status: %u", test);
                                 break;

                         case TM_Deleted:
diff --git a/src/backend/executor/execReplication.c
b/src/backend/executor/execReplication.c
index d8b48c6..d06cbe0 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -207,7 +207,7 @@ retry:
                                 elog(ERROR, "attempted to lock invisible
tuple");
                                 break;
                         default:
-                               elog(ERROR, "unexpected table_lock_tuple
status: %u", res);
+                               elog(PANIC, "unexpected table_lock_tuple
status: %u", res);
                                 break;
                 }
         }
@@ -371,7 +371,7 @@ retry:
                                 elog(ERROR, "attempted to lock invisible
tuple");
                                 break;
                         default:
-                               elog(ERROR, "unexpected table_lock_tuple
status: %u", res);
+                               elog(PANIC, "unexpected table_lock_tuple
status: %u", res);
                                 break;
                 }
         }
diff --git a/src/backend/executor/nodeLockRows.c
b/src/backend/executor/nodeLockRows.c
index 7674ac8..c7a261c 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -229,7 +229,7 @@ lnext:
                                         ereport(ERROR,
                                                         
(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
                                                          errmsg("could
not serialize access due to concurrent update")));
-                               elog(ERROR, "unexpected table_lock_tuple
status: %u",
+                               elog(PANIC, "unexpected table_lock_tuple
status: %u",
                                          test);
                                 break;

diff --git a/src/backend/executor/nodeModifyTable.c
b/src/backend/executor/nodeModifyTable.c
index 7be0e77..b15b7a0 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -780,7 +780,7 @@ ldelete:;
                                                          * locking the
latest version via
                                                          *
TUPLE_LOCK_FLAG_FIND_LAST_VERSION.
                                                          */
-                                                       elog(ERROR,
"unexpected table_lock_tuple status: %u",
+                                                       elog(PANIC,
"unexpected table_lock_tuple status: %u",
                                                                 
result);
                                                         return NULL;
                                         }
@@ -1269,7 +1269,7 @@ lreplace:;

                                                 default:
                                                         /* see
table_lock_tuple call in ExecDelete() */
-                                                       elog(ERROR,
"unexpected table_lock_tuple status: %u",
+                                                       elog(PANIC,
"unexpected table_lock_tuple status: %u",
                                                                 
result);
                                                         return NULL;
                                         }


regards,

Roman Zharkov


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

r.zharkov
In reply to this post by Andres Freund
On 2019-04-03 00:11, Andres Freund wrote:

> Hi,
>
> On 2019-04-03 00:04:15 +0700, [hidden email] wrote:
>> pgbench connects to the database through the connection pooller called
>> Odyssey.
>> We use pool of 16 or 32 backends.
>> We run four series of pgbench tests with different connections count (
>> 100,
>> 150, 200, ... , 1000 )
>> Three series with pool size 16 runs normally.
>> But fourth series fails with 200-300 connections. It took about five
>> hours
>> to reproduce the error this time.
>>
>> Here is the little piece of the pgbench log:
>>
>> Bench start
>> Tue Apr  2 14:43:12 UTC 2019
>> Timestamp 1554216192
>> transaction type: multiple scripts
>> scaling factor: 1
>
> Is this the actual scaling factor? Did you create the pgbench database
> with foreign keys?
>
> Greetings,
>
> Andres Freund
Hi,
Yes, 1 is the actual scaling factor.

My script to create database:
/db/zharkov/.vanila/bin/pg_ctl -D /db/zharkov/vanila -l
/db/zharkov/vanila/log.log stop
rm -rf /db/zharkov/vanila/*
/db/zharkov/.vanila/bin/initdb -D /db/zharkov/vanila --data-checksums
#/db/zharkov/.vanila/bin/initdb -D /db/zharkov/vanila
cp postgresql.conf /db/zharkov/vanila/
ulimit -c unlimited
/db/zharkov/.vanila/bin/pg_ctl -D /db/zharkov/vanila -c -l
/db/zharkov/vanila/log.log start
/db/zharkov/.vanila/bin/psql postgres -c 'create database test'
/db/zharkov/.vanila/bin/psql test -c 'create extension pg_prewarm;'
/db/zharkov/.vanila/bin/pgbench -i test -s 1
#/db/zharkov/.vanila/bin/pgbench -i test -s 1000
/db/zharkov/.vanila/bin/psql test -c "select pg_prewarm(
'pgbench_accounts' );"
/db/zharkov/.vanila/bin/psql test -c "select pg_prewarm(
'pgbench_branches' );"
/db/zharkov/.vanila/bin/psql test -c "select pg_prewarm(
'pgbench_history' );"
/db/zharkov/.vanila/bin/psql test -c "select pg_prewarm(
'pgbench_tellers' );"

The dump of the server settings is in the attachment.

--
regards,

Roman Zharkov

settings.txt (10K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

Alvaro Herrera-9
In reply to this post by r.zharkov
On 2019-Apr-03, [hidden email] wrote:

> #0  0x00007f5e0dc7b428 in __GI_raise (sig=sig@entry=6) at
> ../sysdeps/unix/sysv/linux/raise.c:54
> #1  0x00007f5e0dc7d02a in __GI_abort () at abort.c:89
> #2  0x0000000000a9b7ce in ExceptionalCondition (
>     conditionName=0xb3de08 "!((((tuple->t_data->t_infomask) & 0x0080) ||
> (((tuple->t_data->t_infomask) & (0x1000 | ((0x0040 | 0x0010) | 0x0040 |
> 0x0010))) == 0x0040)))",
>     errorType=0xb39a69 "FailedAssertion", fileName=0xb39ae8 "heapam.c",
> lineNumber=4286) at assert.c:54
> #3  0x00000000004ea024 in heap_lock_tuple (relation=0x7f5e0f252d80,
> tuple=0x30f4dd0, cid=0, mode=LockTupleNoKeyExclusive,
> wait_policy=LockWaitBlock, follow_updates=false,
>     buffer=0x7ffed968cbb4, tmfd=0x7ffed968cd80) at heapam.c:4286

Uhm ... this assertion failure is doesn't seem related to the other
problem you were reporting.  Here, it's failing because it thinks the
tuple should be marked as only locked when the current transaction is
the only Xmax for the tuple.

                /*
                 * As a check independent from those above, we can also avoid sleeping
                 * if the current transaction is the sole locker of the tuple.  Note
                 * that the strength of the lock already held is irrelevant; this is
                 * not about recording the lock in Xmax (which will be done regardless
                 * of this optimization, below).  Also, note that the cases where we
                 * hold a lock stronger than we are requesting are already handled
                 * above by not doing anything.
                 *
                 * Note we only deal with the non-multixact case here; MultiXactIdWait
                 * is well equipped to deal with this situation on its own.
                 */
                if (require_sleep && !(infomask & HEAP_XMAX_IS_MULTI) &&
                        TransactionIdIsCurrentTransactionId(xwait))
                {
                        /* ... but if the xmax changed in the meantime, start over */
                        LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
                        if (xmax_infomask_changed(tuple->t_data->t_infomask, infomask) ||
                                !TransactionIdEquals(HeapTupleHeaderGetRawXmax(tuple->t_data),
                                                                         xwait))
                                goto l3;
                        Assert(HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_data->t_infomask));    //  <--- failed assertion
                        require_sleep = false;
                }

I'm not quite sure what's up with that.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


123