BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping a partition table

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping a partition table

PG Bug reporting form
The following bug has been logged on the website:

Bug reference:      15672
Logged by:          Jianing Yang
Email address:      [hidden email]
PostgreSQL version: 11.2
Operating system:   Ubuntu 18.10
Description:        

Reproduce steps:

1. create a partition table with the following constraints
    a. with a unique key on partition key and a varchar type field
    b. using hash partition
2. alter the  length of the varchar type field
3. drop the partition table in a transaction
4. crash

Screenshot:
→ pgcli -h localhost -p 5555 -U postgres -W
Server: PostgreSQL 11.2
Version: 2.0.1
Chat: https://gitter.im/dbcli/pgcli
Mail: https://groups.google.com/forum/#!forum/pgcli
Home: http://pgcli.com

postgres@localhost:postgres> show server_version;                          
                               
+-------------------------------+
| server_version                |
|-------------------------------|
| 11.2 (Debian 11.2-1.pgdg90+1) |
+-------------------------------+
SHOW
Time: 0.011s
postgres@localhost:postgres> create table users(user_id int, name varchar
(64), unique (user_id, name)) partition by hash(user_id);                  
                                                               
CREATE TABLE
Time: 0.004s
postgres@localhost:postgres> create table users_000 partition of users for
values with (modulus 2, remainder 0);                                      
                                                               
CREATE TABLE
Time: 0.012s
postgres@localhost:postgres> create table users_001 partition of users for
values with (modulus 2, remainder 1);                                      
                                                               
CREATE TABLE
Time: 0.012s
postgres@localhost:postgres> alter table users alter column name type
varchar(127);                        
You're about to run a destructive command.
Do you want to proceed? (y/n): y
Your call!
ALTER TABLE
Time: 0.007s
postgres@localhost:postgres> \d users;                                      
                               
+----------+------------------------+-------------+
| Column   | Type                   | Modifiers   |
|----------+------------------------+-------------|
| user_id  | integer                |             |
| name     | character varying(127) |             |
+----------+------------------------+-------------+
Indexes:
    "users_user_id_name_key" UNIQUE CONSTRAINT, btree (user_id, name)
Partition key: HASH (user_id)
Number of partitions 2: (Use \d+ to list them.)

Time: 0.012s
postgres@localhost:postgres> begin;                                        
                               
BEGIN
Time: 0.001s
postgres@localhost:postgres> drop table users;                              
                               
You're about to run a destructive command.
Do you want to proceed? (y/n): y
Your call!
DROP TABLE
Time: 0.002s
postgres@localhost:postgres> commit;  
Connection reset. Reconnect (Y/n):


Server Log:

2019-03-06 14:59:26.101 UTC [61] ERROR:  SMgrRelation hashtable corrupted
2019-03-06 14:59:26.101 UTC [61] STATEMENT:  commit
2019-03-06 14:59:26.101 UTC [61] WARNING:  AbortTransaction while in COMMIT
state
2019-03-06 14:59:26.101 UTC [61] PANIC:  cannot abort transaction 573, it
was already committed
2019-03-06 14:59:26.178 UTC [1] LOG:  server process (PID 61) was terminated
by signal 6: Aborted
2019-03-06 14:59:26.178 UTC [1] DETAIL:  Failed process was running:
commit
2019-03-06 14:59:26.178 UTC [1] LOG:  terminating any other active server
processes
2019-03-06 14:59:26.178 UTC [58] WARNING:  terminating connection because of
crash of another server process
2019-03-06 14:59:26.178 UTC [58] DETAIL:  The postmaster has commanded this
server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2019-03-06 14:59:26.178 UTC [58] HINT:  In a moment you should be able to
reconnect to the database and repeat your command.
2019-03-06 14:59:26.179 UTC [1] LOG:  all server processes terminated;
reinitializing
2019-03-06 14:59:26.186 UTC [68] LOG:  database system was interrupted; last
known up at 2019-03-06 14:58:30 UTC
2019-03-06 14:59:26.212 UTC [68] LOG:  database system was not properly shut
down; automatic recovery in progress
2019-03-06 14:59:26.214 UTC [68] LOG:  redo starts at 0/1650278
2019-03-06 14:59:26.215 UTC [68] FATAL:  SMgrRelation hashtable corrupted
2019-03-06 14:59:26.215 UTC [68] CONTEXT:  WAL redo at 0/16768C8 for
Transaction/COMMIT: 2019-03-06 14:59:26.099739+00; rels: base/13067/16389
base/13067/16387 base/13067/16394 base/13067/16387; inval msgs: catcache 41
catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6
catcache 7 catcache 6 catcache 50 catcache 49 catcache 19 catcache 32
catcache 7 catcache 6 catcache 7 catcache 6 catcache 50 catcache 49 catcache
74 catcache 73 catcache 74 catcache 73 catcache 7 catcache 6 catcache 7
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6
catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache
50 catcache 49 catcache 19 catcache 32 catcache 7 catcache 6 catcache 7
catcache 6 catcache 50 catcache 49 catcache 74 catcache 73 catcache 74
catcache 73 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache
6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache
7 catcache 6 catcache 7 catcache 6 catcache 50 catcache 49 catcache 19
catcache 32 catcache 7 catcache 6 catcache 7 catcache 6 catcache 50 catcache
49 catcache 74 catcache 73 catcache 74 catcache 73 relcache 16384 snapshot
2608 snapshot 2608 relcache 16399 relcache 16384 snapshot 2608 snapshot 2608
snapshot 2608 relcache 16389 relcache 16384 snapshot 2608 snapshot 2608
relcache 16401 relcache 16389 snapshot 2608 snapshot 2608 snapshot 2608
relcache 16394 relcache 16384 snapshot 2608 snapshot 2608 relcache 16403
relcache 16394 snapshot 2608 snapshot 2608 snapshot 2608
2019-03-06 14:59:26.216 UTC [1] LOG:  startup process (PID 68) exited with
exit code 1
2019-03-06 14:59:26.216 UTC [1] LOG:  aborting startup due to startup
process failure
2019-03-06 14:59:26.217 UTC [1] LOG:  database system is shut down


Affected Server Version:

11.1
11.2

Reply | Threaded
Open this post in threaded view
|

Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping a partition table

Michael Paquier-2
On Wed, Mar 06, 2019 at 03:06:53PM +0000, PG Bug reporting form wrote:
> 1. create a partition table with the following constraints
>     a. with a unique key on partition key and a varchar type field
>     b. using hash partition
> 2. alter the  length of the varchar type field
> 3. drop the partition table in a transaction
> 4. crash

I can reproduce the failure easily, not on HEAD but with
REL_11_STABLE:
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f585729b535 in __GI_abort () at abort.c:79
#2  0x000055eef597e60a in errfinish (dummy=0) at elog.c:555
#3  0x000055eef5980c50 in elog_finish (elevel=22, fmt=0x55eef5a41408
"cannot abort transaction %u, it was already committed") at
elog.c:1376
#4  0x000055eef5479647 in RecordTransactionAbort (isSubXact=false) at
xact.c:1580
#5  0x000055eef547a6c0 in AbortTransaction () at xact.c:2602
#6  0x000055eef547aef4 in AbortCurrentTransaction () at xact.c:3104

That's worth an investigation, SMgrRelationHash is getting messed up
which causes the transaction commit to fail where it should not.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping a partition table

Amit Langote-2
Hi,

On 2019/03/07 9:03, Michael Paquier wrote:

> On Wed, Mar 06, 2019 at 03:06:53PM +0000, PG Bug reporting form wrote:
>> 1. create a partition table with the following constraints
>>     a. with a unique key on partition key and a varchar type field
>>     b. using hash partition
>> 2. alter the  length of the varchar type field
>> 3. drop the partition table in a transaction
>> 4. crash
>
> I can reproduce the failure easily, not on HEAD but with
> REL_11_STABLE:

Same here.  I could reproduce it with 11.0.

> (gdb) bt
> #0  __GI_raise (sig=sig@entry=6) at
> ../sysdeps/unix/sysv/linux/raise.c:50
> #1  0x00007f585729b535 in __GI_abort () at abort.c:79
> #2  0x000055eef597e60a in errfinish (dummy=0) at elog.c:555
> #3  0x000055eef5980c50 in elog_finish (elevel=22, fmt=0x55eef5a41408
> "cannot abort transaction %u, it was already committed") at
> elog.c:1376
> #4  0x000055eef5479647 in RecordTransactionAbort (isSubXact=false) at
> xact.c:1580
> #5  0x000055eef547a6c0 in AbortTransaction () at xact.c:2602
> #6  0x000055eef547aef4 in AbortCurrentTransaction () at xact.c:3104
>
> That's worth an investigation, SMgrRelationHash is getting messed up
> which causes the transaction commit to fail where it should not.

Looking at what was causing the SMgrRelationHash corruption, it seems
there were entries with same/duplicated relnode value in pendingDeletes list.

The problem start when ALTER TABLE users ALTER COLUMN is executed.
create table users(user_id int, name varchar(64), unique (user_id, name))
partition by list(user_id);

create table users_000 partition of users for values in(0);
create table users_001 partition of users for values in(1);
select relname, relfilenode from pg_class where relname like 'users%';
          relname           │ relfilenode
────────────────────────────┼─────────────
 users                      │       16441
 users_000                  │       16446
 users_000_user_id_name_key │       16449
 users_001                  │       16451
 users_001_user_id_name_key │       16454
 users_user_id_name_key     │       16444
(6 rows)

alter table users alter column name type varchar(127);
select relname, relfilenode from pg_class where relname like 'users%';
          relname           │ relfilenode
────────────────────────────┼─────────────
 users                      │       16441
 users_000                  │       16446
 users_000_user_id_name_key │       16444    <===  duplicated
 users_001                  │       16451
 users_001_user_id_name_key │       16444    <===  duplicated
 users_user_id_name_key     │       16444    <===  duplicated
(6 rows)

Ran out off time...

Thanks,
Amit


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping a partition table

Amit Langote
On Thu, Mar 7, 2019 at 11:17 AM Amit Langote
<[hidden email]> wrote:

> The problem start when ALTER TABLE users ALTER COLUMN is executed.
> create table users(user_id int, name varchar(64), unique (user_id, name))
> partition by list(user_id);
>
> create table users_000 partition of users for values in(0);
> create table users_001 partition of users for values in(1);
> select relname, relfilenode from pg_class where relname like 'users%';
>           relname           │ relfilenode
> ────────────────────────────┼─────────────
>  users                      │       16441
>  users_000                  │       16446
>  users_000_user_id_name_key │       16449
>  users_001                  │       16451
>  users_001_user_id_name_key │       16454
>  users_user_id_name_key     │       16444
> (6 rows)
>
> alter table users alter column name type varchar(127);
> select relname, relfilenode from pg_class where relname like 'users%';
>           relname           │ relfilenode
> ────────────────────────────┼─────────────
>  users                      │       16441
>  users_000                  │       16446
>  users_000_user_id_name_key │       16444    <===  duplicated
>  users_001                  │       16451
>  users_001_user_id_name_key │       16444    <===  duplicated
>  users_user_id_name_key     │       16444    <===  duplicated
> (6 rows)

I checked why users_000's and user_0001's indexes end up reusing
users_user_id_name_key's relfilenode.  At the surface, it's because
DefineIndex(<parent's-index-to-be-recreated>) is carrying oldNode =
<old-parents-index's-relfilenode> in IndexStmt, which is recursively
passed down to DefineIndex(<child-indexes-to-be-recreated>).  This
DefineIndex() chain is running due to ATPostAlterTypeCleanup() on the
parent rel.  This surface problem may be solved in DefineIndex() by
just resetting oldNode in each child IndexStmt before recursing, but
that means child indexes are recreated with new relfilenodes.  That
solves the immediate problem of relfilenodes being wrongly duplicated,
that's leading to madness such as SMgrRelationHash corruption being
seen in the original bug report.

But, the root problem seems to be that ATPostAlterTypeCleanup() on
child tables isn't setting up their own
DefineIndex(<child-index-to-be-rewritten>) step.  That's because the
parent's ATPostAlterTypeCleanup() dropped child copies of the UNIQUE
constraint due to dependencies (+ CCI).  So, ATExecAlterColumnType()
on child relations isn't able to find the constraint on the individual
child relations to turn into their own
DefineIndex(<child-index-to-be-rewritten>). If we manage to handle
each relation's ATPostAlterTypeCleanup() independently, child's
recreated indexes will be able to reuse their old relfilenodes and
everything will be fine.  But maybe that will require significant
overhaul of how this post-alter-type-cleanup occurs?

Thanks,
Amit

Reply | Threaded
Open this post in threaded view
|

Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping a partition table

Amit Langote-2
On 2019/03/07 20:36, Amit Langote wrote:

> On Thu, Mar 7, 2019 at 11:17 AM Amit Langote
> <[hidden email]> wrote:
>> The problem start when ALTER TABLE users ALTER COLUMN is executed.
>> create table users(user_id int, name varchar(64), unique (user_id, name))
>> partition by list(user_id);
>>
>> create table users_000 partition of users for values in(0);
>> create table users_001 partition of users for values in(1);
>> select relname, relfilenode from pg_class where relname like 'users%';
>>           relname           │ relfilenode
>> ────────────────────────────┼─────────────
>>  users                      │       16441
>>  users_000                  │       16446
>>  users_000_user_id_name_key │       16449
>>  users_001                  │       16451
>>  users_001_user_id_name_key │       16454
>>  users_user_id_name_key     │       16444
>> (6 rows)
>>
>> alter table users alter column name type varchar(127);
>> select relname, relfilenode from pg_class where relname like 'users%';
>>           relname           │ relfilenode
>> ────────────────────────────┼─────────────
>>  users                      │       16441
>>  users_000                  │       16446
>>  users_000_user_id_name_key │       16444    <===  duplicated
>>  users_001                  │       16451
>>  users_001_user_id_name_key │       16444    <===  duplicated
>>  users_user_id_name_key     │       16444    <===  duplicated
>> (6 rows)
>
> I checked why users_000's and user_0001's indexes end up reusing
> users_user_id_name_key's relfilenode.  At the surface, it's because
> DefineIndex(<parent's-index-to-be-recreated>) is carrying oldNode =
> <old-parents-index's-relfilenode> in IndexStmt, which is recursively
> passed down to DefineIndex(<child-indexes-to-be-recreated>).  This
> DefineIndex() chain is running due to ATPostAlterTypeCleanup() on the
> parent rel.  This surface problem may be solved in DefineIndex() by
> just resetting oldNode in each child IndexStmt before recursing, but
> that means child indexes are recreated with new relfilenodes.  That
> solves the immediate problem of relfilenodes being wrongly duplicated,
> that's leading to madness such as SMgrRelationHash corruption being
> seen in the original bug report.
This doesn't happen in HEAD, because in HEAD we got 807ae415c5, which
changed things so that partitioned relations always have their relfilenode
set to 0.  So, there's no question of parent's relfilenode being passed to
children and hence being duplicated.

But that also means child indexes are unnecessarily rewritten, that is,
with new relfilenodes.

> But, the root problem seems to be that ATPostAlterTypeCleanup() on
> child tables isn't setting up their own
> DefineIndex(<child-index-to-be-rewritten>) step.  That's because the
> parent's ATPostAlterTypeCleanup() dropped child copies of the UNIQUE
> constraint due to dependencies (+ CCI).  So, ATExecAlterColumnType()
> on child relations isn't able to find the constraint on the individual
> child relations to turn into their own
> DefineIndex(<child-index-to-be-rewritten>). If we manage to handle
> each relation's ATPostAlterTypeCleanup() independently, child's
> recreated indexes will be able to reuse their old relfilenodes and
> everything will be fine.  But maybe that will require significant
> overhaul of how this post-alter-type-cleanup occurs?
We could try to solve this dividing ATPostAlterTypeCleanup processing into
two functions:

1 The first one runs right after ATExecAlterColumnType() is finished for a
given table (like it does today), which then runs ATPostAlterTypeParse to
generate commands for constraints and/or indexes to re-add.  This part
won't drop the old constraints/indexes just yet, so child
constraints/indexes will remain for ATExecAlterColumnType to see when
executed for the children.

2. Dropping the old constraints/indexes is the responsibility of the 2nd
part, which runs just before executing ATExecReAddIndex or
ATExecAddConstraint (with is_readd = true), so that the new constraints
don't collide with the existing ones.

This arrangement allows step 1 to generate the commands to recreate even
the child indexes such that the old relfilenode can be be preserved by
setting IndexStmt.oldNode.

Attached patch is a very rough sketch, which fails some regression tests,
but I ran out of time today.

Thanks,
Amit

wip-post-alter-type-cleanup-divide.patch (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping a partition table

Amit Langote-2
On 2019/03/08 19:22, Amit Langote wrote:

> On 2019/03/07 20:36, Amit Langote wrote:
>> On Thu, Mar 7, 2019 at 11:17 AM Amit Langote
>> <[hidden email]> wrote:
>>> The problem start when ALTER TABLE users ALTER COLUMN is executed.
>>> create table users(user_id int, name varchar(64), unique (user_id, name))
>>> partition by list(user_id);
>>>
>>> create table users_000 partition of users for values in(0);
>>> create table users_001 partition of users for values in(1);
>>> select relname, relfilenode from pg_class where relname like 'users%';
>>>           relname           │ relfilenode
>>> ────────────────────────────┼─────────────
>>>  users                      │       16441
>>>  users_000                  │       16446
>>>  users_000_user_id_name_key │       16449
>>>  users_001                  │       16451
>>>  users_001_user_id_name_key │       16454
>>>  users_user_id_name_key     │       16444
>>> (6 rows)
>>>
>>> alter table users alter column name type varchar(127);
>>> select relname, relfilenode from pg_class where relname like 'users%';
>>>           relname           │ relfilenode
>>> ────────────────────────────┼─────────────
>>>  users                      │       16441
>>>  users_000                  │       16446
>>>  users_000_user_id_name_key │       16444    <===  duplicated
>>>  users_001                  │       16451
>>>  users_001_user_id_name_key │       16444    <===  duplicated
>>>  users_user_id_name_key     │       16444    <===  duplicated
>>> (6 rows)
>>
>> I checked why users_000's and user_0001's indexes end up reusing
>> users_user_id_name_key's relfilenode.  At the surface, it's because
>> DefineIndex(<parent's-index-to-be-recreated>) is carrying oldNode =
>> <old-parents-index's-relfilenode> in IndexStmt, which is recursively
>> passed down to DefineIndex(<child-indexes-to-be-recreated>).  This
>> DefineIndex() chain is running due to ATPostAlterTypeCleanup() on the
>> parent rel.  This surface problem may be solved in DefineIndex() by
>> just resetting oldNode in each child IndexStmt before recursing, but
>> that means child indexes are recreated with new relfilenodes.  That
>> solves the immediate problem of relfilenodes being wrongly duplicated,
>> that's leading to madness such as SMgrRelationHash corruption being
>> seen in the original bug report.
>
> This doesn't happen in HEAD, because in HEAD we got 807ae415c5, which
> changed things so that partitioned relations always have their relfilenode
> set to 0.  So, there's no question of parent's relfilenode being passed to
> children and hence being duplicated.
>
> But that also means child indexes are unnecessarily rewritten, that is,
> with new relfilenodes.
>
>> But, the root problem seems to be that ATPostAlterTypeCleanup() on
>> child tables isn't setting up their own
>> DefineIndex(<child-index-to-be-rewritten>) step.  That's because the
>> parent's ATPostAlterTypeCleanup() dropped child copies of the UNIQUE
>> constraint due to dependencies (+ CCI).  So, ATExecAlterColumnType()
>> on child relations isn't able to find the constraint on the individual
>> child relations to turn into their own
>> DefineIndex(<child-index-to-be-rewritten>). If we manage to handle
>> each relation's ATPostAlterTypeCleanup() independently, child's
>> recreated indexes will be able to reuse their old relfilenodes and
>> everything will be fine.  But maybe that will require significant
>> overhaul of how this post-alter-type-cleanup occurs?
>
> We could try to solve this dividing ATPostAlterTypeCleanup processing into
> two functions:
>
> 1 The first one runs right after ATExecAlterColumnType() is finished for a
> given table (like it does today), which then runs ATPostAlterTypeParse to
> generate commands for constraints and/or indexes to re-add.  This part
> won't drop the old constraints/indexes just yet, so child
> constraints/indexes will remain for ATExecAlterColumnType to see when
> executed for the children.
>
> 2. Dropping the old constraints/indexes is the responsibility of the 2nd
> part, which runs just before executing ATExecReAddIndex or
> ATExecAddConstraint (with is_readd = true), so that the new constraints
> don't collide with the existing ones.
>
> This arrangement allows step 1 to generate the commands to recreate even
> the child indexes such that the old relfilenode can be be preserved by
> setting IndexStmt.oldNode.
>
> Attached patch is a very rough sketch, which fails some regression tests,
> but I ran out of time today.
I'm thinking of adding this to open items under Older Bugs.  Attached the
patch that I had posted on -bugs, but it's only a rough sketch as
described above, not a full fix.

Link to the original bug report:
https://www.postgresql.org/message-id/flat/15672-b9fa7db32698269f%40postgresql.org

Thanks,
Amit

wip-post-alter-type-cleanup-divide.patch (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping a partition table

Michael Paquier-2
On Wed, Mar 27, 2019 at 11:40:12AM +0900, Amit Langote wrote:
> I'm thinking of adding this to open items under Older Bugs.  Attached the
> patch that I had posted on -bugs, but it's only a rough sketch as
> described above, not a full fix.

Adding it to the section for older bugs sounds fine to me.  Thanks for
doing so.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping a partition table

Amit Langote-2
On 2019/04/11 15:57, Michael Paquier wrote:
> On Wed, Mar 27, 2019 at 11:56:20AM +0900, Michael Paquier wrote:
>> Adding it to the section for older bugs sounds fine to me.  Thanks for
>> doing so.
>
> I have begun looking at this issue.  Hopefully I'll be able to provide
> an update soon.

Great, thanks.

Regards,
Amit