WAL logging problem in 9.4.3?

classic Classic list List threaded Threaded
116 messages Options
1234 ... 6
Reply | Threaded
Open this post in threaded view
|

WAL logging problem in 9.4.3?

Martijn van Oosterhout
Hoi,

I ran into this in our CI setup and I thinks it's an actual bug. The
issue appears to be that when a table is created *and truncated* in a
single transaction, that the WAL log is logging a truncate record it
shouldn't, such that if the database crashes you have a broken index.
It would also lose any data that was in the table at commit time.

I didn't test 9.4.4 yet, though I don't see anything in the release
notes that resemble this.

Detail:

=== Start with an empty database

martijn@martijn-jessie:$ psql ctmp -h localhost -U username
Password for user username:
psql (9.4.3)
Type "help" for help.

ctmp=# begin;
BEGIN
ctmp=# create table test(id serial primary key);
CREATE TABLE
ctmp=# truncate table test;
TRUNCATE TABLE
ctmp=# commit;
COMMIT
ctmp=# select relname, relfilenode from pg_class where relname like 'test%';
   relname   | relfilenode
-------------+-------------
 test        |       16389
 test_id_seq |       16387
 test_pkey   |       16393
(3 rows)

=== Note: if you do a CHECKPOINT here the issue doesn't happen
=== obviously.

ctmp=# \q
martijn@martijn-jessie:$ sudo ls -l /data/postgres/base/16385/{16389,16387,16393}
[sudo] password for martijn:
-rw------- 1 messagebus ssl-cert 8192 Jul  2 23:34 /data/postgres/base/16385/16387
-rw------- 1 messagebus ssl-cert    0 Jul  2 23:34 /data/postgres/base/16385/16389
-rw------- 1 messagebus ssl-cert 8192 Jul  2 23:34 /data/postgres/base/16385/16393

=== Note the index file is 8KB.
=== At this point nuke the database server (in this case it was simply
=== destroying the container it was running in.

=== Dump the xlogs just to show what got recorded. Note there's a
=== truncate for the data file and the index file.

martijn@martijn-jessie:$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p /data/postgres/pg_xlog/ 000000010000000000000001 |grep -wE '16389|16387|16393'
rmgr: XLOG        len (rec/tot):     72/   104, tx:          0, lsn: 0/016A9240, prev 0/016A9200, bkp: 0000, desc: checkpoint: redo 0/16A9240; tli 1; prev tli 1; fpw true; xid 0/686; oid 16387; multi 1; offset 0; oldest xid 673 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown
rmgr: Storage     len (rec/tot):     16/    48, tx:          0, lsn: 0/016A92D0, prev 0/016A92A8, bkp: 0000, desc: file create: base/16385/16387
rmgr: Sequence    len (rec/tot):    158/   190, tx:        686, lsn: 0/016B5E50, prev 0/016B5D88, bkp: 0000, desc: log: rel 1663/16385/16387
rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016B5F10, prev 0/016B5E50, bkp: 0000, desc: file create: base/16385/16389
rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016BB028, prev 0/016BAFD8, bkp: 0000, desc: file create: base/16385/16393
rmgr: Sequence    len (rec/tot):    158/   190, tx:        686, lsn: 0/016BE4F8, prev 0/016BE440, bkp: 0000, desc: log: rel 1663/16385/16387
rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016BE6B0, prev 0/016BE660, bkp: 0000, desc: file truncate: base/16385/16389 to 0 blocks
rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016BE6E0, prev 0/016BE6B0, bkp: 0000, desc: file truncate: base/16385/16393 to 0 blocks
pg_xlogdump: FATAL:  error in WAL record at 0/16BE710: record with zero length at 0/16BE740

=== Start the DB up again

database_1 | LOG:  database system was interrupted; last known up at 2015-07-02 21:08:05 UTC
database_1 | LOG:  database system was not properly shut down; automatic recovery in progress
database_1 | LOG:  redo starts at 0/16A92A8
database_1 | LOG:  record with zero length at 0/16BE740
database_1 | LOG:  redo done at 0/16BE710
database_1 | LOG:  last completed transaction was at log time 2015-07-02 21:34:45.664989+00
database_1 | LOG:  database system is ready to accept connections
database_1 | LOG:  autovacuum launcher started

=== Oops, the index file is empty now

martijn@martijn-jessie:$ sudo ls -l /data/postgres/base/16385/{16389,16387,16393}
-rw------- 1 messagebus ssl-cert 8192 Jul  2 23:37 /data/postgres/base/16385/16387
-rw------- 1 messagebus ssl-cert    0 Jul  2 23:34 /data/postgres/base/16385/16389
-rw------- 1 messagebus ssl-cert    0 Jul  2 23:37 /data/postgres/base/16385/16393

martijn@martijn-jessie:$ psql ctmp -h localhost -U username
Password for user username:
psql (9.4.3)
Type "help" for help.

=== And now the index is broken. I think the only reason it doesn't
=== complain about the data file is because zero bytes there is OK.  But if
=== the table had data before it would be gone now.

ctmp=# select * from test;
ERROR:  could not read block 0 in file "base/16385/16393": read only 0 of 8192 bytes

ctmp=# select version();
                                            version                                            
-----------------------------------------------------------------------------------------------
 PostgreSQL 9.4.3 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit
(1 row)

Hope this helps.
--
Martijn van Oosterhout   <[hidden email]>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.
   -- Arthur Schopenhauer

signature.asc (845 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: WAL logging problem in 9.4.3?

Andres Freund
Hi,

On 2015-07-03 00:05:24 +0200, Martijn van Oosterhout wrote:
> === Start with an empty database

My guess is you have wal_level = minimal?

> ctmp=# begin;
> BEGIN
> ctmp=# create table test(id serial primary key);
> CREATE TABLE
> ctmp=# truncate table test;
> TRUNCATE TABLE
> ctmp=# commit;
> COMMIT
> ctmp=# select relname, relfilenode from pg_class where relname like 'test%';
>    relname   | relfilenode
> -------------+-------------
>  test        |       16389
>  test_id_seq |       16387
>  test_pkey   |       16393
> (3 rows)
>

> === Note the index file is 8KB.
> === At this point nuke the database server (in this case it was simply
> === destroying the container it was running in.

How did you continue from there? The container has persistent storage?
Or are you repapplying the WAL to somewhere else?

> === Dump the xlogs just to show what got recorded. Note there's a
> === truncate for the data file and the index file.

That should be ok.

> martijn@martijn-jessie:$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p /data/postgres/pg_xlog/ 000000010000000000000001 |grep -wE '16389|16387|16393'
> rmgr: XLOG        len (rec/tot):     72/   104, tx:          0, lsn: 0/016A9240, prev 0/016A9200, bkp: 0000, desc: checkpoint: redo 0/16A9240; tli 1; prev tli 1; fpw true; xid 0/686; oid 16387; multi 1; offset 0; oldest xid 673 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown
> rmgr: Storage     len (rec/tot):     16/    48, tx:          0, lsn: 0/016A92D0, prev 0/016A92A8, bkp: 0000, desc: file create: base/16385/16387
> rmgr: Sequence    len (rec/tot):    158/   190, tx:        686, lsn: 0/016B5E50, prev 0/016B5D88, bkp: 0000, desc: log: rel 1663/16385/16387
> rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016B5F10, prev 0/016B5E50, bkp: 0000, desc: file create: base/16385/16389
> rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016BB028, prev 0/016BAFD8, bkp: 0000, desc: file create: base/16385/16393
> rmgr: Sequence    len (rec/tot):    158/   190, tx:        686, lsn: 0/016BE4F8, prev 0/016BE440, bkp: 0000, desc: log: rel 1663/16385/16387
> rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016BE6B0, prev 0/016BE660, bkp: 0000, desc: file truncate: base/16385/16389 to 0 blocks
> rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016BE6E0, prev 0/016BE6B0, bkp: 0000, desc: file truncate: base/16385/16393 to 0 blocks
> pg_xlogdump: FATAL:  error in WAL record at 0/16BE710: record with zero length at 0/16BE740

Note that the truncate will lead to a new, different, relfilenode.

> === Start the DB up again
>
> database_1 | LOG:  database system was interrupted; last known up at 2015-07-02 21:08:05 UTC
> database_1 | LOG:  database system was not properly shut down; automatic recovery in progress
> database_1 | LOG:  redo starts at 0/16A92A8
> database_1 | LOG:  record with zero length at 0/16BE740
> database_1 | LOG:  redo done at 0/16BE710
> database_1 | LOG:  last completed transaction was at log time 2015-07-02 21:34:45.664989+00
> database_1 | LOG:  database system is ready to accept connections
> database_1 | LOG:  autovacuum launcher started
>
> === Oops, the index file is empty now

That's probably just the old index file?

> martijn@martijn-jessie:$ sudo ls -l /data/postgres/base/16385/{16389,16387,16393}
> -rw------- 1 messagebus ssl-cert 8192 Jul  2 23:37 /data/postgres/base/16385/16387
> -rw------- 1 messagebus ssl-cert    0 Jul  2 23:34 /data/postgres/base/16385/16389
> -rw------- 1 messagebus ssl-cert    0 Jul  2 23:37 /data/postgres/base/16385/16393
>
> martijn@martijn-jessie:$ psql ctmp -h localhost -U username
> Password for user username:
> psql (9.4.3)
> Type "help" for help.
>
> === And now the index is broken. I think the only reason it doesn't
> === complain about the data file is because zero bytes there is OK.  But if
> === the table had data before it would be gone now.
>
> ctmp=# select * from test;
> ERROR:  could not read block 0 in file "base/16385/16393": read only 0 of 8192 bytes

Hm. I can't reproduce this. Can you include a bit more details about how
to reproduce?

Greetings,

Andres Freund


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WAL logging problem in 9.4.3?

Martijn van Oosterhout
On Fri, Jul 03, 2015 at 12:21:02AM +0200, Andres Freund wrote:
> Hi,
>
> On 2015-07-03 00:05:24 +0200, Martijn van Oosterhout wrote:
> > === Start with an empty database
>
> My guess is you have wal_level = minimal?

Default config, was just initdb'd. So yes, the default wal_level =
minimal.

> > === Note the index file is 8KB.
> > === At this point nuke the database server (in this case it was simply
> > === destroying the container it was running in.
>
> How did you continue from there? The container has persistent storage?
> Or are you repapplying the WAL to somewhere else?

The container has persistant storage on the host. What I think is
actually unusual is that the script that started postgres was missing
an 'exec" so postgres never gets the signal to shutdown.

> > martijn@martijn-jessie:$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p /data/postgres/pg_xlog/ 000000010000000000000001 |grep -wE '16389|16387|16393'
> > rmgr: XLOG        len (rec/tot):     72/   104, tx:          0, lsn: 0/016A9240, prev 0/016A9200, bkp: 0000, desc: checkpoint: redo 0/16A9240; tli 1; prev tli 1; fpw true; xid 0/686; oid 16387; multi 1; offset 0; oldest xid 673 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown
> > rmgr: Storage     len (rec/tot):     16/    48, tx:          0, lsn: 0/016A92D0, prev 0/016A92A8, bkp: 0000, desc: file create: base/16385/16387
> > rmgr: Sequence    len (rec/tot):    158/   190, tx:        686, lsn: 0/016B5E50, prev 0/016B5D88, bkp: 0000, desc: log: rel 1663/16385/16387
> > rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016B5F10, prev 0/016B5E50, bkp: 0000, desc: file create: base/16385/16389
> > rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016BB028, prev 0/016BAFD8, bkp: 0000, desc: file create: base/16385/16393
> > rmgr: Sequence    len (rec/tot):    158/   190, tx:        686, lsn: 0/016BE4F8, prev 0/016BE440, bkp: 0000, desc: log: rel 1663/16385/16387
> > rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016BE6B0, prev 0/016BE660, bkp: 0000, desc: file truncate: base/16385/16389 to 0 blocks
> > rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016BE6E0, prev 0/016BE6B0, bkp: 0000, desc: file truncate: base/16385/16393 to 0 blocks
> > pg_xlogdump: FATAL:  error in WAL record at 0/16BE710: record with zero length at 0/16BE740
>
> Note that the truncate will lead to a new, different, relfilenode.
Really? Comparing the relfilenodes gives the same values before and
after the truncate.
>
> > ctmp=# select * from test;
> > ERROR:  could not read block 0 in file "base/16385/16393": read only 0 of 8192 bytes
>
> Hm. I can't reproduce this. Can you include a bit more details about how
> to reproduce?

Hmm, for me it is 100% reproducable. Are you familiar with Docker? I
can probably construct a Dockerfile that reproduces it pretty reliably.

Have a nice day,
--
Martijn van Oosterhout   <[hidden email]>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.
   -- Arthur Schopenhauer

signature.asc (845 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: WAL logging problem in 9.4.3?

Fujii Masao-2
On Fri, Jul 3, 2015 at 2:20 PM, Martijn van Oosterhout
<[hidden email]> wrote:

> On Fri, Jul 03, 2015 at 12:21:02AM +0200, Andres Freund wrote:
>> Hi,
>>
>> On 2015-07-03 00:05:24 +0200, Martijn van Oosterhout wrote:
>> > === Start with an empty database
>>
>> My guess is you have wal_level = minimal?
>
> Default config, was just initdb'd. So yes, the default wal_level =
> minimal.
>
>> > === Note the index file is 8KB.
>> > === At this point nuke the database server (in this case it was simply
>> > === destroying the container it was running in.
>>
>> How did you continue from there? The container has persistent storage?
>> Or are you repapplying the WAL to somewhere else?
>
> The container has persistant storage on the host. What I think is
> actually unusual is that the script that started postgres was missing
> an 'exec" so postgres never gets the signal to shutdown.
>
>> > martijn@martijn-jessie:$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p /data/postgres/pg_xlog/ 000000010000000000000001 |grep -wE '16389|16387|16393'
>> > rmgr: XLOG        len (rec/tot):     72/   104, tx:          0, lsn: 0/016A9240, prev 0/016A9200, bkp: 0000, desc: checkpoint: redo 0/16A9240; tli 1; prev tli 1; fpw true; xid 0/686; oid 16387; multi 1; offset 0; oldest xid 673 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown
>> > rmgr: Storage     len (rec/tot):     16/    48, tx:          0, lsn: 0/016A92D0, prev 0/016A92A8, bkp: 0000, desc: file create: base/16385/16387
>> > rmgr: Sequence    len (rec/tot):    158/   190, tx:        686, lsn: 0/016B5E50, prev 0/016B5D88, bkp: 0000, desc: log: rel 1663/16385/16387
>> > rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016B5F10, prev 0/016B5E50, bkp: 0000, desc: file create: base/16385/16389
>> > rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016BB028, prev 0/016BAFD8, bkp: 0000, desc: file create: base/16385/16393
>> > rmgr: Sequence    len (rec/tot):    158/   190, tx:        686, lsn: 0/016BE4F8, prev 0/016BE440, bkp: 0000, desc: log: rel 1663/16385/16387
>> > rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016BE6B0, prev 0/016BE660, bkp: 0000, desc: file truncate: base/16385/16389 to 0 blocks
>> > rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016BE6E0, prev 0/016BE6B0, bkp: 0000, desc: file truncate: base/16385/16393 to 0 blocks
>> > pg_xlogdump: FATAL:  error in WAL record at 0/16BE710: record with zero length at 0/16BE740
>>
>> Note that the truncate will lead to a new, different, relfilenode.
>
> Really? Comparing the relfilenodes gives the same values before and
> after the truncate.

Yep, the relfilenodes are not changed in this case because CREATE TABLE and
TRUNCATE were executed in the same transaction block.

>> > ctmp=# select * from test;
>> > ERROR:  could not read block 0 in file "base/16385/16393": read only 0 of 8192 bytes
>>
>> Hm. I can't reproduce this. Can you include a bit more details about how
>> to reproduce?
>
> Hmm, for me it is 100% reproducable. Are you familiar with Docker? I
> can probably construct a Dockerfile that reproduces it pretty reliably.

I could reproduce the problem in the master branch by doing
the following steps.

1. start the PostgreSQL server with wal_level = minimal
2. execute the following SQL statements
     begin;
     create table test(id serial primary key);
     truncate table test;
     commit;
3. shutdown the server with immediate mode
4. restart the server (crash recovery occurs)
5. execute the following SQL statement
    select * from test;

The optimization of TRUNCATE opereation that we can use when
CREATE TABLE and TRUNCATE are executed in the same transaction block
seems to cause the problem. In this case, only index file truncation is
logged, and index creation in btbuild() is not logged because wal_level
is minimal. Then at the subsequent crash recovery, index file is truncated
to 0 byte... Very simple fix is to log an index creation in that case,
but not sure if that's ok to do..

Regards,

--
Fujii Masao


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WAL logging problem in 9.4.3?

Martijn van Oosterhout
On Fri, Jul 03, 2015 at 02:34:44PM +0900, Fujii Masao wrote:
> > Hmm, for me it is 100% reproducable. Are you familiar with Docker? I
> > can probably construct a Dockerfile that reproduces it pretty reliably.
>
> I could reproduce the problem in the master branch by doing
> the following steps.

Thank you, I wasn't sure if you could kill the server fast enough
without containers, but it looks like immediate mode is enough.

> 1. start the PostgreSQL server with wal_level = minimal
> 2. execute the following SQL statements
>      begin;
>      create table test(id serial primary key);
>      truncate table test;
>      commit;
> 3. shutdown the server with immediate mode
> 4. restart the server (crash recovery occurs)
> 5. execute the following SQL statement
>     select * from test;
>
> The optimization of TRUNCATE opereation that we can use when
> CREATE TABLE and TRUNCATE are executed in the same transaction block
> seems to cause the problem. In this case, only index file truncation is
> logged, and index creation in btbuild() is not logged because wal_level
> is minimal. Then at the subsequent crash recovery, index file is truncated
> to 0 byte... Very simple fix is to log an index creation in that case,
> but not sure if that's ok to do..
Looks plausible to me.

For reference I attach a small tarball for reproduction with docker.

1. Unpack tarball into empty dir (it has three small files)
2. docker build -t test .
3. docker run -v /tmp/pgtest:/data test
4. docker run -v /tmp/pgtest:/data test

Data dir is in /tmp/pgtest

Have a nice day,
--
Martijn van Oosterhout   <[hidden email]>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.
   -- Arthur Schopenhauer

postgresql-test.tgz (1K) Download Attachment
signature.asc (845 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: WAL logging problem in 9.4.3?

Fujii Masao-2
On Fri, Jul 3, 2015 at 3:01 PM, Martijn van Oosterhout
<[hidden email]> wrote:

> On Fri, Jul 03, 2015 at 02:34:44PM +0900, Fujii Masao wrote:
>> > Hmm, for me it is 100% reproducable. Are you familiar with Docker? I
>> > can probably construct a Dockerfile that reproduces it pretty reliably.
>>
>> I could reproduce the problem in the master branch by doing
>> the following steps.
>
> Thank you, I wasn't sure if you could kill the server fast enough
> without containers, but it looks like immediate mode is enough.
>
>> 1. start the PostgreSQL server with wal_level = minimal
>> 2. execute the following SQL statements
>>      begin;
>>      create table test(id serial primary key);
>>      truncate table test;
>>      commit;
>> 3. shutdown the server with immediate mode
>> 4. restart the server (crash recovery occurs)
>> 5. execute the following SQL statement
>>     select * from test;
>>
>> The optimization of TRUNCATE opereation that we can use when
>> CREATE TABLE and TRUNCATE are executed in the same transaction block
>> seems to cause the problem. In this case, only index file truncation is
>> logged, and index creation in btbuild() is not logged because wal_level
>> is minimal. Then at the subsequent crash recovery, index file is truncated
>> to 0 byte... Very simple fix is to log an index creation in that case,
>> but not sure if that's ok to do..

In 9.2 or before, this problem doesn't occur because no such error is thrown
even if an index file size is zero. But in 9.3 or later, since the planner
tries to read a meta page of an index to get the height of the btree tree,
an empty index file causes such error. The planner was changed that way by
commit 31f38f28, and the problem seems to be an oversight of that commit.

I'm not familiar with that change of the planner, but ISTM that we can
simply change _bt_getrootheight() so that 0 is returned if an index file is
empty, i.e., meta page cannot be read, in order to work around the problem.
Thought?

Regards,

--
Fujii Masao


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WAL logging problem in 9.4.3?

Tom Lane-2
Fujii Masao <[hidden email]> writes:
> The optimization of TRUNCATE opereation that we can use when
> CREATE TABLE and TRUNCATE are executed in the same transaction block
> seems to cause the problem. In this case, only index file truncation is
> logged, and index creation in btbuild() is not logged because wal_level
> is minimal. Then at the subsequent crash recovery, index file is truncated
> to 0 byte... Very simple fix is to log an index creation in that case,
> but not sure if that's ok to do..

> In 9.2 or before, this problem doesn't occur because no such error is thrown
> even if an index file size is zero. But in 9.3 or later, since the planner
> tries to read a meta page of an index to get the height of the btree tree,
> an empty index file causes such error. The planner was changed that way by
> commit 31f38f28, and the problem seems to be an oversight of that commit.

What?  You want to blame the planner for failing because the index was
left corrupt by broken WAL replay?  A failure would occur anyway at
execution.

                        regards, tom lane


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WAL logging problem in 9.4.3?

Fujii Masao-2
On Fri, Jul 3, 2015 at 11:52 PM, Tom Lane <[hidden email]> wrote:

> Fujii Masao <[hidden email]> writes:
>> The optimization of TRUNCATE opereation that we can use when
>> CREATE TABLE and TRUNCATE are executed in the same transaction block
>> seems to cause the problem. In this case, only index file truncation is
>> logged, and index creation in btbuild() is not logged because wal_level
>> is minimal. Then at the subsequent crash recovery, index file is truncated
>> to 0 byte... Very simple fix is to log an index creation in that case,
>> but not sure if that's ok to do..
>
>> In 9.2 or before, this problem doesn't occur because no such error is thrown
>> even if an index file size is zero. But in 9.3 or later, since the planner
>> tries to read a meta page of an index to get the height of the btree tree,
>> an empty index file causes such error. The planner was changed that way by
>> commit 31f38f28, and the problem seems to be an oversight of that commit.
>
> What?  You want to blame the planner for failing because the index was
> left corrupt by broken WAL replay?  A failure would occur anyway at
> execution.

Yep, right. I was not thinking that such index with file size 0 is corrupted
because the reported problem didn't happen before that commit was added.
But that's my fault. Such index can cause an error even in other code paths.

Okay, so probably we need to change WAL replay of TRUNCATE so that
the index file is truncated to one containing only meta page instead of
empty one. That is, the WAL replay of TRUNCATE would need to call
index_build() after smgrtruncate() maybe.

Then how should we implement that? Invent new WAL record type that
calls smgrtruncate() and index_build() during WAL replay? Or add the
special flag to XLOG_SMGR_TRUNCATE record, and make WAL replay
call index_build() only if the flag is found? Any other good idea?
Anyway ISTM that we might need to add or modify WAL record.

Regards,

--
Fujii Masao


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WAL logging problem in 9.4.3?

Andres Freund
On 2015-07-04 01:39:42 +0900, Fujii Masao wrote:

> Okay, so probably we need to change WAL replay of TRUNCATE so that
> the index file is truncated to one containing only meta page instead of
> empty one. That is, the WAL replay of TRUNCATE would need to call
> index_build() after smgrtruncate() maybe.
>
> Then how should we implement that? Invent new WAL record type that
> calls smgrtruncate() and index_build() during WAL replay? Or add the
> special flag to XLOG_SMGR_TRUNCATE record, and make WAL replay
> call index_build() only if the flag is found? Any other good idea?
> Anyway ISTM that we might need to add or modify WAL record.

It's easy enough to log something like a metapage with
log_newpage().

But the more interesting question is why that's not hhappening
today. RelationTruncateIndexes() does call the index_build() which
should end up WAL logging the index creation.



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WAL logging problem in 9.4.3?

Tom Lane-2
In reply to this post by Fujii Masao-2
Fujii Masao <[hidden email]> writes:
> Okay, so probably we need to change WAL replay of TRUNCATE so that
> the index file is truncated to one containing only meta page instead of
> empty one. That is, the WAL replay of TRUNCATE would need to call
> index_build() after smgrtruncate() maybe.

That seems completely unworkable.  For one thing, index_build would expect
to be able to do catalog lookups, but we can't assume that the catalogs
are in a good state yet.

I think the responsibility has to be on the WAL-writing end to emit WAL
instructions that lead to a correct on-disk state.  Putting complex
behavior into the reading side is fundamentally misguided.

                        regards, tom lane


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WAL logging problem in 9.4.3?

Andres Freund
In reply to this post by Andres Freund
On 2015-07-03 18:49:31 +0200, Andres Freund wrote:
> But the more interesting question is why that's not hhappening
> today. RelationTruncateIndexes() does call the index_build() which
> should end up WAL logging the index creation.

So that's because there's an XLogIsNeeded() preventing it.

Maybe I'm just daft right now (35C outside, 32 inside, so ...), but I'm
right now missing how the whole "skip wal logging if relation has just
been truncated" optimization can ever actually be crashsafe unless we
use a new relfilenode (which we don't!).

Sure, we do an heap_sync() at the the end of the transaction. That's
nice and all. But it doesn't help if we crash and re-start WAL apply
from a checkpoint before the table was created. Because that'll replay
the truncation.

That's much worse than just the indexes - the rows added by a COPY
without WAL logging will also be truncated away, no?


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WAL logging problem in 9.4.3?

Martijn van Oosterhout
In reply to this post by Tom Lane-2
On Fri, Jul 03, 2015 at 12:53:56PM -0400, Tom Lane wrote:

> Fujii Masao <[hidden email]> writes:
> > Okay, so probably we need to change WAL replay of TRUNCATE so that
> > the index file is truncated to one containing only meta page instead of
> > empty one. That is, the WAL replay of TRUNCATE would need to call
> > index_build() after smgrtruncate() maybe.
>
> That seems completely unworkable.  For one thing, index_build would expect
> to be able to do catalog lookups, but we can't assume that the catalogs
> are in a good state yet.
>
> I think the responsibility has to be on the WAL-writing end to emit WAL
> instructions that lead to a correct on-disk state.  Putting complex
> behavior into the reading side is fundamentally misguided.
Am I missing something. ISTM that if the truncate record was simply not
logged at all everything would work fine. The whole point is that the
table was created in this transaction and so if it exists the table on
disk must be the correct representation.

The broken index is just one symptom. The heap also shouldn't be
truncated at all. If you insert a row before commit then after replay
the tuple should be there still.

Have a nice day,
--
Martijn van Oosterhout   <[hidden email]>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.
   -- Arthur Schopenhauer

signature.asc (845 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: WAL logging problem in 9.4.3?

Andres Freund
On 2015-07-03 19:14:26 +0200, Martijn van Oosterhout wrote:
> Am I missing something. ISTM that if the truncate record was simply not
> logged at all everything would work fine. The whole point is that the
> table was created in this transaction and so if it exists the table on
> disk must be the correct representation.

That'd not work either. Consider:

BEGIN;
CREATE TABLE ...
INSERT;
TRUNCATE;
INSERT;
COMMIT;

If you replay that without a truncation wal record the second INSERT
will try to add stuff to already occupied space. And they can have
different lengths and stuff, so you cannot just ignore that fact.

> The broken index is just one symptom.

Agreed. I think the problem is something else though. Namely that we
reuse the relfilenode for heap_truncate_one_rel(). That's just entirely
broken afaics. We need to allocate a new relfilenode and write stuff
into that. Then we can forgo WAL logging the truncation record.

> If you insert a row before commit then after replay the tuple should be there still.

The insert would be WAL logged. COPY skips wal logging tho.


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WAL logging problem in 9.4.3?

Andres Freund
In reply to this post by Andres Freund
On 2015-07-03 19:02:29 +0200, Andres Freund wrote:
> Maybe I'm just daft right now (35C outside, 32 inside, so ...), but I'm
> right now missing how the whole "skip wal logging if relation has just
> been truncated" optimization can ever actually be crashsafe unless we
> use a new relfilenode (which we don't!).

We actually used to use a different relfilenode, but optimized that
away: cab9a0656c36739f59277b34fea8ab9438395869

commit cab9a0656c36739f59277b34fea8ab9438395869
Author: Tom Lane <[hidden email]>
Date:   Sun Aug 23 19:23:41 2009 +0000

    Make TRUNCATE do truncate-in-place when processing a relation that was created
    or previously truncated in the current (sub)transaction.  This is safe since
    if the (sub)transaction later rolls back, we'd just discard the rel's current
    physical file anyway.  This avoids unreasonable growth in the number of
    transient files when a relation is repeatedly truncated.  Per a performance
    gripe a couple weeks ago from Todd Cook.

to me the reasoning here looks flawed.


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WAL logging problem in 9.4.3?

Martijn van Oosterhout
In reply to this post by Andres Freund
On Fri, Jul 03, 2015 at 07:21:21PM +0200, Andres Freund wrote:

> On 2015-07-03 19:14:26 +0200, Martijn van Oosterhout wrote:
> > Am I missing something. ISTM that if the truncate record was simply not
> > logged at all everything would work fine. The whole point is that the
> > table was created in this transaction and so if it exists the table on
> > disk must be the correct representation.
>
> That'd not work either. Consider:
>
> BEGIN;
> CREATE TABLE ...
> INSERT;
> TRUNCATE;
> INSERT;
> COMMIT;
>
> If you replay that without a truncation wal record the second INSERT
> will try to add stuff to already occupied space. And they can have
> different lengths and stuff, so you cannot just ignore that fact.
I was about to disagree with you by suggesting that if the table was
created in this transaction then WAL logging is skipped. But testing
shows that inserts are indeed logged, as you point out.

With inserts the WAL records look as follows (relfilenodes changed):

martijn@martijn-jessie:~/git/ctm/docker$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p /tmp/pgtest/postgres/pg_xlog/ 000000010000000000000001 |grep -wE '16386|16384|16390'
rmgr: Storage     len (rec/tot):     16/    48, tx:          0, lsn: 0/016A79C8, prev 0/016A79A0, bkp: 0000, desc: file create: base/12139/16384
rmgr: Sequence    len (rec/tot):    158/   190, tx:        683, lsn: 0/016B4258, prev 0/016B2508, bkp: 0000, desc: log: rel 1663/12139/16384
rmgr: Storage     len (rec/tot):     16/    48, tx:        683, lsn: 0/016B4318, prev 0/016B4258, bkp: 0000, desc: file create: base/12139/16386
rmgr: Storage     len (rec/tot):     16/    48, tx:        683, lsn: 0/016B9468, prev 0/016B9418, bkp: 0000, desc: file create: base/12139/16390
rmgr: Sequence    len (rec/tot):    158/   190, tx:        683, lsn: 0/016BC938, prev 0/016BC880, bkp: 0000, desc: log: rel 1663/12139/16384
rmgr: Sequence    len (rec/tot):    158/   190, tx:        683, lsn: 0/016BCAF0, prev 0/016BCAA0, bkp: 0000, desc: log: rel 1663/12139/16384
rmgr: Heap        len (rec/tot):     35/    67, tx:        683, lsn: 0/016BCBB0, prev 0/016BCAF0, bkp: 0000, desc: insert(init): rel 1663/12139/16386; tid 0/1
rmgr: Btree       len (rec/tot):     20/    52, tx:        683, lsn: 0/016BCBF8, prev 0/016BCBB0, bkp: 0000, desc: newroot: rel 1663/12139/16390; root 1 lev 0
rmgr: Btree       len (rec/tot):     34/    66, tx:        683, lsn: 0/016BCC30, prev 0/016BCBF8, bkp: 0000, desc: insert: rel 1663/12139/16390; tid 1/1
rmgr: Storage     len (rec/tot):     16/    48, tx:        683, lsn: 0/016BCC78, prev 0/016BCC30, bkp: 0000, desc: file truncate: base/12139/16386 to 0 blocks
rmgr: Storage     len (rec/tot):     16/    48, tx:        683, lsn: 0/016BCCA8, prev 0/016BCC78, bkp: 0000, desc: file truncate: base/12139/16390 to 0 blocks
rmgr: Heap        len (rec/tot):     35/    67, tx:        683, lsn: 0/016BCCD8, prev 0/016BCCA8, bkp: 0000, desc: insert(init): rel 1663/12139/16386; tid 0/1
rmgr: Btree       len (rec/tot):     20/    52, tx:        683, lsn: 0/016BCD20, prev 0/016BCCD8, bkp: 0000, desc: newroot: rel 1663/12139/16390; root 1 lev 0
rmgr: Btree       len (rec/tot):     34/    66, tx:        683, lsn: 0/016BCD58, prev 0/016BCD20, bkp: 0000, desc: insert: rel 1663/12139/16390; tid 1/1
 
   relname   | relfilenode
-------------+-------------
 test        |       16386
 test_id_seq |       16384
 test_pkey   |       16390
(3 rows)

And amazingly, the database cluster successfuly recovers and there's no
error now.  So the problem is *only* because there is no data in the
table at commit time.  Which indicates that it's the 'newroot" record
that saves the day normally.  And it's apparently generated by the
first insert.

> Agreed. I think the problem is something else though. Namely that we
> reuse the relfilenode for heap_truncate_one_rel(). That's just entirely
> broken afaics. We need to allocate a new relfilenode and write stuff
> into that. Then we can forgo WAL logging the truncation record.

Would that properly initialise the index though?

Anyway, this is way outside my expertise, so I'll bow out now. Let me
know if I can be of more assistance.

Have a nice day,
--
Martijn van Oosterhout   <[hidden email]>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.
   -- Arthur Schopenhauer

signature.asc (845 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: WAL logging problem in 9.4.3?

Tom Lane-2
Martijn van Oosterhout <[hidden email]> writes:
> With inserts the WAL records look as follows (relfilenodes changed):
> ...
> And amazingly, the database cluster successfuly recovers and there's no
> error now.  So the problem is *only* because there is no data in the
> table at commit time.  Which indicates that it's the 'newroot" record
> that saves the day normally.  And it's apparently generated by the
> first insert.

Yeah, because the correct "empty" state of a btree index is to have a
metapage but no root page, so the first insert forces creation of a root
page.  And, by chance, btree_xlog_newroot restores the metapage from
scratch, so this works even if the metapage had been missing or corrupt.

However, things would still break if the first access to the index was
a read attempt rather than an insert.

                        regards, tom lane


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WAL logging problem in 9.4.3?

Andres Freund
In reply to this post by Andres Freund
On 2015-07-03 19:26:05 +0200, Andres Freund wrote:

> On 2015-07-03 19:02:29 +0200, Andres Freund wrote:
> > Maybe I'm just daft right now (35C outside, 32 inside, so ...), but I'm
> > right now missing how the whole "skip wal logging if relation has just
> > been truncated" optimization can ever actually be crashsafe unless we
> > use a new relfilenode (which we don't!).
>
> We actually used to use a different relfilenode, but optimized that
> away: cab9a0656c36739f59277b34fea8ab9438395869
>
> commit cab9a0656c36739f59277b34fea8ab9438395869
> Author: Tom Lane <[hidden email]>
> Date:   Sun Aug 23 19:23:41 2009 +0000
>
>     Make TRUNCATE do truncate-in-place when processing a relation that was created
>     or previously truncated in the current (sub)transaction.  This is safe since
>     if the (sub)transaction later rolls back, we'd just discard the rel's current
>     physical file anyway.  This avoids unreasonable growth in the number of
>     transient files when a relation is repeatedly truncated.  Per a performance
>     gripe a couple weeks ago from Todd Cook.
>
> to me the reasoning here looks flawed.

It looks to me we need to re-neg on this a bit. I think we can still be
more efficient than the general codepath: We can drop the old
relfilenode immediately. But pg_class.relfilenode has to differ from the
old after the truncation.


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WAL logging problem in 9.4.3?

Tom Lane-2
Andres Freund <[hidden email]> writes:

> On 2015-07-03 19:26:05 +0200, Andres Freund wrote:
>> commit cab9a0656c36739f59277b34fea8ab9438395869
>> Author: Tom Lane <[hidden email]>
>> Date:   Sun Aug 23 19:23:41 2009 +0000
>>
>> Make TRUNCATE do truncate-in-place when processing a relation that was created
>> or previously truncated in the current (sub)transaction.  This is safe since
>> if the (sub)transaction later rolls back, we'd just discard the rel's current
>> physical file anyway.  This avoids unreasonable growth in the number of
>> transient files when a relation is repeatedly truncated.  Per a performance
>> gripe a couple weeks ago from Todd Cook.
>>
>> to me the reasoning here looks flawed.

> It looks to me we need to re-neg on this a bit. I think we can still be
> more efficient than the general codepath: We can drop the old
> relfilenode immediately. But pg_class.relfilenode has to differ from the
> old after the truncation.

Why exactly?  The first truncation in the (sub)xact would have assigned a
new relfilenode, why do we need another one?  The file in question will
go away on crash/rollback in any case, and no other transaction can see
it yet.

I'm prepared to believe that some bit of logic is doing the wrong thing in
this state, but I do not agree that truncate-in-place is unworkable.

                        regards, tom lane


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WAL logging problem in 9.4.3?

Andres Freund
On 2015-07-03 18:38:37 -0400, Tom Lane wrote:
> > Why exactly?  The first truncation in the (sub)xact would have assigned a
> new relfilenode, why do we need another one?  The file in question will
> go away on crash/rollback in any case, and no other transaction can see
> it yet.

Consider:

BEGIN;
CREATE TABLE;
INSERT largeval;
TRUNCATE;
INSERT 1;
COPY;
INSERT 2;
COMMIT;

INSERT 1 is going to be WAL logged. For that to work correctly TRUNCATE
has to be WAL logged, as otherwise there'll be conflicting/overlapping
tuples on the target page.

But:

The truncation itself is not fully wal logged, neither is the COPY. Both
rely on heap_sync()/immedsync(). For that to be correct the current
relfilenode's truncation may *not* be wal-logged, because the contents
of the COPY or the truncation itself will only be on-disk, not in the
WAL.

Only being on-disk but not in the WAL is a problem if we crash and
replay the truncate record.

> I'm prepared to believe that some bit of logic is doing the wrong
> thing in this state, but I do not agree that truncate-in-place is
> unworkable.

Unless we're prepared to make everything that potentially WAL logs
something do the rel->rd_createSubid == mySubid && dance, I can't see
that working.


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WAL logging problem in 9.4.3?

Fujii Masao-2
In reply to this post by Andres Freund
On Sat, Jul 4, 2015 at 2:26 AM, Andres Freund <[hidden email]> wrote:
> On 2015-07-03 19:02:29 +0200, Andres Freund wrote:
>> Maybe I'm just daft right now (35C outside, 32 inside, so ...), but I'm
>> right now missing how the whole "skip wal logging if relation has just
>> been truncated" optimization can ever actually be crashsafe unless we
>> use a new relfilenode (which we don't!).

Agreed... When I ran the following test scenario, I found that
the loaded data disappeared after the crash recovery.

1. start PostgreSQL server with wal_level = minimal
2. execute the following SQL statements
\copy (SELECT num FROM generate_series(1,10) num) to /tmp/num.csv with csv
BEGIN;
CREATE TABLE test (i int primary key);
TRUNCATE TABLE test;
\copy test from /tmp/num.csv with csv
COMMIT;
SELECT COUNT(*) FROM test;  -- returns 10

3. shutdown the server with immediate mode
4. restart the server
5. execute the following SQL statement after crash recovery ends
SELECT COUNT(*) FROM test;  -- returns 0..

In #2, 10 rows were copied and the transaction was committed.
The subsequent statement of "select count(*)" obviously returned 10.
However, after crash recovery, in #5, the same statement returned 0.
That is, the loaded (+ committed) 10 data was lost after the crash.

> We actually used to use a different relfilenode, but optimized that
> away: cab9a0656c36739f59277b34fea8ab9438395869
>
> commit cab9a0656c36739f59277b34fea8ab9438395869
> Author: Tom Lane <[hidden email]>
> Date:   Sun Aug 23 19:23:41 2009 +0000
>
>     Make TRUNCATE do truncate-in-place when processing a relation that was created
>     or previously truncated in the current (sub)transaction.  This is safe since
>     if the (sub)transaction later rolls back, we'd just discard the rel's current
>     physical file anyway.  This avoids unreasonable growth in the number of
>     transient files when a relation is repeatedly truncated.  Per a performance
>     gripe a couple weeks ago from Todd Cook.
>
> to me the reasoning here looks flawed.

Before this commit, when I ran the above test scenario, no data loss happened.

Regards,

--
Fujii Masao


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
1234 ... 6