REINDEX INDEX results in a crash for an index of pg_class since 9.6

classic Classic list List threaded Threaded
82 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

REINDEX INDEX results in a crash for an index of pg_class since 9.6

Michael Paquier-2
Hi all,

Fujii-san has sent me a report offline about REINDEX.  And since 9.6,
trying to REINDEX directly an index of pg_class fails lamentably on an
assertion failure (mbsync failure if bypassing the assert):
#2  0x000055a9c5bfcc2c in ExceptionalCondition
 (conditionName=0x55a9c5ca9750
 "!(!ReindexIsProcessingIndex(((indexRelation)->rd_id)))",
     errorType=0x55a9c5ca969f "FailedAssertion",
 fileName=0x55a9c5ca9680 "indexam.c", lineNumber=204) at assert.c:54
#3  0x000055a9c5686dcd in index_insert (indexRelation=0x7f58402fe5d8,
 values=0x7fff450c3270, isnull=0x7fff450c3250,
 heap_t_ctid=0x55a9c7e2c05c,
     heapRelation=0x7f584031eb68, checkUnique=UNIQUE_CHECK_YES,
 indexInfo=0x55a9c7e30520) at indexam.c:204
#4  0x000055a9c5714a12 in CatalogIndexInsert
 (indstate=0x55a9c7e30408, heapTuple=0x55a9c7e2c058) at indexing.c:140
#5  0x000055a9c5714b1d in CatalogTupleUpdate (heapRel=0x7f584031eb68,
 otid=0x55a9c7e2c05c, tup=0x55a9c7e2c058) at indexing.c:215
#6  0x000055a9c5beda8a in RelationSetNewRelfilenode
 (relation=0x7f58402fe5d8, persistence=112 'p') at relcache.c:3531

Doing a REINDEX TABLE directly on pg_class proves to work correctly,
and CONCURRENTLY is not supported for catalog tables.

Bisecting my way through it, the first commit causing the breakage is
that:
commit: 01e386a325549b7755739f31308de4be8eea110d
author: Tom Lane <[hidden email]>
date: Wed, 23 Dec 2015 20:09:01 -0500
Avoid VACUUM FULL altogether in initdb.

Commit ed7b3b3811c5836a purported to remove initdb's use of VACUUM
FULL,
as had been agreed to in a pghackers discussion back in Dec 2014.
But it missed this one ...

The reason why this does not work is that CatalogIndexInsert() tries
to do an index_insert directly on the index worked on.  And the reason
why this works at table level is that we have tweaks in
reindex_relation() to enforce the list of valid indexes in the
relation cache with RelationSetIndexList().  It seems to me that the
logic in reindex_index() is wrong from the start, and that all the
index list handling done in reindex_relation() should just be in
reindex_index() so as REINDEX INDEX gets the right call.

Thoughts?
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

Michael Paquier-2
On Thu, Apr 18, 2019 at 10:14:30AM +0900, Michael Paquier wrote:
> Doing a REINDEX TABLE directly on pg_class proves to work correctly,
> and CONCURRENTLY is not supported for catalog tables.
>
> Bisecting my way through it, the first commit causing the breakage is
> that:
> commit: 01e386a325549b7755739f31308de4be8eea110d
> author: Tom Lane <[hidden email]>
> date: Wed, 23 Dec 2015 20:09:01 -0500
> Avoid VACUUM FULL altogether in initdb.

This brings down to a first, simple, solution which is to issue a
VACUUM FULL on pg_class at the end of make_template0() in initdb.c to
avoid any subsequent problems if trying to issue a REINDEX on anything
related to pg_class, and it won't fix any existing deployments:
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2042,6 +2042,11 @@ make_template0(FILE *cmdfd)
  * Finally vacuum to clean up dead rows in pg_database
  */
  "VACUUM pg_database;\n\n",
+
+ /*
+ * And rebuild pg_class.
+ */
+ "VACUUM FULL pg_class;\n\n",
  NULL
  };
Now...

> The reason why this does not work is that CatalogIndexInsert() tries
> to do an index_insert directly on the index worked on.  And the reason
> why this works at table level is that we have tweaks in
> reindex_relation() to enforce the list of valid indexes in the
> relation cache with RelationSetIndexList().  It seems to me that the
> logic in reindex_index() is wrong from the start, and that all the
> index list handling done in reindex_relation() should just be in
> reindex_index() so as REINDEX INDEX gets the right call.

I got to wonder if this dance with the relation cache is actually
necessary, because we could directly tell CatalogIndexInsert() to not
insert a tuple into an index which is bring rebuilt, and the index
rebuild would cause an entry to be added to pg_class anyway thanks to
RelationSetNewRelfilenode().  This can obviously only happen for
pg_class indexes.

Any thoughts about both approaches?
--
Michael

reindex-pgclass-v1.patch (6K) Download Attachment
signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

Tom Lane-2
In reply to this post by Michael Paquier-2
Michael Paquier <[hidden email]> writes:
> Fujii-san has sent me a report offline about REINDEX.  And since 9.6,
> trying to REINDEX directly an index of pg_class fails lamentably on an
> assertion failure (mbsync failure if bypassing the assert):

So ... I can't reproduce this on HEAD.  Nor the back branches.

regression=# \d pg_class
...
Indexes:
    "pg_class_oid_index" UNIQUE, btree (oid)
    "pg_class_relname_nsp_index" UNIQUE, btree (relname, relnamespace)
    "pg_class_tblspc_relfilenode_index" btree (reltablespace, relfilenode)

regression=# reindex index pg_class_relname_nsp_index;
REINDEX
regression=# reindex index pg_class_oid_index;
REINDEX
regression=# reindex index pg_class_tblspc_relfilenode_index;
REINDEX
regression=# reindex table pg_class;        
REINDEX
regression=# reindex index pg_class_tblspc_relfilenode_index;
REINDEX

Is there some precondition you're not mentioning?

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

Alvaro Herrera-9
In reply to this post by Michael Paquier-2
On 2019-Apr-18, Michael Paquier wrote:

> Fujii-san has sent me a report offline about REINDEX.  And since 9.6,
> trying to REINDEX directly an index of pg_class fails lamentably on an
> assertion failure (mbsync failure if bypassing the assert):

Hmm, yeah, I ran into this crash too, more than a year ago, but I don't
recall what came out of the investigation, and my search-fu is failing
me.  I'll have a look at my previous laptop's drive ...

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

Michael Paquier-2
In reply to this post by Tom Lane-2
On Tue, Apr 23, 2019 at 04:47:19PM -0400, Tom Lane wrote:

> regression=# reindex index pg_class_relname_nsp_index;
> REINDEX
> regression=# reindex index pg_class_oid_index;
> REINDEX
> regression=# reindex index pg_class_tblspc_relfilenode_index;
> REINDEX
> regression=# reindex table pg_class;        
> REINDEX
> regression=# reindex index pg_class_tblspc_relfilenode_index;
> REINDEX
>
> Is there some precondition you're not mentioning?
Hm.  In my own init scripts, I create a new database just after
starting the instance.  That seems to help in reproducing the
failure, because each time I create a new database, connect to it and
reindex then I can see the crash.  If I do a reindex of pg_class
first, I don't see a crash of some rebuilds already happened, but if I
do directly a reindex of one of the indexes first, then the failure is
plain.  If I also add some regression tests, say in create_index.sql
to stress a reindex of pg_class and its indexes, the crash also shows
up.  If I apply my previous patch to make CatalogIndexInsert() not do
an insert on a catalog index being rebuilt, then things turn to be
fine.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

Tom Lane-2
Michael Paquier <[hidden email]> writes:
> On Tue, Apr 23, 2019 at 04:47:19PM -0400, Tom Lane wrote:
>> Is there some precondition you're not mentioning?

> Hm.  In my own init scripts, I create a new database just after
> starting the instance.

Ah, there we go:

regression=# create database d1;
CREATE DATABASE
regression=# \c d1
You are now connected to database "d1" as user "postgres".
d1=# reindex index pg_class_relname_nsp_index;
psql: server closed the connection unexpectedly

log shows

TRAP: FailedAssertion("!(!ReindexIsProcessingIndex(((indexRelation)->rd_id)))", File: "indexam.c", Line: 204)

#2  0x00000000008c74ed in ExceptionalCondition (
    conditionName=<value optimized out>, errorType=<value optimized out>,
    fileName=<value optimized out>, lineNumber=<value optimized out>)
    at assert.c:54
#3  0x00000000004e4f8c in index_insert (indexRelation=0x7f80f849a5d8,
    values=0x7ffc4f65b030, isnull=0x7ffc4f65b130, heap_t_ctid=0x2842c0c,
    heapRelation=0x7f80f84bab68, checkUnique=UNIQUE_CHECK_YES,
    indexInfo=0x2843230) at indexam.c:204
#4  0x000000000054c290 in CatalogIndexInsert (indstate=<value optimized out>,
    heapTuple=0x2842c08) at indexing.c:140
#5  0x000000000054c472 in CatalogTupleUpdate (heapRel=0x7f80f84bab68,
    otid=0x2842c0c, tup=0x2842c08) at indexing.c:215
#6  0x00000000008bca77 in RelationSetNewRelfilenode (relation=0x7f80f849a5d8,
    persistence=112 'p') at relcache.c:3531
#7  0x0000000000548b3a in reindex_index (indexId=2663,
    skip_constraint_checks=false, persistence=112 'p', options=0)
    at index.c:3339
#8  0x00000000005ed099 in ReindexIndex (indexRelation=<value optimized out>,
    options=0, concurrent=false) at indexcmds.c:2304
#9  0x00000000007b5925 in standard_ProcessUtility (pstmt=0x281fd70,

> If I apply my previous patch to make CatalogIndexInsert() not do
> an insert on a catalog index being rebuilt, then things turn to be
> fine.

That seems like pretty much of a hack :-(, in particular I'm not
convinced that we'd not end up with a missing index entry afterwards.
Maybe it's the only way, but I think first we need to trace down
exactly why this broke.  I remember we had some special-case code
for reindexing pg_class, maybe something broke that?

It also seems quite odd that it doesn't fail every time; surely it's
not conditional whether we'll try to insert a new pg_class tuple or not?
We need to understand that, too.  Maybe the old code never really
worked in all cases?  It seems clear that the commit you bisected to
just allowed a pre-existing misbehavior to become exposed (more easily).

No time to look closer right now.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

Tom Lane-2
I wrote:
> It also seems quite odd that it doesn't fail every time; surely it's
> not conditional whether we'll try to insert a new pg_class tuple or not?
> We need to understand that, too.

Oh!  One gets you ten it "works" as long as the pg_class update is a
HOT update, so that we don't actually end up touching the indexes.
This explains why the crash is less likely to happen in a database
where one's done some work (and, probably, created some dead space in
pg_class).  On the other hand, it doesn't quite fit the observation
that a VACUUM FULL masked the problem ... wouldn't that have ended up
with densely packed pg_class?  Maybe not, if it rebuilt everything
else after pg_class...

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

Michael Paquier-2
In reply to this post by Tom Lane-2
On Tue, Apr 23, 2019 at 07:54:52PM -0400, Tom Lane wrote:
> That seems like pretty much of a hack :-(, in particular I'm not
> convinced that we'd not end up with a missing index entry afterwards.
> Maybe it's the only way, but I think first we need to trace down
> exactly why this broke.  I remember we had some special-case code
> for reindexing pg_class, maybe something broke that?

Yes, reindex_relation() has some infra to enforce the list of indexes
in the cache for pg_class which has been introduced by a56a016 as far
as it goes.

> It also seems quite odd that it doesn't fail every time; surely it's
> not conditional whether we'll try to insert a new pg_class tuple or not?
> We need to understand that, too.  Maybe the old code never really
> worked in all cases?  It seems clear that the commit you bisected to
> just allowed a pre-existing misbehavior to become exposed (more easily).
>
> No time to look closer right now.

Yeah, there is a fishy smell underneath which comes from 9.6.  When
testing with 9.5 or older a database creation does not create any
crash on a subsequent reindex.  Not sure I'll have the time to look at
that more today, perhaps tomorrow depending on the odds.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

Shaoqi Bai
In reply to this post by Tom Lane-2
On Wed, Apr 24, 2019 at 7:55 AM Tom Lane <[hidden email]> wrote:
Michael Paquier <[hidden email]> writes:
> On Tue, Apr 23, 2019 at 04:47:19PM -0400, Tom Lane wrote:
>> Is there some precondition you're not mentioning?

> Hm.  In my own init scripts, I create a new database just after
> starting the instance.

Ah, there we go:

regression=# create database d1;
CREATE DATABASE
regression=# \c d1
You are now connected to database "d1" as user "postgres".
d1=# reindex index pg_class_relname_nsp_index;
psql: server closed the connection unexpectedly

log shows

TRAP: FailedAssertion("!(!ReindexIsProcessingIndex(((indexRelation)->rd_id)))", File: "indexam.c", Line: 204)

Could reproduce TRAP: FailedAssertion("!(!ReindexIsProcessingIndex(((indexRelation)->rd_id)))", File: "indexam.c", Line: 204) in postgres log file.
#2  0x00000000008c74ed in ExceptionalCondition (
    conditionName=<value optimized out>, errorType=<value optimized out>,
    fileName=<value optimized out>, lineNumber=<value optimized out>)
    at assert.c:54
#3  0x00000000004e4f8c in index_insert (indexRelation=0x7f80f849a5d8,
    values=0x7ffc4f65b030, isnull=0x7ffc4f65b130, heap_t_ctid=0x2842c0c,
    heapRelation=0x7f80f84bab68, checkUnique=UNIQUE_CHECK_YES,
    indexInfo=0x2843230) at indexam.c:204
#4  0x000000000054c290 in CatalogIndexInsert (indstate=<value optimized out>,
    heapTuple=0x2842c08) at indexing.c:140
#5  0x000000000054c472 in CatalogTupleUpdate (heapRel=0x7f80f84bab68,
    otid=0x2842c0c, tup=0x2842c08) at indexing.c:215
#6  0x00000000008bca77 in RelationSetNewRelfilenode (relation=0x7f80f849a5d8,
    persistence=112 'p') at relcache.c:3531
#7  0x0000000000548b3a in reindex_index (indexId=2663,
    skip_constraint_checks=false, persistence=112 'p', options=0)
    at index.c:3339
#8  0x00000000005ed099 in ReindexIndex (indexRelation=<value optimized out>,
    options=0, concurrent=false) at indexcmds.c:2304
#9  0x00000000007b5925 in standard_ProcessUtility (pstmt=0x281fd70,
But could only see these stack in lldb -c corefile after type bt. Is there a way to also print these stack in postgres log file , and how?
Reply | Threaded
Open this post in threaded view
|

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

Michael Paquier-2
In reply to this post by Tom Lane-2
On Tue, Apr 23, 2019 at 08:03:37PM -0400, Tom Lane wrote:
> Oh!  One gets you ten it "works" as long as the pg_class update is a
> HOT update, so that we don't actually end up touching the indexes.
> This explains why the crash is less likely to happen in a database
> where one's done some work (and, probably, created some dead space in
> pg_class).  On the other hand, it doesn't quite fit the observation
> that a VACUUM FULL masked the problem ... wouldn't that have ended up
> with densely packed pg_class?  Maybe not, if it rebuilt everything
> else after pg_class...

I have been able to spend a bit more time testing and looking at the
root of the problem, and I have found two things:
1) The problem is reproducible with REL9_5_STABLE.
2) Bisecting between the merge base points of REL9_4_STABLE/master and
REL9_5_STABLE/master, I am being pointed to the introduction of
replication origins:
commit: 5aa2350426c4fdb3d04568b65aadac397012bbcb
author: Andres Freund <[hidden email]>
date: Wed, 29 Apr 2015 19:30:53 +0200
Introduce replication progress tracking infrastructure.

In order to see the problem, also one needs to patch initdb.c so as
the final VACUUM FULL on pg_database is replaced by VACUUM as on
9.6~.  The root of the problem is actually surprising, but manually
testing on 5aa2350 commit and 5aa2350~1 the difference shows up as the
issue is easily reproducible here.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

Alvaro Herrera-9
On 2019-Apr-25, Michael Paquier wrote:

> 2) Bisecting between the merge base points of REL9_4_STABLE/master and
> REL9_5_STABLE/master, I am being pointed to the introduction of
> replication origins:
> commit: 5aa2350426c4fdb3d04568b65aadac397012bbcb
> author: Andres Freund <[hidden email]>
> date: Wed, 29 Apr 2015 19:30:53 +0200
> Introduce replication progress tracking infrastructure.
>
> In order to see the problem, also one needs to patch initdb.c so as
> the final VACUUM FULL on pg_database is replaced by VACUUM as on
> 9.6~.  The root of the problem is actually surprising, but manually
> testing on 5aa2350 commit and 5aa2350~1 the difference shows up as the
> issue is easily reproducible here.

Hmm ... I suspect the problem is even older, and that this commit made
it possible to see as a side effect of changing the catalog contents
(since it creates one more view and does a REVOKE, which becomes an
update on pg_class.)

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

Tom Lane-2
In reply to this post by Michael Paquier-2
Michael Paquier <[hidden email]> writes:
> On Tue, Apr 23, 2019 at 08:03:37PM -0400, Tom Lane wrote:
>> Oh!  One gets you ten it "works" as long as the pg_class update is a
>> HOT update, so that we don't actually end up touching the indexes.

> I have been able to spend a bit more time testing and looking at the
> root of the problem, and I have found two things:
> 1) The problem is reproducible with REL9_5_STABLE.

Actually, as far as I can tell, this has been broken since day 1.
I can reproduce the assertion failure back to 9.1, and I think the
only reason it doesn't happen in older branches is that they lack
the ReindexIsProcessingIndex() check in RELATION_CHECKS :-(.

What you have to do to get it to crash is to ensure that
RelationSetNewRelfilenode's update of pg_class will be a non-HOT
update.  You can try to set that up with "vacuum full pg_class"
but it turns out that that tends to leave the pg_class entries
for pg_class's indexes in the last page of the relation, which
is usually not totally full, so that a HOT update works and the
bug doesn't manifest.

A recipe like the following breaks every branch, by ensuring that
the page containing pg_class_relname_nsp_index's entry is full:

regression=# vacuum full pg_class;
VACUUM
regression=# do $$ begin                              
for i in 100 .. 150 loop
execute 'create table dummy'||i||'(f1 int)';
end loop;
end $$;
DO
regression=# reindex index pg_class_relname_nsp_index;
psql: server closed the connection unexpectedly


As for an actual fix, I tried just moving reindex_index's
SetReindexProcessing call from where it is down to after
RelationSetNewRelfilenode, but that isn't sufficient:

regression=# reindex index pg_class_relname_nsp_index;
psql: ERROR:  could not read block 3 in file "base/16384/41119": read only 0 of 8192 bytes

#0  errfinish (dummy=0) at elog.c:411
#1  0x00000000007a9453 in mdread (reln=<value optimized out>,
    forknum=<value optimized out>, blocknum=<value optimized out>,
    buffer=0x7f608e6a7d00 "") at md.c:633
#2  0x000000000077a9af in ReadBuffer_common (smgr=<value optimized out>,
    relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=3, mode=RBM_NORMAL,
    strategy=0x0, hit=0x7fff6a7452ef) at bufmgr.c:896
#3  0x000000000077b67e in ReadBufferExtended (reln=0x7f608db5d670,
    forkNum=MAIN_FORKNUM, blockNum=3, mode=<value optimized out>,
    strategy=<value optimized out>) at bufmgr.c:664
#4  0x00000000004ea95a in _bt_getbuf (rel=0x7f608db5d670,
    blkno=<value optimized out>, access=1) at nbtpage.c:805
#5  0x00000000004eb67a in _bt_getroot (rel=0x7f608db5d670, access=2)
    at nbtpage.c:323
#6  0x00000000004f2237 in _bt_search (rel=0x7f608db5d670, key=0x1d5a0c0,
    bufP=0x7fff6a7456a8, access=2, snapshot=0x0) at nbtsearch.c:99
#7  0x00000000004e8caf in _bt_doinsert (rel=0x7f608db5d670, itup=0x1c85e58,
    checkUnique=UNIQUE_CHECK_YES, heapRel=0x1ccb8d0) at nbtinsert.c:219
#8  0x00000000004efc17 in btinsert (rel=0x7f608db5d670,
    values=<value optimized out>, isnull=<value optimized out>,
    ht_ctid=0x1d12dc4, heapRel=0x1ccb8d0, checkUnique=UNIQUE_CHECK_YES,
    indexInfo=0x1c857f8) at nbtree.c:205
#9  0x000000000054c320 in CatalogIndexInsert (indstate=<value optimized out>,
    heapTuple=0x1d12dc0) at indexing.c:140
#10 0x000000000054c502 in CatalogTupleUpdate (heapRel=0x1ccb8d0,
    otid=0x1d12dc4, tup=0x1d12dc0) at indexing.c:215
#11 0x00000000008bcba7 in RelationSetNewRelfilenode (relation=0x7f608db5d670,
    persistence=112 'p') at relcache.c:3531
#12 0x0000000000548b16 in reindex_index (indexId=2663,
    skip_constraint_checks=false, persistence=112 'p', options=0)
    at index.c:3336
#13 0x00000000005ed129 in ReindexIndex (indexRelation=<value optimized out>,
    options=0, concurrent=false) at indexcmds.c:2304
#14 0x00000000007b5a45 in standard_ProcessUtility (pstmt=0x1c66d70,
    queryString=0x1c65f68 "reindex index pg_class_relname_nsp_index;",
    context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
    dest=0x1c66e68, completionTag=0x7fff6a745e40 "") at utility.c:787

The problem here is that RelationSetNewRelfilenode is aggressively
changing the index's relcache entry before it's written out the
updated tuple, so that the tuple update tries to make an index
entry in the new storage which isn't filled yet.  I think we can
fix it by *not* doing that, but leaving it to the relcache inval
during the CommandCounterIncrement call to update the relcache
entry.  However, it looks like that will take some API refactoring,
because the storage-creation functions expect to get the new
relfilenode out of the relcache entry, and they'll have to be
changed to not do it that way.

I'll work on a patch ...

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

Tom Lane-2
I wrote:

> The problem here is that RelationSetNewRelfilenode is aggressively
> changing the index's relcache entry before it's written out the
> updated tuple, so that the tuple update tries to make an index
> entry in the new storage which isn't filled yet.  I think we can
> fix it by *not* doing that, but leaving it to the relcache inval
> during the CommandCounterIncrement call to update the relcache
> entry.  However, it looks like that will take some API refactoring,
> because the storage-creation functions expect to get the new
> relfilenode out of the relcache entry, and they'll have to be
> changed to not do it that way.

So looking at that, it seems like the table_relation_set_new_filenode
API is pretty darn ill-designed.  It assumes that it's passed an
already-entirely-valid relcache entry, but it also supposes that
it can pass back information that needs to go into the relation's
pg_class entry.  One or the other side of that has to give, unless
you want to doom everything to updating pg_class twice.

I'm not really sure what's the point of giving the tableam control
of relfrozenxid+relminmxid at all, and I notice that index_create
for one is just Asserting that constant values are returned.

I think we need to do one or possibly both of these things:

* split table_relation_set_new_filenode into two functions,
one that doesn't take a relcache entry at all and returns
appropriate relfrozenxid+relminmxid for a new rel, and then
one that just creates storage without dealing with the xid
values;

* change table_relation_set_new_filenode so that it is told
the relfilenode etc to use without assuming that it has a
valid relcache entry to work with.

Thoughts?

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

Andres Freund
Hi,

On 2019-04-25 12:29:16 -0400, Tom Lane wrote:

> I wrote:
> > The problem here is that RelationSetNewRelfilenode is aggressively
> > changing the index's relcache entry before it's written out the
> > updated tuple, so that the tuple update tries to make an index
> > entry in the new storage which isn't filled yet.  I think we can
> > fix it by *not* doing that, but leaving it to the relcache inval
> > during the CommandCounterIncrement call to update the relcache
> > entry.  However, it looks like that will take some API refactoring,
> > because the storage-creation functions expect to get the new
> > relfilenode out of the relcache entry, and they'll have to be
> > changed to not do it that way.
>
> So looking at that, it seems like the table_relation_set_new_filenode
> API is pretty darn ill-designed.  It assumes that it's passed an
> already-entirely-valid relcache entry, but it also supposes that
> it can pass back information that needs to go into the relation's
> pg_class entry.  One or the other side of that has to give, unless
> you want to doom everything to updating pg_class twice.

I'm not super happy about it either - but I think that's somewhat of an
outgrowth of how this worked before.  I mean there's two differences:

1) Previously the RelationCreateStorage() was called unconditionally,
now it's

                case RELKIND_INDEX:
                case RELKIND_SEQUENCE:
                        RelationCreateStorage(relation->rd_node, persistence);
                        RelationOpenSmgr(relation);
                        break;

                case RELKIND_RELATION:
                case RELKIND_TOASTVALUE:
                case RELKIND_MATVIEW:
                        table_relation_set_new_filenode(relation, persistence,
                                                                                        &freezeXid, &minmulti);
                        break;
        }

That seems pretty obviously necessary.


2) Previously AddNewRelationTuple() relation tuple determined the
initial horizon for table like things:
        /* Initialize relfrozenxid and relminmxid */
        if (relkind == RELKIND_RELATION ||
                relkind == RELKIND_MATVIEW ||
                relkind == RELKIND_TOASTVALUE)
        {
                /*
                 * Initialize to the minimum XID that could put tuples in the table.
                 * We know that no xacts older than RecentXmin are still running, so
                 * that will do.
                 */
                new_rel_reltup->relfrozenxid = RecentXmin;

                /*
                 * Similarly, initialize the minimum Multixact to the first value that
                 * could possibly be stored in tuples in the table.  Running
                 * transactions could reuse values from their local cache, so we are
                 * careful to consider all currently running multis.
                 *
                 * XXX this could be refined further, but is it worth the hassle?
                 */
                new_rel_reltup->relminmxid = GetOldestMultiXactId();
        }

and inserted that. Now it's determined previously below heap_create(),
and passed as an argument to AddNewRelationTuple().

and similarly the caller to RelationSetNewRelfilenode() determined the
new horizons, but they also just were written into the relcache entry
and then updated:

11:
        classform->relfrozenxid = freezeXid;
        classform->relminmxid = minmulti;
        classform->relpersistence = persistence;

        CatalogTupleUpdate(pg_class, &tuple->t_self, tuple);
master:
        classform->relfrozenxid = freezeXid;
        classform->relminmxid = minmulti;
        classform->relpersistence = persistence;

        CatalogTupleUpdate(pg_class, &tuple->t_self, tuple);


I'm not quite sure why the current situation is any worse?


Perhaps that's because I don't quite understand what you mean with "It
assumes that it's passed an already-entirely-valid relcache entry". What
do you mean by that / where does it assume that?  I guess we could warn
a bit more about the underlying tuple not necessarily existing yet it in
the callback's docs, but other than that?  Previously heap.c also was
dealing with an relcache entry without backing pg_class entry but with
existing storage, no?


> I'm not really sure what's the point of giving the tableam control
> of relfrozenxid+relminmxid at all

Well, because not every AM is going to need those. It'd make very little
sense to e.g. set them for an undo based design like zheap's - there is
need to freeze ever. The need for each page to rewritten multiple times
(original write, hint bit sets, freezing for heap) imo is one of the
major reasons people are working on alternative AMs.  That seems to
fundamentally require AMs having control over the relfrozenxid


> and I notice that index_create for one is just Asserting that constant
> values are returned.

Well, that's not going to call into tableam at all?  Those asserts
previously were in RelationSetNewRelfilenode() itself:

        /* Indexes, sequences must have Invalid frozenxid; other rels must not */
        Assert((relation->rd_rel->relkind == RELKIND_INDEX ||
                        relation->rd_rel->relkind == RELKIND_SEQUENCE) ?
                   freezeXid == InvalidTransactionId :
                   TransactionIdIsNormal(freezeXid));

but given that e.g. not every tableam is going to have those values

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

Tom Lane-2
Andres Freund <[hidden email]> writes:
> On 2019-04-25 12:29:16 -0400, Tom Lane wrote:
>> So looking at that, it seems like the table_relation_set_new_filenode
>> API is pretty darn ill-designed.  It assumes that it's passed an
>> already-entirely-valid relcache entry, but it also supposes that
>> it can pass back information that needs to go into the relation's
>> pg_class entry.  One or the other side of that has to give, unless
>> you want to doom everything to updating pg_class twice.

> I'm not super happy about it either - but I think that's somewhat of an
> outgrowth of how this worked before.

I'm not saying that the previous code was nice; I'm just saying that
what is there in HEAD needs to be factored differently so that we
can solve this problem in a reasonable way.

> Perhaps that's because I don't quite understand what you mean with "It
> assumes that it's passed an already-entirely-valid relcache entry". What
> do you mean by that / where does it assume that?

Well, I can see heapam_relation_set_new_filenode touching all of these
fields right now:

rel->rd_node
rel->rd_rel->relpersistence (and why is it looking at that rather than
the passed-in persistence???)
rel->rd_rel->relkind
whatever RelationOpenSmgr touches
rel->rd_smgr

As far as I can see, there is no API restriction on what parts of the
relcache entry it may presume are valid.  It *certainly* thinks that
rd_rel is valid, which is rather at odds with the fact that this has
to be called before the pg_class entry exists all (for the creation
case) or has been updated (for the set-new-relfilenode case).  Unless
you want to redefine things so that we create/update the pg_class
entry, put it into rel->rd_rel, call relation_set_new_filenode, and
then update the pg_class entry again with what that function gives back
for the xmin fields.

That's obviously stupid, of course.  But my point is that we need to
restrict what the function can touch or assume valid, if it's going
to be called before the pg_class update happens.  And I'd rather that
we did so by restricting its argument list so that it hasn't even got
access to stuff we don't want it assuming valid.

Also, in order to fix this problem, we cannot change the actual
relcache entry contents until after we've performed the tuple update.
So if we want the tableam code to be getting the relfilenode or
persistence info out of the relcache entry, rather than passing
those as standalone parameters, the call can't happen till after
the tuple update and CCI call.  That's why I was thinking about
splitting it into two functions.  Get the xid values, update the
pg_class tuple, CCI, then do relation_set_new_filenode with the
updated relcache entry would work.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

Andres Freund
Hi,

On 2019-04-25 14:50:09 -0400, Tom Lane wrote:

> Andres Freund <[hidden email]> writes:
> > Perhaps that's because I don't quite understand what you mean with "It
> > assumes that it's passed an already-entirely-valid relcache entry". What
> > do you mean by that / where does it assume that?
>
> Well, I can see heapam_relation_set_new_filenode touching all of these
> fields right now:
>
> rel->rd_node
> rel->rd_rel->relpersistence (and why is it looking at that rather than
> the passed-in persistence???)

Ugh.

> rel->rd_rel->relkind
> whatever RelationOpenSmgr touches
> rel->rd_smgr


> As far as I can see, there is no API restriction on what parts of the
> relcache entry it may presume are valid.  It *certainly* thinks that
> rd_rel is valid, which is rather at odds with the fact that this has
> to be called before the pg_class entry exists all (for the creation
> case) or has been updated (for the set-new-relfilenode case).

Well, that's just what we did before. And given the way the code is
structured, I am not sure I see a decent alternative that's not a
disproportionate amount of work.  I mean, heap.c's heap_create() and
heap_create_with_catalog() basically work off the Relation after the
RelationBuildLocalRelation() call, a good bit before the underlying
storage is valid.


> But my point is that we need to restrict what the function can touch
> or assume valid, if it's going to be called before the pg_class update
> happens.  And I'd rather that we did so by restricting its argument
> list so that it hasn't even got access to stuff we don't want it
> assuming valid.

OTOH, that'd mean we'd need to separately look up the amhandler, pass in
a lot more arguments etc. ISTM it'd be easier to just declare that only
the fields RelationBuildLocalRelation() sets are to be considered valid.
See the end of the email for a proposal.


> Also, in order to fix this problem, we cannot change the actual
> relcache entry contents until after we've performed the tuple update.
> So if we want the tableam code to be getting the relfilenode or
> persistence info out of the relcache entry, rather than passing
> those as standalone parameters, the call can't happen till after
> the tuple update and CCI call.  That's why I was thinking about
> splitting it into two functions.  Get the xid values, update the
> pg_class tuple, CCI, then do relation_set_new_filenode with the
> updated relcache entry would work.

I think that'd be hard for the initial relation creation. At the moment
we intentionally create the storage for the new relation before
inserting the catalog contents.

Currently the only thing that table_relation_set_new_filenode() accesses
that already is updated is the RelFileNode. I wonder if we shouldn't
change the API so that table_relation_set_new_filenode() will get a
relcache entry *without* any updates passed in, then internally does
GetNewRelFileNode() (if so desired by the AM), and returns the new rnode
via a new out parameter.  So RelationSetNewRelfilenode() would basically
work like this:

        switch (relation->rd_rel->relkind)
        {
                case RELKIND_INDEX:
                case RELKIND_SEQUENCE:
            newrelfilenode = GetNewRelFileNode(...);
                        RelationCreateStorage(newrelfilenode, persistence);
                        RelationOpenSmgr(relation);
                        break;
                case RELKIND_RELATION:
                case RELKIND_TOASTVALUE:
                case RELKIND_MATVIEW:
                        table_relation_set_new_filenode(relation, persistence,
            &newrnode, &freezeXid, &minmulti);
                        break;
        }

    /* Now update the pg_class row. */
        if (relation->rd_rel->relkind != RELKIND_SEQUENCE)
        {
                classform->relpages = 0; /* it's empty until further notice */
                classform->reltuples = 0;
                classform->relallvisible = 0;
        }
        classform->relfrozenxid = freezeXid;
        classform->relminmxid = minmulti;
        classform->relpersistence = persistence;

        /*
         * If we're dealing with a mapped index, pg_class.relfilenode doesn't
     * change; instead we'll have to send the update to the relation mapper.
     * But we can do so only after doing the catalog update, otherwise the
     * contents of the old data is going to be invalid.
     *
     * XXX: Can this actually validly be reached for a mapped table?
         */
    if (!RelationIsMapped(relation))
                classform->relfilenode = newrelfilenode;

        CatalogTupleUpdate(pg_class, &tuple->t_self, tuple);

    /* now that the catalog is updated, update relmapper if necessary */
        if (RelationIsMapped(relation))
                RelationMapUpdateMap(RelationGetRelid(relation),
                                                         newrelfilenode,
                                                         relation->rd_rel->relisshared,
                                                         true);

        /*
         * Make the pg_class row change visible, as well as the relation map
         * change if any.  This will cause the relcache entry to get updated, too.
         */
        CommandCounterIncrement();

    // XXX: Previously we called RelationInitPhysicalAddr() in this routine
    // but I don't think that should be needed due to the CCI?

and the table AM would do the necessary
            *newrelfilenode = GetNewRelFileNode(...);
                        RelationCreateStorage(*newrelfilenode, persistence);
                        RelationOpenSmgr(relation);
dance *iff* it wants to do so.

That seems like it'd go towards allowing a fair bit more flexibility for
table AMs (and possibly index AMs in the future).

We'd also have to make the equivalent set of changes to
ATExecSetTableSpace(). But that seems like it'd architecturally be
cleaner anyway. I think we'd move the FlushRelationBuffers(),
GetNewRelFileNode() logic into the callback.  It'd probably make sense
to have ATExecSetTableSpace() first call
table_relation_set_new_filenode() and then call
table_relation_copy_data() with the new relfilenode, but not yet updated
relcache entry.

We don't currently allow that, but as far as I can see the current
coding of ATExecSetTableSpace() also has bad problems with system
catalog updates. It copies the data and *then* does
CatalogTupleUpdate(), but *witout* updating the reclache - which ijust
would cause the update to be lost.


I could come up with a patch for that if you want me to.


Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

Tom Lane-2
Andres Freund <[hidden email]> writes:
> On 2019-04-25 14:50:09 -0400, Tom Lane wrote:
>> As far as I can see, there is no API restriction on what parts of the
>> relcache entry it may presume are valid.  It *certainly* thinks that
>> rd_rel is valid, which is rather at odds with the fact that this has
>> to be called before the pg_class entry exists all (for the creation
>> case) or has been updated (for the set-new-relfilenode case).

> Well, that's just what we did before. And given the way the code is
> structured, I am not sure I see a decent alternative that's not a
> disproportionate amount of work.  I mean, heap.c's heap_create() and
> heap_create_with_catalog() basically work off the Relation after the
> RelationBuildLocalRelation() call, a good bit before the underlying
> storage is valid.

You could imagine restructuring that ... but I agree it'd be a lot
of work.

> Currently the only thing that table_relation_set_new_filenode() accesses
> that already is updated is the RelFileNode. I wonder if we shouldn't
> change the API so that table_relation_set_new_filenode() will get a
> relcache entry *without* any updates passed in, then internally does
> GetNewRelFileNode() (if so desired by the AM), and returns the new rnode
> via a new out parameter.

That could work.  The important API spec is then that the relcache entry
reflects the *previous* state of the relation, and is not to be modified
by the tableam call.  After the call, we perform the pg_class update and
do CCI, and the relcache inval seen by the CCI causes the relcache entry
to be brought into sync with the new reality.  So relcache entries change
at CCI boundaries, not in between.

In the creation case, it works more or less the same, with the
understanding that the "previous state" is some possibly-partly-dummy
state set up by RelationBuildLocalRelation.  But we might need to add
a CCI call that wasn't there before; not sure.

> We don't currently allow that, but as far as I can see the current
> coding of ATExecSetTableSpace() also has bad problems with system
> catalog updates. It copies the data and *then* does
> CatalogTupleUpdate(), but *witout* updating the reclache - which ijust
> would cause the update to be lost.

Well, I imagine it's expecting the subsequent CCI to update the relcache
entry, which I think is correct behavior in this worldview.  We're
basically trying to make the relcache state follow transaction/command
boundary semantics.

> I could come up with a patch for that if you want me to.

I'm happy to let you take a whack at it if you want.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

Andres Freund
Hi,

On 2019-04-25 16:02:03 -0400, Tom Lane wrote:

> > Currently the only thing that table_relation_set_new_filenode() accesses
> > that already is updated is the RelFileNode. I wonder if we shouldn't
> > change the API so that table_relation_set_new_filenode() will get a
> > relcache entry *without* any updates passed in, then internally does
> > GetNewRelFileNode() (if so desired by the AM), and returns the new rnode
> > via a new out parameter.
>
> That could work.  The important API spec is then that the relcache entry
> reflects the *previous* state of the relation, and is not to be modified
> by the tableam call.

Right.

I was wondering if we should just pass in the pg_class tuple as an "out"
argument, instead of pointers to relfilnode/relfrozenxid/relminmxid.


> > We don't currently allow that, but as far as I can see the current
> > coding of ATExecSetTableSpace() also has bad problems with system
> > catalog updates. It copies the data and *then* does
> > CatalogTupleUpdate(), but *witout* updating the reclache - which ijust
> > would cause the update to be lost.
>
> Well, I imagine it's expecting the subsequent CCI to update the relcache
> entry, which I think is correct behavior in this worldview.  We're
> basically trying to make the relcache state follow transaction/command
> boundary semantics.

My point was that given the current coding the code in
ATExecSetTableSpace() would make changes to the *old* relfilenode, after
having already copied the contents to the new relfilenode. Which means
that if ATExecSetTableSpace() is ever used on pg_class or one of it's
indexes, it'd just loose those changes, afaict.


> > I could come up with a patch for that if you want me to.
>
> I'm happy to let you take a whack at it if you want.

I'll give it a whack (after writing one more email on a separate but
loosely related topic).

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

Tom Lane-2
Andres Freund <[hidden email]> writes:
> On 2019-04-25 16:02:03 -0400, Tom Lane wrote:
>> That could work.  The important API spec is then that the relcache entry
>> reflects the *previous* state of the relation, and is not to be modified
>> by the tableam call.

> Right.

> I was wondering if we should just pass in the pg_class tuple as an "out"
> argument, instead of pointers to relfilnode/relfrozenxid/relminmxid.

Yeah, possibly.  The whole business with xids is perhaps heapam specific,
so decoupling this function's signature from them would be good.

> My point was that given the current coding the code in
> ATExecSetTableSpace() would make changes to the *old* relfilenode, after
> having already copied the contents to the new relfilenode. Which means
> that if ATExecSetTableSpace() is ever used on pg_class or one of it's
> indexes, it'd just loose those changes, afaict.

Hmm.

There's another reason why we'd like the relcache contents to only change
at CCI, which is that if we get a relcache invalidation somewhere before
we get to the CCI, relcache.c would proceed to reload it based on the
*current* catalog contents (ie, pre-update, thanks to the magic of MVCC),
so that the entry would revert back to its previous state until we did get
to CCI.  I wonder whether there's any demonstrable bug there.  Though
you'd think the CLOBBER_CACHE_ALWAYS animals would've found it if so.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

Andres Freund
Hi,

On 2019-04-25 17:12:33 -0400, Tom Lane wrote:
> Andres Freund <[hidden email]> writes:
> > My point was that given the current coding the code in
> > ATExecSetTableSpace() would make changes to the *old* relfilenode, after
> > having already copied the contents to the new relfilenode. Which means
> > that if ATExecSetTableSpace() is ever used on pg_class or one of it's
> > indexes, it'd just loose those changes, afaict.
>
> Hmm.

I think there's no a live bug in because we a) require
allow_system_table_mods to modify system tables, and then b) have
another check

    /*
     * We cannot support moving mapped relations into different tablespaces.
     * (In particular this eliminates all shared catalogs.)
     */
    if (RelationIsMapped(rel))
        ereport(ERROR,
                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
                 errmsg("cannot move system relation \"%s\"",
                        RelationGetRelationName(rel))));

that triggers even when allow_system_table_mods is off.


> There's another reason why we'd like the relcache contents to only change
> at CCI, which is that if we get a relcache invalidation somewhere before
> we get to the CCI, relcache.c would proceed to reload it based on the
> *current* catalog contents (ie, pre-update, thanks to the magic of MVCC),
> so that the entry would revert back to its previous state until we did get
> to CCI.  I wonder whether there's any demonstrable bug there.  Though
> you'd think the CLOBBER_CACHE_ALWAYS animals would've found it if so.

I think we basically assume that there's nothing triggering relcache
invals here inbetween updating the relcache entry, and doing the actual
catalog modification. That looks like it's currently somewhat OK in the
places we've talked about so far.

This made me look at cluster.c and damn, I'd forgotten about that.
Look at the following code in copy_table_data():

                /*
                 * When doing swap by content, any toast pointers written into NewHeap
                 * must use the old toast table's OID, because that's where the toast
                 * data will eventually be found.  Set this up by setting rd_toastoid.
                 * This also tells toast_save_datum() to preserve the toast value
                 * OIDs, which we want so as not to invalidate toast pointers in
                 * system catalog caches, and to avoid making multiple copies of a
                 * single toast value.
                 *
                 * Note that we must hold NewHeap open until we are done writing data,
                 * since the relcache will not guarantee to remember this setting once
                 * the relation is closed.  Also, this technique depends on the fact
                 * that no one will try to read from the NewHeap until after we've
                 * finished writing it and swapping the rels --- otherwise they could
                 * follow the toast pointers to the wrong place.  (It would actually
                 * work for values copied over from the old toast table, but not for
                 * any values that we toast which were previously not toasted.)
                 */
                NewHeap->rd_toastoid = OldHeap->rd_rel->reltoastrelid;
        }
        else
                *pSwapToastByContent = false;

which then goes on to do things like a full blown sort or index
scan. Sure, that's for the old relation, but that's so ugly. It works
because RelationClearRelation() copies over rd_toastoid :/.

Greetings,

Andres Freund


12345