found xmin from before relfrozenxid on pg_catalog.pg_authid

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
45 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

Álvaro Herrera

I admit I'm pretty surprised by this whole episode.  I have no useful
advice to offer here.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

Peter Geoghegan-4
In reply to this post by jfinzel
On Thu, Mar 22, 2018 at 12:27 PM, Jeremy Finzel <[hidden email]> wrote:
> Thank you for the recommendation.  I ran both amcheck functions on all 4
> indexes of those 2 tables with heapallindexed = true, but no issues were
> found.

Probably wouldn't hurt to run it against all indexes, if you can make
time for that. If you can generalize from the example query that calls
the bt_index_check() function, but set
"heapallindexed=>i.indisprimary" and remove "n.nspname =
'pg_catalog'", as well as "LIMIT 10".  This will test tables and
indexes from all schemas, which might be interesting.
--
Peter Geoghegan

Reply | Threaded
Open this post in threaded view
|

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

jfinzel
On Thu, Mar 22, 2018 at 3:20 PM, Peter Geoghegan <[hidden email]> wrote:
On Thu, Mar 22, 2018 at 12:27 PM, Jeremy Finzel <[hidden email]> wrote:
> Thank you for the recommendation.  I ran both amcheck functions on all 4
> indexes of those 2 tables with heapallindexed = true, but no issues were
> found.

Probably wouldn't hurt to run it against all indexes, if you can make
time for that. If you can generalize from the example query that calls
the bt_index_check() function, but set
"heapallindexed=>i.indisprimary" and remove "n.nspname =
'pg_catalog'", as well as "LIMIT 10".  This will test tables and
indexes from all schemas, which might be interesting.
--
Peter Geoghegan

I am running this on a san snapshot of our production system.  I assume that this will give me a valid check for file-system-level corruption.  I am going to kick it off and see if I find anything interesting.

Thanks,
Jeremy
Reply | Threaded
Open this post in threaded view
|

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

Peter Geoghegan-4
On Thu, Mar 22, 2018 at 2:24 PM, Jeremy Finzel <[hidden email]> wrote:
> I am running this on a san snapshot of our production system.  I assume that
> this will give me a valid check for file-system-level corruption.  I am
> going to kick it off and see if I find anything interesting.

It might. Note that SAN snapshots might have corruption hazards,
though. Postgres expects crash consistency across all filesystems, so
you might run into trouble if you had a separate filesystem for WAL,
for example. I know that LVM snapshots only provide a consistent view
of a single logical volume, even though many LVM + Postgres setups
will involve multiple logical volumes. This makes it possible for a
small inconsistency across logical volumes to corrupt data.

I don't know anything about your SAN snapshotting, but this is at
least something to consider.

--
Peter Geoghegan

Reply | Threaded
Open this post in threaded view
|

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

Maxim Boguk-2
​Hi everyone,

I just got the same issue on 9.6.8:

2018-05-15 11:52:01 MSK 33558 @ from  [vxid:317/92895305 txid:0] [] ERROR:  found xmin 2808837517 from before relfrozenxid 248712603
2018-05-15 11:52:01 MSK 33558 @ from  [vxid:317/92895305 txid:0] [] CONTEXT:  automatic vacuum of table "template0.pg_catalog.pg_authid"

Additional details:

1) bt_index_check - no errors on both indexes.

2) SELECT pg_truncate_visibility_map('pg_authid'::regclass); + vacuum freeze - doesn't help.

3)Total 6 rows affected:
template1=# select oid, ctid, xmin, xmax, cmin, cmax from pg_authid where xmin::text::bigint > (select relfrozenxid::text::bigint from pg_class where relname='pg_authid');
    oid    |  ctid  |    xmin    | xmax | cmin | cmax
-----------+--------+------------+------+------+------
 183671986 | (0,90) | 3039161773 |    0 |    0 |    0
 183106699 | (1,48) | 2576823237 |    0 |    0 |    0
 183921770 | (1,50) | 3265971811 |    0 |    0 |    0
 183921914 | (1,52) | 3266122344 |    0 |    0 |    0
 187988173 | (1,58) | 4258893789 |    0 |    0 |    0
 182424977 | (1,62) | 2808837517 |    0 |    0 |    0
on total two pages.

4) template1=# select relfrozenxid from pg_catalog.pg_class where relname = 'pg_authid';
 relfrozenxid
--------------
   2548304492


5)Rows itself looks pretty valid and correspond to the actual database users.
7)No database/server crash happened last few years, no disk errors/problems.

I feel it could be related with vacuum skip locked pages patch + freeze + shared catalog combination, but cannot prove it yet.

Looking for possible course of action.
Probably simplest fix - drop and recreate these 6 affected users, but so far I willing spent some time research into this issue.

--
Maxim Boguk
Senior Postgresql DBA
http://dataegret.com/

Phone RU: +7  985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно когда я так делаю ещё раз?"

Reply | Threaded
Open this post in threaded view
|

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

Andres Freund
On 2018-05-15 11:06:38 +0200, Maxim Boguk wrote:

> ​Hi everyone,
>
> I just got the same issue on 9.6.8:
>
> 2018-05-15 11:52:01 MSK 33558 @ from  [vxid:317/92895305 txid:0] [] ERROR:
> found xmin 2808837517 from before relfrozenxid 248712603
> 2018-05-15 11:52:01 MSK 33558 @ from  [vxid:317/92895305 txid:0] []
> CONTEXT:  automatic vacuum of table "template0.pg_catalog.pg_authid"
>
> Additional details:
>
> 1) bt_index_check - no errors on both indexes.
>
> 2) SELECT pg_truncate_visibility_map('pg_authid'::regclass); + vacuum
> freeze - doesn't help.

Yea, too late.


> 3)Total 6 rows affected:
> template1=# select oid, ctid, xmin, xmax, cmin, cmax from pg_authid where
> xmin::text::bigint > (select relfrozenxid::text::bigint from pg_class where
> relname='pg_authid');
>     oid    |  ctid  |    xmin    | xmax | cmin | cmax
> -----------+--------+------------+------+------+------
>  183671986 | (0,90) | 3039161773 |    0 |    0 |    0
>  183106699 | (1,48) | 2576823237 |    0 |    0 |    0
>  183921770 | (1,50) | 3265971811 |    0 |    0 |    0
>  183921914 | (1,52) | 3266122344 |    0 |    0 |    0
>  187988173 | (1,58) | 4258893789 |    0 |    0 |    0
>  182424977 | (1,62) | 2808837517 |    0 |    0 |    0
> on total two pages.
>
> 4) template1=# select relfrozenxid from pg_catalog.pg_class where relname =
> 'pg_authid';
>  relfrozenxid
> --------------
>    2548304492
>
>
> 5)Rows itself looks pretty valid and correspond to the actual database
> users.
> 7)No database/server crash happened last few years, no disk errors/problems.
>
> I feel it could be related with vacuum skip locked pages patch + freeze +
> shared catalog combination, but cannot prove it yet.
> ​
> Looking for possible course of action.
> Probably simplest fix - drop and recreate these 6 affected users, but so
> far I willing spent some time research into this issue.

Could you use pageinspect to get the infomasks for the affected tuples?

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

Maxim Boguk-2
Hi Andres,


> ​
> Looking for possible course of action.
> Probably simplest fix - drop and recreate these 6 affected users, but so
> far I willing spent some time research into this issue.

Could you use pageinspect to get the infomasks for the affected tuples?

Greetings,

Andres Freund


​For sample:

postgres=# vacuum pg_catalog.pg_authid;
ERROR:  found xmin 2894889518 from before relfrozenxid 248712603

select ctid, xmin, xmax, cmin, cmax from pg_catalog.pg_authid where xmin::text::bigint=2894889518;
  ctid  |    xmin    | xmax | cmin | cmax
--------+------------+------+------+------
 (1,26) | 2894889518 |    0 |    0 |    0


postgres=# SELECT * FROM heap_page_items(get_raw_page('pg_authid', 1)) where t_ctid::text='(1,26)';
-[ RECORD 1 ]-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
lp          | 26
lp_off      | 4656
lp_flags    | 1
lp_len      | 144
t_xmin      | 2894889518
t_xmax      | 0
t_field3    | 0
t_ctid      | (1,26)
t_infomask2 | 32779
t_infomask  | 10507
t_hoff      | 32
t_bits      | 1111111111000000
t_oid       | 189787727

Any new role created in DB instantly affected by this issue.


In the same time:

 select relfrozenxid from pg_class where relname='pg_authid';
 relfrozenxid
--------------
   2863429136

So it's interesting where value of " from before relfrozenxid 248712603" come from.




--
Maxim Boguk
Senior Postgresql DBA
http://dataegret.com/

Phone RU: +7  985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно когда я так делаю ещё раз?"

Reply | Threaded
Open this post in threaded view
|

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

Andres Freund
Hi,

On 2018-05-22 21:30:43 +0300, Maxim Boguk wrote:

> ​For sample:
>
> postgres=# vacuum pg_catalog.pg_authid;
> ERROR:  found xmin 2894889518 from before relfrozenxid 248712603
>
> select ctid, xmin, xmax, cmin, cmax from pg_catalog.pg_authid where
> xmin::text::bigint=2894889518;
>   ctid  |    xmin    | xmax | cmin | cmax
> --------+------------+------+------+------
>  (1,26) | 2894889518 |    0 |    0 |    0
>
>
> postgres=# SELECT * FROM heap_page_items(get_raw_page('pg_authid', 1))
> where t_ctid::text='(1,26)';
> -[ RECORD 1
> ]-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> lp          | 26
> lp_off      | 4656
> lp_flags    | 1
> lp_len      | 144
> t_xmin      | 2894889518
> t_xmax      | 0
> t_field3    | 0
> t_ctid      | (1,26)
> t_infomask2 | 32779
> t_infomask  | 10507
> t_hoff      | 32
> t_bits      | 1111111111000000
> t_oid       | 189787727
>

So this row has, if I didn't screw up decoding the following infomask
bits set::
HEAP_HASNULL
HEAP_HASVARWIDTH
HEAP_HASOID
HEAP_XMIN_COMMITTED
HEAP_XMAX_INVALID
HEAP_UPDATED

So it's not been previously frozen, which I was wondering about.


> Which indeed makes it
> Any new role created in DB instantly affected by this issue.

What's txid_current()?


>
> In the same time:
>
>  select relfrozenxid from pg_class where relname='pg_authid';
>  relfrozenxid
> --------------
>    2863429136
>
> So it's interesting where value of " from before relfrozenxid 248712603"
> come from.

Hm. That's indeed odd.  Could you get a backtrace of the error with "bt
full" of the error?

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

Maxim Boguk-2


On Tue, May 22, 2018 at 9:47 PM, Andres Freund <[hidden email]> wrote:
Hi,

On 2018-05-22 21:30:43 +0300, Maxim Boguk wrote:
> ​For sample:
>
> postgres=# vacuum pg_catalog.pg_authid;
> ERROR:  found xmin 2894889518 from before relfrozenxid 248712603
>
> select ctid, xmin, xmax, cmin, cmax from pg_catalog.pg_authid where
> xmin::text::bigint=2894889518;
>   ctid  |    xmin    | xmax | cmin | cmax
> --------+------------+------+------+------
>  (1,26) | 2894889518 |    0 |    0 |    0
>
>
> postgres=# SELECT * FROM heap_page_items(get_raw_page('pg_authid', 1))
> where t_ctid::text='(1,26)';
> -[ RECORD 1
> ]-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> lp          | 26
> lp_off      | 4656
> lp_flags    | 1
> lp_len      | 144
> t_xmin      | 2894889518
> t_xmax      | 0
> t_field3    | 0
> t_ctid      | (1,26)
> t_infomask2 | 32779
> t_infomask  | 10507
> t_hoff      | 32
> t_bits      | 1111111111000000
> t_oid       | 189787727
>

So this row has, if I didn't screw up decoding the following infomask
bits set::
HEAP_HASNULL
HEAP_HASVARWIDTH
HEAP_HASOID
HEAP_XMIN_COMMITTED
HEAP_XMAX_INVALID
HEAP_UPDATED

So it's not been previously frozen, which I was wondering about.


> Which indeed makes it
> Any new role created in DB instantly affected by this issue.

What's txid_current()?


>
> In the same time:
>
>  select relfrozenxid from pg_class where relname='pg_authid';
>  relfrozenxid
> --------------
>    2863429136
>
> So it's interesting where value of " from before relfrozenxid 248712603"
> come from.

Hm. That's indeed odd.  Could you get a backtrace of the error with "bt
full" of the error?

Greetings,

Andres Freund

​select txid_current();
 txid_current
--------------
  41995913769
​About gdb bt - it's tricky because it is mission critical master db of huge project.
I'll will try promote backup replica and check is issue persist there and if yes - we will have our playground for a while, but it will require sometime to arrange.​


--
Maxim Boguk
Senior Postgresql DBA
http://dataegret.com/

Phone RU: +7  985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно когда я так делаю ещё раз?"

Reply | Threaded
Open this post in threaded view
|

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

Andres Freund
Hi,

On 2018-05-22 22:18:15 +0300, Maxim Boguk wrote:
> On Tue, May 22, 2018 at 9:47 PM, Andres Freund <[hidden email]> wrote:
> > >  select relfrozenxid from pg_class where relname='pg_authid';
> > >  relfrozenxid
> > > --------------
> > >    2863429136

> ​select txid_current();
>  txid_current
> --------------
>   41995913769

So that's an xid of 3341208114, if you leave the epoch out. What's
autovacuum_freeze_max_age set to in that cluster?

Can you show pg_controldata output, and relminmxid from that cluster?

I might be daft here, but it's surely curious that the relfrozenxid from
the error and pg_catalog are really different (number of digits):
catalog: 2863429136
error:    248712603


> ​About gdb bt - it's tricky because it is mission critical master db of
> huge project.
> I'll will try promote backup replica and check is issue persist there and
> if yes - we will have our playground for a while, but it will require
> sometime to arrange.​

You should be ok to just bt that in the running cluster, but I
definitely understand if you don't want to do that...  I'd appreciate if
you set up the a playground, because this seems like something that'll
reappear.

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

Maxim Boguk-2


On Tue, May 22, 2018 at 10:30 PM, Andres Freund <[hidden email]> wrote:
Hi,

On 2018-05-22 22:18:15 +0300, Maxim Boguk wrote:
> On Tue, May 22, 2018 at 9:47 PM, Andres Freund <[hidden email]> wrote:
> > >  select relfrozenxid from pg_class where relname='pg_authid';
> > >  relfrozenxid
> > > --------------
> > >    2863429136

> ​select txid_current();
>  txid_current
> --------------
>   41995913769

So that's an xid of 3341208114, if you leave the epoch out. What's
​​
autovacuum_freeze_max_age set to in that cluster?

​postgres=# show autovacuum_freeze_max_age;
 autovacuum_freeze_max_age
---------------------------
 200000000
(default value I think)​

 
Can you show pg_controldata output, and
​​
relminmxid from that cluster?

​postgres@db:~$ /usr/lib/postgresql/9.6/bin/pg_controldata  -D /var/lib/postgresql/9.6/main
pg_control version number:            960
Catalog version number:               201608131
Database system identifier:           6469368654711450114
Database cluster state:               in production
pg_control last modified:             Tue 22 May 2018 10:20:14 PM MSK
Latest checkpoint location:           CCB5/F9C37950
Prior checkpoint location:            CCB0/43F316B0
Latest checkpoint's REDO location:    CCB1/6706BD88
Latest checkpoint's REDO WAL file:    000000010000CCB100000067
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          9:3341161759
Latest checkpoint's NextOID:          190071899
Latest checkpoint's NextMultiXactId:  59416233
Latest checkpoint's NextMultiOffset:  215588532
Latest checkpoint's oldestXID:        2814548646
Latest checkpoint's oldestXID's DB:   16400
Latest checkpoint's oldestActiveXID:  3341161759
Latest checkpoint's oldestMultiXid:   54264778
Latest checkpoint's oldestMulti's DB: 16400
Latest checkpoint's oldestCommitTsXid:2814548646
Latest checkpoint's newestCommitTsXid:3341161758
Time of latest checkpoint:            Tue 22 May 2018 10:05:16 PM MSK
Fake LSN counter for unlogged rels:   0/1
Minimum recovery ending location:     0/0
Min recovery ending loc's timeline:   0
Backup start location:                0/0
Backup end location:                  0/0
End-of-backup record required:        no
wal_level setting:                    replica
wal_log_hints setting:                on
max_connections setting:              2000
max_worker_processes setting:         8
max_prepared_xacts setting:           0
max_locks_per_xact setting:           64
track_commit_timestamp setting:       on
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Size of a large-object chunk:         2048
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value
Data page checksum version:           0
postgres=# select datname,datfrozenxid,datminmxid from pg_database order by datname;
  datname  | datfrozenxid | datminmxid
-----------+--------------+------------
 **        |   2815939794 |   54265194
 postgres  |   2863429136 |   54280819
 template0 |   3148297669 |   59342106
 template1 |   2816765546 |   59261794



 

I might be daft here, but it's surely curious that the relfrozenxid from
the error and pg_catalog are really different (number of digits):
catalog: 2863429136
error:    248712603


> ​About gdb bt - it's tricky because it is mission critical master db of
> huge project.
> I'll will try promote backup replica and check is issue persist there and
> if yes - we will have our playground for a while, but it will require
> sometime to arrange.​

You should be ok to just bt that in the running cluster, but I
definitely understand if you don't want to do that...  I'd appreciate if
you set up the a playground, because this seems like something that'll
reappear.

​My gdb kunf-fu exceptionally rusty now, so I better play in sandbox.​
​And if error did't exist on fresh promoted replica it will give us useful info as well.​

--
Maxim Boguk
Senior Postgresql DBA
http://dataegret.com/

Phone RU: +7  985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно когда я так делаю ещё раз?"

Reply | Threaded
Open this post in threaded view
|

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

jfinzel


On Tue, May 22, 2018 at 2:42 PM, Maxim Boguk <[hidden email]> wrote:


On Tue, May 22, 2018 at 10:30 PM, Andres Freund <[hidden email]> wrote:
Hi,

On 2018-05-22 22:18:15 +0300, Maxim Boguk wrote:
> On Tue, May 22, 2018 at 9:47 PM, Andres Freund <[hidden email]> wrote:
> > >  select relfrozenxid from pg_class where relname='pg_authid';
> > >  relfrozenxid
> > > --------------
> > >    2863429136

> ​select txid_current();
>  txid_current
> --------------
>   41995913769

So that's an xid of 3341208114, if you leave the epoch out. What's
​​
autovacuum_freeze_max_age set to in that cluster?

​postgres=# show autovacuum_freeze_max_age;
 autovacuum_freeze_max_age
---------------------------
 200000000
(default value I think)​

 
Can you show pg_controldata output, and
​​
relminmxid from that cluster?

​postgres@db:~$ /usr/lib/postgresql/9.6/bin/pg_controldata  -D /var/lib/postgresql/9.6/main
pg_control version number:            960
Catalog version number:               201608131
Database system identifier:           6469368654711450114
Database cluster state:               in production
pg_control last modified:             Tue 22 May 2018 10:20:14 PM MSK
Latest checkpoint location:           CCB5/F9C37950
Prior checkpoint location:            CCB0/43F316B0
Latest checkpoint's REDO location:    CCB1/6706BD88
Latest checkpoint's REDO WAL file:    000000010000CCB100000067
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          9:3341161759
Latest checkpoint's NextOID:          190071899
Latest checkpoint's NextMultiXactId:  59416233
Latest checkpoint's NextMultiOffset:  215588532
Latest checkpoint's oldestXID:        2814548646
Latest checkpoint's oldestXID's DB:   16400
Latest checkpoint's oldestActiveXID:  3341161759
Latest checkpoint's oldestMultiXid:   54264778
Latest checkpoint's oldestMulti's DB: 16400
Latest checkpoint's oldestCommitTsXid:2814548646
Latest checkpoint's newestCommitTsXid:3341161758
Time of latest checkpoint:            Tue 22 May 2018 10:05:16 PM MSK
Fake LSN counter for unlogged rels:   0/1
Minimum recovery ending location:     0/0
Min recovery ending loc's timeline:   0
Backup start location:                0/0
Backup end location:                  0/0
End-of-backup record required:        no
wal_level setting:                    replica
wal_log_hints setting:                on
max_connections setting:              2000
max_worker_processes setting:         8
max_prepared_xacts setting:           0
max_locks_per_xact setting:           64
track_commit_timestamp setting:       on
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Size of a large-object chunk:         2048
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value
Data page checksum version:           0
postgres=# select datname,datfrozenxid,datminmxid from pg_database order by datname;
  datname  | datfrozenxid | datminmxid
-----------+--------------+------------
 **        |   2815939794 |   54265194
 postgres  |   2863429136 |   54280819
 template0 |   3148297669 |   59342106
 template1 |   2816765546 |   59261794



 

I might be daft here, but it's surely curious that the relfrozenxid from
the error and pg_catalog are really different (number of digits):
catalog: 2863429136
error:    248712603


> ​About gdb bt - it's tricky because it is mission critical master db of
> huge project.
> I'll will try promote backup replica and check is issue persist there and
> if yes - we will have our playground for a while, but it will require
> sometime to arrange.​

You should be ok to just bt that in the running cluster, but I
definitely understand if you don't want to do that...  I'd appreciate if
you set up the a playground, because this seems like something that'll
reappear.

​My gdb kunf-fu exceptionally rusty now, so I better play in sandbox.​
​And if error did't exist on fresh promoted replica it will give us useful info as well.​

--
Maxim Boguk
Senior Postgresql DBA
http://dataegret.com/

Phone RU: +7  985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно когда я так делаю ещё раз?"


All - 

We just noticed that this exact same problem has now returned for us.  It has exactly the same symptoms as my original post (including the fact that a snapshot does not have the issue).

I am open to all suggestions as to troubleshooting this further!

Thank you,
Jeremy
Reply | Threaded
Open this post in threaded view
|

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

Maxim Boguk-2
In reply to this post by Andres Freund



> ​About gdb bt - it's tricky because it is mission critical master db of
> huge project.
> I'll will try promote backup replica and check is issue persist there and
> if yes - we will have our playground for a while, but it will require
> sometime to arrange.​

You should be ok to just bt that in the running cluster, but I
definitely understand if you don't want to do that...  I'd appreciate if
you set up the a playground, because this seems like something that'll
reappear.

Greetings,

Andres Freund

​Ok this issue reproducible on promoted replica.
So now I have my playground.
Will provide gdb info in few hours.​


--
Maxim Boguk
Senior Postgresql DBA
http://dataegret.com/

Phone RU: +7  985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно когда я так делаю ещё раз?"

Reply | Threaded
Open this post in threaded view
|

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

Maxim Boguk-2


On Thu, May 24, 2018 at 12:38 PM, Maxim Boguk <[hidden email]> wrote:



> ​About gdb bt - it's tricky because it is mission critical master db of
> huge project.
> I'll will try promote backup replica and check is issue persist there and
> if yes - we will have our playground for a while, but it will require
> sometime to arrange.​

You should be ok to just bt that in the running cluster, but I
definitely understand if you don't want to do that...  I'd appreciate if
you set up the a playground, because this seems like something that'll
reappear.

Greetings,

Andres Freund

​bt full :

​#0  errmsg_internal (fmt=0x555b62e6eb70 "found xmin %u from before relfrozenxid %u") at /build/postgresql-9.6-9VrjLn/postgresql-9.6-9.6.8/build/../src/backend/utils/error/elog.c:828
        edata = <optimized out>
        __func__ = "errmsg_internal"
#1  0x0000555b62ad1cb9 in heap_prepare_freeze_tuple (tuple=<optimized out>, relfrozenxid=relfrozenxid@entry=248720453, relminmxid=relminmxid@entry=53644256, cutoff_xid=3485221679,
    cutoff_multi=<optimized out>, frz=frz@entry=0x555b640d1988, totally_frozen_p=0x7ffca32c0e90 "\001")
    at /build/postgresql-9.6-9VrjLn/postgresql-9.6-9.6.8/build/../src/backend/access/heap/heapam.c:6685
        changed = <optimized out>
        freeze_xmax = 0 '\000'
        xid = <optimized out>
        totally_frozen = <optimized out>
        __func__ = "heap_prepare_freeze_tuple"
#2  0x0000555b62bfd2b5 in lazy_scan_heap (aggressive=0 '\000', nindexes=2, Irel=0x555b64095948, vacrelstats=<optimized out>, options=26, onerel=0x555b64029498)
    at /build/postgresql-9.6-9VrjLn/postgresql-9.6-9.6.8/build/../src/backend/commands/vacuumlazy.c:1090
        tuple_totally_frozen = 1 '\001'
        itemid = 0x7f2b7bc5d4fc
        buf = 18138842
        page = <optimized out>
        offnum = 26
        maxoff = 26
        hastup = 1 '\001'
        nfrozen = 0
        freespace = <optimized out>
        all_frozen = 1 '\001'
        tupgone = 0 '\000'
        prev_dead_count = 0
        all_visible_according_to_vm = 0 '\000'
        all_visible = 1 '\001'
        has_dead_tuples = 0 '\000'
        visibility_cutoff_xid = 3490221678
        relname = 0x555b640296a8 "pg_authid"
        relminmxid = 53644256
        tups_vacuumed = 0
        indstats = 0x555b64095928
        relfrozenxid = 248720453
        skipping_blocks = 0 '\000'
        frozen = 0x555b640d1988
        initprog_val = {1, 2, 582}
        vacuumed_pages = 0
        num_tuples = 80
        nkeep = 0
        ru0 = {tv = {tv_sec = 1527175061, tv_usec = 739743}, ru = {ru_utime = {tv_sec = 0, tv_usec = 60000}, ru_stime = {tv_sec = 1, tv_usec = 416000}, {ru_maxrss = 9704,
              __ru_maxrss_word = 9704}, {ru_ixrss = 0, __ru_ixrss_word = 0}, {ru_idrss = 0, __ru_idrss_word = 0}, {ru_isrss = 0, __ru_isrss_word = 0}, {ru_minflt = 33982,
              __ru_minflt_word = 33982}, {ru_majflt = 0, __ru_majflt_word = 0}, {ru_nswap = 0, __ru_nswap_word = 0}, {ru_inblock = 0, __ru_inblock_word = 0}, {ru_oublock = 8,
              __ru_oublock_word = 8}, {ru_msgsnd = 0, __ru_msgsnd_word = 0}, {ru_msgrcv = 0, __ru_msgrcv_word = 0}, {ru_nsignals = 0, __ru_nsignals_word = 0}, {ru_nvcsw = 38446,
              __ru_nvcsw_word = 38446}, {ru_nivcsw = 2, __ru_nivcsw_word = 2}}}
        vmbuffer = 8763411
        empty_pages = 0
        nunused = 0
        i = <optimized out>
        next_unskippable_block = <optimized out>
        buf = {data = 0x1 <error: Cannot access memory at address 0x1>, len = -1557393520, maxlen = 32764, cursor = -1557393616}
        nblocks = 2
        blkno = <optimized out>
        tuple = {t_len = 144, t_self = {ip_blkid = {bi_hi = 0, bi_lo = 1}, ip_posid = 26}, t_tableOid = 1260, t_data = 0x7f2b7bc5e6b0}
        initprog_index = {0, 1, 5}
#3  lazy_vacuum_rel (onerel=onerel@entry=0x555b64029498, options=options@entry=1, params=params@entry=0x7ffca32c11b0, bstrategy=<optimized out>)
    at /build/postgresql-9.6-9VrjLn/postgresql-9.6-9.6.8/build/../src/backend/commands/vacuumlazy.c:253
        vacrelstats = <optimized out>
        Irel = 0x555b64095948
        nindexes = 2
        ru0 = {tv = {tv_sec = 93850993708032, tv_usec = 23936}, ru = {ru_utime = {tv_sec = 128, tv_usec = 93851007694584}, ru_stime = {tv_sec = 0, tv_usec = 0}, {ru_maxrss = 140723046059424,
              __ru_maxrss_word = 140723046059424}, {ru_ixrss = 93850989549501, __ru_ixrss_word = 93850989549501}, {ru_idrss = 16, __ru_idrss_word = 16}, {ru_isrss = 93851007694584,
              __ru_isrss_word = 93851007694584}, {ru_minflt = 140723046059472, __ru_minflt_word = 140723046059472}, {ru_majflt = 93850989559601, __ru_majflt_word = 93850989559601}, {
              ru_nswap = 93850994010032, __ru_nswap_word = 93850994010032}, {ru_inblock = 139833259949608, __ru_inblock_word = 139833259949608}, {ru_oublock = 93850994010032,
              __ru_oublock_word = 93850994010032}, {ru_msgsnd = 93851008239032, __ru_msgsnd_word = 93851008239032}, {ru_msgrcv = 140723046059904, __ru_msgrcv_word = 140723046059904}, {
              ru_nsignals = 93850989313526, __ru_nsignals_word = 93850989313526}, {ru_nvcsw = 345176855416, __ru_nvcsw_word = 345176855416}, {ru_nivcsw = 140723046059551,
              __ru_nivcsw_word = 140723046059551}}}
        starttime = 0
        secs = 1527175061
        usecs = 144
        read_rate = <optimized out>
        write_rate = <optimized out>
        aggressive = 0 '\000'
        scanned_all_unfrozen = <optimized out>
        xidFullScanLimit = 3440221679
        mxactFullScanLimit = 4204418425
        new_rel_pages = <optimized out>
        new_rel_tuples = <optimized out>
        new_rel_allvisible = 2737573328
        new_live_tuples = <optimized out>
        new_frozen_xid = <optimized out>
        new_min_multi = <optimized out>
        __func__ = "lazy_vacuum_rel"
#4  0x0000555b62bfa54d in vacuum_rel (relid=relid@entry=1260, relation=relation@entry=0x555b64047100, options=options@entry=1, params=params@entry=0x7ffca32c11b0)
    at /build/postgresql-9.6-9VrjLn/postgresql-9.6-9.6.8/build/../src/backend/commands/vacuum.c:1391
        lmode = 4
        onerel = 0x555b64029498
        onerelid = {relId = 1260, dbId = 0}
        toast_relid = 0
        save_userid = 10
        save_sec_context = 0
        save_nestlevel = 2
        __func__ = "vacuum_rel"
#5  0x0000555b62bfb545 in vacuum (options=1, relation=0x555b64047100, relid=relid@entry=0, params=params@entry=0x7ffca32c11b0, va_cols=0x0, bstrategy=<optimized out>, bstrategy@entry=0x0,
    isTopLevel=1 '\001') at /build/postgresql-9.6-9VrjLn/postgresql-9.6-9.6.8/build/../src/backend/commands/vacuum.c:303
        relid = 1260
        cur = 0x555b640998b8
        save_exception_stack = 0x7ffca32c1590
        save_context_stack = 0x0
        local_sigjmp_buf = {{__jmpbuf = {93851008735448, -4513934124599105488, 93850994009984, 93851008619616, 93851008619616, 93851008398480, -4513934124655728592, -7776366411743385552},
            __mask_was_saved = 0, __saved_mask = {__val = {0, 0, 0, 0, 1, 1, 1, 3473119737632784380, 139833388464563, 140723046060384, 12, 140723046060912, 93850990837571,
                4049721212982268568, 1, 140723046060432}}}}
        stmttype = <optimized out>
        in_outer_xact = 0 '\000'
        use_own_xacts = 1 '\001'
        relations = 0x555b640998d8
        in_vacuum = 1 '\001'
        __func__ = "vacuum"
#6  0x0000555b62bfb8ca in ExecVacuum (vacstmt=vacstmt@entry=0x555b64047150, isTopLevel=isTopLevel@entry=1 '\001')
    at /build/postgresql-9.6-9VrjLn/postgresql-9.6-9.6.8/build/../src/backend/commands/vacuum.c:121
        params = {freeze_min_age = -1, freeze_table_age = -1, multixact_freeze_min_age = -1, multixact_freeze_table_age = -1, is_wraparound = 0 '\000', log_min_duration = -1}
#7  0x0000555b62d27787 in standard_ProcessUtility (parsetree=0x555b64047150, queryString=0x555b64046688 "vacuum pg_catalog.pg_authid;", context=<optimized out>, params=0x0,
    dest=0x555b64047490, completionTag=0x7ffca32c19c0 "") at /build/postgresql-9.6-9VrjLn/postgresql-9.6-9.6.8/build/../src/backend/tcop/utility.c:655
        stmt = 0x555b64047150
        isTopLevel = <optimized out>
        __func__ = "standard_ProcessUtility"


--
Maxim Boguk
Senior Postgresql DBA
http://dataegret.com/

Phone RU: +7  985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно когда я так делаю ещё раз?"




--
Maxim Boguk
Senior Postgresql DBA
http://dataegret.com/

Phone RU: +7  985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно когда я так делаю ещё раз?"

Reply | Threaded
Open this post in threaded view
|

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

Alvaro Herrera-9
Hmm .. surely

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5016181fd7..5d7fa1fb45 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6690,7 +6690,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
  xid = HeapTupleHeaderGetXmin(tuple);
  xmin_frozen = ((xid == FrozenTransactionId) ||
    HeapTupleHeaderXminFrozen(tuple));
- if (TransactionIdIsNormal(xid))
+ if (!xmin_frozen && TransactionIdIsNormal(xid))
  {
  if (TransactionIdPrecedes(xid, relfrozenxid))
  ereport(ERROR,


??


--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

Andres Freund
On 2018-05-24 13:08:53 -0400, Alvaro Herrera wrote:

> Hmm .. surely
>
> diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
> index 5016181fd7..5d7fa1fb45 100644
> --- a/src/backend/access/heap/heapam.c
> +++ b/src/backend/access/heap/heapam.c
> @@ -6690,7 +6690,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
>   xid = HeapTupleHeaderGetXmin(tuple);
>   xmin_frozen = ((xid == FrozenTransactionId) ||
>     HeapTupleHeaderXminFrozen(tuple));
> - if (TransactionIdIsNormal(xid))
> + if (!xmin_frozen && TransactionIdIsNormal(xid))
>   {
>   if (TransactionIdPrecedes(xid, relfrozenxid))
>   ereport(ERROR,
>
>
> ??

I don't think that's necesary - HeapTupleHeaderGetXmin() returns
FrozenTransactionId if the tuple is frozen (note the
HeapTupleHeaderXminFrozen() within).

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

Andres Freund
On 2018-05-24 13:30:54 -0700, Andres Freund wrote:

> On 2018-05-24 13:08:53 -0400, Alvaro Herrera wrote:
> > Hmm .. surely
> >
> > diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
> > index 5016181fd7..5d7fa1fb45 100644
> > --- a/src/backend/access/heap/heapam.c
> > +++ b/src/backend/access/heap/heapam.c
> > @@ -6690,7 +6690,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
> >   xid = HeapTupleHeaderGetXmin(tuple);
> >   xmin_frozen = ((xid == FrozenTransactionId) ||
> >     HeapTupleHeaderXminFrozen(tuple));
> > - if (TransactionIdIsNormal(xid))
> > + if (!xmin_frozen && TransactionIdIsNormal(xid))
> >   {
> >   if (TransactionIdPrecedes(xid, relfrozenxid))
> >   ereport(ERROR,
> >
> >
> > ??
>
> I don't think that's necesary - HeapTupleHeaderGetXmin() returns
> FrozenTransactionId if the tuple is frozen (note the
> HeapTupleHeaderXminFrozen() within).

FWIW, even if that weren't the case: a) there'd be a lot more wrong with
this routine imo. b) some of the tuples affected clearly weren't
frozen...

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

Alvaro Herrera-9
In reply to this post by Andres Freund
On 2018-May-24, Andres Freund wrote:

> On 2018-05-24 13:08:53 -0400, Alvaro Herrera wrote:
> > Hmm .. surely

> >   xid = HeapTupleHeaderGetXmin(tuple);
> >   xmin_frozen = ((xid == FrozenTransactionId) ||
> >     HeapTupleHeaderXminFrozen(tuple));
> > - if (TransactionIdIsNormal(xid))
> > + if (!xmin_frozen && TransactionIdIsNormal(xid))

> I don't think that's necesary - HeapTupleHeaderGetXmin() returns
> FrozenTransactionId if the tuple is frozen (note the
> HeapTupleHeaderXminFrozen() within).

Ah, yeah ... I probably thought about this when writing it and removed
it for that reason.

BTW I think the definition of HeapTupleHeaderXminFrozen is seriously
confusing, by failing to return true if the xmin is numerically
FrozenXid (which it'll be if the database was pg_upgraded).  I wonder
about this one in HeapTupleSatisfiesMVCC:

        else
        {
                /* xmin is committed, but maybe not according to our snapshot */
                if (!HeapTupleHeaderXminFrozen(tuple) &&
                        XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot))
                        return false; /* treat as still in progress */
        }

I think this is not a bug only because XidInMVCCSnapshot does this

        /* Any xid < xmin is not in-progress */
        if (TransactionIdPrecedes(xid, snapshot->xmin))
                return false;

which makes it return false for FrozenXid, but seems more of an accident
than explicitly designed.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

Alvaro Herrera-9
In reply to this post by Andres Freund
On 2018-May-24, Andres Freund wrote:

> FWIW, even if that weren't the case: a) there'd be a lot more wrong with
> this routine imo. b) some of the tuples affected clearly weren't
> frozen...

Right.

BTW is it just a coincidence or are all the affected tables pg_authid?
Maybe the problem is shared relations ..?  Maybe the fact that they have
separate relfrozenxid (!?) in different databases?

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

Andres Freund
On 2018-05-24 16:49:40 -0400, Alvaro Herrera wrote:
> BTW is it just a coincidence or are all the affected tables pg_authid?
> Maybe the problem is shared relations ..?  Maybe the fact that they have
> separate relfrozenxid (!?) in different databases?

Yes, that appears to be part of the problem. I've looked at a number of
shared relation related codepaths, but so far my theory is that the
relcache is wrong. Note that one of the reports in this thread clearly
had a different relcache relfrozenxid than in the catalog.

Then there's also:
http://archives.postgresql.org/message-id/1527193504642.36340%40amazon.com

Greetings,

Andres Freund

123