Odd, intermittent failure in contrib/pageinspect

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Odd, intermittent failure in contrib/pageinspect

Tom Lane-2
whelk failed today [1] with this surprising symptom:

--- snip ---
diff -w -U3 C:/buildfarm/buildenv/HEAD/pgsql.build/contrib/pageinspect/expected/page.out C:/buildfarm/buildenv/HEAD/pgsql.build/contrib/pageinspect/results/page.out
--- C:/buildfarm/buildenv/HEAD/pgsql.build/contrib/pageinspect/expected/page.out 2020-03-08 09:00:35.036254700 +0100
+++ C:/buildfarm/buildenv/HEAD/pgsql.build/contrib/pageinspect/results/page.out 2021-01-18 22:10:10.889655500 +0100
@@ -90,8 +90,8 @@
 FROM heap_page_items(get_raw_page('test1', 0)),
      LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2);
  t_infomask | t_infomask2 |                         raw_flags                         |   combined_flags  
-------------+-------------+-----------------------------------------------------------+--------------------
-       2816 |           2 | {HEAP_XMIN_COMMITTED,HEAP_XMIN_INVALID,HEAP_XMAX_INVALID} | {HEAP_XMIN_FROZEN}
+------------+-------------+-----------------------------------------+----------------
+       2304 |           2 | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (1 row)
 
 -- output the decoded flag HEAP_XMIN_FROZEN instead
@@ -99,8 +99,8 @@
 FROM heap_page_items(get_raw_page('test1', 0)),
      LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2);
  t_infomask | t_infomask2 |                         raw_flags                         |   combined_flags  
-------------+-------------+-----------------------------------------------------------+--------------------
-       2816 |           2 | {HEAP_XMIN_COMMITTED,HEAP_XMIN_INVALID,HEAP_XMAX_INVALID} | {HEAP_XMIN_FROZEN}
+------------+-------------+-----------------------------------------+----------------
+       2304 |           2 | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (1 row)
 
 -- tests for decoding of combined flags
--- snip ---

Searching the buildfarm logs turned up exactly one previous occurrence,
also on whelk [2].  So I'm not sure what to make of it.  Could the
immediately preceding VACUUM FREEZE command have silently skipped this
page for some reason?  That'd be a bug I should think.

Also, not really a bug, but why is this test script running exactly
the same query twice in a row?  If that's of value, and not just a
copy-and-paste error, the comments sure don't explain why.  But what
it looks like is that these queries were different when first added,
and then 58b4cb30a5b made them the same when it probably should have
removed one.

                        regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=whelk&dt=2021-01-18%2020%3A42%3A13
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=whelk&dt=2020-04-17%2023%3A42%3A10


Reply | Threaded
Open this post in threaded view
|

Re: Odd, intermittent failure in contrib/pageinspect

Álvaro Herrera
On 2021-Jan-18, Tom Lane wrote:

> Searching the buildfarm logs turned up exactly one previous occurrence,
> also on whelk [2].  So I'm not sure what to make of it.  Could the
> immediately preceding VACUUM FREEZE command have silently skipped this
> page for some reason?  That'd be a bug I should think.

Hmm, doesn't vacuum skip pages when they are pinned?  I don't think
VACUUM FREEZE would be treated especially -- only "aggressive"
wraparound would be an exception, IIRC.  This would reflect in the
relfrozenxid for the table after vacuum, but I'm not sure if there's a
decent way to make the regression tests reflect that.

> Also, not really a bug, but why is this test script running exactly
> the same query twice in a row?  If that's of value, and not just a
> copy-and-paste error, the comments sure don't explain why.  But what
> it looks like is that these queries were different when first added,
> and then 58b4cb30a5b made them the same when it probably should have
> removed one.

Agreed.

--
Álvaro Herrera                            39°49'30"S 73°17'W


Reply | Threaded
Open this post in threaded view
|

Re: Odd, intermittent failure in contrib/pageinspect

Tom Lane-2
Alvaro Herrera <[hidden email]> writes:
> On 2021-Jan-18, Tom Lane wrote:
>> Searching the buildfarm logs turned up exactly one previous occurrence,
>> also on whelk [2].  So I'm not sure what to make of it.  Could the
>> immediately preceding VACUUM FREEZE command have silently skipped this
>> page for some reason?  That'd be a bug I should think.

> Hmm, doesn't vacuum skip pages when they are pinned?  I don't think
> VACUUM FREEZE would be treated especially -- only "aggressive"
> wraparound would be an exception, IIRC.

Right.  If that's the explanation, then adding DISABLE_PAGE_SKIPPING
to the test's VACUUM options should fix it.  However, to believe that
theory you have to have some reason to think that some other process
might have the page pinned.  What would that be?  test1 only has one
small tuple in it, so it doesn't seem credible that autovacuum or
autoanalyze would have fired on it.

[ thinks for a bit... ]  Does the checkpointer pin pages it's writing
out?  I guess it'd have to ...

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Odd, intermittent failure in contrib/pageinspect

Álvaro Herrera
On 2021-Jan-18, Tom Lane wrote:

> Right.  If that's the explanation, then adding DISABLE_PAGE_SKIPPING
> to the test's VACUUM options should fix it.  However, to believe that
> theory you have to have some reason to think that some other process
> might have the page pinned.  What would that be?  test1 only has one
> small tuple in it, so it doesn't seem credible that autovacuum or
> autoanalyze would have fired on it.

I guess the machine would have to be pretty constrained.  (It takes
almost seven minutes to go through the pg_upgrade test, so it does seems
small.)

> [ thinks for a bit... ]  Does the checkpointer pin pages it's writing
> out?  I guess it'd have to ...

It does, per SyncOneBuffer(), called from BufferSync(), called from
CheckPointBuffers().

--
Álvaro Herrera                            39°49'30"S 73°17'W


Reply | Threaded
Open this post in threaded view
|

Re: Odd, intermittent failure in contrib/pageinspect

Tom Lane-2
Alvaro Herrera <[hidden email]> writes:
> On 2021-Jan-18, Tom Lane wrote:
>> [ thinks for a bit... ]  Does the checkpointer pin pages it's writing
>> out?  I guess it'd have to ...

> It does, per SyncOneBuffer(), called from BufferSync(), called from
> CheckPointBuffers().

Right, then we don't need any strange theories about autovacuum,
just bad timing luck.  whelk does seem pretty slow, so it's not
much of a stretch to imagine that it's more susceptible to this
corner case than faster machines.

So, do we have any other tests that are invoking a manual vacuum
and assuming it won't skip any pages?  By this theory, they'd
all be failures waiting to happen.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Odd, intermittent failure in contrib/pageinspect

Michael Paquier-2
On Mon, Jan 18, 2021 at 05:47:40PM -0500, Tom Lane wrote:
> Right, then we don't need any strange theories about autovacuum,
> just bad timing luck.  whelk does seem pretty slow, so it's not
> much of a stretch to imagine that it's more susceptible to this
> corner case than faster machines.
>
> So, do we have any other tests that are invoking a manual vacuum
> and assuming it won't skip any pages?  By this theory, they'd
> all be failures waiting to happen.

That looks possible by looking at the code around lazy_scan_heap(),
but that's narrow.

check_heap.sql and heap_surgery.sql have one VACUUM FREEZE each and it
seems to me that we had better be sure that no pages are skipped for
their cases?

The duplicated query result looks to be an oversight from 58b4cb3 when
the thing got rewritten, so it can just go away.  Good catch.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Odd, intermittent failure in contrib/pageinspect

Tom Lane-2
Michael Paquier <[hidden email]> writes:
> On Mon, Jan 18, 2021 at 05:47:40PM -0500, Tom Lane wrote:
>> So, do we have any other tests that are invoking a manual vacuum
>> and assuming it won't skip any pages?  By this theory, they'd
>> all be failures waiting to happen.

> check_heap.sql and heap_surgery.sql have one VACUUM FREEZE each and it
> seems to me that we had better be sure that no pages are skipped for
> their cases?

It looks to me like heap_surgery ought to be okay, because it's operating
on a temp table; if there are any page access conflicts on that, we've
got BIG trouble ;-)

Poking around, I found a few other places where it looked like a skipped
page could produce diffs in the expected output:
contrib/amcheck/t/001_verify_heapam.pl
contrib/pg_visibility/sql/pg_visibility.sql

There are lots of other vacuums of course, but they don't look like
a missed page would have any effect on the visible results, so I think
we should leave them alone.

In short I propose the attached patch, which also gets rid of
that duplicate query.

                        regards, tom lane


diff --git a/contrib/amcheck/expected/check_heap.out b/contrib/amcheck/expected/check_heap.out
index 882f853d56..1fb3823142 100644
--- a/contrib/amcheck/expected/check_heap.out
+++ b/contrib/amcheck/expected/check_heap.out
@@ -109,7 +109,7 @@ ERROR:  ending block number must be between 0 and 0
 SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 10000, endblock := 11000);
 ERROR:  starting block number must be between 0 and 0
 -- Vacuum freeze to change the xids encountered in subsequent tests
-VACUUM FREEZE heaptest;
+VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) heaptest;
 -- Check that valid options are not rejected nor corruption reported
 -- for a non-empty frozen table
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'none');
diff --git a/contrib/amcheck/sql/check_heap.sql b/contrib/amcheck/sql/check_heap.sql
index c10a25f21c..298de6886a 100644
--- a/contrib/amcheck/sql/check_heap.sql
+++ b/contrib/amcheck/sql/check_heap.sql
@@ -51,7 +51,7 @@ SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 0, endblock :=
 SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 10000, endblock := 11000);
 
 -- Vacuum freeze to change the xids encountered in subsequent tests
-VACUUM FREEZE heaptest;
+VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) heaptest;
 
 -- Check that valid options are not rejected nor corruption reported
 -- for a non-empty frozen table
diff --git a/contrib/amcheck/t/001_verify_heapam.pl b/contrib/amcheck/t/001_verify_heapam.pl
index 1581e51f3c..a2f65b826d 100644
--- a/contrib/amcheck/t/001_verify_heapam.pl
+++ b/contrib/amcheck/t/001_verify_heapam.pl
@@ -46,7 +46,7 @@ detects_heap_corruption(
 # Check a corrupt table with all-frozen data
 #
 fresh_test_table('test');
-$node->safe_psql('postgres', q(VACUUM FREEZE test));
+$node->safe_psql('postgres', q(VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) test));
 corrupt_first_page('test');
 detects_heap_corruption("verify_heapam('test')",
  "all-frozen corrupted table");
diff --git a/contrib/pageinspect/expected/page.out b/contrib/pageinspect/expected/page.out
index 4cd0db8018..4da28f0a1d 100644
--- a/contrib/pageinspect/expected/page.out
+++ b/contrib/pageinspect/expected/page.out
@@ -1,7 +1,7 @@
 CREATE EXTENSION pageinspect;
 CREATE TABLE test1 (a int, b int);
 INSERT INTO test1 VALUES (16777217, 131584);
-VACUUM test1;  -- set up FSM
+VACUUM (DISABLE_PAGE_SKIPPING) test1;  -- set up FSM
 -- The page contents can vary, so just test that it can be read
 -- successfully, but don't keep the output.
 SELECT octet_length(get_raw_page('test1', 'main', 0)) AS main_0;
@@ -87,18 +87,8 @@ SELECT * FROM fsm_page_contents(get_raw_page('test1', 'fsm', 0));
 (1 row)
 
 -- If we freeze the only tuple on test1, the infomask should
--- always be the same in all test runs. we show raw flags by
--- default: HEAP_XMIN_COMMITTED and HEAP_XMIN_INVALID.
-VACUUM FREEZE test1;
-SELECT t_infomask, t_infomask2, raw_flags, combined_flags
-FROM heap_page_items(get_raw_page('test1', 0)),
-     LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2);
- t_infomask | t_infomask2 |                         raw_flags                         |   combined_flags  
-------------+-------------+-----------------------------------------------------------+--------------------
-       2816 |           2 | {HEAP_XMIN_COMMITTED,HEAP_XMIN_INVALID,HEAP_XMAX_INVALID} | {HEAP_XMIN_FROZEN}
-(1 row)
-
--- output the decoded flag HEAP_XMIN_FROZEN instead
+-- always be the same in all test runs.
+VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) test1;
 SELECT t_infomask, t_infomask2, raw_flags, combined_flags
 FROM heap_page_items(get_raw_page('test1', 0)),
      LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2);
diff --git a/contrib/pageinspect/sql/page.sql b/contrib/pageinspect/sql/page.sql
index 01844cb629..d333b763d7 100644
--- a/contrib/pageinspect/sql/page.sql
+++ b/contrib/pageinspect/sql/page.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION pageinspect;
 CREATE TABLE test1 (a int, b int);
 INSERT INTO test1 VALUES (16777217, 131584);
 
-VACUUM test1;  -- set up FSM
+VACUUM (DISABLE_PAGE_SKIPPING) test1;  -- set up FSM
 
 -- The page contents can vary, so just test that it can be read
 -- successfully, but don't keep the output.
@@ -34,15 +34,9 @@ SELECT tuple_data_split('test1'::regclass, t_data, t_infomask, t_infomask2, t_bi
 SELECT * FROM fsm_page_contents(get_raw_page('test1', 'fsm', 0));
 
 -- If we freeze the only tuple on test1, the infomask should
--- always be the same in all test runs. we show raw flags by
--- default: HEAP_XMIN_COMMITTED and HEAP_XMIN_INVALID.
-VACUUM FREEZE test1;
+-- always be the same in all test runs.
+VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) test1;
 
-SELECT t_infomask, t_infomask2, raw_flags, combined_flags
-FROM heap_page_items(get_raw_page('test1', 0)),
-     LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2);
-
--- output the decoded flag HEAP_XMIN_FROZEN instead
 SELECT t_infomask, t_infomask2, raw_flags, combined_flags
 FROM heap_page_items(get_raw_page('test1', 0)),
      LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2);
diff --git a/contrib/pg_visibility/expected/pg_visibility.out b/contrib/pg_visibility/expected/pg_visibility.out
index 0017e3415c..315633bfea 100644
--- a/contrib/pg_visibility/expected/pg_visibility.out
+++ b/contrib/pg_visibility/expected/pg_visibility.out
@@ -105,7 +105,7 @@ ERROR:  "test_foreign_table" is not a table, materialized view, or TOAST table
 create table regular_table (a int, b text);
 alter table regular_table alter column b set storage external;
 insert into regular_table values (1, repeat('one', 1000)), (2, repeat('two', 1000));
-vacuum regular_table;
+vacuum (disable_page_skipping) regular_table;
 select count(*) > 0 from pg_visibility('regular_table');
  ?column?
 ----------
@@ -132,7 +132,7 @@ select count(*) > 0 from pg_visibility((select reltoastrelid from pg_class where
 (1 row)
 
 create materialized view matview_visibility_test as select * from regular_table;
-vacuum matview_visibility_test;
+vacuum (disable_page_skipping) matview_visibility_test;
 select count(*) > 0 from pg_visibility('matview_visibility_test');
  ?column?
 ----------
@@ -149,7 +149,7 @@ select count(*) > 0 from pg_visibility('matview_visibility_test');
 
 -- regular tables which are part of a partition *do* have visibility maps
 insert into test_partition values (1);
-vacuum test_partition;
+vacuum (disable_page_skipping) test_partition;
 select count(*) > 0 from pg_visibility('test_partition', 0);
  ?column?
 ----------
diff --git a/contrib/pg_visibility/sql/pg_visibility.sql b/contrib/pg_visibility/sql/pg_visibility.sql
index ec1afd4906..ff3538f996 100644
--- a/contrib/pg_visibility/sql/pg_visibility.sql
+++ b/contrib/pg_visibility/sql/pg_visibility.sql
@@ -71,7 +71,7 @@ select pg_truncate_visibility_map('test_foreign_table');
 create table regular_table (a int, b text);
 alter table regular_table alter column b set storage external;
 insert into regular_table values (1, repeat('one', 1000)), (2, repeat('two', 1000));
-vacuum regular_table;
+vacuum (disable_page_skipping) regular_table;
 select count(*) > 0 from pg_visibility('regular_table');
 select count(*) > 0 from pg_visibility((select reltoastrelid from pg_class where relname = 'regular_table'));
 truncate regular_table;
@@ -79,7 +79,7 @@ select count(*) > 0 from pg_visibility('regular_table');
 select count(*) > 0 from pg_visibility((select reltoastrelid from pg_class where relname = 'regular_table'));
 
 create materialized view matview_visibility_test as select * from regular_table;
-vacuum matview_visibility_test;
+vacuum (disable_page_skipping) matview_visibility_test;
 select count(*) > 0 from pg_visibility('matview_visibility_test');
 insert into regular_table values (1), (2);
 refresh materialized view matview_visibility_test;
@@ -87,7 +87,7 @@ select count(*) > 0 from pg_visibility('matview_visibility_test');
 
 -- regular tables which are part of a partition *do* have visibility maps
 insert into test_partition values (1);
-vacuum test_partition;
+vacuum (disable_page_skipping) test_partition;
 select count(*) > 0 from pg_visibility('test_partition', 0);
 select count(*) > 0 from pg_visibility_map('test_partition');
 select count(*) > 0 from pg_visibility_map_summary('test_partition');
Reply | Threaded
Open this post in threaded view
|

Re: Odd, intermittent failure in contrib/pageinspect

Andres Freund
In reply to this post by Álvaro Herrera
Hi,

On 2021-01-18 19:40:05 -0300, Alvaro Herrera wrote:
> > [ thinks for a bit... ]  Does the checkpointer pin pages it's writing
> > out?  I guess it'd have to ...
>
> It does, per SyncOneBuffer(), called from BufferSync(), called from
> CheckPointBuffers().

I think you don't event need checkpointer to be involved, normal buffer
replacement would do the trick. We briefly pin the page in BufferAlloc()
even if the page is clean. Longer when it's dirty, of course.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Odd, intermittent failure in contrib/pageinspect

Tom Lane-2
Andres Freund <[hidden email]> writes:
> I think you don't event need checkpointer to be involved, normal buffer
> replacement would do the trick. We briefly pin the page in BufferAlloc()
> even if the page is clean. Longer when it's dirty, of course.

True, but it seems unlikely that the pages in question here would be
chosen as replacement victims.  These are non-parallel tests, so
there's little competitive pressure.  I could believe that a background
autovacuum is active, but not that it's dirtied so many pages that
tables the test script just created need to get swapped out.

The checkpointer theory seems good because it requires no assumptions
at all about competing demand for buffers.  If the clock sweep gets
to the table page (which we know is recently dirtied) at just the right
time, we'll see a failure.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Odd, intermittent failure in contrib/pageinspect

Michael Paquier-2
In reply to this post by Tom Lane-2
On Tue, Jan 19, 2021 at 05:03:49PM -0500, Tom Lane wrote:
> It looks to me like heap_surgery ought to be okay, because it's operating
> on a temp table; if there are any page access conflicts on that, we've
> got BIG trouble ;-)

Bah, of course.  I managed to miss this part.

> Poking around, I found a few other places where it looked like a skipped
> page could produce diffs in the expected output:
> contrib/amcheck/t/001_verify_heapam.pl
> contrib/pg_visibility/sql/pg_visibility.sql
>
> There are lots of other vacuums of course, but they don't look like
> a missed page would have any effect on the visible results, so I think
> we should leave them alone.

Yeah, I got to wonder a bit about check_btree.sql on a second look,
but that's no big deal to leave it alone either.

> In short I propose the attached patch, which also gets rid of
> that duplicate query.

Agreed, +1.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Odd, intermittent failure in contrib/pageinspect

Tom Lane-2
Michael Paquier <[hidden email]> writes:
> On Tue, Jan 19, 2021 at 05:03:49PM -0500, Tom Lane wrote:
>> In short I propose the attached patch, which also gets rid of
>> that duplicate query.

> Agreed, +1.

Pushed, thanks for looking at it.

                        regards, tom lane