Subscription test 013_partition.pl fails under CLOBBER_CACHE_ALWAYS

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Subscription test 013_partition.pl fails under CLOBBER_CACHE_ALWAYS

Tom Lane-2
In connection with a nearby thread, I tried to run the subscription
test suite in a CLOBBER_CACHE_ALWAYS build.  I soon found that I had
to increase wal_receiver_timeout, but after doing this:

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 1488bff..5fe6810 100644
*** a/src/test/perl/PostgresNode.pm
--- b/src/test/perl/PostgresNode.pm
*************** sub init
*** 447,452 ****
--- 447,453 ----
    print $conf "log_statement = all\n";
    print $conf "log_replication_commands = on\n";
    print $conf "wal_retrieve_retry_interval = '500ms'\n";
+   print $conf "wal_receiver_timeout = '10min'\n";
 
    # If a setting tends to affect whether tests pass or fail, print it after
    # TEMP_CONFIG.  Otherwise, print it before TEMP_CONFIG, thereby permitting

I let it run overnight, and came back to find that it was stuck at

[03:02:15] t/013_partition.pl ................. 19/51

and had been for circa eight hours, where extrapolation from other tests
said it shouldn't take much over half an hour.  Investigation found that
the subscriber was repeatedly failing like this:

2020-09-14 11:05:26.483 EDT [1030506] LOG:  logical replication apply worker for subscription "sub1" has started
2020-09-14 11:05:27.139 EDT [1030506] ERROR:  cache lookup failed for relation 0
2020-09-14 11:05:27.140 EDT [947156] LOG:  background worker "logical replication worker" (PID 1030506) exited with exit code 1
2020-09-14 11:05:27.571 EDT [1030509] LOG:  logical replication apply worker for subscription "sub1" has started
2020-09-14 11:05:28.227 EDT [1030509] ERROR:  cache lookup failed for relation 0
2020-09-14 11:05:28.228 EDT [947156] LOG:  background worker "logical replication worker" (PID 1030509) exited with exit code 1

The publisher's log shows no sign of distress:

2020-09-14 11:06:09.380 EDT [1030619] sub1 LOG:  statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-14 11:06:09.446 EDT [1030619] sub1 LOG:  received replication command: IDENTIFY_SYSTEM
2020-09-14 11:06:09.446 EDT [1030619] sub1 LOG:  received replication command: START_REPLICATION SLOT "sub1" LOGICAL 0/163CF08 (proto_version '2', publication_names '"pub1"')
2020-09-14 11:06:09.447 EDT [1030619] sub1 LOG:  starting logical decoding for slot "sub1"
2020-09-14 11:06:09.447 EDT [1030619] sub1 DETAIL:  Streaming transactions committing after 0/163D848, reading WAL from 0/163CF08.
2020-09-14 11:06:09.447 EDT [1030619] sub1 LOG:  logical decoding found consistent point at 0/163CF08
2020-09-14 11:06:09.447 EDT [1030619] sub1 DETAIL:  There are no running transactions.
2020-09-14 11:06:10.468 EDT [1030621] sub1 LOG:  statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-14 11:06:10.533 EDT [1030621] sub1 LOG:  received replication command: IDENTIFY_SYSTEM
2020-09-14 11:06:10.534 EDT [1030621] sub1 LOG:  received replication command: START_REPLICATION SLOT "sub1" LOGICAL 0/163CF08 (proto_version '2', publication_names '"pub1"')
2020-09-14 11:06:10.534 EDT [1030621] sub1 LOG:  starting logical decoding for slot "sub1"
2020-09-14 11:06:10.534 EDT [1030621] sub1 DETAIL:  Streaming transactions committing after 0/163D848, reading WAL from 0/163CF08.
2020-09-14 11:06:10.534 EDT [1030621] sub1 LOG:  logical decoding found consistent point at 0/163CF08
2020-09-14 11:06:10.534 EDT [1030621] sub1 DETAIL:  There are no running transactions.

I do not have time today to chase this further, but somebody should.

More generally, this seems like good evidence that we really oughta have a
buildfarm member that's running *all* the tests under CLOBBER_CACHE_ALWAYS
not just the core tests.  That seems impossibly expensive, but I realized
while watching these tests that a ridiculous fraction of the runtime is
being spent in repeated initdb calls.  On my machine, initdb takes about
12 minutes under CCA, so doing it separately for publisher and subscriber
means 24 minutes, which compares not at all favorably to the
circa-half-an-hour total runtime of each of the subscription test scripts.
We're surely not learning anything after the first CCA run of initdb, so
if we could find a way to skip that overhead for later runs, it'd make a
huge difference in the practicality of running these tests under CCA.

I recall having worked on a patch to make the regression tests run
initdb just once, creating a template directory tree, and then "cp -a"
that into place for each test.  I did not finish it, because it wasn't
showing a lot of advantage in a normal test run, but maybe the idea
could be resurrected for CCA and other slow builds.

Another idea is to make CCA a little more dynamic, say allow it to be
suppressed through an environment variable setting, and then use that
to speed up per-test initdbs.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Subscription test 013_partition.pl fails under CLOBBER_CACHE_ALWAYS

Neha Sharma
Hi Tom,

I have tested the subscription test 013_partition.pl with CCA enabled on HEAD and PG13,
and I am able to reproduce the issue on both the versions.

Logs:
[centos@clobber-cache subscription]$ git branch
* REL_13_STABLE
  master
[centos@clobber-cache-db93 subscription]$ tail -f tmp_check/log/013_partition_subscriber1.log
2020-09-15 08:42:19.763 UTC [27866] LOG:  logical replication apply worker for subscription "sub1" has started
2020-09-15 08:42:20.395 UTC [27866] ERROR:  cache lookup failed for relation 0
2020-09-15 08:42:20.436 UTC [26427] LOG:  background worker "logical replication worker" (PID 27866) exited with exit code 1
2020-09-15 08:42:20.835 UTC [27868] LOG:  logical replication apply worker for subscription "sub1" has started
2020-09-15 08:42:21.462 UTC [27868] ERROR:  cache lookup failed for relation 0
2020-09-15 08:42:21.508 UTC [26427] LOG:  background worker "logical replication worker" (PID 27868) exited with exit code 1
2020-09-15 08:42:21.921 UTC [27870] LOG:  logical replication apply worker for subscription "sub1" has started
2020-09-15 08:42:22.551 UTC [27870] ERROR:  cache lookup failed for relation 0


Thanks.
--
Regards,
Neha Sharma


On Mon, Sep 14, 2020 at 8:50 PM Tom Lane <[hidden email]> wrote:
In connection with a nearby thread, I tried to run the subscription
test suite in a CLOBBER_CACHE_ALWAYS build.  I soon found that I had
to increase wal_receiver_timeout, but after doing this:

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 1488bff..5fe6810 100644
*** a/src/test/perl/PostgresNode.pm
--- b/src/test/perl/PostgresNode.pm
*************** sub init
*** 447,452 ****
--- 447,453 ----
    print $conf "log_statement = all\n";
    print $conf "log_replication_commands = on\n";
    print $conf "wal_retrieve_retry_interval = '500ms'\n";
+   print $conf "wal_receiver_timeout = '10min'\n";

    # If a setting tends to affect whether tests pass or fail, print it after
    # TEMP_CONFIG.  Otherwise, print it before TEMP_CONFIG, thereby permitting

I let it run overnight, and came back to find that it was stuck at

[03:02:15] t/013_partition.pl ................. 19/51

and had been for circa eight hours, where extrapolation from other tests
said it shouldn't take much over half an hour.  Investigation found that
the subscriber was repeatedly failing like this:

2020-09-14 11:05:26.483 EDT [1030506] LOG:  logical replication apply worker for subscription "sub1" has started
2020-09-14 11:05:27.139 EDT [1030506] ERROR:  cache lookup failed for relation 0
2020-09-14 11:05:27.140 EDT [947156] LOG:  background worker "logical replication worker" (PID 1030506) exited with exit code 1
2020-09-14 11:05:27.571 EDT [1030509] LOG:  logical replication apply worker for subscription "sub1" has started
2020-09-14 11:05:28.227 EDT [1030509] ERROR:  cache lookup failed for relation 0
2020-09-14 11:05:28.228 EDT [947156] LOG:  background worker "logical replication worker" (PID 1030509) exited with exit code 1

The publisher's log shows no sign of distress:

2020-09-14 11:06:09.380 EDT [1030619] sub1 LOG:  statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-14 11:06:09.446 EDT [1030619] sub1 LOG:  received replication command: IDENTIFY_SYSTEM
2020-09-14 11:06:09.446 EDT [1030619] sub1 LOG:  received replication command: START_REPLICATION SLOT "sub1" LOGICAL 0/163CF08 (proto_version '2', publication_names '"pub1"')
2020-09-14 11:06:09.447 EDT [1030619] sub1 LOG:  starting logical decoding for slot "sub1"
2020-09-14 11:06:09.447 EDT [1030619] sub1 DETAIL:  Streaming transactions committing after 0/163D848, reading WAL from 0/163CF08.
2020-09-14 11:06:09.447 EDT [1030619] sub1 LOG:  logical decoding found consistent point at 0/163CF08
2020-09-14 11:06:09.447 EDT [1030619] sub1 DETAIL:  There are no running transactions.
2020-09-14 11:06:10.468 EDT [1030621] sub1 LOG:  statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-14 11:06:10.533 EDT [1030621] sub1 LOG:  received replication command: IDENTIFY_SYSTEM
2020-09-14 11:06:10.534 EDT [1030621] sub1 LOG:  received replication command: START_REPLICATION SLOT "sub1" LOGICAL 0/163CF08 (proto_version '2', publication_names '"pub1"')
2020-09-14 11:06:10.534 EDT [1030621] sub1 LOG:  starting logical decoding for slot "sub1"
2020-09-14 11:06:10.534 EDT [1030621] sub1 DETAIL:  Streaming transactions committing after 0/163D848, reading WAL from 0/163CF08.
2020-09-14 11:06:10.534 EDT [1030621] sub1 LOG:  logical decoding found consistent point at 0/163CF08
2020-09-14 11:06:10.534 EDT [1030621] sub1 DETAIL:  There are no running transactions.

I do not have time today to chase this further, but somebody should.

More generally, this seems like good evidence that we really oughta have a
buildfarm member that's running *all* the tests under CLOBBER_CACHE_ALWAYS
not just the core tests.  That seems impossibly expensive, but I realized
while watching these tests that a ridiculous fraction of the runtime is
being spent in repeated initdb calls.  On my machine, initdb takes about
12 minutes under CCA, so doing it separately for publisher and subscriber
means 24 minutes, which compares not at all favorably to the
circa-half-an-hour total runtime of each of the subscription test scripts.
We're surely not learning anything after the first CCA run of initdb, so
if we could find a way to skip that overhead for later runs, it'd make a
huge difference in the practicality of running these tests under CCA.

I recall having worked on a patch to make the regression tests run
initdb just once, creating a template directory tree, and then "cp -a"
that into place for each test.  I did not finish it, because it wasn't
showing a lot of advantage in a normal test run, but maybe the idea
could be resurrected for CCA and other slow builds.

Another idea is to make CCA a little more dynamic, say allow it to be
suppressed through an environment variable setting, and then use that
to speed up per-test initdbs.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Subscription test 013_partition.pl fails under CLOBBER_CACHE_ALWAYS

Tom Lane-2
In reply to this post by Tom Lane-2
I wrote:
> [ $subject ]

I found some time to trace this down, and what it turns out to be is
that apply_handle_truncate() is making use of a LogicalRepRelMapEntry's
localreloid field without any consideration for the possibility that
that's been set to zero as a result of a cache flush.  The visible
symptom of "cache lookup failed for relation 0" comes from trying
to invoke find_all_inheritors with a zero OID.

Now, study of apply_handle_truncate doesn't immediately reveal where
a cache flush could have occurred, but I realized that it's actually
possible that the LogicalRepRelMapEntry is *already* marked invalid
when logicalrep_rel_open() returns!  That's because for some reason
it does GetSubscriptionRelState last, after it's already marked the
entry valid, and that function does plenty o' catalog accesses.

It's not really clear to me why setting localreloid to zero is a sane
way to represent "this entry needs to be revalidated".  I think a
separate flag would be more appropriate.  Once we have lock on the
target relation, it seems to me that no interesting changes should
be possible as long as we have lock; so there's no very good reason
to destroy useful state to remind ourselves that we should recheck
it next time.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Subscription test 013_partition.pl fails under CLOBBER_CACHE_ALWAYS

Tom Lane-2
I wrote:
> It's not really clear to me why setting localreloid to zero is a sane
> way to represent "this entry needs to be revalidated".  I think a
> separate flag would be more appropriate.  Once we have lock on the
> target relation, it seems to me that no interesting changes should
> be possible as long as we have lock; so there's no very good reason
> to destroy useful state to remind ourselves that we should recheck
> it next time.

Here's a patch that changes that, and also cleans up some sloppy
thinking about how to re-acquire lock on the replication target
relation.  (Just because the OID was valid last you heard does
not mean that table_open is guaranteed to succeed.)

With this, we get through 013_partition.pl under CCA.  I plan to
try to run all of subscription/ and recovery/ before concluding
there's nothing else to fix, though.

                        regards, tom lane


diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 3d2d56295b..9ee70a2563 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -77,7 +77,7 @@ logicalrep_relmap_invalidate_cb(Datum arg, Oid reloid)
  {
  if (entry->localreloid == reloid)
  {
- entry->localreloid = InvalidOid;
+ entry->localrelvalid = false;
  hash_seq_term(&status);
  break;
  }
@@ -91,7 +91,7 @@ logicalrep_relmap_invalidate_cb(Datum arg, Oid reloid)
  hash_seq_init(&status, LogicalRepRelMap);
 
  while ((entry = (LogicalRepRelMapEntry *) hash_seq_search(&status)) != NULL)
- entry->localreloid = InvalidOid;
+ entry->localrelvalid = false;
  }
 }
 
@@ -230,15 +230,13 @@ logicalrep_rel_att_by_name(LogicalRepRelation *remoterel, const char *attname)
 /*
  * Open the local relation associated with the remote one.
  *
- * Optionally rebuilds the Relcache mapping if it was invalidated
- * by local DDL.
+ * Rebuilds the Relcache mapping if it was invalidated by local DDL.
  */
 LogicalRepRelMapEntry *
 logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
 {
  LogicalRepRelMapEntry *entry;
  bool found;
- Oid relid = InvalidOid;
  LogicalRepRelation *remoterel;
 
  if (LogicalRepRelMap == NULL)
@@ -254,14 +252,45 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
 
  remoterel = &entry->remoterel;
 
+ /* Ensure we don't leak a relcache refcount. */
+ if (entry->localrel)
+ elog(ERROR, "remote relation ID %u is already open", remoteid);
+
  /*
  * When opening and locking a relation, pending invalidation messages are
- * processed which can invalidate the relation.  We need to update the
- * local cache both when we are first time accessing the relation and when
- * the relation is invalidated (aka entry->localreloid is set InvalidOid).
+ * processed which can invalidate the relation.  Hence, if the entry is
+ * currently considered valid, try to open the local relation by OID and
+ * see if invalidation ensues.
+ */
+ if (entry->localrelvalid)
+ {
+ entry->localrel = try_table_open(entry->localreloid, lockmode);
+ if (!entry->localrel)
+ {
+ /* Table was renamed or dropped. */
+ entry->localrelvalid = false;
+ }
+ else if (!entry->localrelvalid)
+ {
+ /* Note we release the no-longer-useful lock here. */
+ table_close(entry->localrel, lockmode);
+ entry->localrel = NULL;
+ }
+ }
+
+ /*
+ * If the entry has been marked invalid since we last had lock on it,
+ * re-open the local relation by name and rebuild all derived data.
  */
- if (!OidIsValid(entry->localreloid))
+ if (!entry->localrelvalid)
  {
+ Oid relid;
+ int found;
+ Bitmapset  *idkey;
+ TupleDesc desc;
+ MemoryContext oldctx;
+ int i;
+
  /* Try to find and lock the relation by name. */
  relid = RangeVarGetRelid(makeRangeVar(remoterel->nspname,
   remoterel->relname, -1),
@@ -272,21 +301,7 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
  errmsg("logical replication target relation \"%s.%s\" does not exist",
  remoterel->nspname, remoterel->relname)));
  entry->localrel = table_open(relid, NoLock);
-
- }
- else
- {
- relid = entry->localreloid;
- entry->localrel = table_open(entry->localreloid, lockmode);
- }
-
- if (!OidIsValid(entry->localreloid))
- {
- int found;
- Bitmapset  *idkey;
- TupleDesc desc;
- MemoryContext oldctx;
- int i;
+ entry->localreloid = relid;
 
  /* Check for supported relkind. */
  CheckSubscriptionRelkind(entry->localrel->rd_rel->relkind,
@@ -380,7 +395,7 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
  }
  }
 
- entry->localreloid = relid;
+ entry->localrelvalid = true;
  }
 
  if (entry->state != SUBREL_STATE_READY)
@@ -523,7 +538,7 @@ logicalrep_partmap_invalidate_cb(Datum arg, Oid reloid)
  {
  if (entry->localreloid == reloid)
  {
- entry->localreloid = InvalidOid;
+ entry->localrelvalid = false;
  hash_seq_term(&status);
  break;
  }
@@ -537,7 +552,7 @@ logicalrep_partmap_invalidate_cb(Datum arg, Oid reloid)
  hash_seq_init(&status, LogicalRepPartMap);
 
  while ((entry = (LogicalRepRelMapEntry *) hash_seq_search(&status)) != NULL)
- entry->localreloid = InvalidOid;
+ entry->localrelvalid = false;
  }
 }
 
@@ -656,6 +671,8 @@ logicalrep_partition_open(LogicalRepRelMapEntry *root,
 
  entry->updatable = root->updatable;
 
+ entry->localrelvalid = true;
+
  /* state and statelsn are left set to 0. */
  MemoryContextSwitchTo(oldctx);
 
diff --git a/src/include/replication/logicalrelation.h b/src/include/replication/logicalrelation.h
index a6b44b12bd..62ddd3c7a2 100644
--- a/src/include/replication/logicalrelation.h
+++ b/src/include/replication/logicalrelation.h
@@ -19,9 +19,16 @@ typedef struct LogicalRepRelMapEntry
 {
  LogicalRepRelation remoterel; /* key is remoterel.remoteid */
 
- /* Mapping to local relation, filled as needed. */
+ /*
+ * Validity flag -- when false, revalidate all derived info at next
+ * logicalrep_rel_open.  (While the localrel is open, we assume our lock
+ * on that rel ensures the info remains good.)
+ */
+ bool localrelvalid;
+
+ /* Mapping to local relation. */
  Oid localreloid; /* local relation id */
- Relation localrel; /* relcache entry */
+ Relation localrel; /* relcache entry (NULL when closed) */
  AttrMap    *attrmap; /* map of local attributes to remote ones */
  bool updatable; /* Can apply updates/deletes? */
 
Reply | Threaded
Open this post in threaded view
|

Re: Subscription test 013_partition.pl fails under CLOBBER_CACHE_ALWAYS

Tom Lane-2
I wrote:
> With this, we get through 013_partition.pl under CCA.  I plan to
> try to run all of subscription/ and recovery/ before concluding
> there's nothing else to fix, though.

Looks like the rest passes.  FTR, it was possible to get through
subscription/ in about 2 hours on my workstation, and recovery/ in about
30 minutes, after hacking things so that initdb used a non-CCA backend.
So that's definitely a promising thing to look at if anyone wants
to try to do this on a regular basis.

Aside from increasing wal_receiver_timeout as previously mentioned,
I found I had to set a higher pg_ctl start timeout using PGCTLTIMEOUT
in order to get through the recovery tests.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Subscription test 013_partition.pl fails under CLOBBER_CACHE_ALWAYS

akapila
In reply to this post by Tom Lane-2
On Wed, Sep 16, 2020 at 1:16 AM Tom Lane <[hidden email]> wrote:

>
> I wrote:
> It's not really clear to me why setting localreloid to zero is a sane
> way to represent "this entry needs to be revalidated".  I think a
> separate flag would be more appropriate.  Once we have lock on the
> target relation, it seems to me that no interesting changes should
> be possible as long as we have lock; so there's no very good reason
> to destroy useful state to remind ourselves that we should recheck
> it next time.
>

So, can we assume that the current code can only cause the problem in
CCA builds bot not in any practical scenario because after having a
lock on relation probably there shouldn't be any invalidation which
leads to this problem?

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: Subscription test 013_partition.pl fails under CLOBBER_CACHE_ALWAYS

Tom Lane-2
Amit Kapila <[hidden email]> writes:
> So, can we assume that the current code can only cause the problem in
> CCA builds bot not in any practical scenario because after having a
> lock on relation probably there shouldn't be any invalidation which
> leads to this problem?

No.  The reason we expend so much time and effort on CCA testing is
that cache flushes are unpredictable, and they can happen even when
you have a lock on the object(s) in question.  In particular, the
easiest real-world case that could cause the described problem is an
sinval queue overrun that we detect during the GetSubscriptionRelState
call at the bottom of logicalrep_rel_open.  In that case we'll
come back to the caller with the LogicalRepRelMapEntry marked as
already needing revalidation, because the cache inval callback
will have marked *all* of them that way.  That's actually a harmless
condition, because we have lock on the rel so nothing really
changed ... but if we blew away localreloid and the caller needs
to use that, kaboom.

We could imagine marking the entry valid at the very bottom of
logicalrep_rel_open, but that just moves the problem somewhere else.
Any caller that does *any* catalog access while holding open a
LogicalRepRelMapEntry would not be able to rely on its localreloid
staying valid.  That's a recipe for irreproducible bugs, and it's
unnecessary.  In practice the entry is good as long as
we continue to hold a lock on the local relation.  So we should
mark LogicalRepRelMapEntries as potentially-needing-revalidation
in a way that doesn't interfere with active users of the entry.

In short: the value of CCA testing is to model sinval overruns
happening at any point where they could happen.  The real-world
odds of one happening at any given instant are low, but they're
never zero.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Subscription test 013_partition.pl fails under CLOBBER_CACHE_ALWAYS

akapila
On Wed, Sep 16, 2020 at 9:41 AM Tom Lane <[hidden email]> wrote:

>
> Amit Kapila <[hidden email]> writes:
> > So, can we assume that the current code can only cause the problem in
> > CCA builds bot not in any practical scenario because after having a
> > lock on relation probably there shouldn't be any invalidation which
> > leads to this problem?
>
>
> In short: the value of CCA testing is to model sinval overruns
> happening at any point where they could happen.  The real-world
> odds of one happening at any given instant are low, but they're
> never zero.
>

Thanks for the explanation. I have read your patch and it looks good to me.

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: Subscription test 013_partition.pl fails under CLOBBER_CACHE_ALWAYS

Tom Lane-2
Amit Kapila <[hidden email]> writes:
> Thanks for the explanation. I have read your patch and it looks good to me.

Pushed, thanks for checking the patch.

                        regards, tom lane