ERROR: subtransaction logged without previous top-level txn record

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
39 messages Options
12
Reply | Threaded
Open this post in threaded view
|

ERROR: subtransaction logged without previous top-level txn record

Hsu, John
Hello,

I saw the previous thread but it wasn't in my inbox at the time, so I’m creating a new one sorry about that.
https://www.postgresql.org/message-id/20190516170434.masck6ehwg2kvbi2@...

I’ve managed to reproduce the issue pretty consistently on REL9_6_STABLE on commit 959792087a10baf7f1b58408d28411109bcedb7a

OS version:
[ec2-user@ ... ~]$ uname -a
... 4.14.77-80.57.amzn2.x86_64 #1 SMP Tue Nov 6 21:18:57 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Postgres version:
postgres=# SELECT version();
                                                 version                                                  
----------------------------------------------------------------------------------------------------------
 PostgreSQL 9.6.13 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5), 64-bit
(1 row)

I’m on an EC2 m5.4xlarge if that matters.

Repro steps:

1. Create the following script
[ec2-user@ip-172-31-18-48 ~]$ cat ~/subbench.txt
\set aid random(1, 100000 * :scale)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)
\set subcnt random(1, 800)
select * from pgbench(:aid, :bid, :tid, :delta, :subcnt);

2. Create the following function:

CREATE OR REPLACE FUNCTION pgbench(pAid int, pBid int, pTid int, delta int, subcnt int) returns int as $$
DECLARE
    abal int;
BEGIN
    FOR i in 1 .. subcnt LOOP
        BEGIN
            UPDATE pgbench_accounts SET abalance = abalance + delta WHERE aid = pAid;
            --subcnt := subcnt;
        EXCEPTION
        WHEN division_by_zero THEN
            subcnt := subcnt;
        END;
    END LOOP;
    abal := abalance FROM pgbench_accounts WHERE aid = pAid;
    return abal;
END; $$LANGUAGE 'plpgsql';

3. Create a few logical slots in the database

select pg_create_logical_replication_slot('test_slot_1', 'test_decoding');
select pg_create_logical_replication_slot('test_slot_2', 'test_decoding');
select pg_create_logical_replication_slot('test_slot_3', 'test_decoding');
...

4. Initialize pgbench
pgbench -i -d postgres

5. Load the data
pgbench -f subbench.txt -c 64 -j 64 -T 600 -P 1 -d postgres

6. Run pg_recvlogical with a timeout, it usually takes a few iterations (~7-9) before the error occurs

var=0
while true; do
timeout 30 pg_recvlogical -d postgres --start --slot test_slot_1 -f /dev/null;
var=$((var+1))
echo "Sleeping 5s Time: $var";
sleep 5;
done

pg_recvlogical -d postgres --start --slot test_slot_1 -f -
pg_recvlogical: unexpected termination of replication stream: ERROR:  subtransaction logged without previous top-level txn record

pg_recvlogical -d postgres --start --slot test_slot_2 -f -
pg_recvlogical: unexpected termination of replication stream: ERROR:  subtransaction logged without previous top-level txn record

pg_recvlogical -d postgres --start --slot test_slot_3 -f -
pg_recvlogical: unexpected termination of replication stream: ERROR:  subtransaction logged without previous top-level txn record
pg_recvlogical: disconnected; waiting 5 seconds to try again

What's interesting is that the confirmed_flush_lsn are all different from test_slot_1 --> test_slot_3

postgres=# select * from pg_replication_slots;
  slot_name   |    plugin     | slot_type | datoid | database | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
--------------+---------------+-----------+--------+----------+--------+------------+------+--------------+-------------+---------------------
 test_slot_1  | test_decoding | logical   |  13382 | postgres | f      |            |      |         1848 | 0/1C5BC5A0  | 0/5488E468
 test_slot_2  | test_decoding | logical   |  13382 | postgres | f      |            |      |         1848 | 0/1C5BC5A0  | 0/40E45EA0
 test_slot_3  | test_decoding | logical   |  13382 | postgres | f      |            |      |         1848 | 0/3F4B6AF8  | 0/6BB3A990


Let me know if you require more info to repro.

Thanks!

John H







Reply | Threaded
Open this post in threaded view
|

Re: ERROR: subtransaction logged without previous top-level txn record

Arseny Sher-2
Hi,

Our customer also encountered this issue and I've looked into it. The problem is
reproduced well enough using the instructions in the previous message.

The check leading to this ERROR is too strict, it forbids legit behaviours. Say
we have in WAL

[ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_1> <commit> confirmed_flush_lsn> ]

- First xl_xact_assignment record is beyond reading, i.e. earlier
  restart_lsn, where ready snapshot will be taken from disk.
- After restart_lsn there is some change of a subxact.
- After that, there is second xl_xact_assignment (for another subxact)
  revealing relationship between top and first subxact, where this ERROR fires.

Such transaction won't be streamed because we hadn't seen it in full. It must be
finished before streaming will start, i.e. before confirmed_flush_lsn.

Of course, the easiest fix is to just throw the check out. However, supposing
that someone would probably want to relax it instead, I considered ways to
accomplish this. Something like 'if we are still in SNAPSHOT_FULL and xid is
before SnapBuildNextPhaseAt, just ignore xl_xact_assignment record, we haven't
seen such xact in full and definitely won't stream it.' That led to discovery of
another bug in the place which I had found suspicious long before.

Snapbuilder enters into SNAPBUILD_CONSISTENT immediately after deserializing the
snapshot. Generally this is incorrect because SNAPBUILD_CONSISTENT means not
just complete snapshot (snapshot by itself in FULL state is just good as in
CONSISTENT), but also reorderbuffer filled with all currently running
xacts. This is painless for decoding sessions with existing slots because they
won't stream anything before confirmed_flush_lsn is reached anyway, at which
point all transactions which hadn't got into reorderbuffer would definitely
finish. However, new slots might be created too early, thus losing (not
decoding) parts of transactions committed after freshly created
confirmed_flush_lsn. This can happen under the following extremely unlucky
circumstances:
  - New slot creation reserves point in WAL since which it would read it
    (GetXLogInsertRecPtr);
  - It logs xl_running_xacts to start assembling a snapshot;
  - Running decoding session with another slot quickly reads this
    xl_running_xacts and serializes its snapshot;
  - New slot reads xl_running_xacts and picks this snapshot up, saying that it
    is ready to stream henceforth, though its reorderbuffer is empty.

Exact reproducing steps:

-- session 1
create table t (i int);
select pg_create_logical_replication_slot('slot_1', 'test_decoding');

-- session 2
begin;
insert into t values (1);

-- session 3, start slot creation
select pg_create_logical_replication_slot('slot_2', 'test_decoding');
-- stop (with gdb or something) it at DecodingContextFindStartpoint(ctx);

-- session 1
-- xl_running_xacts is dumped by ReplicationSlotReserveWal in previous command, no
-- need to sleep; our snap will be immediately serialized there
SELECT data FROM pg_logical_slot_get_changes('slot_1', NULL, NULL, 'include-xids', '1', 'skip-empty-xacts', '0');

-- continue slot_2 creation

-- session 2: insert some more and commit
insert into t values (1);
commit;

-- now this would find second insert, but not the first one
SELECT data FROM pg_logical_slot_get_changes('slot_2', NULL, NULL, 'include-xids', '1', 'skip-empty-xacts', '0');


What we can do here? Initially I was like, ok, then let's get into FULL_SNAPSHOT
upon deserializing the snap and wait for all xacts finish as usual. However, to
my surprise I've found this impossible. That is, snapbuilder has no way to
enforce that we go into CONSISTENT only when we have seen all running xacts
completely without risk of skipping legit transactions. Specifically, after
deserializing FULL snapshot snapbuilder must iterate over WAL further until all
running xacts finish, as we must see with correct snapshots all changes of every
transaction we are going to stream. However, snapbuilder can't *immediately*
notice this point, because
 - Snapbuilder updates xmin (first running xact) by taking it from xl_running_xacts
   (c.f. SnapBuildProcessRunningXacts). Even if we guarantee that, for
   each possible WAL reading starting position, there is always an an
   xl_running_xacts records logged right before the earliest possible
   streaming point -- IOW, after all xacts which we can't stream had
   finished (which is currently true btw, as slot's advancement is
   considered only at xl_running_xacts) -- that would not be enough due
   to races around xl_running_xacts, i.e with WAL like
   [ <T1> <restart_lsn> <T1 commit> <confirmed_flush_lsn, xrx> <T2 commit> ]
   T2 might be skipped if T1 is shown as running in xl_running_xacts.
 - Tracking xmin manually by recoding commits is not only inefficient,
   it just not feasible because serialized snapshot is not full: it
   contains only committed catalog-modifying xacts. Thus, we can't
   distinguish non-catalog-modifying xact committed before serialized
   snapshot from not yet committed one.

Which means only code external to snapbuilder knows the earliest point suitable
for streaming; slot advancement machinery ensures that <restart_lsn,
confirmed_flush_lsn> pair is always good. So possible fix is the following: if
snapbuilder's user knows exact LSN since which streaming is safe (existing slot,
essentially), trust him and and switch into CONSISTENT state after deserializing
snapshot as before. OTOH, if he doesn't know it (new slot creation), go via
usual FULL -> CONSISTENT procedure; we might transition into CONSISTENT a bit
later than it became possible, but there is nothing bad about that.

First attached patch implements this. I don't particularly like it, but the only
alternative which I see is to rework slots advancement logic to make
<restart_lsn, confirmed_flush_lsn> pair such that there is always
xl_running_xacts before confirmed_flush_lsn which confirms all xacts running as
of restart_lsn have finished. This unnecessary complexity looks much worse.


As for the check in the topic, I nonetheless propose to remove it completely, as
in second attached patch. Saying for sure whether xact of some record
encountered after snapshot was deserialized can be streamed or not requires to
know nextXid (first not yet running xid) as of this snapshot's lsn -- all xids <
nextXid possibly hadn't been seen in full and are not subject to
decoding. However, generally we don't know nextXid which is taken from
xl_running_xacts; in particular snapshot can be serizalized/deserialized at
XLOG_END_OF_RECOVERY. Changing that for the sake of the check in question is not
worthwhile, so just throw it out instead of trying to relax.

In fact, I don't see what is so important about seeing the top xact first at
all. During COMMIT decoding we'll know all subxids anyway, so why care?


P.S. While digging this, I have noted that return values of
SnapBuildFindSnapshot seem pretty random to me. Basically returning 'true'
performs immediately 4 things:
 - update xmin
 - purge old xip entries
 - advance xmin of the slot
 - if CONSISTENT, advance lsn (earliest serialized snap)

The latter two make sense only after slot created or confirmed_flush_lsn
reached. The first two make sense even immediately after deserializing the
snapshot (because it is serialized *before* updating xmin and xip); generally,
always when committed xids are tracked. Then why cleanup is done when xmin
horizon is too low? Why it is not performed after restoring serialized snapshot?
On the whole, I find this not very important as all these operations are pretty
cheap and harmless if executed too early -- it would be simpler just do them
always.


--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



From 86a03671c3bdbaee3dabe02a9a1f8e2e23098652 Mon Sep 17 00:00:00 2001
From: Arseny Sher <[hidden email]>
Date: Tue, 22 Oct 2019 19:02:14 +0300
Subject: [PATCH 1/2] Fix serialized snapshot usage for new logical slots.

Previously, snapbuilder entered into SNAPBUILD_CONSISTENT immediately after
deserializing the snapshot. Generally this is incorrect because
SNAPBUILD_CONSISTENT means not just complete snapshot, but also reorderbuffer
filled with all currently running xacts. This is painless for decoding sessions
with existing slots because they won't stream anything before
confirmed_flush_lsn is reached anyway, at which point all transactions which
hadn't got into reorderbuffer would definitely finish. However, new slots might
be created too early, thus losing (not decoding) parts of transactions committed
after freshly created confirmed_flush_lsn. This can happen under the following
extremely unlucky circumstances:
 - New slot creation reserves point in WAL since which it would read it
   (GetXLogInsertRecPtr);
 - It logs xl_running_xacts to start assembling a snapshot;
 - Running decoding session with another slot quickly reads this
   xl_running_xacts and serializes its snapshot;
 - New slot reads xl_running_xacts and picks this snapshot up, saying that it
   is ready to stream henceforth, though its reorderbuffer is empty.

It turns out, as comment to AllocateSnapshotBuilder explains, that snapbuilder
can't locate *earliest* possible point for streaming on its own. Thus, if
snapbuild users know it (decoding session for existing slot), we trust them and
switch after snapshot deserialization into SNAPBUILD_CONSISTENT as
previously. However, if it is not known (new slot creation), switch into
SNAPBUILD_FULL_SNAPSHOT and wait for all running xacts to finish as usual.
---
 src/backend/replication/logical/logical.c   |   2 +-
 src/backend/replication/logical/snapbuild.c | 141 ++++++++++++++++++++++------
 2 files changed, 115 insertions(+), 28 deletions(-)

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index da265f5294..67ed52e56f 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -327,7 +327,7 @@ CreateInitDecodingContext(char *plugin,
  ReplicationSlotMarkDirty();
  ReplicationSlotSave();
 
- ctx = StartupDecodingContext(NIL, restart_lsn, xmin_horizon,
+ ctx = StartupDecodingContext(NIL, InvalidXLogRecPtr, xmin_horizon,
  need_full_snapshot, false,
  read_page, prepare_write, do_write,
  update_progress);
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 0bd1d0f954..3ed178a4f8 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,7 +165,7 @@ struct SnapBuild
 
  /*
  * Don't replay commits from an LSN < this LSN. This can be set externally
- * but it will also be advanced (never retreat) from within snapbuild.c.
+ * or established by snapbuild.c once consistent snapshot is assembled.
  */
  XLogRecPtr start_decoding_at;
 
@@ -275,7 +275,7 @@ static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutof
 
 /* serialization functions */
 static void SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn);
-static bool SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn);
+static bool SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn, TransactionId nextXid);
 
 /*
  * Return TransactionId after which the next phase of initial snapshot
@@ -309,7 +309,42 @@ SnapBuildStartNextPhaseAt(SnapBuild *builder, TransactionId at)
  * Allocate a new snapshot builder.
  *
  * xmin_horizon is the xid >= which we can be sure no catalog rows have been
- * removed, start_lsn is the LSN >= we want to replay commits.
+ * removed.
+ * start_lsn is InvalidXLogRecPtr or LSN >= we want to replay commits.
+ *  InvalidXLogRecPtr is used during slot creation; snapbuild will assemble
+ *  consistent snapshot and (if continue decoding -- no core code does that
+ *  currently) stream all further commits.
+ *  If normal lsn is passed, caller *must* be sure that WAL reading starts
+ *  early enough to build the snapshot and pick up the first change of
+ *  earliest xact to stream. Slot creation and advancement machinery
+ *  guarantees that slot's <restart_lsn, confirmed_flush_lsn> pair always
+ *  satisfies this.
+ *
+ * The condition is the caller's responsibility because in some cases
+ * snapbuilder has no way to enforce this rule without risk of skipping legit
+ * transactions. Specifically, after constructing (usually deserializing) FULL
+ * snapshot snapbuilder must iterate over WAL further until all running xacts
+ * from the snap finish, as we must see with correct snapshots all changes of
+ * every transaction we are going to stream. However, snapbuild can't
+ *  *immediately* notice this point, because
+ *    - Snapbuilder updates xmin by taking it from xl_running_xacts
+ *      (c.f. SnapBuildProcessRunningXacts). Even if we guarantee that, for
+ *      each possible WAL reading starting position, there is always an an
+ *      xl_running_xacts records logged right before the earliest possible
+ *      streaming point -- IOW, after all xacts which we can't stream had
+ *      finished (which is currently true btw, as slot's advancement is
+ *      considered only at xl_running_xacts) -- that would not be enough due
+ *      to races around xl_running_xacts, i.e with WAL like
+ *      [ <T1> <restart_lsn> <T1 commit> <confirmed_flush_lsn, xrx> <T2 commit> ]
+ *      T2 might be skipped if T1 is shown as running in xl_running_xacts.
+ *  - Tracking xmin manually by recoding commits is not only inefficient,
+ *      it just not feasible because serialized snapshot is not full: it
+ *      contains only committed catalog-modifying xacts. Thus, we can't
+ *      distinguish non-catalog-modifying xact committed before serialized
+ *      snapshot from not yet committed one.
+ *  So, trust the caller: once given start_lsn is reached, it means we must
+ *  have reached the point where all further xacts can be streamed, FULL ->
+ *  CONSISTENT transition.
  */
 SnapBuild *
 AllocateSnapshotBuilder(ReorderBuffer *reorder,
@@ -408,7 +443,8 @@ SnapBuildCurrentState(SnapBuild *builder)
 bool
 SnapBuildXactNeedsSkip(SnapBuild *builder, XLogRecPtr ptr)
 {
- return ptr < builder->start_decoding_at;
+ return XLogRecPtrIsInvalid(builder->start_decoding_at) ||
+ ptr < builder->start_decoding_at;
 }
 
 /*
@@ -945,16 +981,16 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
  TransactionIdPrecedes(xid, SnapBuildNextPhaseAt(builder))))
  {
  /* ensure that only commits after this are getting replayed */
- if (builder->start_decoding_at <= lsn)
- builder->start_decoding_at = lsn + 1;
+ Assert(XLogRecPtrIsInvalid(builder->start_decoding_at) ||
+   builder->start_decoding_at > lsn);
  return;
  }
 
  if (builder->state < SNAPBUILD_CONSISTENT)
  {
  /* ensure that only commits after this are getting replayed */
- if (builder->start_decoding_at <= lsn)
- builder->start_decoding_at = lsn + 1;
+ Assert(XLogRecPtrIsInvalid(builder->start_decoding_at) ||
+   builder->start_decoding_at > lsn);
 
  /*
  * If building an exportable snapshot, force xid to be tracked, even
@@ -966,6 +1002,16 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
  }
  }
 
+ if (!XLogRecPtrIsInvalid(builder->start_decoding_at) &&
+ builder->start_decoding_at <= lsn)
+ {
+ /*
+ * We are going to stream this xact, so must already have fine
+ * snapshot.
+ */
+ Assert(builder->state == SNAPBUILD_CONSISTENT);
+ }
+
  for (nxact = 0; nxact < nsubxacts; nxact++)
  {
  TransactionId subxid = subxacts[nxact];
@@ -1250,10 +1296,10 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
  */
  if (running->oldestRunningXid == running->nextXid)
  {
- if (builder->start_decoding_at == InvalidXLogRecPtr ||
- builder->start_decoding_at <= lsn)
+ if (XLogRecPtrIsInvalid(builder->start_decoding_at))
  /* can decode everything after this */
  builder->start_decoding_at = lsn + 1;
+ Assert(builder->start_decoding_at >= lsn);
 
  /* As no transactions were running xmin/xmax can be trivially set. */
  builder->xmin = running->nextXid; /* < are finished */
@@ -1275,9 +1321,9 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
  }
  /* b) valid on disk state and not building full snapshot */
  else if (!builder->building_full_snapshot &&
- SnapBuildRestore(builder, lsn))
+ SnapBuildRestore(builder, lsn, running->nextXid))
  {
- /* there won't be any state to cleanup */
+ /* there won't be much state to cleanup */
  return false;
  }
 
@@ -1358,6 +1404,10 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
  {
  builder->state = SNAPBUILD_CONSISTENT;
  SnapBuildStartNextPhaseAt(builder, InvalidTransactionId);
+ if (XLogRecPtrIsInvalid(builder->start_decoding_at))
+ /* can decode everything after this */
+ builder->start_decoding_at = lsn + 1;
+ Assert(builder->start_decoding_at >= lsn);
 
  ereport(LOG,
  (errmsg("logical decoding found consistent point at %X/%X",
@@ -1471,8 +1521,8 @@ typedef struct SnapBuildOnDisk
 void
 SnapBuildSerializationPoint(SnapBuild *builder, XLogRecPtr lsn)
 {
- if (builder->state < SNAPBUILD_CONSISTENT)
- SnapBuildRestore(builder, lsn);
+ if (builder->state < SNAPBUILD_FULL_SNAPSHOT)
+ SnapBuildRestore(builder, lsn, InvalidTransactionId);
  else
  SnapBuildSerialize(builder, lsn);
 }
@@ -1499,10 +1549,12 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
    builder->last_serialized_snapshot <= lsn);
 
  /*
- * no point in serializing if we cannot continue to work immediately after
- * restoring the snapshot
+ * No point in serializing if the snapshot is not complete.
+ * However, FULL snapshot is just as good as CONSISTENT; difference
+ * between these states is not snapshot property, but whether we have
+ * filled reorderbuffer with all currently running xacts.
  */
- if (builder->state < SNAPBUILD_CONSISTENT)
+ if (builder->state < SNAPBUILD_FULL_SNAPSHOT)
  return;
 
  /*
@@ -1688,10 +1740,15 @@ out:
 
 /*
  * Restore a snapshot into 'builder' if previously one has been stored at the
- * location indicated by 'lsn'. Returns true if successful, false otherwise.
+ * location indicated by 'lsn'.
+ * nextXid is first not yet running xid as of this lsn or InvalidTransactionId;
+ *   unless external code knows where it is safe to start streaming, we can
+ *   use serialized snapshot only if we are aware which xids finish we must
+ *   wait to be able to stream all further commits.
+ *  Returns true if successful, false otherwise.
  */
 static bool
-SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
+SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn, TransactionId nextXid)
 {
  SnapBuildOnDisk ondisk;
  int fd;
@@ -1701,7 +1758,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
  pg_crc32c checksum;
 
  /* no point in loading a snapshot if we're already there */
- if (builder->state == SNAPBUILD_CONSISTENT)
+ if (builder->state >= SNAPBUILD_FULL_SNAPSHOT)
  return false;
 
  sprintf(path, "pg_logical/snapshots/%X-%X.snap",
@@ -1884,11 +1941,47 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
  if (TransactionIdPrecedes(ondisk.builder.xmin, builder->initial_xmin_horizon))
  goto snapshot_not_interesting;
 
+ /*
+ * Don't use snapshot if external code doesn't know where it is safe to
+ * start streaming and we have no idea which xids to wait for.
+ */
+ if (XLogRecPtrIsInvalid(builder->start_decoding_at) &&
+ !TransactionIdIsValid(nextXid))
+ goto snapshot_not_interesting;
 
  /* ok, we think the snapshot is sensible, copy over everything important */
  builder->xmin = ondisk.builder.xmin;
  builder->xmax = ondisk.builder.xmax;
- builder->state = ondisk.builder.state;
+ Assert(ondisk.builder.state >= SNAPBUILD_FULL_SNAPSHOT);
+
+ if (XLogRecPtrIsInvalid(builder->start_decoding_at))
+ {
+ /*
+ * Snapshot is fine, now we need to wait till we see all further
+ * commits since the xact's first record.
+ */
+ builder->state = SNAPBUILD_FULL_SNAPSHOT;
+ SnapBuildStartNextPhaseAt(builder, nextXid);
+ }
+ else
+ {
+ /*
+ * If external code (c.f. AllocateSnapshotBuilder) knows we would pick
+ * up all xacts in full before start_decoding_at, just go directly
+ * into CONSISTENT. Though we probably can't stream right now (as we
+ * haven't seen beginnings of some xacts), no xact will be streamed
+ * before start_decoding_at, and we can't be sure to switch into
+ * CONSISTENT later in time anyway.
+ */
+ builder->state = SNAPBUILD_CONSISTENT;
+ SnapBuildStartNextPhaseAt(builder, InvalidTransactionId);
+
+ ereport(LOG,
+ (errmsg("logical decoding found consistent point at %X/%X",
+ (uint32) (lsn >> 32), (uint32) lsn),
+ errdetail("Logical decoding will begin using saved snapshot.")));
+ }
+
 
  builder->committed.xcnt = ondisk.builder.committed.xcnt;
  /* We only allocated/stored xcnt, not xcnt_space xids ! */
@@ -1911,12 +2004,6 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 
  ReorderBufferSetRestartPoint(builder->reorder, lsn);
 
- Assert(builder->state == SNAPBUILD_CONSISTENT);
-
- ereport(LOG,
- (errmsg("logical decoding found consistent point at %X/%X",
- (uint32) (lsn >> 32), (uint32) lsn),
- errdetail("Logical decoding will begin using saved snapshot.")));
  return true;
 
 snapshot_not_interesting:
--
2.11.0


From ab798a087d99ea8d7d0df4c296fee5787cf5394e Mon Sep 17 00:00:00 2001
From: Arseny Sher <[hidden email]>
Date: Wed, 23 Oct 2019 15:56:46 +0300
Subject: [PATCH 2/2] Stop demanding that top xact must be seen before subxact
 in decoding.

Manifested as
ERROR:  subtransaction logged without previous top-level txn record
, this check forbids legit behaviours like
 - First xl_xact_assignment record is beyond reading, i.e. earlier
   restart_lsn.
 - After restart_lsn there is some change of a subxact.
 - After that, there is second xl_xact_assignment (for another subxact)
   revealing relationship between top and first subxact.

Such transaction won't be streamed anyway because we hadn't seen it in full;
confirmed_flush_lsn must be past all these records. Saying for sure whether xact
of some record encountered after snapshot was deserialized can be streamed or
not requires to know nextXid (first not yet running xid) as of this snapshot's
lsn -- all xids < nextXid possibly hadn't been seen in full and are not subject
to decoding. However, generally we don't know nextXid which is taken from
xl_running_xacts; in particular snapshot can be serizalized/deserialized at
XLOG_END_OF_RECOVERY. Changing that for the sake of the check in question is not
worthwhile, so just throw it out instead of trying to relax.
---
 src/backend/replication/logical/reorderbuffer.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 8ce28ad629..6faba6077e 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -778,9 +778,6 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
  txn = ReorderBufferTXNByXid(rb, xid, true, &new_top, lsn, true);
  subtxn = ReorderBufferTXNByXid(rb, subxid, true, &new_sub, lsn, false);
 
- if (new_top && !new_sub)
- elog(ERROR, "subtransaction logged without previous top-level txn record");
-
  if (!new_sub)
  {
  if (subtxn->is_known_as_subxact)
--
2.11.0

Reply | Threaded
Open this post in threaded view
|

Re: ERROR: subtransaction logged without previous top-level txn record

Andres Freund
Hi,

On 2019-10-24 12:59:30 +0300, Arseny Sher wrote:
> Our customer also encountered this issue and I've looked into it. The problem is
> reproduced well enough using the instructions in the previous message.

Is this with

commit bac2fae05c7737530a6fe8276cd27d210d25c6ac
Author: Alvaro Herrera <[hidden email]>
Date:   2019-09-13 16:36:28 -0300

    logical decoding: process ASSIGNMENT during snapshot build
   
    Most WAL records are ignored in early SnapBuild snapshot build phases.
    But it's critical to process some of them, so that later messages have
    the correct transaction state after the snapshot is completely built; in
    particular, XLOG_XACT_ASSIGNMENT messages are critical in order for
    sub-transactions to be correctly assigned to their parent transactions,
    or at least one assert misbehaves, as reported by Ildar Musin.
   
    Diagnosed-by: Masahiko Sawada
    Author: Masahiko Sawada
    Discussion: https://postgr.es/m/CAONYFtOv+Er1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg@...

applied?

> The check leading to this ERROR is too strict, it forbids legit behaviours. Say
> we have in WAL
>
> [ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_1> <commit> confirmed_flush_lsn> ]
>
> - First xl_xact_assignment record is beyond reading, i.e. earlier
>   restart_lsn, where ready snapshot will be taken from disk.
> - After restart_lsn there is some change of a subxact.
> - After that, there is second xl_xact_assignment (for another subxact)
>   revealing relationship between top and first subxact, where this ERROR fires.
>
> Such transaction won't be streamed because we hadn't seen it in full. It must be
> finished before streaming will start, i.e. before confirmed_flush_lsn.
>
> Of course, the easiest fix is to just throw the check out.

I don't think that'd actually be a fix, and just hiding a bug.


> Snapbuilder enters into SNAPBUILD_CONSISTENT immediately after deserializing the
> snapshot. Generally this is incorrect because SNAPBUILD_CONSISTENT means not
> just complete snapshot (snapshot by itself in FULL state is just good as in
> CONSISTENT), but also reorderbuffer filled with all currently running
> xacts. This is painless for decoding sessions with existing slots because they
> won't stream anything before confirmed_flush_lsn is reached anyway, at which
> point all transactions which hadn't got into reorderbuffer would definitely
> finish. However, new slots might be created too early, thus losing (not
> decoding) parts of transactions committed after freshly created
> confirmed_flush_lsn. This can happen under the following extremely unlucky
> circumstances:
>   - New slot creation reserves point in WAL since which it would read it
>     (GetXLogInsertRecPtr);
>   - It logs xl_running_xacts to start assembling a snapshot;
>   - Running decoding session with another slot quickly reads this
>     xl_running_xacts and serializes its snapshot;
>   - New slot reads xl_running_xacts and picks this snapshot up, saying that it
>     is ready to stream henceforth, though its reorderbuffer is empty.

Yea, that's a problem :(


> Exact reproducing steps:
>
> -- session 1
> create table t (i int);
> select pg_create_logical_replication_slot('slot_1', 'test_decoding');
>
> -- session 2
> begin;
> insert into t values (1);
>
> -- session 3, start slot creation
> select pg_create_logical_replication_slot('slot_2', 'test_decoding');
> -- stop (with gdb or something) it at DecodingContextFindStartpoint(ctx);
>
> -- session 1
> -- xl_running_xacts is dumped by ReplicationSlotReserveWal in previous command, no
> -- need to sleep; our snap will be immediately serialized there
> SELECT data FROM pg_logical_slot_get_changes('slot_1', NULL, NULL, 'include-xids', '1', 'skip-empty-xacts', '0');
>
> -- continue slot_2 creation
>
> -- session 2: insert some more and commit
> insert into t values (1);
> commit;
>
> -- now this would find second insert, but not the first one
> SELECT data FROM pg_logical_slot_get_changes('slot_2', NULL, NULL, 'include-xids', '1', 'skip-empty-xacts', '0');

It'd be good to make this into an isolation test.


> What we can do here?

I think the easiest fix might actually be to just ignore serialized
snapshots when building the initial snapshot.


Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: ERROR: subtransaction logged without previous top-level txn record

Arseny Sher-2

Andres Freund <[hidden email]> writes:

> Hi,
>
> On 2019-10-24 12:59:30 +0300, Arseny Sher wrote:
>> Our customer also encountered this issue and I've looked into it. The problem is
>> reproduced well enough using the instructions in the previous message.
>
> Is this with
>
> commit bac2fae05c7737530a6fe8276cd27d210d25c6ac
> Author: Alvaro Herrera <[hidden email]>
> Date:   2019-09-13 16:36:28 -0300
>
>     logical decoding: process ASSIGNMENT during snapshot build
>
>     Most WAL records are ignored in early SnapBuild snapshot build phases.
>     But it's critical to process some of them, so that later messages have
>     the correct transaction state after the snapshot is completely built; in
>     particular, XLOG_XACT_ASSIGNMENT messages are critical in order for
>     sub-transactions to be correctly assigned to their parent transactions,
>     or at least one assert misbehaves, as reported by Ildar Musin.
>
>     Diagnosed-by: Masahiko Sawada
>     Author: Masahiko Sawada
>     Discussion: https://postgr.es/m/CAONYFtOv+Er1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg@...
>
> applied?

Yeah, I tried fresh master. See below: skipped xl_xact_assignment is
beyond restart_lsn at all (and thus not read), so this doesn't matter.


>> The check leading to this ERROR is too strict, it forbids legit behaviours. Say
>> we have in WAL
>>
>> [ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_1> <commit> confirmed_flush_lsn> ]
>>
>> - First xl_xact_assignment record is beyond reading, i.e. earlier
>>   restart_lsn, where ready snapshot will be taken from disk.
>> - After restart_lsn there is some change of a subxact.
>> - After that, there is second xl_xact_assignment (for another subxact)
>>   revealing relationship between top and first subxact, where this ERROR fires.
>>
>> Such transaction won't be streamed because we hadn't seen it in full. It must be
>> finished before streaming will start, i.e. before confirmed_flush_lsn.
>>
>> Of course, the easiest fix is to just throw the check out.
>
> I don't think that'd actually be a fix, and just hiding a bug.

I don't see a bug here. At least in reproduced scenario I see false
alert, as explained above: transaction with skipped xl_xact_assignment
won't be streamed as it finishes before confirmed_flush_lsn. And I am
pretty sure people encountered in the field the same issue.

In the end of my mail I have offered a way to relax this check instead
of removing it to avoid false triggers: serialize/deserialize a snapshot
only at xl_running_xacts to know nextXid, add function to snapbuilder to
check whether xact can be streamed or not by looking at its xid, etc,
somehow deal with existing serialized snaps which may be logged at
END_OF_RECOVERY without nextXid. I don't believe this check is worth
these complexities. If you think it does, I can do that though.

>
>> Snapbuilder enters into SNAPBUILD_CONSISTENT immediately after deserializing the
>> snapshot. Generally this is incorrect because SNAPBUILD_CONSISTENT means not
>> just complete snapshot (snapshot by itself in FULL state is just good as in
>> CONSISTENT), but also reorderbuffer filled with all currently running
>> xacts. This is painless for decoding sessions with existing slots because they
>> won't stream anything before confirmed_flush_lsn is reached anyway, at which
>> point all transactions which hadn't got into reorderbuffer would definitely
>> finish. However, new slots might be created too early, thus losing (not
>> decoding) parts of transactions committed after freshly created
>> confirmed_flush_lsn. This can happen under the following extremely unlucky
>> circumstances:
>>   - New slot creation reserves point in WAL since which it would read it
>>     (GetXLogInsertRecPtr);
>>   - It logs xl_running_xacts to start assembling a snapshot;
>>   - Running decoding session with another slot quickly reads this
>>     xl_running_xacts and serializes its snapshot;
>>   - New slot reads xl_running_xacts and picks this snapshot up, saying that it
>>     is ready to stream henceforth, though its reorderbuffer is empty.
>
> Yea, that's a problem :(
>
>
>> Exact reproducing steps:
>>
>> -- session 1
>> create table t (i int);
>> select pg_create_logical_replication_slot('slot_1', 'test_decoding');
>>
>> -- session 2
>> begin;
>> insert into t values (1);
>>
>> -- session 3, start slot creation
>> select pg_create_logical_replication_slot('slot_2', 'test_decoding');
>> -- stop (with gdb or something) it at DecodingContextFindStartpoint(ctx);
>>
>> -- session 1
>> -- xl_running_xacts is dumped by ReplicationSlotReserveWal in previous command, no
>> -- need to sleep; our snap will be immediately serialized there
>> SELECT data FROM pg_logical_slot_get_changes('slot_1', NULL, NULL, 'include-xids', '1', 'skip-empty-xacts', '0');
>>
>> -- continue slot_2 creation
>>
>> -- session 2: insert some more and commit
>> insert into t values (1);
>> commit;
>>
>> -- now this would find second insert, but not the first one
>> SELECT data FROM pg_logical_slot_get_changes('slot_2', NULL, NULL, 'include-xids', '1', 'skip-empty-xacts', '0');
>
> It'd be good to make this into an isolation test.

Yeah, but to get real chance of firing this requires kinda sleep/break
in the middle of pg_create_logical_replication_slot execution, so I have
no idea how to do that =(

>
>> What we can do here?
>
> I think the easiest fix might actually be to just ignore serialized
> snapshots when building the initial snapshot.

That's an option. However, that anyway requires the distinction between
new slot creation and streaming from existing slot at snapbuilder level,
which currently doesn't exist and which constitutes most part of my
first patch. If we add that, changing between using and not using
serialized snapshots in new slot creation is easy (my patch uses it), I
think this is not principal.


--
Arseny Sher


Reply | Threaded
Open this post in threaded view
|

Re: ERROR: subtransaction logged without previous top-level txn record

Arseny Sher-2
Andres, Álvaro, could you please have a look at this?

--
Arseny Sher


Reply | Threaded
Open this post in threaded view
|

Re: ERROR: subtransaction logged without previous top-level txn record

Dan Katz
Is the resolution of the issue in this thread being tracked elsewhere, either in a commit fest or other stream of work?

On Tue, Dec 17, 2019 at 9:47 AM Arseny Sher <[hidden email]> wrote:
Andres, Álvaro, could you please have a look at this?

--
Arseny Sher




Reply | Threaded
Open this post in threaded view
|

Re: ERROR: subtransaction logged without previous top-level txn record

Arseny Sher-2

Dan Katz <[hidden email]> writes:

> Is the resolution of the issue in this thread being tracked elsewhere,
> either in a commit fest or other stream of work?

Ok, I've created a cf entry:
https://commitfest.postgresql.org/26/2383/

I believe it is the most important to commit at least

0002-Stop-demanding-that-top-xact-must-be-seen-before-sub.patch

from my mail above -- as we see, this issue creates problems in the
field. Moreover, the patch is trivial and hopefully I've shown that
ERROR triggers were spurious and there is no easy way to relax the
check.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Reply | Threaded
Open this post in threaded view
|

Re: ERROR: subtransaction logged without previous top-level txn record

Dan Katz
Arseny,

I was hoping you could give me some insights about how this bug might appear with multiple replications slots. For example if I have two replication slots would you expect both slots to see the same error, even if they were started, consumed or the LSN was confirmed-flushed at different times?

Dan

On Tue, Dec 17, 2019 at 10:15 AM Arseny Sher <[hidden email]> wrote:

Dan Katz <[hidden email]> writes:

> Is the resolution of the issue in this thread being tracked elsewhere,
> either in a commit fest or other stream of work?

Ok, I've created a cf entry:
https://commitfest.postgresql.org/26/2383/

I believe it is the most important to commit at least

0002-Stop-demanding-that-top-xact-must-be-seen-before-sub.patch

from my mail above -- as we see, this issue creates problems in the
field. Moreover, the patch is trivial and hopefully I've shown that
ERROR triggers were spurious and there is no easy way to relax the
check.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Reply | Threaded
Open this post in threaded view
|

Re: ERROR: subtransaction logged without previous top-level txn record

Arseny Sher-2
Hi,

Dan Katz <[hidden email]> writes:

> Arseny,
>
> I was hoping you could give me some insights about how this bug might
> appear with multiple replications slots. For example if I have two
> replication slots would you expect both slots to see the same error, even
> if they were started, consumed or the LSN was confirmed-flushed at
> different times?

Well, to encounter this you must happen to interrupt decoding session
(e.g. shutdown server) when restart_lsn (LSN since WAL will be read next
time) is at unfortunate position, as described in
https://www.postgresql.org/message-id/87ftjifoql.fsf%40ars-thinkpad

Generally each slot has its own restart_lsn, so if one decoding session
stucked on this issue, another one won't necessarily fail at the same
time. However, restart_lsn can be advanced only to certain points,
mainly xl_running_xacts records, which is logged every 15 seconds. So if
all consumers acknowledge changes fast enough, it is quite likely that
during shutdown restart_lsn will be the same for all slots -- which
means either all of them will stuck on further decoding or all of them
won't. If not, different slots might have different restart_lsn and
probably won't fail at the same time; but encountering this issue even
once suggests that your workload makes possibility of such problematic
restart_lsn perceptible (i.e. many subtransactions). And each
restart_lsn probably has approximately the same chance to be 'bad'
(provided the workload is even).


We need a committer familiar with this code to look here...


--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Reply | Threaded
Open this post in threaded view
|

Re: ERROR: subtransaction logged without previous top-level txn record

akapila
In reply to this post by Arseny Sher-2
On Fri, Oct 25, 2019 at 12:26 PM Arseny Sher <[hidden email]> wrote:

>
>
> Andres Freund <[hidden email]> writes:
>
> > Hi,
> >
> > On 2019-10-24 12:59:30 +0300, Arseny Sher wrote:
> >> Our customer also encountered this issue and I've looked into it. The problem is
> >> reproduced well enough using the instructions in the previous message.
> >
> > Is this with
> >
> > commit bac2fae05c7737530a6fe8276cd27d210d25c6ac
> > Author: Alvaro Herrera <[hidden email]>
> > Date:   2019-09-13 16:36:28 -0300
> >
> >     logical decoding: process ASSIGNMENT during snapshot build
> >
> >     Most WAL records are ignored in early SnapBuild snapshot build phases.
> >     But it's critical to process some of them, so that later messages have
> >     the correct transaction state after the snapshot is completely built; in
> >     particular, XLOG_XACT_ASSIGNMENT messages are critical in order for
> >     sub-transactions to be correctly assigned to their parent transactions,
> >     or at least one assert misbehaves, as reported by Ildar Musin.
> >
> >     Diagnosed-by: Masahiko Sawada
> >     Author: Masahiko Sawada
> >     Discussion: https://postgr.es/m/CAONYFtOv+Er1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg@...
> >
> > applied?
>
> Yeah, I tried fresh master. See below: skipped xl_xact_assignment is
> beyond restart_lsn at all (and thus not read), so this doesn't matter.
>
>
> >> The check leading to this ERROR is too strict, it forbids legit behaviours. Say
> >> we have in WAL
> >>
> >> [ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_1> <commit> confirmed_flush_lsn> ]
> >>
> >> - First xl_xact_assignment record is beyond reading, i.e. earlier
> >>   restart_lsn, where ready snapshot will be taken from disk.
> >> - After restart_lsn there is some change of a subxact.
> >> - After that, there is second xl_xact_assignment (for another subxact)
> >>   revealing relationship between top and first subxact, where this ERROR fires.
> >>
> >> Such transaction won't be streamed because we hadn't seen it in full. It must be
> >> finished before streaming will start, i.e. before confirmed_flush_lsn.
> >>
> >> Of course, the easiest fix is to just throw the check out.
> >
> > I don't think that'd actually be a fix, and just hiding a bug.
>
> I don't see a bug here. At least in reproduced scenario I see false
> alert, as explained above: transaction with skipped xl_xact_assignment
> won't be streamed as it finishes before confirmed_flush_lsn.
>

Does this guarantee come from the fact that we need to wait for such a
transaction before reaching a consistent snapshot state?  If not, can
you explain a bit more what makes you say so?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: ERROR: subtransaction logged without previous top-level txn record

Arseny Sher-2

Amit Kapila <[hidden email]> writes:

>> I don't see a bug here. At least in reproduced scenario I see false
>> alert, as explained above: transaction with skipped xl_xact_assignment
>> won't be streamed as it finishes before confirmed_flush_lsn.
>>
>
> Does this guarantee come from the fact that we need to wait for such a
> transaction before reaching a consistent snapshot state?  If not, can
> you explain a bit more what makes you say so?

Right, see FULL_SNAPSHOT -> SNAPBUILD_CONSISTENT transition -- it exists
exactly for this purpose: once we have good snapshot, we need to wait
for all running xacts to finish to see all xacts we are promising to
stream in full. This ensures <restart_lsn, confirmed_flush_lsn> pair is
good (reading WAL since the former is enough to see all xacts committing
after the latter in full) initially, and slot advancement arrangements
ensure it stays good forever (see
LogicalIncreaseRestartDecodingForSlot).

Well, almost. This is true as long initial snapshot construction process
goes the long way of building the snapshot by itself. If it happens to
pick up from disk ready snapshot pickled there by another decoding
session, it fast path'es to SNAPBUILD_CONSISTENT, which is technically a
bug as described in
https://www.postgresql.org/message-id/87ftjifoql.fsf%40ars-thinkpad

In theory, this bug could indeed lead to 'subtransaction logged without
previous top-level txn record' error. In practice, I think its
possibility is disappearingly small -- process of slot creation must be
intervened in a very short gap by another decoder who serializes its
snapshot (see the exact sequence of steps in the mail above). What is
much more probable (doesn't involve new slot creation and relatively
easily reproducible without sleeps) is false alert triggered by unlucky
position of restart_lsn.


Surely we still must fix it. I just mean
  - People definitely encountered false alert, not this bug
    (at least because nobody said this was immediately after slot
    creation).
  - I've no bright ideas how to relax the check to make it proper
    without additional complications and I'm pretty sure this is
    impossible (again, see above for details), so I'd remove it.


--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Reply | Threaded
Open this post in threaded view
|

Re: ERROR: subtransaction logged without previous top-level txn record

akapila
On Mon, Feb 3, 2020 at 2:50 PM Arseny Sher <[hidden email]> wrote:

>
>
> Amit Kapila <[hidden email]> writes:
>
> >> I don't see a bug here. At least in reproduced scenario I see false
> >> alert, as explained above: transaction with skipped xl_xact_assignment
> >> won't be streamed as it finishes before confirmed_flush_lsn.
> >>
> >
> > Does this guarantee come from the fact that we need to wait for such a
> > transaction before reaching a consistent snapshot state?  If not, can
> > you explain a bit more what makes you say so?
>
> Right, see FULL_SNAPSHOT -> SNAPBUILD_CONSISTENT transition -- it exists
> exactly for this purpose: once we have good snapshot, we need to wait
> for all running xacts to finish to see all xacts we are promising to
> stream in full.
>

So, doesn't this mean that it started occurring after the fix done in
commit 96b5033e11 [1]?  Because before that fix we wouldn't have
allowed processing XLOG_XACT_ASSIGNMENT records unless we are in
SNAPBUILD_FULL_SNAPSHOT state.  I am not telling the fix in that
commit is wrong, but just trying to understand the situation here.

>
> Well, almost. This is true as long initial snapshot construction process
> goes the long way of building the snapshot by itself. If it happens to
> pick up from disk ready snapshot pickled there by another decoding
> session, it fast path'es to SNAPBUILD_CONSISTENT, which is technically a
> bug as described in
> https://www.postgresql.org/message-id/87ftjifoql.fsf%40ars-thinkpad
>

Can't we deal with this separately?  If so, I think let's not mix the
discussions for both as the root cause of both seems different.


[1] -
 commit bac2fae05c7737530a6fe8276cd27d210d25c6ac
 Author: Alvaro Herrera <[hidden email]>
Date:   2019-09-13 16:36:28 -0300

     logical decoding: process ASSIGNMENT during snapshot build



--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: ERROR: subtransaction logged without previous top-level txn record

Arseny Sher-2

Amit Kapila <[hidden email]> writes:

> So, doesn't this mean that it started occurring after the fix done in
> commit 96b5033e11 [1]?  Because before that fix we wouldn't have
> allowed processing XLOG_XACT_ASSIGNMENT records unless we are in
> SNAPBUILD_FULL_SNAPSHOT state.  I am not telling the fix in that
> commit is wrong, but just trying to understand the situation here.

Nope. Consider again example of WAL above triggering the error:

[ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_2> <commit> <confirmed_flush_lsn> ]

Decoder starting reading WAL at <restart_lsn> where he immediately reads
from disk snapshot serialized earlier, which makes it jump to
SNAPBUILD_CONSISTENT right away. It doesn't read xl_xact_assignment_1,
but it reads xl_xact_assignment_2 already in SNAPBUILD_CONSISTENT state,
so catches the error regardless of this commit.

>> Well, almost. This is true as long initial snapshot construction process
>> goes the long way of building the snapshot by itself. If it happens to
>> pick up from disk ready snapshot pickled there by another decoding
>> session, it fast path'es to SNAPBUILD_CONSISTENT, which is technically a
>> bug as described in
>> https://www.postgresql.org/message-id/87ftjifoql.fsf%40ars-thinkpad
>>
>
> Can't we deal with this separately?  If so, I think let's not mix the
> discussions for both as the root cause of both seems different.

These issues are related: before removing the check it would be nice to
ensure that there is no bugs it might protect us from (and it turns out
there actually is, though it won't always protect, and though this bug
has very small probability). Moreover, they are about more or less
subject -- avoiding partially decoded xacts -- and once you dived deep
enough to deal with one, it is reasonable to deal with another instead
of doing that twice. But as a practical matter, removing the check is
simple one-liner, and its presence causes people troubles -- so I'd
suggest doing that first and then deal with the rest. I don't think
starting new thread is worthwhile here, but if you think it does, I can
create it.


--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Reply | Threaded
Open this post in threaded view
|

Re: ERROR: subtransaction logged without previous top-level txn record

akapila
On Mon, Feb 3, 2020 at 7:16 PM Arseny Sher <[hidden email]> wrote:

> Amit Kapila <[hidden email]> writes:
>
> > So, doesn't this mean that it started occurring after the fix done in
> > commit 96b5033e11 [1]?  Because before that fix we wouldn't have
> > allowed processing XLOG_XACT_ASSIGNMENT records unless we are in
> > SNAPBUILD_FULL_SNAPSHOT state.  I am not telling the fix in that
> > commit is wrong, but just trying to understand the situation here.
>
> Nope. Consider again example of WAL above triggering the error:
>
> [ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_2> <commit> <confirmed_flush_lsn> ]
>
> Decoder starting reading WAL at <restart_lsn> where he immediately reads
> from disk snapshot serialized earlier, which makes it jump to
> SNAPBUILD_CONSISTENT right away.
>

Sure, I understand that if we get serialized snapshot from disk, this
problem can occur and probably we can fix by ignoring serialized
snapshots during slot creation or something on those lines.  However,
what I am trying to understand is whether this can occur from another
path as well.  I think it might occur without using serialized
snapshots as well because we allow decoding xl_xact_assignment record
even when the snapshot state is not full.  Say in your above example,
even if the snapshot state is not SNAPBUILD_CONSISTENT as we haven't
used the serialized snapshot, then also, it can lead to the above
problem due to decoding of xl_xact_assignment.  I have not tried to
generate a test case for this, so I could easily be wrong here.

What I am trying to get at is if the problem can only occur by using
serialized snapshots and we fix it by somehow not using them initial
slot creation, then ideally we don't need to remove the error in
ReorderBufferAssignChild, but if that is not the case, then we need to
discuss other cases as well.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: ERROR: subtransaction logged without previous top-level txn record

Arseny Sher-2

Amit Kapila <[hidden email]> writes:

> On Mon, Feb 3, 2020 at 7:16 PM Arseny Sher <[hidden email]> wrote:
>> Amit Kapila <[hidden email]> writes:
>>
>> > So, doesn't this mean that it started occurring after the fix done in
>> > commit 96b5033e11 [1]?  Because before that fix we wouldn't have
>> > allowed processing XLOG_XACT_ASSIGNMENT records unless we are in
>> > SNAPBUILD_FULL_SNAPSHOT state.  I am not telling the fix in that
>> > commit is wrong, but just trying to understand the situation here.
>>
>> Nope. Consider again example of WAL above triggering the error:
>>
>> [ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_2> <commit> <confirmed_flush_lsn> ]
>>
>> Decoder starting reading WAL at <restart_lsn> where he immediately reads
>> from disk snapshot serialized earlier, which makes it jump to
>> SNAPBUILD_CONSISTENT right away.
>>
>
> Sure, I understand that if we get serialized snapshot from disk, this
> problem can occur and probably we can fix by ignoring serialized
> snapshots during slot creation or something on those lines.

There is some confusion. I'll try to reword what we have here:

1) Decoding from existing slot (*not* initial snapshot construction)
starts up, immediately picks up snapshot at restart_lsn (getting into
SNAPBUILD_CONSISTENT) and in some xl_xact_assignment learns that it
hadn't seen in full (no toplevel records) transaction which it is not
even going to stream -- but still dies with "subtransation logged
without...". That's my example above, and that's what people are
complaining about. Here, usage of serialized snapshot and jump to
SNAPBUILD_CONSISTENT is not just legit, it is essential: or order to be
able to stream data since confirmed_flush_lsn, we must pick it up as we
might not be able to assemble it from scratch in time. I mean, what is
wrong here is not serialized snapshot usage but the check.

(Lengthy comment to AllocateSnapshotBuilder in my
0001-Fix-serialized-snapshot-usage-for-new-logical-slots.patch explains
why snapbuilder is not able to do FULL -> CONSISTENT transition on its
own early enough for decoding from existing slot, so the jump on
snapshot pickup is performed to CONSISTENT directly.)

This happens with or without bac2fae05c.

2) *New* slot creationg process picks up serialized snapshot and jumps
to CONSISTENT without waiting for all running xacts to finish. This is
wrong and is a bug (of very low probability), as we risk promising to
decode xacts which we might not seen in full. Sometimes it could be
arrested by "subtransation logged without..." check, but not necessarily
-- e.g. there could be no subxacts at all.


> However,
> what I am trying to understand is whether this can occur from another
> path as well.  I think it might occur without using serialized
> snapshots as well because we allow decoding xl_xact_assignment record
> even when the snapshot state is not full.  Say in your above example,
> even if the snapshot state is not SNAPBUILD_CONSISTENT as we haven't
> used the serialized snapshot, then also, it can lead to the above
> problem due to decoding of xl_xact_assignment.  I have not tried to
> generate a test case for this, so I could easily be wrong here.

What you are suggesting here is 3), which is, well, sort of form of 1),
meaning "subxact logged..." error also pointlessly triggered, but for
new slot creation. With bac2fae0, decoder might miss first
xl_xact_assignment because it is beyond start of reading WAL but
encounter second xl_xact_assignment and die on it due to this check
before even getting FULL.

But now that I'm thinking about it, I suspect that similar could happen
even before bac2fae0. Imagine

<start_of_reading_wal> <xl_xact_assignment_1> <SNAPBUILD_FULL> <subxact_change> <xl_xact_assignment_2> <commit> ... <SNAPBUILD_CONSISTENT>

Before bac2fae0, xl_xact_assignment_1 was ignored, so
xl_xact_assignment_1 would trigger the error.

> What I am trying to get at is if the problem can only occur by using
> serialized snapshots and we fix it by somehow not using them initial
> slot creation, then ideally we don't need to remove the error in
> ReorderBufferAssignChild, but if that is not the case, then we need to
> discuss other cases as well.

So, 1) and 3) mean this is not the case.


--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Reply | Threaded
Open this post in threaded view
|

Re: ERROR: subtransaction logged without previous top-level txn record

akapila
On Wed, Feb 5, 2020 at 2:34 PM Arseny Sher <[hidden email]> wrote:

>
> Amit Kapila <[hidden email]> writes:
>
> > On Mon, Feb 3, 2020 at 7:16 PM Arseny Sher <[hidden email]> wrote:
> >> Amit Kapila <[hidden email]> writes:
> >>
> >> > So, doesn't this mean that it started occurring after the fix done in
> >> > commit 96b5033e11 [1]?  Because before that fix we wouldn't have
> >> > allowed processing XLOG_XACT_ASSIGNMENT records unless we are in
> >> > SNAPBUILD_FULL_SNAPSHOT state.  I am not telling the fix in that
> >> > commit is wrong, but just trying to understand the situation here.
> >>
> >> Nope. Consider again example of WAL above triggering the error:
> >>
> >> [ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_2> <commit> <confirmed_flush_lsn> ]
> >>
> >> Decoder starting reading WAL at <restart_lsn> where he immediately reads
> >> from disk snapshot serialized earlier, which makes it jump to
> >> SNAPBUILD_CONSISTENT right away.
> >>
> >
> > Sure, I understand that if we get serialized snapshot from disk, this
> > problem can occur and probably we can fix by ignoring serialized
> > snapshots during slot creation or something on those lines.
>
> There is some confusion. I'll try to reword what we have here:
>
> 1) Decoding from existing slot (*not* initial snapshot construction)
> starts up, immediately picks up snapshot at restart_lsn (getting into
> SNAPBUILD_CONSISTENT) and in some xl_xact_assignment learns that it
> hadn't seen in full (no toplevel records) transaction which it is not
> even going to stream -- but still dies with "subtransation logged
> without...". That's my example above, and that's what people are
> complaining about. Here, usage of serialized snapshot and jump to
> SNAPBUILD_CONSISTENT is not just legit, it is essential: or order to be
> able to stream data since confirmed_flush_lsn, we must pick it up as we
> might not be able to assemble it from scratch in time. I mean, what is
> wrong here is not serialized snapshot usage but the check.
>

I was thinking if we have some way to skip processing of
xl_xact_assignment for such cases, then it might be better.  Say,
along with restart_lsn, if have some way to find corresponding nextXid
(below which we don't need to process records).  Say, if, during
SnapBuildProcessRunningXacts, we record the xid of txn we got by
ReorderBufferGetOldestTXN in slot, then can't we use it to skip such
records.

> (Lengthy comment to AllocateSnapshotBuilder in my
> 0001-Fix-serialized-snapshot-usage-for-new-logical-slots.patch explains
> why snapbuilder is not able to do FULL -> CONSISTENT transition on its
> own early enough for decoding from existing slot, so the jump on
> snapshot pickup is performed to CONSISTENT directly.)
>
> This happens with or without bac2fae05c.
>
> 2) *New* slot creationg process picks up serialized snapshot and jumps
> to CONSISTENT without waiting for all running xacts to finish. This is
> wrong and is a bug (of very low probability), as we risk promising to
> decode xacts which we might not seen in full. Sometimes it could be
> arrested by "subtransation logged without..." check, but not necessarily
> -- e.g. there could be no subxacts at all.
>
>
> > However,
> > what I am trying to understand is whether this can occur from another
> > path as well.  I think it might occur without using serialized
> > snapshots as well because we allow decoding xl_xact_assignment record
> > even when the snapshot state is not full.  Say in your above example,
> > even if the snapshot state is not SNAPBUILD_CONSISTENT as we haven't
> > used the serialized snapshot, then also, it can lead to the above
> > problem due to decoding of xl_xact_assignment.  I have not tried to
> > generate a test case for this, so I could easily be wrong here.
>
> What you are suggesting here is 3), which is, well, sort of form of 1),
> meaning "subxact logged..." error also pointlessly triggered, but for
> new slot creation. With bac2fae0, decoder might miss first
> xl_xact_assignment because it is beyond start of reading WAL but
> encounter second xl_xact_assignment and die on it due to this check
> before even getting FULL.
>
> But now that I'm thinking about it, I suspect that similar could happen
> even before bac2fae0. Imagine
>
> <start_of_reading_wal> <xl_xact_assignment_1> <SNAPBUILD_FULL> <subxact_change> <xl_xact_assignment_2> <commit> ... <SNAPBUILD_CONSISTENT>
>
> Before bac2fae0, xl_xact_assignment_1 was ignored, so
> xl_xact_assignment_1 would trigger the error.
>

'xl_xact_assignment_1 would trigger the error', I think in this part
of sentence you mean to say xl_xact_assignment_2 because we won't try
to decode xl_xact_assignment_1 before bac2fae0.  If so, won't we wait
for such a transaction to finish while changing the snapshot state
from SNAPBUILD_BUILDING_SNAPSHOT to SNAPBUILD_FULL_SNAPSHOT?  And if
the transaction is finished, ideally, we should not try to decode its
WAL and or create its ReorderBufferTxn.

> > What I am trying to get at is if the problem can only occur by using
> > serialized snapshots and we fix it by somehow not using them initial
> > slot creation, then ideally we don't need to remove the error in
> > ReorderBufferAssignChild, but if that is not the case, then we need to
> > discuss other cases as well.
>
> So, 1) and 3) mean this is not the case.
>

Right, I am thinking that if we can find some way to skip the xact
assignment for (1) and (3), then that might be a better fix.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: ERROR: subtransaction logged without previous top-level txn record

akapila
On Fri, Feb 7, 2020 at 4:29 PM Amit Kapila <[hidden email]> wrote:

>
> On Wed, Feb 5, 2020 at 2:34 PM Arseny Sher <[hidden email]> wrote:
> >
>
> > > What I am trying to get at is if the problem can only occur by using
> > > serialized snapshots and we fix it by somehow not using them initial
> > > slot creation, then ideally we don't need to remove the error in
> > > ReorderBufferAssignChild, but if that is not the case, then we need to
> > > discuss other cases as well.
> >
> > So, 1) and 3) mean this is not the case.
> >
>
> Right, I am thinking that if we can find some way to skip the xact
> assignment for (1) and (3), then that might be a better fix.
>

Just to be clear, I am just brainstorming the ideas to see if we can
find some better solutions to the problems (1) and (3) described by
Arseny in the above email [1].  At this stage, it is not clear to me
that we have a fix simple enough to backpatch apart from what Arseny
posted in his fist email [2] (which is to stop demanding that top xact
must be seen before subxact in decoding.).

[1] - https://www.postgresql.org/message-id/87zhdx76d5.fsf%40ars-thinkpad
[2] - https://www.postgresql.org/message-id/87ftjifoql.fsf%40ars-thinkpad

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: ERROR: subtransaction logged without previous top-level txn record

Arseny Sher-2

Amit Kapila <[hidden email]> writes:

>> 1) Decoding from existing slot (*not* initial snapshot construction)
>> starts up, immediately picks up snapshot at restart_lsn (getting into
>> SNAPBUILD_CONSISTENT) and in some xl_xact_assignment learns that it
>> hadn't seen in full (no toplevel records) transaction which it is not
>> even going to stream -- but still dies with "subtransation logged
>> without...". That's my example above, and that's what people are
>> complaining about. Here, usage of serialized snapshot and jump to
>> SNAPBUILD_CONSISTENT is not just legit, it is essential: or order to be
>> able to stream data since confirmed_flush_lsn, we must pick it up as we
>> might not be able to assemble it from scratch in time. I mean, what is
>> wrong here is not serialized snapshot usage but the check.
>>
>
> I was thinking if we have some way to skip processing of
> xl_xact_assignment for such cases, then it might be better.  Say,
> along with restart_lsn, if have some way to find corresponding nextXid
> (below which we don't need to process records).
I don't believe you can that without persisting additional
data. Basically, what we need is list of transactions who are running at
the point of snapshot serialization *and* already wrote something before
it -- those we hadn't seen in full and can't decode. We have no such
data currently. The closest thing we have is xl_running_xacts->nextXid,
but

  1) issued xid doesn't necessarily means xact actually wrote something,
     so we can't just skip xl_xact_assignment for xid < nextXid, it might
     still be decoded
  2) snapshot might be serialized not at xl_running_xacts anyway

Surely this thing doesn't deserve changing persisted data format.


Somehow I hadn't realized this earlier, so my comments/commit messages
in patches above were not accurate here; I've edited them. Also in the
first patch serialized snapshots are not no longer used for new slot
creation at all, as Andres suggested above. This is not principal, as I
said, but arguably makes things simpler a bit.

I've also found a couple of issues with slot copying feature, will post
in separate thread on them.



From 8952bc955585d12a06cdd6d339bcbabafc772768 Mon Sep 17 00:00:00 2001
From: Arseny Sher <[hidden email]>
Date: Tue, 22 Oct 2019 19:02:14 +0300
Subject: [PATCH 1/2] Don't use serialized snapshots during logical slot
 creation.

snapbuild.c enters SNAPBUILD_CONSISTENT immediately after deserializing the
snapshot. Generally this is incorrect because SNAPBUILD_CONSISTENT means not
just complete snapshot, but also reorderbuffer filled with all currently running
xacts. This is fine for decoding sessions with existing slots though as they
won't stream anything before confirmed_flush_lsn is reached anyway, at which
point all transactions which hadn't got into reorderbuffer would definitely
finish. However, new slots might be created too early, thus losing (not
decoding) parts of transactions committed after freshly created
confirmed_flush_lsn. This can happen under the following extremely unlucky
circumstances:
 - New slot creation reserves point in WAL since which it would read it
   (GetXLogInsertRecPtr);
 - It logs xl_running_xacts to start assembling a snapshot;
 - Running decoding session with another slot quickly reads this
   xl_running_xacts and serializes its snapshot;
 - New slot reads xl_running_xacts and picks this snapshot up, saying that it
   is ready to stream henceforth, though its reorderbuffer is empty.

Fix by forbidding to use serialized snapshots during slot creation. Actually we
could use snapshot if it was picked at xl_running_xacts where we could perform
usual FULL -> CONSISTENT transition conservatively via nextXid, but that seems
like a complication for a too small benefit.
---
 src/backend/replication/logical/logical.c   |  2 +-
 src/backend/replication/logical/snapbuild.c | 85 +++++++++++++++++++++++------
 2 files changed, 70 insertions(+), 17 deletions(-)

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index da265f5294..67ed52e56f 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -327,7 +327,7 @@ CreateInitDecodingContext(char *plugin,
  ReplicationSlotMarkDirty();
  ReplicationSlotSave();
 
- ctx = StartupDecodingContext(NIL, restart_lsn, xmin_horizon,
+ ctx = StartupDecodingContext(NIL, InvalidXLogRecPtr, xmin_horizon,
  need_full_snapshot, false,
  read_page, prepare_write, do_write,
  update_progress);
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 0bd1d0f954..6f503a93aa 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,7 +165,7 @@ struct SnapBuild
 
  /*
  * Don't replay commits from an LSN < this LSN. This can be set externally
- * but it will also be advanced (never retreat) from within snapbuild.c.
+ * or established by snapbuild.c once consistent snapshot is assembled.
  */
  XLogRecPtr start_decoding_at;
 
@@ -309,7 +309,16 @@ SnapBuildStartNextPhaseAt(SnapBuild *builder, TransactionId at)
  * Allocate a new snapshot builder.
  *
  * xmin_horizon is the xid >= which we can be sure no catalog rows have been
- * removed, start_lsn is the LSN >= we want to replay commits.
+ * removed.
+ * start_lsn is InvalidXLogRecPtr or LSN >= we want to replay commits.
+ *  InvalidXLogRecPtr is used during slot creation; snapbuild will assemble
+ *  consistent snapshot and (if continue decoding -- no core code does that
+ *  currently) stream all further commits.
+ *  If normal LSN is passed, caller *must* be sure that WAL reading starts
+ *  early enough to get the snapshot and pick up the first change of
+ *  earliest xact to stream. Slot creation and advancement machinery
+ *  guarantees that slot's <restart_lsn, confirmed_flush_lsn> pair always
+ *  satisfies this.
  */
 SnapBuild *
 AllocateSnapshotBuilder(ReorderBuffer *reorder,
@@ -408,7 +417,8 @@ SnapBuildCurrentState(SnapBuild *builder)
 bool
 SnapBuildXactNeedsSkip(SnapBuild *builder, XLogRecPtr ptr)
 {
- return ptr < builder->start_decoding_at;
+ return XLogRecPtrIsInvalid(builder->start_decoding_at) ||
+ ptr < builder->start_decoding_at;
 }
 
 /*
@@ -945,16 +955,16 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
  TransactionIdPrecedes(xid, SnapBuildNextPhaseAt(builder))))
  {
  /* ensure that only commits after this are getting replayed */
- if (builder->start_decoding_at <= lsn)
- builder->start_decoding_at = lsn + 1;
+ Assert(XLogRecPtrIsInvalid(builder->start_decoding_at) ||
+   builder->start_decoding_at > lsn);
  return;
  }
 
  if (builder->state < SNAPBUILD_CONSISTENT)
  {
  /* ensure that only commits after this are getting replayed */
- if (builder->start_decoding_at <= lsn)
- builder->start_decoding_at = lsn + 1;
+ Assert(XLogRecPtrIsInvalid(builder->start_decoding_at) ||
+   builder->start_decoding_at > lsn);
 
  /*
  * If building an exportable snapshot, force xid to be tracked, even
@@ -966,6 +976,16 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
  }
  }
 
+ if (!XLogRecPtrIsInvalid(builder->start_decoding_at) &&
+ builder->start_decoding_at <= lsn)
+ {
+ /*
+ * We are going to stream this xact, so must already have fine
+ * snapshot.
+ */
+ Assert(builder->state == SNAPBUILD_CONSISTENT);
+ }
+
  for (nxact = 0; nxact < nsubxacts; nxact++)
  {
  TransactionId subxid = subxacts[nxact];
@@ -1250,10 +1270,10 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
  */
  if (running->oldestRunningXid == running->nextXid)
  {
- if (builder->start_decoding_at == InvalidXLogRecPtr ||
- builder->start_decoding_at <= lsn)
+ if (XLogRecPtrIsInvalid(builder->start_decoding_at))
  /* can decode everything after this */
  builder->start_decoding_at = lsn + 1;
+ Assert(builder->start_decoding_at >= lsn);
 
  /* As no transactions were running xmin/xmax can be trivially set. */
  builder->xmin = running->nextXid; /* < are finished */
@@ -1277,7 +1297,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
  else if (!builder->building_full_snapshot &&
  SnapBuildRestore(builder, lsn))
  {
- /* there won't be any state to cleanup */
+ /* there won't be much state to cleanup */
  return false;
  }
 
@@ -1358,6 +1378,10 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
  {
  builder->state = SNAPBUILD_CONSISTENT;
  SnapBuildStartNextPhaseAt(builder, InvalidTransactionId);
+ if (XLogRecPtrIsInvalid(builder->start_decoding_at))
+ /* can decode everything after this */
+ builder->start_decoding_at = lsn + 1;
+ Assert(builder->start_decoding_at >= lsn);
 
  ereport(LOG,
  (errmsg("logical decoding found consistent point at %X/%X",
@@ -1499,8 +1523,12 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
    builder->last_serialized_snapshot <= lsn);
 
  /*
- * no point in serializing if we cannot continue to work immediately after
- * restoring the snapshot
+ * No point in serializing if the snapshot is not complete (< FULL).
+ * FULL snapshot per se is just as good as CONSISTENT; difference between
+ * these states is not a snapshot property, but whether we have filled
+ * reorderbuffer with all currently running xacts. However, there isn't
+ * much sense in serializing it as serialized snaps are used only for
+ * decoding from existing slots, which had already reached CONSISTENT.
  */
  if (builder->state < SNAPBUILD_CONSISTENT)
  return;
@@ -1688,7 +1716,8 @@ out:
 
 /*
  * Restore a snapshot into 'builder' if previously one has been stored at the
- * location indicated by 'lsn'. Returns true if successful, false otherwise.
+ * location indicated by 'lsn'.
+ * Returns true if successful, false otherwise.
  */
 static bool
 SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
@@ -1704,6 +1733,16 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
  if (builder->state == SNAPBUILD_CONSISTENT)
  return false;
 
+ /*
+ * Don't use serialized snapshot if we are not sure where all
+ * currently running xacts will finish (new slot creation).
+ * (Actually, if we came here through xl_running_xacts, we could perform
+ * SNAPBUILD_FULL_SNAPSHOT -> SNAPBUILD_CONSISTENT transition properly,
+ * but added lines of code would hardly worth the benefit.)
+ */
+ if (builder->start_decoding_at == InvalidXLogRecPtr)
+ return false;
+
  sprintf(path, "pg_logical/snapshots/%X-%X.snap",
  (uint32) (lsn >> 32), (uint32) lsn);
 
@@ -1888,7 +1927,6 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
  /* ok, we think the snapshot is sensible, copy over everything important */
  builder->xmin = ondisk.builder.xmin;
  builder->xmax = ondisk.builder.xmax;
- builder->state = ondisk.builder.state;
 
  builder->committed.xcnt = ondisk.builder.committed.xcnt;
  /* We only allocated/stored xcnt, not xcnt_space xids ! */
@@ -1901,6 +1939,23 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
  }
  ondisk.builder.committed.xip = NULL;
 
+ /*
+ * We probably don't have in reorderbuffer parts of currently running
+ * xacts right now. However, snapbuild can't determine *earliest* spot
+ * where decoding is safe (we'd see all xacts in full) on its own, because
+ * this requires to know which xids already wrote something, and such
+ * writes are beyond our reading of WAL. xl_running_xacts->nextXid
+ * doesn't help us here, because 1) issued xid doesn't necessarily means
+ * xact wrote something, so there would be a risk of arriving at
+ * CONSISTENT a bit later than possible which is ok for new slot creation,
+ * but unacceptable otherwise; 2) snapshot might be serialized at some
+ * other point.
+ *
+ * Slot must have provided as with safe start_decoding_at though, so just
+ * skip FULL -> CONSISTENT transition and set CONSISTENT directly.
+ */
+ builder->state = SNAPBUILD_CONSISTENT;
+
  /* our snapshot is not interesting anymore, build a new one */
  if (builder->snapshot != NULL)
  {
@@ -1911,8 +1966,6 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 
  ReorderBufferSetRestartPoint(builder->reorder, lsn);
 
- Assert(builder->state == SNAPBUILD_CONSISTENT);
-
  ereport(LOG,
  (errmsg("logical decoding found consistent point at %X/%X",
  (uint32) (lsn >> 32), (uint32) lsn),
--
2.11.0


From 1449c8aacd2feadfa104a1ea4b73030e37cbba14 Mon Sep 17 00:00:00 2001
From: Arseny Sher <[hidden email]>
Date: Wed, 23 Oct 2019 15:56:46 +0300
Subject: [PATCH 2/2] Stop demanding that top xact must be seen before subxact
 in decoding.

Manifested as

ERROR:  subtransaction logged without previous top-level txn record

this check forbids legit behaviours like
 - First xl_xact_assignment record is beyond reading, i.e. earlier
   restart_lsn.
 - After restart_lsn there is some change of a subxact.
 - After that, there is second xl_xact_assignment (for another subxact)
   revealing relationship between top and first subxact.

Such transaction won't be streamed anyway because we hadn't seen it in full;
confirmed_flush_lsn must be past all these records. Saying for sure whether xact
of some record encountered after snapshot was deserialized can be streamed or
not requires to know whether it wrote something before deserialization point --
if yes, it hasn't been seen in full and can't be decoded. Snapshot doesn't have
such info, so there is no easy way to relax the check.
---
 src/backend/replication/logical/reorderbuffer.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 8ce28ad629..6faba6077e 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -778,9 +778,6 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
  txn = ReorderBufferTXNByXid(rb, xid, true, &new_top, lsn, true);
  subtxn = ReorderBufferTXNByXid(rb, subxid, true, &new_sub, lsn, false);
 
- if (new_top && !new_sub)
- elog(ERROR, "subtransaction logged without previous top-level txn record");
-
  if (!new_sub)
  {
  if (subtxn->is_known_as_subxact)
--
2.11.0




--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Reply | Threaded
Open this post in threaded view
|

Re: ERROR: subtransaction logged without previous top-level txn record

akapila
On Sun, Feb 9, 2020 at 9:37 PM Arseny Sher <[hidden email]> wrote:

>
> Amit Kapila <[hidden email]> writes:
>
> >> 1) Decoding from existing slot (*not* initial snapshot construction)
> >> starts up, immediately picks up snapshot at restart_lsn (getting into
> >> SNAPBUILD_CONSISTENT) and in some xl_xact_assignment learns that it
> >> hadn't seen in full (no toplevel records) transaction which it is not
> >> even going to stream -- but still dies with "subtransation logged
> >> without...". That's my example above, and that's what people are
> >> complaining about. Here, usage of serialized snapshot and jump to
> >> SNAPBUILD_CONSISTENT is not just legit, it is essential: or order to be
> >> able to stream data since confirmed_flush_lsn, we must pick it up as we
> >> might not be able to assemble it from scratch in time. I mean, what is
> >> wrong here is not serialized snapshot usage but the check.
> >>
> >
> > I was thinking if we have some way to skip processing of
> > xl_xact_assignment for such cases, then it might be better.  Say,
> > along with restart_lsn, if have some way to find corresponding nextXid
> > (below which we don't need to process records).
>
> I don't believe you can that without persisting additional
> data. Basically, what we need is list of transactions who are running at
> the point of snapshot serialization *and* already wrote something before
> it -- those we hadn't seen in full and can't decode. We have no such
> data currently. The closest thing we have is xl_running_xacts->nextXid,
> but
>
>   1) issued xid doesn't necessarily means xact actually wrote something,
>      so we can't just skip xl_xact_assignment for xid < nextXid, it might
>      still be decoded
>   2) snapshot might be serialized not at xl_running_xacts anyway
>
> Surely this thing doesn't deserve changing persisted data format.
>

I agree that it won't be a good idea to change the persisted data
format, especially in back-branches.  I don't see any fix which can
avoid this without doing major changes in the code.  Apart from this,
we have to come up with a solution for point (3) discussed in the
above email [1] which again could be change in design.  I think we can
first try to proceed with the patch
0002-Stop-demanding-that-top-xact-must-be-seen-before--v2 and then we
can discuss the other patch.  I can't see a way to write a test case
for this, can you think of any way?

Andres, anyone else, if you have a better idea other than changing the
code (removing the expected error) as in
0002-Stop-demanding-that-top-xact-must-be-seen-before--v2, then
please, let us know. You can read the points (1) and (3) in the email
above [1] where the below error check will hit for valid cases.  We
have discussed this in detail, but couldn't come up with anything
better than to remove this check.

@@ -778,9 +778,6 @@ ReorderBufferAssignChild(ReorderBuffer *rb,
TransactionId xid,
  txn = ReorderBufferTXNByXid(rb, xid, true, &new_top, lsn, true);
  subtxn = ReorderBufferTXNByXid(rb, subxid, true, &new_sub, lsn, false);

- if (new_top && !new_sub)
- elog(ERROR, "subtransaction logged without previous top-level txn record");
-

[1] - https://www.postgresql.org/message-id/87zhdx76d5.fsf%40ars-thinkpad

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: ERROR: subtransaction logged without previous top-level txn record

Arseny Sher-2

Amit Kapila <[hidden email]> writes:

>> I don't believe you can that without persisting additional
>> data. Basically, what we need is list of transactions who are running at
>> the point of snapshot serialization *and* already wrote something before
>> it -- those we hadn't seen in full and can't decode. We have no such
>> data currently. The closest thing we have is xl_running_xacts->nextXid,
>> but
>>
>>   1) issued xid doesn't necessarily means xact actually wrote something,
>>      so we can't just skip xl_xact_assignment for xid < nextXid, it might
>>      still be decoded
>>   2) snapshot might be serialized not at xl_running_xacts anyway
>>
>> Surely this thing doesn't deserve changing persisted data format.
>>
>
> I agree that it won't be a good idea to change the persisted data
> format, especially in back-branches.  I don't see any fix which can
> avoid this without doing major changes in the code.  Apart from this,
> we have to come up with a solution for point (3) discussed in the
> above email [1] which again could be change in design.  I think we can
> first try to proceed with the patch
> 0002-Stop-demanding-that-top-xact-must-be-seen-before--v2 and then we
> can discuss the other patch.  I can't see a way to write a test case
> for this, can you think of any way?
Yeah, let's finally get it.

Attached is raw version of isolation test triggering false
'subtransaction logged without...' (case (1)). However, frankly I don't
see much value in it, so I'm dubious whether it should be included in
the patch.


diff --git a/contrib/test_decoding/specs/subxact_logged_without_top.spec b/contrib/test_decoding/specs/subxact_logged_without_top.spec
new file mode 100644
index 0000000000..55b51357a9
--- /dev/null
+++ b/contrib/test_decoding/specs/subxact_logged_without_top.spec
@@ -0,0 +1,51 @@
+
+setup
+{
+    SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); -- must be first write in xact
+    CREATE TABLE harvest(apples integer);
+    CREATE OR REPLACE FUNCTION subxacts() returns void as $$
+    BEGIN
+      FOR i in 1 .. 65 LOOP
+        BEGIN
+          INSERT INTO harvest VALUES (42);
+        EXCEPTION
+        WHEN OTHERS THEN
+  RAISE;
+        END;
+    END LOOP;
+    END; $$LANGUAGE 'plpgsql';
+}
+
+teardown
+{
+    DROP TABLE IF EXISTS harvest;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+setup { SET synchronous_commit=on; }
+step "s0_begin" { BEGIN; }
+step "s0_first_subxact" {
+    DO LANGUAGE plpgsql $$
+      BEGIN
+        BEGIN
+          INSERT INTO harvest VALUES (41);
+ EXCEPTION WHEN OTHERS THEN RAISE;
+ END;
+      END $$;
+}
+step "s0_many_subxacts" { select subxacts(); }
+step "s0_commit" { COMMIT; }
+
+session "s1"
+setup { SET synchronous_commit=on; }
+step "s1_begin" { BEGIN; }
+step "s1_dml" { INSERT INTO harvest VALUES (43); }
+step "s1_commit" { COMMIT; }
+
+session "s2"
+setup { SET synchronous_commit=on; }
+step "s2_checkpoint" { CHECKPOINT; }
+step "s2_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); }
+
+permutation "s0_begin" "s0_first_subxact" "s2_checkpoint" "s1_begin" "s1_dml" "s0_many_subxacts" "s0_commit" "s2_checkpoint" "s2_get_changes" "s1_commit" "s2_get_changes"



-- cheers, Arseny
12