My top-level query is: I'm using logical replication under pg 9.6 to do a kind of change data capture and I'm seeing occasional extended periods of significant lag. I'm not sure what conceptual model I'm missing in order to understand why this happens.
I'm running Postgres 9.6.19 from the postgres debian apt repos & the wal2json extension.
I have a custom client application which essentially executes pg_logical_slot_get_changes() for some manually-created logical replication slot on a loop.
I'm monitoring replication lag defined as pg_current_xlog_location() - confirmed_flush_lsn for that slot.
What I'm observing is - very occasionally - an extended period (hours long) wherein:
* The normal database write load continues or slightly increases
* calls to pg_logical_slot_get_changes() return no rows and confirmed_flush_lsn doesn't move
* the duration of a call to pg_logical_slot_get_changes() rises linearly over time
I understand from the docs and research that this is usually caused by a long-running write transaction, but I notice I'm still confused.
* I'm not 100% sure - I'm still confirming - but I'm fairly confident that I don't have any egregiously long write transactions (at least on that scale of hours). Are there any other common scenarios that can result in a similar 'blockage'? e.g some categories of long read-only transactions, or advisory locks, or other kinds of database activity like a vacuum?
* Conversely, from experimenting, it seems as if not all long-running write transactions cause pg_logical_slot_get_changes() to be unable to advance. In fact, I'm not able so far to produce a minimal set of simple queries which show that behaviour.
Given the following sequence of queries I see changes emitted:
-- session 1
insert into foo(bar,baz) values (1, 1);
-- session 2
insert into foo(bar,baz) values (2,2);
-- session 3
select data from pg_logical_slot_get_changes('example-slot', NULL, NULL, 'format-version', '2');
Session 3 is able to return the row from session 2 despite session 1's ongoing transaction starting first and not yet committing. Can you help me understand (or better yet point me to a resource which explains) the underlying logic defining how logical decoding does in fact get blocked by in-flight transactions?