Improving connection scalability: GetSnapshotData()

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
68 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Andres Freund
Hi,

These benchmarks are on my workstation. The larger VM I used in the last
round wasn't currently available.

HW:
2 x Intel(R) Xeon(R) Gold 5215 (each 10 cores / 20 threads)
192GB Ram.
data directory is on a Samsung SSD 970 PRO 1TB

A bunch of terminals, emacs, mutt are open while the benchmark is
running. No browser.

Unless mentioned otherwise, relevant configuration options are:
max_connections=1200
shared_buffers=8GB
max_prepared_transactions=1000
synchronous_commit=local
huge_pages=on
fsync=off # to make it more likely to see scalability bottlenecks


Independent of the effects of this patch (i.e. including master) I had a
fairly hard time getting reproducible number for *low* client cases. I
found the numbers to be more reproducible if I pinned server/pgbench
onto the same core :(.  I chose to do that for the -c1 cases, to
benchmark the optimal behaviour, as that seemed to have the biggest
potential for regressions.

All numbers are best of three. Tests start in freshly created cluster
each.


On 2020-03-30 17:04:00 +0300, Alexander Korotkov wrote:
> Following pgbench scripts comes first to my mind:
> 1) SELECT txid_current(); (artificial but good for checking corner case)

-M prepared -T 180
(did a few longer runs, but doesn't seem to matter much)

clients     tps master      tps pgxact
1           46118           46027
16          377357          440233
40          373304          410142
198         103912          105579

btw, there's some pretty horrible cacheline bouncing in txid_current()
because backends first ReadNextFullTransactionId() (acquires XidGenLock
in shared mode, reads ShmemVariableCache->nextFullXid), then separately
causes GetNewTransactionId() (acquires XidGenLock exclusively, reads &
writes nextFullXid).

With for fsync=off (and also for synchronous_commit=off) the numbers
are, at lower client counts, severly depressed and variable due to
walwriter going completely nuts (using more CPU than the backend doing
the queries). Because WAL writes are so fast on my storage, individual
XLogBackgroundFlush() calls are very quick. This leads to a *lot* of
kill()s from the backend, from within XLogSetAsyncXactLSN().  There got
to be a bug here.  But unrelated.

> 2) Single insert statement (as example of very short transaction)

CREATE TABLE testinsert(c1 int not null, c2 int not null, c3 int not null, c4 int not null);
INSERT INTO testinsert VALUES(1, 2, 3, 4);

-M prepared -T 360

fsync on:
clients     tps master      tps pgxact
1           653             658
16          5687            5668
40          14212           14229
198         60483           62420

fsync off:
clients     tps master      tps pgxact
1           59356           59891
16          290626    299991
40          348210          355669
198         289182          291529

clients     tps master      tps pgxact
1024        47586           52135

-M simple
fsync off:
clients     tps master      tps pgxact
40          289077          326699
198         286011          299928




> 3) Plain pgbench read-write (you already did it for sure)

-s 100 -M prepared -T 700

autovacuum=off, fsync on:
clients     tps master      tps pgxact
1           474             479
16          4356            4476
40          8591            9309
198         20045           20261
1024        17986           18545

autovacuum=off, fsync off:
clients     tps master      tps pgxact
1           7828            7719
16          49069           50482
40          68241           73081
198         73464           77801
1024        25621           28410

I chose autovacuum off because otherwise the results vary much more
widely, and autovacuum isn't really needed for the workload.



> 4) pgbench read-write script with increased amount of SELECTs.  Repeat
> select from pgbench_accounts say 10 times with different aids.

I did intersperse all server-side statements in the script with two
selects of other pgbench_account rows each.

-s 100 -M prepared -T 700
autovacuum=off, fsync on:
clients     tps master      tps pgxact
1           365             367
198         20065           21391

-s 1000 -M prepared -T 700
autovacuum=off, fsync on:
clients     tps master      tps pgxact
16          2757            2880
40          4734            4996
198         16950           19998
1024        22423           24935


> 5) 10% pgbench read-write, 90% of pgbench read-only

-s 100 -M prepared -T 100 -bselect-only@9 -btpcb-like@1

autovacuum=off, fsync on:
clients     tps master      tps pgxact
16          37289           38656
40          81284           81260
198         189002          189357
1024        143986          164762


> > That definitely needs to be measured, due to the locking changes around procarrayaddd/remove.
> >
> > I don't think regressions besides perhaps 2pc are likely - there's nothing really getting more expensive but procarray add/remove.
>
> I agree that ProcArrayAdd()/Remove() should be first subject of
> investigation, but other cases should be checked as well IMHO.

I'm not sure I really see the point. If simple prepared tx doesn't show
up as a negative difference, a more complex one won't either, since the
ProcArrayAdd()/Remove() related bottlenecks will play smaller and
smaller role.


> Regarding 2pc I can following scenarios come to my mind:
> 1) pgbench read-write modified so that every transaction is prepared
> first, then commit prepared.

The numbers here are -M simple, because I wanted to use
PREPARE TRANSACTION 'ptx_:client_id';
COMMIT PREPARED 'ptx_:client_id';

-s 100 -M prepared -T 700 -f ~/tmp/pgbench-write-2pc.sql
autovacuum=off, fsync on:
clients     tps master      tps pgxact
1           251             249
16          2134            2174
40          3984            4089
198         6677            7522
1024        3641            3617


> 2) 10% of 2pc pgbench read-write, 90% normal pgbench read-write

-s 100 -M prepared -T 100 -f ~/tmp/pgbench-write-2pc.sql@1 -btpcb-like@9

clients     tps master      tps pgxact
198         18625           18906

> 3) 10% of 2pc pgbench read-write, 90% normal pgbench read-only

-s 100 -M prepared -T 100 -f ~/tmp/pgbench-write-2pc.sql@1 -bselect-only@9

clients     tps master      tps pgxact
198         84817           84350


I also benchmarked connection overhead, by using pgbench with -C
executing SELECT 1.

-T 10
clients     tps master      tps pgxact
1           572             587
16          2109            2140
40          2127            2136
198         2097            2129
1024        2101            2118



These numbers seem pretty decent to me. The regressions seem mostly
within noise. The one possible exception to that is plain pgbench
read/write with fsync=off and only a single session. I'll run more
benchmarks around that tomorrow (but now it's 6am :().

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Andres Freund
Hi,

On 2020-04-06 06:39:59 -0700, Andres Freund wrote:
> These benchmarks are on my workstation. The larger VM I used in the last
> round wasn't currently available.

One way to reproduce the problem at smaller connection counts / smaller
machines is to take more snapshots. Doesn't fully reproduce the problem,
because resetting ->xmin without xact overhead is part of the problem,
but it's helpful.

I use a volatile function that loops over a trivial statement. There's
probably an easier / more extreme way to reproduce the problem. But it's
good enough.

-- setup
CREATE OR REPLACE FUNCTION snapme(p_ret int, p_loop int) RETURNS int VOLATILE LANGUAGE plpgsql AS $$BEGIN FOR x in 1..p_loop LOOP EXECUTE 'SELECT 1';END LOOP; RETURN p_ret; END;$$;
-- statement executed in parallel
SELECT snapme(17, 10000);

before (all above 1.5%):
+   37.82%  postgres  postgres          [.] GetSnapshotData
+    6.26%  postgres  postgres          [.] AllocSetAlloc
+    3.77%  postgres  postgres          [.] base_yyparse
+    3.04%  postgres  postgres          [.] core_yylex
+    1.94%  postgres  postgres          [.] grouping_planner
+    1.83%  postgres  libc-2.30.so      [.] __strncpy_avx2
+    1.80%  postgres  postgres          [.] palloc
+    1.73%  postgres  libc-2.30.so      [.] __memset_avx2_unaligned_erms

after:
+    5.75%  postgres  postgres          [.] base_yyparse
+    4.37%  postgres  postgres          [.] palloc
+    4.29%  postgres  postgres          [.] AllocSetAlloc
+    3.75%  postgres  postgres          [.] expression_tree_walker.part.0
+    3.14%  postgres  postgres          [.] core_yylex
+    2.51%  postgres  postgres          [.] subquery_planner
+    2.48%  postgres  postgres          [.] CheckExprStillValid
+    2.45%  postgres  postgres          [.] check_stack_depth
+    2.42%  postgres  plpgsql.so        [.] exec_stmt
+    1.92%  postgres  libc-2.30.so      [.] __memset_avx2_unaligned_erms
+    1.91%  postgres  postgres          [.] query_tree_walker
+    1.88%  postgres  libc-2.30.so      [.] __GI_____strtoll_l_internal
+    1.86%  postgres  postgres          [.] _SPI_execute_plan
+    1.85%  postgres  postgres          [.] assign_query_collations_walker
+    1.84%  postgres  postgres          [.] remove_useless_results_recurse
+    1.83%  postgres  postgres          [.] grouping_planner
+    1.50%  postgres  postgres          [.] set_plan_refs


If I change the workload to be
BEGIN;
SELECT txid_current();
SELECT snapme(17, 1000);
COMMIT;


the difference reduces (because GetSnapshotData() only needs to look at
procs with xids, and xids are assigned for much longer), but still is
significant:

before (all above 1.5%):
+   35.89%  postgres  postgres            [.] GetSnapshotData
+    7.94%  postgres  postgres            [.] AllocSetAlloc
+    4.42%  postgres  postgres            [.] base_yyparse
+    3.62%  postgres  libc-2.30.so        [.] __memset_avx2_unaligned_erms
+    2.87%  postgres  postgres            [.] LWLockAcquire
+    2.76%  postgres  postgres            [.] core_yylex
+    2.30%  postgres  postgres            [.] expression_tree_walker.part.0
+    1.81%  postgres  postgres            [.] MemoryContextAllocZeroAligned
+    1.80%  postgres  postgres            [.] transformStmt
+    1.66%  postgres  postgres            [.] grouping_planner
+    1.64%  postgres  postgres            [.] subquery_planner

after:
+   24.59%  postgres  postgres          [.] GetSnapshotData
+    4.89%  postgres  postgres          [.] base_yyparse
+    4.59%  postgres  postgres          [.] AllocSetAlloc
+    3.00%  postgres  postgres          [.] LWLockAcquire
+    2.76%  postgres  postgres          [.] palloc
+    2.27%  postgres  postgres          [.] MemoryContextAllocZeroAligned
+    2.26%  postgres  postgres          [.] check_stack_depth
+    1.77%  postgres  postgres          [.] core_yylex

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Andres Freund
In reply to this post by Andres Freund
Hi,

On 2020-04-06 06:39:59 -0700, Andres Freund wrote:

> > 3) Plain pgbench read-write (you already did it for sure)
>
> -s 100 -M prepared -T 700
>
> autovacuum=off, fsync on:
> clients     tps master      tps pgxact
> 1           474             479
> 16          4356            4476
> 40          8591            9309
> 198         20045           20261
> 1024        17986           18545
>
> autovacuum=off, fsync off:
> clients     tps master      tps pgxact
> 1           7828            7719
> 16          49069           50482
> 40          68241           73081
> 198         73464           77801
> 1024        25621           28410
>
> I chose autovacuum off because otherwise the results vary much more
> widely, and autovacuum isn't really needed for the workload.

> These numbers seem pretty decent to me. The regressions seem mostly
> within noise. The one possible exception to that is plain pgbench
> read/write with fsync=off and only a single session. I'll run more
> benchmarks around that tomorrow (but now it's 6am :().

The "one possible exception" turned out to be a "real" regression, but
one that was dead easy to fix: It was an DEBUG1 elog I had left in. The
overhead seems to solely have been the increased code size + overhead of
errstart().  After that there's no difference in the single client case
anymore (I'd not expect a benefit).

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Andres Freund
Hi,

SEE BELOW: What, and what not, to do for v13.


Attached is a substantially polished version of my patches. Note that
the first three patches, as well as the last, are not intended to be
committed at this time / in this form - they're there to make testing
easier.

There is a lot of polish, but also a few substantial changes:

- To be compatible with old_snapshot_threshold I've revised the way
  heap_page_prune_opt() deals with old_snapshot_threshold. Now
  old_snapshot_threshold is only applied when we otherwise would have
  been unable to prune (both at the time of the pd_prune_xid check, and
  on individual tuples). This makes old_snapshot_threshold considerably
  cheaper and cause less conflicts.

  This required adding a version of HeapTupleSatisfiesVacuum that
  returns the horizon, rather than doing the horizon test itself; that
  way we can first test a tuple's horizon against the normal approximate
  threshold (making it an accurate threshold if needed) and only if that
  fails fall back to old_snapshot_threshold.

  The main reason here was not to improve old_snapshot_threshold, but to
  avoid a regression when its being used. Because we need a horizon to
  pass to old_snapshot_threshold, we'd have to fall back to computing an
  accurate horizon too often.


- Previous versions of the patch had a TODO about computing horizons not
  just for one of shared / catalog / data tables, but all of them at
  once. To avoid potentially needing to determine xmin horizons multiple
  times within one transaction.  For that I've renamed GetOldestXmin()
  to ComputeTransactionHorizons() and added wrapper functions instead of
  the different flag combinations we previously had for GetOldestXmin().

  This allows us to get rid of the PROCARRAY_* flags, and PROC_RESERVED.


- To address Thomas' review comment about not accessing nextFullXid
  without xidGenLock, I made latestCompletedXid a FullTransactionId (a
  fxid is needed to be able to infer 64bit xids for the horizons -
  otherwise there is some danger they could wrap).


- Improving the comment around the snapshot caching, I decided that the
  justification for correctness around not taking ProcArrayLock is too
  complicated (in particular around setting MyProc->xmin). While
  avoiding ProcArrayLock alltogether is a substantial gain, the caching
  itself helps a lot already. Seems best to leave that for a later step.

  This means that the numbers for the very high connection counts aren't
  quite as good.


- Plenty of small changes to address issues I found while
  benchmarking. The only one of real note is that I had released
  XidGenLock after ProcArrayLock in ProcArrayAdd/Remove. For 2pc that
  causes noticable unnecessary contention, because we'll wait for
  XidGenLock while holding ProcArrayLock...


I think this is pretty close to being committable.


But: This patch came in very late for v13, and it took me much longer to
polish it up than I had hoped (partially distraction due to various bugs
I found (in particular snapshot_too_old), partially covid19, partially
"hell if I know"). The patchset touches core parts of the system. While
both Thomas and David have done some review, they haven't for the latest
version (mea culpa).

In many other instances I would say that the above suggests slipping to
v14, given the timing.

The main reason I am considering pushing is that I think this patcheset
addresses one of the most common critiques of postgres, as well as very
common, hard to fix, real-world production issues. GetSnapshotData() has
been a major bottleneck for about as long as I have been using postgres,
and this addresses that to a significant degree.

A second reason I am considering it is that, in my opinion, the changes
are not all that complicated and not even that large. At least not for a
change to a problem that we've long tried to improve.


Obviously we all have a tendency to think our own work is important, and
that we deserve a bit more leeway than others. So take the above with a
grain of salt.


Comments?


Greetings,

Andres Freund

v1-0001-TMP-work-around-missing-snapshot-registrations.patch (8K) Download Attachment
v1-0002-Improve-and-extend-asserts-for-a-snapshot-being-s.patch (6K) Download Attachment
v7-0001-TMP-work-around-missing-snapshot-registrations.patch (8K) Download Attachment
v7-0002-Improve-and-extend-asserts-for-a-snapshot-being-s.patch (6K) Download Attachment
v7-0003-Fix-xlogreader-fd-leak-encountered-with-twophase-.patch (1K) Download Attachment
v7-0004-Move-delayChkpt-from-PGXACT-to-PGPROC-it-s-rarely.patch (9K) Download Attachment
v7-0005-Change-the-way-backends-perform-tuple-is-invisibl.patch (130K) Download Attachment
v7-0006-snapshot-scalability-Move-PGXACT-xmin-back-to-PGP.patch (22K) Download Attachment
v7-0007-snapshot-scalability-Move-in-progress-xids-to-Pro.patch (49K) Download Attachment
v7-0008-snapshot-scalability-Move-PGXACT-vacuumFlags-to-P.patch (18K) Download Attachment
v7-0009-snapshot-scalability-Move-subxact-info-from-PGXAC.patch (19K) Download Attachment
v7-0010-Remove-now-unused-PGXACT.patch (9K) Download Attachment
v7-0011-snapshot-scalability-cache-snapshots-using-a-xact.patch (9K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Andres Freund
On 2020-04-07 05:15:03 -0700, Andres Freund wrote:
> Attached is a substantially polished version of my patches. Note that
> the first three patches, as well as the last, are not intended to be
> committed at this time / in this form - they're there to make testing
> easier.

I didn't actually attached that last not-to-be-committed patch... It's
just the pgbench patch that I had attached before (and started a thread
about). Here it is again.

v7-0012-WIP-pgbench.patch (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Jonathan S. Katz-3
In reply to this post by Andres Freund
On 4/7/20 8:15 AM, Andres Freund wrote:

> I think this is pretty close to being committable.
>
>
> But: This patch came in very late for v13, and it took me much longer to
> polish it up than I had hoped (partially distraction due to various bugs
> I found (in particular snapshot_too_old), partially covid19, partially
> "hell if I know"). The patchset touches core parts of the system. While
> both Thomas and David have done some review, they haven't for the latest
> version (mea culpa).
>
> In many other instances I would say that the above suggests slipping to
> v14, given the timing.
>
> The main reason I am considering pushing is that I think this patcheset
> addresses one of the most common critiques of postgres, as well as very
> common, hard to fix, real-world production issues. GetSnapshotData() has
> been a major bottleneck for about as long as I have been using postgres,
> and this addresses that to a significant degree.
>
> A second reason I am considering it is that, in my opinion, the changes
> are not all that complicated and not even that large. At least not for a
> change to a problem that we've long tried to improve.
Even as recently as earlier this week there was a blog post making the
rounds about the pain points running PostgreSQL with many simultaneous
connections. Anything to help with that would go a long way, and looking
at the benchmarks you ran (at least with a quick, nonthorough glance)
this could and should be very positively impactful to a *lot* of
PostgreSQL users.

I can't comment on the "close to committable" aspect (at least not with
an informed, confident opinion) but if it is indeed close to committable
and you can put the work to finish polishing (read: "bug fixing" :-) and
we have a plan both of testing and, if need be, to revert, I would be
okay with including it, for whatever my vote is worth. Is the timing /
situation ideal? No, but the way you describe it, it sounds like there
is enough that can be done to ensure it's ready for Beta 1.

From a RMT standpoint, perhaps this is one of the "Recheck at Mid-Beta"
items, as well.

Thanks,

Jonathan


signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Robert Haas
In reply to this post by Andres Freund
Comments:

In 0002, the comments in SnapshotSet() are virtually incomprehensible.
There's no commit message so the reasons for the changes are unclear.
But mostly looks unproblematic.

0003 looks like a fairly unrelated bug fix that deserves to be
discussed on the thread related to the original patch. Probably should
be an open item.

0004 looks fine.

Regarding 0005:

There's sort of a mix of terminology here: are we pruning tuples or
removing tuples or testing whether things are invisible? It would be
better to be more consistent.

+ * State for testing whether tuple versions may be removed. To improve
+ * GetSnapshotData() performance we don't compute an accurate value whenever
+ * acquiring a snapshot. Instead we compute boundaries above/below which we
+ * know that row versions are [not] needed anymore.  If at test time values
+ * falls in between the two, the boundaries can be recomputed (unless that
+ * just happened).

I don't like the wording here much. Maybe: State for testing whether
an XID is invisible to all current snapshots. If an XID precedes
maybe_needed_bound, it's definitely not visible to any current
snapshot. If it equals or follows definitely_needed_bound, that XID
isn't necessarily invisible to all snapshots. If it falls in between,
we're not sure. If, when testing a particular tuple, we see an XID
somewhere in the middle, we can try recomputing the boundaries to get
a more accurate answer (unless we've just done that). This is cheaper
than maintaining an accurate value all the time.

There's also the problem that this sorta contradicts the comment for
definitely_needed_bound. There it says intermediate values needed to
be tested against the ProcArray, whereas here it says we need to
recompute the bounds. That's kinda confusing.

ComputedHorizons seems like a fairly generic name. I think there's
some relationship between InvisibleToEveryoneState and
ComputedHorizons that should be brought out more clearly by the naming
and the comments.

+ /*
+ * The value of ShmemVariableCache->latestCompletedFullXid when
+ * ComputeTransactionHorizons() held ProcArrayLock.
+ */
+ FullTransactionId latest_completed;
+
+ /*
+ * The same for procArray->replication_slot_xmin and.
+ * procArray->replication_slot_catalog_xmin.
+ */
+ TransactionId slot_xmin;
+ TransactionId slot_catalog_xmin;

Department of randomly inconsistent names. In general I think it's
quite hard to grasp the relationship between the different fields in
ComputedHorizons.

+static inline bool OldSnapshotThresholdActive(void)
+{
+ return old_snapshot_threshold >= 0;
+}

Formatting.

+
+bool
+GinPageIsRecyclable(Page page)

Needs a comment. Or more than one.

- /*
- * If a transaction wrote a commit record in the gap between taking and
- * logging the snapshot then latestCompletedXid may already be higher than
- * the value from the snapshot, so check before we use the incoming value.
- */
- if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
-   running->latestCompletedXid))
- ShmemVariableCache->latestCompletedXid = running->latestCompletedXid;
-
- Assert(TransactionIdIsNormal(ShmemVariableCache->latestCompletedXid));
-
- LWLockRelease(ProcArrayLock);

This code got relocated so that the lock is released later, but you
didn't add any comments explaining why. Somebody will move it back and
then you'll yet at them for doing it wrong. :-)

+ /*
+ * Must have called GetOldestVisibleTransactionId() if using SnapshotAny.
+ * Shouldn't have for an MVCC snapshot. (It's especially worth checking
+ * this for parallel builds, since ambuild routines that support parallel
+ * builds must work these details out for themselves.)
+ */
+ Assert(snapshot == SnapshotAny || IsMVCCSnapshot(snapshot));
+ Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
+    !TransactionIdIsValid(OldestXmin));
+ Assert(snapshot == SnapshotAny || !anyvisible);

This looks like a gratuitous code relocation.

+HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer,
TransactionId *dead_after)

I don't much like the name dead_after, but I don't have a better
suggestion, either.

- * Deleter committed, but perhaps it was recent enough that some open
- * transactions could still see the tuple.
+ * Deleter committed, allow caller to check if it was recent enough that
+ * some open transactions could still see the tuple.

I think you could drop this change.

+ /*
+ * State related to determining whether a dead tuple is still needed.
+ */
+ InvisibleToEveryoneState *vistest;
+ TimestampTz limited_oldest_ts;
+ TransactionId limited_oldest_xmin;
+ /* have we made removal decision based on old_snapshot_threshold */
+ bool limited_oldest_committed;

Would benefit from more comments.

+ * accuring to prstate->vistest, but that can be removed based on

Typo.

Generally, heap_prune_satisfies_vacuum looks pretty good. The
limited_oldest_committed naming is confusing, but the comments make it
a lot clearer.

+ * If oldest btpo.xact in the deleted pages is invisible, then at

I'd say "invisible to everyone" here for clarity.

-latestCompletedXid variable.  This allows GetSnapshotData to use
-latestCompletedXid + 1 as xmax for its snapshot: there can be no
+latestCompletedFullXid variable.  This allows GetSnapshotData to use
+latestCompletedFullXid + 1 as xmax for its snapshot: there can be no

Is this fixing a preexisting README defect?

It might be useful if this README expanded on the new machinery a bit
instead of just updating the wording to account for it, but I'm not
sure exactly what that would look like or whether it would be too
duplicative of other things.

+void AssertTransactionIdMayBeOnDisk(TransactionId xid)

Formatting.

+ * Assert that xid is one that we could actually see on disk.

I don't know what this means. The whole purpose of this routine is
very unclear to me.

  * the secondary effect that it sets RecentGlobalXmin.  (This is critical
  * for anything that reads heap pages, because HOT may decide to prune
  * them even if the process doesn't attempt to modify any tuples.)
+ *
+ * FIXME: This comment is inaccurate / the code buggy. A snapshot that is
+ * not pushed/active does not reliably prevent HOT pruning (->xmin could
+ * e.g. be cleared when cache invalidations are processed).

Something needs to be done here... and in the other similar case.

Is this kind of review helpful?

...Robert


Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Andres Freund
Hi,

Thanks for the review!


On 2020-04-07 12:41:07 -0400, Robert Haas wrote:
> In 0002, the comments in SnapshotSet() are virtually incomprehensible.
> There's no commit message so the reasons for the changes are unclear.
> But mostly looks unproblematic.

I was planning to drop that patch pre-commit, at least for now.  I think
there's a few live bugs here, but they're all older. I did send a few emails
about the class of problem, unfortunately it was a fairly one-sided
conversation so far ;)

https://www.postgresql.org/message-id/20200407072418.ccvnyjbrktyi3rzc%40alap3.anarazel.de


> 0003 looks like a fairly unrelated bug fix that deserves to be
> discussed on the thread related to the original patch. Probably should
> be an open item.

There was some discussion in a separate thread:
https://www.postgresql.org/message-id/20200406025651.fpzdb5yyb7qyhqko%40alap3.anarazel.de
The only reason for including it in this patch stack is that I can't
really execercise the patchset without the fix (it's a bit sad that this
issue has gone unnoticed for months before I found it as part of the
development of this patch).

Think I'll push a minimal version now, and add an open item.


>
> Regarding 0005:
>
> There's sort of a mix of terminology here: are we pruning tuples or
> removing tuples or testing whether things are invisible? It would be
> better to be more consistent.
>
> + * State for testing whether tuple versions may be removed. To improve
> + * GetSnapshotData() performance we don't compute an accurate value whenever
> + * acquiring a snapshot. Instead we compute boundaries above/below which we
> + * know that row versions are [not] needed anymore.  If at test time values
> + * falls in between the two, the boundaries can be recomputed (unless that
> + * just happened).
>
> I don't like the wording here much. Maybe: State for testing whether
> an XID is invisible to all current snapshots. If an XID precedes
> maybe_needed_bound, it's definitely not visible to any current
> snapshot. If it equals or follows definitely_needed_bound, that XID
> isn't necessarily invisible to all snapshots. If it falls in between,
> we're not sure. If, when testing a particular tuple, we see an XID
> somewhere in the middle, we can try recomputing the boundaries to get
> a more accurate answer (unless we've just done that). This is cheaper
> than maintaining an accurate value all the time.

I'll incorporate that, thanks.


> There's also the problem that this sorta contradicts the comment for
> definitely_needed_bound. There it says intermediate values needed to
> be tested against the ProcArray, whereas here it says we need to
> recompute the bounds. That's kinda confusing.

For me those are the same. Computing an accurate bound is visitting the
procarray. But I'll rephrase.


> ComputedHorizons seems like a fairly generic name. I think there's
> some relationship between InvisibleToEveryoneState and
> ComputedHorizons that should be brought out more clearly by the naming
> and the comments.

I don't like the naming of ComputedHorizons, ComputeTransactionHorizons
much... But I find it hard to come up with something that's meaningfully
better.

I'll add a comment.


> + /*
> + * The value of ShmemVariableCache->latestCompletedFullXid when
> + * ComputeTransactionHorizons() held ProcArrayLock.
> + */
> + FullTransactionId latest_completed;
> +
> + /*
> + * The same for procArray->replication_slot_xmin and.
> + * procArray->replication_slot_catalog_xmin.
> + */
> + TransactionId slot_xmin;
> + TransactionId slot_catalog_xmin;
>
> Department of randomly inconsistent names. In general I think it's
> quite hard to grasp the relationship between the different fields in
> ComputedHorizons.

What's the inconsistency? The dropped replication_ vs dropped FullXid
postfix?



> +
> +bool
> +GinPageIsRecyclable(Page page)
>
> Needs a comment. Or more than one.

Well, I started to write one a couple times. But it's really just moving
the pre-existing code from the macro into a function and there weren't
any comments around *any* of it before. All my comment attempts
basically just were restating the code in so many words, or would have
required more work than I saw justified in the context of just moving
code.


> - /*
> - * If a transaction wrote a commit record in the gap between taking and
> - * logging the snapshot then latestCompletedXid may already be higher than
> - * the value from the snapshot, so check before we use the incoming value.
> - */
> - if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
> -   running->latestCompletedXid))
> - ShmemVariableCache->latestCompletedXid = running->latestCompletedXid;
> -
> - Assert(TransactionIdIsNormal(ShmemVariableCache->latestCompletedXid));
> -
> - LWLockRelease(ProcArrayLock);
>
> This code got relocated so that the lock is released later, but you
> didn't add any comments explaining why. Somebody will move it back and
> then you'll yet at them for doing it wrong. :-)

I just moved it because the code now references ->nextFullXid, which was
previously maintained after latestCompletedXid.


> + /*
> + * Must have called GetOldestVisibleTransactionId() if using SnapshotAny.
> + * Shouldn't have for an MVCC snapshot. (It's especially worth checking
> + * this for parallel builds, since ambuild routines that support parallel
> + * builds must work these details out for themselves.)
> + */
> + Assert(snapshot == SnapshotAny || IsMVCCSnapshot(snapshot));
> + Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
> +    !TransactionIdIsValid(OldestXmin));
> + Assert(snapshot == SnapshotAny || !anyvisible);
>
> This looks like a gratuitous code relocation.

I found it hard to understand the comments because the Asserts were done
further away from where the relevant decisions they were made. And I
think I have history to back me up: It looks to me that that that is
because ab0dfc961b6a821f23d9c40c723d11380ce195a6 just put the progress
related code between the if (!scan) and the Asserts.


> +HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer,
> TransactionId *dead_after)
>
> I don't much like the name dead_after, but I don't have a better
> suggestion, either.
>
> - * Deleter committed, but perhaps it was recent enough that some open
> - * transactions could still see the tuple.
> + * Deleter committed, allow caller to check if it was recent enough that
> + * some open transactions could still see the tuple.
>
> I think you could drop this change.

Ok. Wasn't quite sure what to what to do with that comment.


> Generally, heap_prune_satisfies_vacuum looks pretty good. The
> limited_oldest_committed naming is confusing, but the comments make it
> a lot clearer.

I didn't like _committed much either. But couldn't come up with
something short. _relied_upon?


> + * If oldest btpo.xact in the deleted pages is invisible, then at
>
> I'd say "invisible to everyone" here for clarity.
>
> -latestCompletedXid variable.  This allows GetSnapshotData to use
> -latestCompletedXid + 1 as xmax for its snapshot: there can be no
> +latestCompletedFullXid variable.  This allows GetSnapshotData to use
> +latestCompletedFullXid + 1 as xmax for its snapshot: there can be no
>
> Is this fixing a preexisting README defect?

It's just adjusting for the changed name of latestCompletedXid to
latestCompletedFullXid, as part of widening it to 64bits.  I'm not
really a fan of adding that to the variable name, but surrounding code
already did it (cf VariableCache->nextFullXid), so I thought I'd follow
suit.


> It might be useful if this README expanded on the new machinery a bit
> instead of just updating the wording to account for it, but I'm not
> sure exactly what that would look like or whether it would be too
> duplicative of other things.



> +void AssertTransactionIdMayBeOnDisk(TransactionId xid)
>
> Formatting.
>
> + * Assert that xid is one that we could actually see on disk.
>
> I don't know what this means. The whole purpose of this routine is
> very unclear to me.

It's intended to be a double check against


>   * the secondary effect that it sets RecentGlobalXmin.  (This is critical
>   * for anything that reads heap pages, because HOT may decide to prune
>   * them even if the process doesn't attempt to modify any tuples.)
> + *
> + * FIXME: This comment is inaccurate / the code buggy. A snapshot that is
> + * not pushed/active does not reliably prevent HOT pruning (->xmin could
> + * e.g. be cleared when cache invalidations are processed).
>
> Something needs to be done here... and in the other similar case.

Indeed. I wrote a separate email about it yesterday:
https://www.postgresql.org/message-id/20200407072418.ccvnyjbrktyi3rzc%40alap3.anarazel.de



> Is this kind of review helpful?

Yes!

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Robert Haas
More review, since it sounds like you like it:

0006 - Boring. But I'd probably make this move both xmin and xid back,
with related comment changes; see also next comment.

0007 -

+ TransactionId xidCopy; /* this backend's xid, a copy of this proc's
+    ProcGlobal->xids[] entry. */

Can we please NOT put Copy into the name like that? Pretty please?

+ int pgxactoff; /* offset into various ProcGlobal-> arrays
+ * NB: can change any time unless locks held!
+ */

I'm going to add the helpful comment "NB: can change any time unless
locks held!" to every data structure in the backend that is in shared
memory and not immutable. No need, of course, to mention WHICH
locks...

On a related note, PROC_HDR really, really, really needs comments
explaining the locking regimen for the new xids field.

+ ProcGlobal->xids[pgxactoff] = InvalidTransactionId;

Apparently this array is not dense in the sense that it excludes
unused slots, but comments elsewhere don't seem to entirely agree.
Maybe the comments discussing how it is "dense" need to be a little
more precise about this.

+ for (int i = 0; i < nxids; i++)

I miss my C89. Yeah, it's just me.

- if (!suboverflowed)
+ if (suboverflowed)
+ continue;
+

Do we really need to do this kind of diddling in this patch? I mean
yes to the idea, but no to things that are going to make it harder to
understand what happened if this blows up.

+ uint32 TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;

  /* ProcGlobal */
  size = add_size(size, sizeof(PROC_HDR));
- /* MyProcs, including autovacuum workers and launcher */
- size = add_size(size, mul_size(MaxBackends, sizeof(PGPROC)));
- /* AuxiliaryProcs */
- size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGPROC)));
- /* Prepared xacts */
- size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGPROC)));
- /* ProcStructLock */
+ size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));

This seems like a bad idea. If we establish a precedent that it's OK
to have sizing routines that don't use add_size() and mul_size(),
people are going to cargo cult that into places where there is more
risk of overflow than there is here.

You've got a bunch of different places that talk about the new PGXACT
array and they are somewhat redundant yet without saying exactly the
same thing every time either. I think that needs cleanup.

One thing I didn't see is any clear discussion of what happens if the
two copies of the XID information don't agree with each other. That
should be added someplace, either in an appropriate code comment or in
a README or something. I *think* both are protected by the same locks,
but there's also some unlocked access to those structure members, so
it's not entirely a slam dunk.

...Robert


Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Andres Freund
In reply to this post by Andres Freund
Hi,

On 2020-04-07 10:51:12 -0700, Andres Freund wrote:

> > +void AssertTransactionIdMayBeOnDisk(TransactionId xid)
> >
> > Formatting.
> >
> > + * Assert that xid is one that we could actually see on disk.
> >
> > I don't know what this means. The whole purpose of this routine is
> > very unclear to me.
>
> It's intended to be a double check against

forgetting things...? Err:

It is intended to make it easier to detect cases where the passed
TransactionId is not safe against wraparound. If there is protection
against wraparound, then the xid

a) may never be older than ShmemVariableCache->oldestXid (since
   otherwise the rel/datfrozenxid could not have advanced past the xid,
   and because oldestXid is what what prevents ->nextFullXid from
   advancing far enough to cause a wraparound)

b) cannot be >= ShmemVariableCache->nextFullXid. If it is, it cannot
   recently have come from GetNewTransactionId(), and thus there is no
   anti-wraparound protection either.

As full wraparounds are painful to exercise in testing,
AssertTransactionIdMayBeOnDisk() is intended to make it easier to detect
potential hazards.

The reason for the *OnDisk naming is that [oldestXid, nextFullXid) is
the appropriate check for values actually stored in tables. There could,
and probably should, be a narrower assertion ensuring that a xid is
protected against being pruned away (i.e. a PGPROC's xmin covering it).

The reason for being concerned enough in the new code to add the new
assertion helper (as well as a major motivating reason for making the
horizons 64 bit xids) is that it's much harder to ensure that "global
xmin" style horizons don't wrap around. By definition they include other
backend's ->xmin, and those can be released without a lock at any
time. As a lot of wraparound issues are triggered by very longrunning
transactions, it is not even unlikely to hit such problems: At some
point somebody is going to kill that old backend and ->oldestXid will
advance very quickly.

There is a lot of code that is pretty unsafe around wraparounds... They
are getting easier and easier to hit on a regular schedule in production
(plenty of databases that hit wraparounds multiple times a week). And I
don't think we as PG developers often don't quite take that into
account.


Does that make some sense? Do you have a better suggestion for a name?

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Peter Geoghegan-4
On Tue, Apr 7, 2020 at 11:28 AM Andres Freund <[hidden email]> wrote:
> There is a lot of code that is pretty unsafe around wraparounds... They
> are getting easier and easier to hit on a regular schedule in production
> (plenty of databases that hit wraparounds multiple times a week). And I
> don't think we as PG developers often don't quite take that into
> account.

It would be nice if there was high level documentation on wraparound
hazards. Maybe even a dedicated README file.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Robert Haas
In reply to this post by Andres Freund
On Tue, Apr 7, 2020 at 2:28 PM Andres Freund <[hidden email]> wrote:
> Does that make some sense? Do you have a better suggestion for a name?

I think it makes sense. I have two basic problems with the name. The
first is that "on disk" doesn't seem to be a very clear way of
describing what you're actually checking here, and it definitely
doesn't refer to an existing concept which sophisticated hackers can
be expected to understand. The second is that "may" is ambiguous in
English: it can either mean that something is permissible ("Johnny,
you may go to the bathroom") or that we do not have certain knowledge
of it ("Johnny may be in the bathroom"). When it is followed by "be",
it usually has the latter sense, although there are exceptions (e.g.
"She may be discharged from the hospital today if she wishes, but we
recommend that she stay for another day"). Consequently, I found that
use of "may be" in this context wicked confusing. What came to mind
was:

bool
RobertMayBeAGiraffe(void)
{
    return true; // i mean, i haven't seen him since last week, so who knows?
}

So I suggest a name with "Is" or no verb, rather than one with
"MayBe." And I suggest something else instead of "OnDisk," e.g.
AssertTransactionIdIsInUsableRange() or
TransactionIdIsInAllowableRange() or
AssertTransactionIdWraparoundProtected(). I kind of like that last
one, but YMMV.

I wish to clarify that in sending these review emails I am taking no
position on whether or not it is prudent to commit any or all of them.
I do not think we can rule out the possibility that they will Break
Things, but neither do I wish to be seen as That Guy Who Stands In The
Way of Important Improvements. Time does not permit me a detailed
review anyway. So, these comments are provided in the hope that they
may be useful but without endorsement or acrimony. If other people
want to endorse or, uh, acrimoniate, based on my comments, that is up
to them.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Robert Haas
In reply to this post by Andres Freund
On Tue, Apr 7, 2020 at 1:51 PM Andres Freund <[hidden email]> wrote:
> > ComputedHorizons seems like a fairly generic name. I think there's
> > some relationship between InvisibleToEveryoneState and
> > ComputedHorizons that should be brought out more clearly by the naming
> > and the comments.
>
> I don't like the naming of ComputedHorizons, ComputeTransactionHorizons
> much... But I find it hard to come up with something that's meaningfully
> better.

It would help to stick XID in there, like ComputedXIDHorizons. What I
find really baffling is that you seem to have two structures in the
same file that have essentially the same purpose, but the second one
(ComputedHorizons) has a lot more stuff in it. I can't understand why.

> What's the inconsistency? The dropped replication_ vs dropped FullXid
> postfix?

Yeah, just having the member names be randomly different between the
structs. Really harms greppability.

> > Generally, heap_prune_satisfies_vacuum looks pretty good. The
> > limited_oldest_committed naming is confusing, but the comments make it
> > a lot clearer.
>
> I didn't like _committed much either. But couldn't come up with
> something short. _relied_upon?

oldSnapshotLimitUsed or old_snapshot_limit_used, like currentCommandIdUsed?

> It's just adjusting for the changed name of latestCompletedXid to
> latestCompletedFullXid, as part of widening it to 64bits.  I'm not
> really a fan of adding that to the variable name, but surrounding code
> already did it (cf VariableCache->nextFullXid), so I thought I'd follow
> suit.

Oops, that was me misreading the diff. Sorry for the noise.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Andres Freund
In reply to this post by Robert Haas
Hi,

On 2020-04-07 14:28:09 -0400, Robert Haas wrote:

> More review, since it sounds like you like it:
>
> 0006 - Boring. But I'd probably make this move both xmin and xid back,
> with related comment changes; see also next comment.
>
> 0007 -
>
> + TransactionId xidCopy; /* this backend's xid, a copy of this proc's
> +    ProcGlobal->xids[] entry. */
>
> Can we please NOT put Copy into the name like that? Pretty please?

Do you have a suggested naming scheme? Something indicating that it's
not the only place that needs to be updated?


> + int pgxactoff; /* offset into various ProcGlobal-> arrays
> + * NB: can change any time unless locks held!
> + */
>
> I'm going to add the helpful comment "NB: can change any time unless
> locks held!" to every data structure in the backend that is in shared
> memory and not immutable. No need, of course, to mention WHICH
> locks...

I think it's more on-point here, because we need to hold either of the
locks* even, for changes to a backend's own status that one reasonably
could expect would be safe to at least inspect. E.g looking at
ProcGlobal->xids[MyProc->pgxactoff]
doesn't look suspicious, but could very well return another backends
xid, if neither ProcArrayLock nor XidGenLock is held (because a
ProcArrayRemove() could have changed pgxactoff if a previous entry was
removed).

*see comment at PROC_HDR:

 *
 * Adding/Removing an entry into the procarray requires holding *both*
 * ProcArrayLock and XidGenLock in exclusive mode (in that order). Both are
 * needed because the dense arrays (see below) are accessed from
 * GetNewTransactionId() and GetSnapshotData(), and we don't want to add
 * further contention by both using one lock. Adding/Removing a procarray
 * entry is much less frequent.
 */
typedef struct PROC_HDR
{
        /* Array of PGPROC structures (not including dummies for prepared txns) */
        PGPROC   *allProcs;


        /*
         * Arrays with per-backend information that is hotly accessed, indexed by
         * PGPROC->pgxactoff. These are in separate arrays for three reasons:
         * First, to allow for as tight loops accessing the data as
         * possible. Second, to prevent updates of frequently changing data from
         * invalidating cachelines shared with less frequently changing
         * data. Third to condense frequently accessed data into as few cachelines
         * as possible.
         *
         * When entering a PGPROC for 2PC transactions with ProcArrayAdd(), those
         * copies are used to provide the contents of the dense data, and will be
         * transferred by ProcArrayAdd() while it already holds ProcArrayLock.
         */

there's also

 * The various *Copy fields are copies of the data in ProcGlobal arrays that
 * can be accessed without holding ProcArrayLock / XidGenLock (see PROC_HDR
 * comments).


I had a more explicit warning/explanation about the dangers of accessing
the arrays without locks, but apparently went to far when reducing
duplicated comments.


> On a related note, PROC_HDR really, really, really needs comments
> explaining the locking regimen for the new xids field.


I'll expand the above, in particular highlighting the danger of
pgxactoff changing.


> + ProcGlobal->xids[pgxactoff] = InvalidTransactionId;
>
> Apparently this array is not dense in the sense that it excludes
> unused slots, but comments elsewhere don't seem to entirely agree.

What do you mean with "unused slots"? Backends that committed?

Dense is intended to describe that the arrays only contain currently
"live" entries. I.e. that the first procArray->numProcs entries in each
array have the data for all procs (including prepared xacts) that are
"connected".  This is extending the concept that already existed for
procArray->pgprocnos.

Wheras the PGPROC/PGXACT arrays have "unused" entries interspersed.

This is what previously lead to the slow loop in GetSnapshotData(),
where we had to iterate over PGXACTs over an indirection in
procArray->pgprocnos. I.e. to only look at in-use PGXACTs we had to go
through allProcs[arrayP->pgprocnos[i]], which is, uh, suboptimal for
a performance critical routine holding a central lock.

I'll try to expand the comments around dense, but if you have a better
descriptor.


> Maybe the comments discussing how it is "dense" need to be a little
> more precise about this.
>
> + for (int i = 0; i < nxids; i++)
>
> I miss my C89. Yeah, it's just me.

Oh, dear god. I hate declaring variables like 'i' on function scope. The
bug that haunted me the longest in the development of this patch was in
XidCacheRemoveRunningXids, where there are both i and j, and a macro
XidCacheRemove(i), but the macro gets passed j as i.


> - if (!suboverflowed)
> + if (suboverflowed)
> + continue;
> +
>
> Do we really need to do this kind of diddling in this patch? I mean
> yes to the idea, but no to things that are going to make it harder to
> understand what happened if this blows up.

I can try to reduce those differences. Given the rest of the changes it
didn't seem likely to matter. I found it hard to keep the branches
nesting in my head when seeing:
                                        }
                                }
                        }
                }
        }



> + uint32 TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
>
>   /* ProcGlobal */
>   size = add_size(size, sizeof(PROC_HDR));
> - /* MyProcs, including autovacuum workers and launcher */
> - size = add_size(size, mul_size(MaxBackends, sizeof(PGPROC)));
> - /* AuxiliaryProcs */
> - size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGPROC)));
> - /* Prepared xacts */
> - size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGPROC)));
> - /* ProcStructLock */
> + size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));
>
> This seems like a bad idea. If we establish a precedent that it's OK
> to have sizing routines that don't use add_size() and mul_size(),
> people are going to cargo cult that into places where there is more
> risk of overflow than there is here.

Hm. I'm not sure I see the problem. Are you concerned that TotalProcs
would overflow due to too big MaxBackends or max_prepared_xacts? The
multiplication itself is still protected by add_size(). It didn't seem
correct to use add_size for the TotalProcs addition, since that's not
really a size. And since the limit for procs is much lower than
UINT32_MAX...

It seems worse to add a separate add_size calculation for each type of
proc entry, for for each of the individual arrays. We'd end up with

        size = add_size(size, sizeof(PROC_HDR));
        size = add_size(size, mul_size(MaxBackends, sizeof(PGPROC)));
        size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGPROC)));
        size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGPROC)));
        size = add_size(size, sizeof(slock_t));

        size = add_size(size, mul_size(MaxBackends, sizeof(sizeof(*ProcGlobal->xids))));
        size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(sizeof(*ProcGlobal->xids)));
        size = add_size(size, mul_size(max_prepared_xacts, sizeof(sizeof(*ProcGlobal->xids))));
        size = add_size(size, mul_size(MaxBackends, sizeof(sizeof(*ProcGlobal->subxidStates))));
        size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(sizeof(*ProcGlobal->subxidStates)));
        size = add_size(size, mul_size(max_prepared_xacts, sizeof(sizeof(*ProcGlobal->subxidStates))));
        size = add_size(size, mul_size(MaxBackends, sizeof(sizeof(*ProcGlobal->vacuumFlags))));
        size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(sizeof(*ProcGlobal->vacuumFlags)));
        size = add_size(size, mul_size(max_prepared_xacts, sizeof(sizeof(*ProcGlobal->vacuumFlags))));

instead of

        size = add_size(size, sizeof(PROC_HDR));
        size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));
        size = add_size(size, sizeof(slock_t));

        size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->xids)));
        size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->subxidStates)));
        size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->vacuumFlags)));

which seems clearly worse.


> You've got a bunch of different places that talk about the new PGXACT
> array and they are somewhat redundant yet without saying exactly the
> same thing every time either. I think that needs cleanup.

Could you point out a few of those comments, I'm not entirely sure which
you're talking about?


> One thing I didn't see is any clear discussion of what happens if the
> two copies of the XID information don't agree with each other.

It should never happen. There's asserts that try to ensure that. For the
xid-less case:

ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
...
                Assert(!TransactionIdIsValid(proc->xidCopy));
                Assert(proc->subxidStatusCopy.count == 0);
and for the case of having an xid:

ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid)
...
        Assert(ProcGlobal->xids[pgxactoff] == proc->xidCopy);
...
        Assert(ProcGlobal->subxidStates[pgxactoff].count == proc->subxidStatusCopy.count &&
                   ProcGlobal->subxidStates[pgxactoff].overflowed == proc->subxidStatusCopy.overflowed);


> That should be added someplace, either in an appropriate code comment
> or in a README or something. I *think* both are protected by the same
> locks, but there's also some unlocked access to those structure
> members, so it's not entirely a slam dunk.

Hm. I considered is allowed to modify those and when to really be
covered by the existing comments in transam/README. In particular in the
"Interlocking Transaction Begin, Transaction End, and Snapshots"
section.

Do you think that a comment explaining that the *Copy version has to be
kept up2date at all times (except when not yet added with ProcArrayAdd)
would ameliorate that concern?

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Robert Haas
In reply to this post by Robert Haas
0008 -

Here again, I greatly dislike putting Copy in the name. It makes
little sense to pretend that either is the original and the other is
the copy. You just have the same data in two places. If one of them is
more authoritative, the place to explain that is in the comments, not
by elongating the structure member name and supposing anyone will be
able to make something of that.

+ *
+ * XXX: That's why this is using vacuumFlagsCopy.

I am not sure there's any problem with the code that needs fixing
here, so I might think about getting rid of this XXX. But this gets
back to my complaint about the locking regime being unclear. What I
think you need to do here is rephrase the previous paragraph so that
it explains the reason for using this copy a bit better. Like "We read
the copy of vacuumFlags from PGPROC rather than visiting the copy
attached to ProcGlobal because we can do that without taking a lock.
See fuller explanation in <place>." Or whatever.

0009, 0010 -

I think you've got this whole series of things divided up too finely.
Like, 0005 feels like the meat of it, and that has a bunch of things
in it that could plausible be separated out as separate commits. 0007
also seems to do more than one kind of thing (see my comment regarding
moving some of that into 0006). But whacking everything around like a
crazy man in 0005 and a little more in 0007 and then doing the
following cleanup in these little tiny steps seems pretty lame.
Separating 0009 from 0010 is maybe the clearest example of that, but
IMHO it's pretty unclear why both of these shouldn't be merged with
0008.

To be clear, I exaggerate for effect. 0005 is not whacking everything
around like a crazy man. But it is a non-minimal patch, whereas I
consider 0009 and 0010 to be sub-minimal.

My comments on the Copy naming apply here as well. I am also starting
to wonder why exactly we need two copies of all this stuff. Perhaps
I've just failed to absorb the idea for having read the patch too
briefly, but I think that we need to make sure that it's super-clear
why we're doing that. If we just needed it for one field because
$REASONS, that would be one thing, but if we need it for all of them
then there must be some underlying principle here that needs a good
explanation in an easy-to-find and centrally located place.

0011 -

+ * Number of top-level transactions that completed in some form since the
+ * start of the server. This currently is solely used to check whether
+ * GetSnapshotData() needs to recompute the contents of the snapshot, or
+ * not. There are likely other users of this.  Always above 1.

Does it only count XID-bearing transactions? If so, best mention that.

+ * transactions completed since the last GetSnapshotData()..

Too many periods.

+ /* Same with CSN */
+ ShmemVariableCache->xactCompletionCount++;

If I didn't know that CSN stood for commit sequence number from
reading years of mailing list traffic, I'd be lost here. So I think
this comment shouldn't use that term.

+GetSnapshotDataFillTooOld(Snapshot snapshot)

Uh... no clue what's going on here. Granted the code had no comments
in the old place either, so I guess it's not worse, but even the name
of the new function is pretty incomprehensible.

+ * Helper function for GetSnapshotData() that check if the bulk of the

checks

+ * the fields that need to change and returns true. false is returned
+ * otherwise.

Otherwise, it returns false.

+ * It is safe to re-enter the snapshot's xmin. This can't cause xmin to go

I know what it means to re-enter a building, but I don't know what it
means to re-enter the snapshot's xmin.

This whole comment seems a bit murky.

...Robert


Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Andres Freund
In reply to this post by Robert Haas
Hi,

On 2020-04-07 14:51:52 -0400, Robert Haas wrote:

> On Tue, Apr 7, 2020 at 2:28 PM Andres Freund <[hidden email]> wrote:
> > Does that make some sense? Do you have a better suggestion for a name?
>
> I think it makes sense. I have two basic problems with the name. The
> first is that "on disk" doesn't seem to be a very clear way of
> describing what you're actually checking here, and it definitely
> doesn't refer to an existing concept which sophisticated hackers can
> be expected to understand. The second is that "may" is ambiguous in
> English: it can either mean that something is permissible ("Johnny,
> you may go to the bathroom") or that we do not have certain knowledge
> of it ("Johnny may be in the bathroom"). When it is followed by "be",
> it usually has the latter sense, although there are exceptions (e.g.
> "She may be discharged from the hospital today if she wishes, but we
> recommend that she stay for another day"). Consequently, I found that
> use of "may be" in this context wicked confusing.

Well, it *is* only a vague test :). It shouldn't ever have a false
positive, but there's plenty chance for false negatives (if wrapped
around far enough).


> So I suggest a name with "Is" or no verb, rather than one with
> "MayBe." And I suggest something else instead of "OnDisk," e.g.
> AssertTransactionIdIsInUsableRange() or
> TransactionIdIsInAllowableRange() or
> AssertTransactionIdWraparoundProtected(). I kind of like that last
> one, but YMMV.

Make sense - but they all seem to express a bit more certainty than I
think the test actually provides.

I explicitly did not want (and added a comment to that affect) have
something like TransactionIdIsInAllowableRange(), because there never
can be a safe use of its return value, as far as I can tell.

The "OnDisk" was intended to clarify that the range it verifies is
whether it'd be ok for the xid to have been found in a relation.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Andres Freund
In reply to this post by Robert Haas
Hi,

On 2020-04-07 15:03:46 -0400, Robert Haas wrote:

> On Tue, Apr 7, 2020 at 1:51 PM Andres Freund <[hidden email]> wrote:
> > > ComputedHorizons seems like a fairly generic name. I think there's
> > > some relationship between InvisibleToEveryoneState and
> > > ComputedHorizons that should be brought out more clearly by the naming
> > > and the comments.
> >
> > I don't like the naming of ComputedHorizons, ComputeTransactionHorizons
> > much... But I find it hard to come up with something that's meaningfully
> > better.
>
> It would help to stick XID in there, like ComputedXIDHorizons. What I
> find really baffling is that you seem to have two structures in the
> same file that have essentially the same purpose, but the second one
> (ComputedHorizons) has a lot more stuff in it. I can't understand why.

ComputedHorizons are the various "accurate" horizons computed by
ComputeTransactionHorizons(). That's used to determine a horizon for
vacuuming (via GetOldestVisibleTransactionId()) and other similar use
cases.

The various InvisibleToEveryoneState variables contain the boundary
based horizons, and are updated / initially filled by
GetSnapshotData(). When the a tested value falls between the boundaries,
we update the approximate boundaries using
ComputeTransactionHorizons(). That briefly makes the boundaries in
the InvisibleToEveryoneState accurate - but future GetSnapshotData()
calls will increase the definitely_needed_bound (if transactions
committed since).

The ComputedHorizons fields could instead just be pointer based
arguments to ComputeTransactionHorizons(), but that seems clearly
worse.

I'll change ComputedHorizons's comment to say that it's the result of
ComputeTransactionHorizons(), not the "state".


> > What's the inconsistency? The dropped replication_ vs dropped FullXid
> > postfix?
>
> Yeah, just having the member names be randomly different between the
> structs. Really harms greppability.

The long names make it hard to keep line lengths in control, in
particular when also involving the long macro names for TransactionId /
FullTransactionId comparators...


> > > Generally, heap_prune_satisfies_vacuum looks pretty good. The
> > > limited_oldest_committed naming is confusing, but the comments make it
> > > a lot clearer.
> >
> > I didn't like _committed much either. But couldn't come up with
> > something short. _relied_upon?
>
> oldSnapshotLimitUsed or old_snapshot_limit_used, like currentCommandIdUsed?

Will go for old_snapshot_limit_used, and rename the other variables to
old_snapshot_limit_ts, old_snapshot_limit_xmin.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Robert Haas
In reply to this post by Andres Freund
On Tue, Apr 7, 2020 at 3:31 PM Andres Freund <[hidden email]> wrote:
> Well, it *is* only a vague test :). It shouldn't ever have a false
> positive, but there's plenty chance for false negatives (if wrapped
> around far enough).

Sure, but I think you get my point. Asserting that something "might
be" true isn't much of an assertion. Saying that it's in the correct
range is not to say there can't be a problem - but we're saying that
it IS in the expect range, not that it may or may not be.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Robert Haas
In reply to this post by Andres Freund
On Tue, Apr 7, 2020 at 3:24 PM Andres Freund <[hidden email]> wrote:
> > 0007 -
> >
> > + TransactionId xidCopy; /* this backend's xid, a copy of this proc's
> > +    ProcGlobal->xids[] entry. */
> >
> > Can we please NOT put Copy into the name like that? Pretty please?
>
> Do you have a suggested naming scheme? Something indicating that it's
> not the only place that needs to be updated?

I don't think trying to indicate that in the structure member names is
a useful idea. I think you should give them the same names, maybe with
an "s" to pluralize the copy hanging off of ProcGlobal, and put a
comment that says something like:

We keep two copies of each of the following three fields. One copy is
here in the PGPROC, and the other is in a more densely-packed array
hanging off of PGXACT. Both copies of the value must always be updated
at the same time and under the same locks, so that it is always the
case that MyProc->xid == ProcGlobal->xids[MyProc->pgprocno] and
similarly for vacuumFlags and WHATEVER. Note, however, that the arrays
attached to ProcGlobal only contain entries for PGPROC structures that
are currently part of the ProcArray (i.e. there is currently a backend
for that PGPROC). We use those arrays when STUFF and the copies in the
individual PGPROC when THINGS.

> I think it's more on-point here, because we need to hold either of the
> locks* even, for changes to a backend's own status that one reasonably
> could expect would be safe to at least inspect.

It's just too brief and obscure to be useful.

> > + ProcGlobal->xids[pgxactoff] = InvalidTransactionId;
> >
> > Apparently this array is not dense in the sense that it excludes
> > unused slots, but comments elsewhere don't seem to entirely agree.
>
> What do you mean with "unused slots"? Backends that committed?

Backends that have no XID. You mean, I guess, that it is "dense" in
the sense that only live backends are in there, not "dense" in the
sense that only active write transactions are in there. It would be
nice to nail that down better; the wording I suggested above might
help.

> > + uint32 TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
> >
> >   /* ProcGlobal */
> >   size = add_size(size, sizeof(PROC_HDR));
> > - /* MyProcs, including autovacuum workers and launcher */
> > - size = add_size(size, mul_size(MaxBackends, sizeof(PGPROC)));
> > - /* AuxiliaryProcs */
> > - size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGPROC)));
> > - /* Prepared xacts */
> > - size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGPROC)));
> > - /* ProcStructLock */
> > + size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));
> >
> > This seems like a bad idea. If we establish a precedent that it's OK
> > to have sizing routines that don't use add_size() and mul_size(),
> > people are going to cargo cult that into places where there is more
> > risk of overflow than there is here.
>
> Hm. I'm not sure I see the problem. Are you concerned that TotalProcs
> would overflow due to too big MaxBackends or max_prepared_xacts? The
> multiplication itself is still protected by add_size(). It didn't seem
> correct to use add_size for the TotalProcs addition, since that's not
> really a size. And since the limit for procs is much lower than
> UINT32_MAX...

I'm concerned that there are 0 uses of add_size in any shared-memory
sizing function, and I think it's best to keep it that way. If you
initialize TotalProcs = add_size(MaxBackends,
add_size(NUM_AUXILIARY_PROCS, max_prepared_xacts)) then I'm happy. I
think it's a desperately bad idea to imagine that we can dispense with
overflow checks here and be safe. It's just too easy for that to
become false due to future code changes, or get copied to other places
where it's unsafe now.

> > You've got a bunch of different places that talk about the new PGXACT
> > array and they are somewhat redundant yet without saying exactly the
> > same thing every time either. I think that needs cleanup.
>
> Could you point out a few of those comments, I'm not entirely sure which
> you're talking about?

+ /*
+ * Also allocate a separate arrays for data that is frequently (e.g. by
+ * GetSnapshotData()) accessed from outside a backend.  There is one entry
+ * in each for every *live* PGPROC entry, and they are densely packed so
+ * that the first procArray->numProc entries are all valid.  The entries
+ * for a PGPROC in those arrays are at PGPROC->pgxactoff.
+ *
+ * Note that they may not be accessed without ProcArrayLock held! Upon
+ * ProcArrayRemove() later entries will be moved.
+ *
+ * These are separate from the main PGPROC array so that the most heavily
+ * accessed data is stored contiguously in memory in as few cache lines as
+ * possible. This provides significant performance benefits, especially on
+ * a multiprocessor system.
+ */

+ * Arrays with per-backend information that is hotly accessed, indexed by
+ * PGPROC->pgxactoff. These are in separate arrays for three reasons:
+ * First, to allow for as tight loops accessing the data as
+ * possible. Second, to prevent updates of frequently changing data from
+ * invalidating cachelines shared with less frequently changing
+ * data. Third to condense frequently accessed data into as few cachelines
+ * as possible.

+ *
+ * The various *Copy fields are copies of the data in ProcGlobal arrays that
+ * can be accessed without holding ProcArrayLock / XidGenLock (see PROC_HDR
+ * comments).

+ * Adding/Removing an entry into the procarray requires holding *both*
+ * ProcArrayLock and XidGenLock in exclusive mode (in that order). Both are
+ * needed because the dense arrays (see below) are accessed from
+ * GetNewTransactionId() and GetSnapshotData(), and we don't want to add
+ * further contention by both using one lock. Adding/Removing a procarray
+ * entry is much less frequent.

I'm not saying these are all entirely redundant with each other;
that's not so. But I don't think it gives a terribly clear grasp of
the overall picture either, even taking all of them together.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Andres Freund
In reply to this post by Robert Haas
Hi,

On 2020-04-07 15:26:36 -0400, Robert Haas wrote:
> 0008 -
>
> Here again, I greatly dislike putting Copy in the name. It makes
> little sense to pretend that either is the original and the other is
> the copy. You just have the same data in two places. If one of them is
> more authoritative, the place to explain that is in the comments, not
> by elongating the structure member name and supposing anyone will be
> able to make something of that.

Ok.



> 0009, 0010 -
>
> I think you've got this whole series of things divided up too finely.
> Like, 0005 feels like the meat of it, and that has a bunch of things
> in it that could plausible be separated out as separate commits. 0007
> also seems to do more than one kind of thing (see my comment regarding
> moving some of that into 0006). But whacking everything around like a
> crazy man in 0005 and a little more in 0007 and then doing the
> following cleanup in these little tiny steps seems pretty lame.
> Separating 0009 from 0010 is maybe the clearest example of that, but
> IMHO it's pretty unclear why both of these shouldn't be merged with
> 0008.

I found it a *lot* easier to review / evolve them this way. I e.g. had
an earlier version in which the subxid part of the change worked
substantially differently (it tried to elide the overflowed bool, by
definining -1 as the indicator for overflows), and it'd been way harder
to change that if I didn't have a patch with *just* the subxid changes.

I'd not push them separated by time, but I do think it'd make sense to
push them as separate commits. I think it's easier to review them in
case of a bug in a separate area.


> My comments on the Copy naming apply here as well. I am also starting
> to wonder why exactly we need two copies of all this stuff. Perhaps
> I've just failed to absorb the idea for having read the patch too
> briefly, but I think that we need to make sure that it's super-clear
> why we're doing that. If we just needed it for one field because
> $REASONS, that would be one thing, but if we need it for all of them
> then there must be some underlying principle here that needs a good
> explanation in an easy-to-find and centrally located place.

The main reason is that we want to be able to cheaply check the current
state of the variables (mostly when checking a backend's own state). We
can't access the "dense" ones without holding a lock, but we e.g. don't
want to make ProcArrayEndTransactionInternal() take a lock just to check
if vacuumFlags is set.

It turns out to also be good for performance to have the copy for
another reason: The "dense" arrays share cachelines with other
backends. That's worth it because it allows to make GetSnapshotData(),
by far the most frequent operation, touch fewer cache lines. But it also
means that it's more likely that a backend's "dense" array entry isn't
in a local cpu cache (it'll be pulled out of there when modified in
another backend). In many cases we don't need the shared entry at commit
etc time though, we just need to check if it is set - and most of the
time it won't be.  The local entry allows to do that cheaply.

Basically it makes sense to access the PGPROC variable when checking a
single backend's data, especially when we have to look at the PGPROC for
other reasons already.  It makes sense to look at the "dense" arrays if
we need to look at many / most entries, because we then benefit from the
reduced indirection and better cross-process cacheability.


> 0011 -
>
> + * Number of top-level transactions that completed in some form since the
> + * start of the server. This currently is solely used to check whether
> + * GetSnapshotData() needs to recompute the contents of the snapshot, or
> + * not. There are likely other users of this.  Always above 1.
>
> Does it only count XID-bearing transactions? If so, best mention that.

Oh, good point.


> +GetSnapshotDataFillTooOld(Snapshot snapshot)
>
> Uh... no clue what's going on here. Granted the code had no comments
> in the old place either, so I guess it's not worse, but even the name
> of the new function is pretty incomprehensible.

It fills the old_snapshot_threshold fields of a Snapshot.


> + * It is safe to re-enter the snapshot's xmin. This can't cause xmin to go
>
> I know what it means to re-enter a building, but I don't know what it
> means to re-enter the snapshot's xmin.

Re-entering it into the procarray, thereby preventing rows that the
snapshot could see from being removed.

> This whole comment seems a bit murky.

How about:
        /*
         * If the current xactCompletionCount is still the same as it was at the
         * time the snapshot was built, we can be sure that rebuilding the
         * contents of the snapshot the hard way would result in the same snapshot
         * contents:
         *
         * As explained in transam/README, the set of xids considered running by
         * GetSnapshotData() cannot change while ProcArrayLock is held. Snapshot
         * contents only depend on transactions with xids and xactCompletionCount
         * is incremented whenever a transaction with an xid finishes (while
         * holding ProcArrayLock) exclusively). Thus the xactCompletionCount check
         * ensures we would detect if the snapshot would have changed.
         *
         * As the snapshot contents are the same as it was before, it is is safe
         * to re-enter the snapshot's xmin into the PGPROC array. None of the rows
         * visible under the snapshot could already have been removed (that'd
         * require the set of running transactions to change) and it fulfills the
         * requirement that concurrent GetSnapshotData() calls yield the same
         * xmin.
         */

Greetings,

Andres Freund


1234