Improving connection scalability: GetSnapshotData()

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
110 messages Options
1234 ... 6
Reply | Threaded
Open this post in threaded view
|

Improving connection scalability: GetSnapshotData()

Andres Freund
Hi,

I think postgres' issues with scaling to larger numbers of connections
is a serious problem in the field. While poolers can address some of
that, given the issues around prepared statements, transaction state,
etc, I don't think that's sufficient in many cases. It also adds
latency.

Nor do I think the argument that one shouldn't have more than a few
dozen connection holds particularly much water. As clients have think
time, and database results have to be sent/received (most clients don't
use pipelining), and as many applications have many application servers
with individual connection pools, it's very common to need more
connections than postgres can easily deal with.


The largest reason for that is GetSnapshotData(). It scales poorly to
larger connection counts. Part of that is obviously it's O(connections)
nature, but I always thought it had to be more.  I've seen production
workloads spending > 98% of the cpu time n GetSnapshotData().


After a lot of analysis and experimentation I figured out that the
primary reason for this is PGXACT->xmin. Even the simplest transaction
modifies MyPgXact->xmin several times during its lifetime (IIRC twice
(snapshot & release) for exec_bind_message(), same for
exec_exec_message(), then again as part of EOXact processing). Which
means that a backend doing GetSnapshotData() on a system with a number
of other connections active, is very likely to hit PGXACT cachelines
that are owned by another cpu / set of cpus / socket. The larger the
system is, the worse the consequences of this are.

This problem is most prominent (and harder to fix) for xmin, but also
exists for the other fields in PGXACT. We rarely have xid, nxids,
overflow, or vacuumFlags set, yet constantly set them, leading to
cross-node traffic.

The second biggest problem is that the indirection through pgprocnos
that GetSnapshotData() has to do to go through to get each backend's
xmin is very unfriendly for a pipelined CPU (i.e. all that postgres runs
on). There's basically a stall at the end of every loop iteration -
which is exascerbated by there being so many cache misses.


It's fairly easy to avoid unnecessarily dirtying cachelines for all the
PGXACT fields except xmin. Because that actually needs to be visible to
other backends.


While it sounds almost trivial in hindsight, it took me a long while to
grasp a solution to a big part of this problem: We don't actually need
to look at PGXACT->xmin to compute a snapshot. The only reason that
GetSnapshotData() does so, is because it also computes
RecentGlobal[Data]Xmin.

But we don't actually need them all that frequently. They're primarily
used as a horizons for heap_page_prune_opt() etc. But for one, while
pruning is really important, it doesn't happen *all* the time. But more
importantly a RecentGlobalXmin from an earlier transaction is actually
sufficient for most pruning requests, especially when there is a larger
percentage of reading than updating transaction (very common).

By having GetSnapshotData() compute an accurate upper bound after which
we are certain not to be able to prune (basically the transaction's
xmin, slots horizons, etc), and a conservative lower bound below which
we are definitely able to prune, we can allow some pruning actions to
happen. If a pruning request (or something similar) encounters an xid
between those, an accurate lower bound can be computed.

That allows to avoid looking at PGXACT->xmin.


To address the second big problem (the indirection), we can instead pack
the contents of PGXACT tightly, just like we do for pgprocnos. In the
attached series, I introduced separate arrays for xids, vacuumFlags,
nsubxids.

The reason for splitting them is that they change at different rates,
and different sizes.  In a read-mostly workload, most backends are not
going to have an xid, therefore making the xids array almost
constant. As long as all xids are unassigned, GetSnapshotData() doesn't
need to look at anything else, therefore making it sensible to check the
xid first.


Here are some numbers for the submitted patch series. I'd to cull some
further improvements to make it more manageable, but I think the numbers
still are quite convincing.

The workload is a pgbench readonly, with pgbench -M prepared -c $conns
-j $conns -S -n for each client count.  This is on a machine with 2
Intel(R) Xeon(R) Platinum 8168, but virtualized.

conns   tps master tps pgxact-split

1       26842.492845            26524.194821
10      246923.158682           249224.782661
50      695956.539704           709833.746374
100     1054727.043139          1903616.306028
200     964795.282957           1949200.338012
300     906029.377539           1927881.231478
400     845696.690912           1911065.369776
500     812295.222497           1926237.255856
600     888030.104213           1903047.236273
700     866896.532490           1886537.202142
800     863407.341506           1883768.592610
900     871386.608563           1874638.012128
1000    887668.277133           1876402.391502
1500    860051.361395           1815103.564241
2000    890900.098657           1775435.271018
3000    874184.980039           1653953.817997
4000    845023.080703           1582582.316043
5000    817100.195728           1512260.802371

I think these are pretty nice results.


Note that the patchset currently does not implement snapshot_too_old,
the rest of the regression tests do pass.


One further cool recognition of the fact that GetSnapshotData()'s
results can be made to only depend on the set of xids in progress, is
that caching the results of GetSnapshotData() is almost trivial at that
point: We only need to recompute snapshots when a toplevel transaction
commits/aborts.

So we can avoid rebuilding snapshots when no commt has happened since it
was last built. Which amounts to assigning a current 'commit sequence
number' to the snapshot, and checking that against the current number
at the time of the next GetSnapshotData() call. Well, turns out there's
this "LSN" thing we assign to commits (there are some small issues with
that though).  I've experimented with that, and it considerably further
improves the numbers above. Both with a higher peak throughput, but more
importantly it almost entirely removes the throughput regression from
2000 connections onwards.

I'm still working on cleaning that part of the patch up, I'll post it in
a bit.


The series currently consists out of:

0001-0005: Fixes and assert improvements that are independent of the patch, but
           are hit by the new code (see also separate thread).

0006: Move delayChkpt from PGXACT to PGPROC it's rarely checked & frequently modified

0007: WIP: Introduce abstraction layer for "is tuple invisible" tests.

      This is the most crucial piece. Instead of code directly using
      RecentOldestXmin, there's a new set of functions for testing
      whether an xid is visible (InvisibleToEveryoneTestXid() et al).

      Those function use new horizon boundaries computed as part of
      GetSnapshotData(), and recompute an accurate boundary when the
      tested xid falls inbetween.

      There's a bit more infrastructure needed - we need to limit how
      often an accurate snapshot is computed. Probably to once per
      snapshot? Or once per transaction?


      To avoid issues with the lower boundary getting too old and
      presenting a wraparound danger, I made all the xids be
      FullTransactionIds. That imo is a good thing?


      This patch currently breaks old_snapshot_threshold, as I've not
      yet integrated it with the new functions. I think we can make the
      old snapshot stuff a lot more precise with this - instead of
      always triggering conflicts when a RecentGlobalXmin is too old, we
      can do so only in the cases we actually remove a row. I ran out of
      energy threading that through the heap_page_prune and
      HeapTupleSatisfiesVacuum.


0008: Move PGXACT->xmin back to PGPROC.

      Now that GetSnapshotData() doesn't access xmin anymore, we can
      make it a normal field in PGPROC again.


0009: Improve GetSnapshotData() performance by avoiding indirection for xid access.
0010: Improve GetSnapshotData() performance by avoiding indirection for vacuumFlags
0011: Improve GetSnapshotData() performance by avoiding indirection for nsubxids access

      These successively move the remaining PGXACT fields into separate
      arrays in ProcGlobal, and adjust GetSnapshotData() to take
      advantage.  Those arrays are dense in the sense that they only
      contain data for PGPROCs that are in use (i.e. when disconnecting,
      the array is moved around)..

      I think xid, and vacuumFlags are pretty reasonable. But need
      cleanup, obviously:
      - The biggest cleanup would be to add a few helper functions for
        accessing the values, rather than open coding that.
      - Perhaps we should call the ProcGlobal ones 'cached', and name
        the PGPROC ones as the one true source of truth?

      For subxid I thought it'd be nice to have nxids and overflow be
      only one number. But that probably was the wrong call? Now
      TransactionIdInProgress() cannot look at at the subxids that did
      fit in PGPROC.subxid. I'm not sure that's important, given the
      likelihood of misses?  But I'd probably still have the subxid
      array be one of {uint8 nsubxids; bool overflowed} instead.


      To keep the arrays dense they copy the logic for pgprocnos. Which
      means that ProcArrayAdd/Remove move things around. Unfortunately
      that requires holding both ProcArrayLock and XidGenLock currently
      (to avoid GetNewTransactionId() having to hold ProcArrayLock). But
      that doesn't seem too bad?


0012: Remove now unused PGXACT.

      There's no reason to have it anymore.

The patchseries is also available at
https://github.com/anarazel/postgres/tree/pgxact-split

Greetings,

Andres Freund

attachment0 (1K) Download Attachment
v1-0002-TMP-work-around-missing-snapshot-registrations.patch.gz (3K) Download Attachment
attachment2 (688 bytes) Download Attachment
attachment3 (3K) Download Attachment
attachment4 (1K) Download Attachment
attachment5 (3K) Download Attachment
attachment6 (12K) Download Attachment
v1-0008-Move-PGXACT-xmin-back-to-PGPROC.patch.gz (5K) Download Attachment
attachment8 (12K) Download Attachment
attachment9 (6K) Download Attachment
attachment10 (6K) Download Attachment
v1-0012-Remove-now-unused-PGXACT.patch.gz (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Andres Freund
Hi,

On 2020-03-01 00:36:01 -0800, Andres Freund wrote:

> conns   tps master tps pgxact-split
>
> 1       26842.492845            26524.194821
> 10      246923.158682           249224.782661
> 50      695956.539704           709833.746374
> 100     1054727.043139          1903616.306028
> 200     964795.282957           1949200.338012
> 300     906029.377539           1927881.231478
> 400     845696.690912           1911065.369776
> 500     812295.222497           1926237.255856
> 600     888030.104213           1903047.236273
> 700     866896.532490           1886537.202142
> 800     863407.341506           1883768.592610
> 900     871386.608563           1874638.012128
> 1000    887668.277133           1876402.391502
> 1500    860051.361395           1815103.564241
> 2000    890900.098657           1775435.271018
> 3000    874184.980039           1653953.817997
> 4000    845023.080703           1582582.316043
> 5000    817100.195728           1512260.802371
>
> I think these are pretty nice results.
Attached as a graph as well.

connection-scalability-improvements.png (32K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

akapila
On Sun, Mar 1, 2020 at 2:17 PM Andres Freund <[hidden email]> wrote:

>
> Hi,
>
> On 2020-03-01 00:36:01 -0800, Andres Freund wrote:
> > conns   tps master            tps pgxact-split
> >
> > 1       26842.492845            26524.194821
> > 10      246923.158682           249224.782661
> > 50      695956.539704           709833.746374
> > 100     1054727.043139          1903616.306028
> > 200     964795.282957           1949200.338012
> > 300     906029.377539           1927881.231478
> > 400     845696.690912           1911065.369776
> > 500     812295.222497           1926237.255856
> > 600     888030.104213           1903047.236273
> > 700     866896.532490           1886537.202142
> > 800     863407.341506           1883768.592610
> > 900     871386.608563           1874638.012128
> > 1000    887668.277133           1876402.391502
> > 1500    860051.361395           1815103.564241
> > 2000    890900.098657           1775435.271018
> > 3000    874184.980039           1653953.817997
> > 4000    845023.080703           1582582.316043
> > 5000    817100.195728           1512260.802371
> >
> > I think these are pretty nice results.
>

Nice improvement.  +1 for improving the scalability for higher connection count.


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Andres Freund
In reply to this post by Andres Freund
Hi,

On 2020-03-01 00:36:01 -0800, Andres Freund wrote:

> Here are some numbers for the submitted patch series. I'd to cull some
> further improvements to make it more manageable, but I think the numbers
> still are quite convincing.
>
> The workload is a pgbench readonly, with pgbench -M prepared -c $conns
> -j $conns -S -n for each client count.  This is on a machine with 2
> Intel(R) Xeon(R) Platinum 8168, but virtualized.
>
> conns   tps master tps pgxact-split
>
> 1       26842.492845            26524.194821
> 10      246923.158682           249224.782661
> 50      695956.539704           709833.746374
> 100     1054727.043139          1903616.306028
> 200     964795.282957           1949200.338012
> 300     906029.377539           1927881.231478
> 400     845696.690912           1911065.369776
> 500     812295.222497           1926237.255856
> 600     888030.104213           1903047.236273
> 700     866896.532490           1886537.202142
> 800     863407.341506           1883768.592610
> 900     871386.608563           1874638.012128
> 1000    887668.277133           1876402.391502
> 1500    860051.361395           1815103.564241
> 2000    890900.098657           1775435.271018
> 3000    874184.980039           1653953.817997
> 4000    845023.080703           1582582.316043
> 5000    817100.195728           1512260.802371
>
> I think these are pretty nice results.

> One further cool recognition of the fact that GetSnapshotData()'s
> results can be made to only depend on the set of xids in progress, is
> that caching the results of GetSnapshotData() is almost trivial at that
> point: We only need to recompute snapshots when a toplevel transaction
> commits/aborts.
>
> So we can avoid rebuilding snapshots when no commt has happened since it
> was last built. Which amounts to assigning a current 'commit sequence
> number' to the snapshot, and checking that against the current number
> at the time of the next GetSnapshotData() call. Well, turns out there's
> this "LSN" thing we assign to commits (there are some small issues with
> that though).  I've experimented with that, and it considerably further
> improves the numbers above. Both with a higher peak throughput, but more
> importantly it almost entirely removes the throughput regression from
> 2000 connections onwards.
>
> I'm still working on cleaning that part of the patch up, I'll post it in
> a bit.
I triggered a longer run on the same hardware, that also includes
numbers for the caching patch.

nclients master pgxact-split pgxact-split-cache
1 29742.805074 29086.874404 28120.709885
2 58653.005921 56610.432919 57343.937924
3 116580.383993 115102.94057 117512.656103
4 150821.023662 154130.354635 152053.714824
5 186679.754357 189585.156519 191095.841847
6 219013.756252 223053.409306 224480.026711
7 256861.673892 256709.57311 262427.179555
8 291495.547691 294311.524297 296245.219028
9 332835.641015 333223.666809 335460.280487
10 367883.74842 373562.206447 375682.894433
15 561008.204553 578601.577916 587542.061911
20 748000.911053 794048.140682 810964.700467
25 904581.660543 1037279.089703 1043615.577083
30 999231.007768 1251113.123461 1288276.726489
35 1001274.289847 1438640.653822 1438508.432425
40 991672.445199 1518100.079695 1573310.171868
45 994427.395069 1575758.31948 1649264.339117
50 1017561.371878 1654776.716703 1715762.303282
60 993943.210188 1720318.989894 1789698.632656
70 971379.995255 1729836.303817 1819477.25356
80 966276.137538 1744019.347399 1842248.57152
90 901175.211649 1768907.069263 1847823.970726
100 803175.74326 1784636.397822 1865795.782943
125 664438.039582 1806275.514545 1870983.64688
150 623562.201749 1796229.009658 1876529.428419
175 680683.150597 1809321.487338 1910694.40987
200 668413.988251 1833457.942035 1878391.674828
225 682786.299485 1816577.462613 1884587.77743
250 727308.562076 1825796.324814 1864692.025853
275 676295.999761 1843098.107926 1908698.584573
300 698831.398432 1832068.168744 1892735.290045
400 661534.639489 1859641.983234 1898606.247281
500 645149.788352 1851124.475202 1888589.134422
600 740636.323211 1875152.669115 1880653.747185
700 858645.363292 1833527.505826 1874627.969414
800 858287.957814 1841914.668668 1892106.319085
900 882204.933544 1850998.221969 1868260.041595
1000 910988.551206 1836336.091652 1862945.18557
1500 917727.92827 1808822.338465 1864150.00307
2000 982137.053108 1813070.209217 1877104.342864
3000 1013514.639108 1753026.733843 1870416.924248
4000 1025476.80688 1600598.543635 1859908.314496
5000 1019889.160511 1534501.389169 1870132.571895
7500 968558.864242 1352137.828569 1853825.376742
10000 887558.112017 1198321.352461 1867384.381886
15000 687766.593628 950788.434914 1710509.977169

The odd dip for master between 90 and 700 connections looks like it's
not directly related to GetSnapshotData(). It looks like it's related to
the linux scheduler and virtiualization. When a pgbench thread and
postgres backend need to swap who gets executed, and both are on
different CPUs, the wakeup is more expensive when the target CPU is idle
or isn't going to reschedule soon. In the expensive path a
inter-process-interrupt (IPI) gets triggered, which requires to exit out
of the VM (which is really expensive on azure, apparently).  I can
trigger similar behaviour for the other runs by renicing, albeit on a
slightly smaller scale.

I'll try to find a larger system that's not virtualized :/.

Greetings,

Andres Freund

connection-scalability-improvements-2.png (24K) Download Attachment
csn.diff (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

David Steele
In reply to this post by Andres Freund
On 3/1/20 3:36 AM, Andres Freund wrote:
>
> I think these are pretty nice results.

Indeed they are.

Is the target version PG13 or PG14?  It seems like a pretty big patch to
go in the last commitfest for PG13.

Regards,
--
-David
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

David Rowley
In reply to this post by Andres Freund
Hi,

Nice performance gains.

On Sun, 1 Mar 2020 at 21:36, Andres Freund <[hidden email]> wrote:

> The series currently consists out of:
>
> 0001-0005: Fixes and assert improvements that are independent of the patch, but
>            are hit by the new code (see also separate thread).
>
> 0006: Move delayChkpt from PGXACT to PGPROC it's rarely checked & frequently modified
>
> 0007: WIP: Introduce abstraction layer for "is tuple invisible" tests.
>
>       This is the most crucial piece. Instead of code directly using
>       RecentOldestXmin, there's a new set of functions for testing
>       whether an xid is visible (InvisibleToEveryoneTestXid() et al).
>
>       Those function use new horizon boundaries computed as part of
>       GetSnapshotData(), and recompute an accurate boundary when the
>       tested xid falls inbetween.
>
>       There's a bit more infrastructure needed - we need to limit how
>       often an accurate snapshot is computed. Probably to once per
>       snapshot? Or once per transaction?
>
>
>       To avoid issues with the lower boundary getting too old and
>       presenting a wraparound danger, I made all the xids be
>       FullTransactionIds. That imo is a good thing?
>
>
>       This patch currently breaks old_snapshot_threshold, as I've not
>       yet integrated it with the new functions. I think we can make the
>       old snapshot stuff a lot more precise with this - instead of
>       always triggering conflicts when a RecentGlobalXmin is too old, we
>       can do so only in the cases we actually remove a row. I ran out of
>       energy threading that through the heap_page_prune and
>       HeapTupleSatisfiesVacuum.
>
>
> 0008: Move PGXACT->xmin back to PGPROC.
>
>       Now that GetSnapshotData() doesn't access xmin anymore, we can
>       make it a normal field in PGPROC again.
>
>
> 0009: Improve GetSnapshotData() performance by avoiding indirection for xid access.

I've only looked at 0001-0009 so far. I'm not quite the expert in this
area, so the review feels a bit superficial.  Here's what I noted down
during my pass.

0001

1. cant't -> can't

* snapshot cant't change in the midst of a relcache build, so there's no

0002

2. I don't quite understand your change in
UpdateSubscriptionRelState(). snap seems unused. Drilling down into
SearchSysCacheCopy2, in SearchCatCacheMiss() the systable_beginscan()
passes a NULL snapshot.

the whole patch does this. I guess I don't understand why 0002 does this.

0004

3. This comment seems to have the line order swapped in bt_check_every_level

/*
* RecentGlobalXmin/B-Tree page deletion.
* This assertion matches the one in index_getnext_tid().  See note on
*/
Assert(SnapshotSet());

0006

4. Did you consider the location of 'delayChkpt' in PGPROC. Couldn't
you slot it in somewhere it would fit in existing padding?

0007

5. GinPageIsRecyclable() has no comments at all. I know that
ginvacuum.c is not exactly the modal citizen for function header
comments, but likely this patch is no good reason to continue the
trend.

6. The comment rearrangement in bt_check_every_level should be in the
0004 patch.

7. struct InvisibleToEveryoneState could do with some comments
explaining the fields.

8. The header comment in GetOldestXminInt needs to be updated. It
talks about "if rel = NULL and there are no transactions", but there's
no parameter by that name now. Maybe the whole comment should be moved
down to the external implementation of the function

9. I get the idea you don't intend to keep the debug message in
InvisibleToEveryoneTestFullXid(), but if you do, then shouldn't it be
using UINT64_FORMAT?

10. teh -> the

* which is based on teh value computed when getting the current snapshot.

11. InvisibleToEveryoneCheckXid and InvisibleToEveryoneCheckFullXid
seem to have their extern modifiers in the .c file.

0009

12. iare -> are

* These iare separate from the main PGPROC array so that the most heavily

13. is -> are

* accessed data is stored contiguously in memory in as few cache lines as

14. It doesn't seem to quite make sense to talk about "this proc" in:

/*
* TransactionId of top-level transaction currently being executed by this
* proc, if running and XID is assigned; else InvalidTransactionId.
*
* Each PGPROC has a copy of its value in PGPROC.xidCopy.
*/
TransactionId *xids;

maybe "this" can be replaced with "each"

I will try to continue with the remaining patches soon. However, it
would be good to get a more complete patchset. I feel there are quite
a few XXX comments remaining for things you need to think about later,
and ... it's getting late.


Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Thomas Munro-5
In reply to this post by Andres Freund
On Sun, Mar 1, 2020 at 9:36 PM Andres Freund <[hidden email]> wrote:

> conns   tps master              tps pgxact-split
>
> 1       26842.492845            26524.194821
> 10      246923.158682           249224.782661
> 50      695956.539704           709833.746374
> 100     1054727.043139          1903616.306028
> 200     964795.282957           1949200.338012
> 300     906029.377539           1927881.231478
> 400     845696.690912           1911065.369776
> 500     812295.222497           1926237.255856
> 600     888030.104213           1903047.236273
> 700     866896.532490           1886537.202142
> 800     863407.341506           1883768.592610
> 900     871386.608563           1874638.012128
> 1000    887668.277133           1876402.391502
> 1500    860051.361395           1815103.564241
> 2000    890900.098657           1775435.271018
> 3000    874184.980039           1653953.817997
> 4000    845023.080703           1582582.316043
> 5000    817100.195728           1512260.802371

This will clearly be really big news for lots of PostgreSQL users.

> One further cool recognition of the fact that GetSnapshotData()'s
> results can be made to only depend on the set of xids in progress, is
> that caching the results of GetSnapshotData() is almost trivial at that
> point: We only need to recompute snapshots when a toplevel transaction
> commits/aborts.
>
> So we can avoid rebuilding snapshots when no commt has happened since it
> was last built. Which amounts to assigning a current 'commit sequence
> number' to the snapshot, and checking that against the current number
> at the time of the next GetSnapshotData() call. Well, turns out there's
> this "LSN" thing we assign to commits (there are some small issues with
> that though).  I've experimented with that, and it considerably further
> improves the numbers above. Both with a higher peak throughput, but more
> importantly it almost entirely removes the throughput regression from
> 2000 connections onwards.
>
> I'm still working on cleaning that part of the patch up, I'll post it in
> a bit.

I looked at that part on your public pgxact-split branch.  In that
version you used "CSN" rather than something based on LSNs, which I
assume avoids complications relating to WAL locking or something like
that.  We should probably be careful to avoid confusion with the
pre-existing use of the term "commit sequence number" (CommitSeqNo,
CSN) that appears in predicate.c.  This also calls to mind the
2013-2016 work by Ants Aasma and others[1] on CSN-based snapshots,
which is obviously a much more radical change, but really means what
it says (commits).  The CSN in your patch set is used purely as a
level-change for snapshot cache invalidation IIUC, and it advances
also for aborts -- so maybe it should be called something like
completed_xact_count, using existing terminology from procarray.c.

+       if (snapshot->csn != 0 && MyProc->xidCopy == InvalidTransactionId &&
+               UINT64_ACCESS_ONCE(ShmemVariableCache->csn) == snapshot->csn)

Why is it OK to read ShmemVariableCache->csn without at least a read
barrier?  I suppose this allows a cached snapshot to be used very soon
after a transaction commits and should be visible to you, but ...
hmmmrkwjherkjhg... I guess it must be really hard to observe any
anomaly.  Let's see... maybe it's possible on a relaxed memory system
like POWER or ARM, if you use a shm flag to say "hey I just committed
a transaction", and the other guy sees the flag but can't yet see the
new CSN, so an SPI query can't see the transaction?

Another theoretical problem is the non-atomic read of a uint64 on some
32 bit platforms.

> 0007: WIP: Introduce abstraction layer for "is tuple invisible" tests.
>
>       This is the most crucial piece. Instead of code directly using
>       RecentOldestXmin, there's a new set of functions for testing
>       whether an xid is visible (InvisibleToEveryoneTestXid() et al).
>
>       Those function use new horizon boundaries computed as part of
>       GetSnapshotData(), and recompute an accurate boundary when the
>       tested xid falls inbetween.
>
>       There's a bit more infrastructure needed - we need to limit how
>       often an accurate snapshot is computed. Probably to once per
>       snapshot? Or once per transaction?
>
>
>       To avoid issues with the lower boundary getting too old and
>       presenting a wraparound danger, I made all the xids be
>       FullTransactionIds. That imo is a good thing?

+1, as long as we don't just move the wraparound danger to the places
where we convert xids to fxids!

+/*
+ * Be very careful about when to use this function. It can only safely be used
+ * when there is a guarantee that, at the time of the call, xid is within 2
+ * billion xids of rel. That e.g. can be guaranteed if the the caller assures
+ * a snapshot is held by the backend, and xid is from a table (where
+ * vacuum/freezing ensures the xid has to be within that range).
+ */
+static inline FullTransactionId
+FullXidViaRelative(FullTransactionId rel, TransactionId xid)
+{
+       uint32 rel_epoch = EpochFromFullTransactionId(rel);
+       TransactionId rel_xid = XidFromFullTransactionId(rel);
+       uint32 epoch;
+
+       /*
+        * TODO: A function to easily write an assertion ensuring that xid is
+        * between [oldestXid, nextFullXid) woudl be useful here, and in plenty
+        * other places.
+        */
+
+       if (xid > rel_xid)
+               epoch = rel_epoch - 1;
+       else
+               epoch = rel_epoch;
+
+       return FullTransactionIdFromEpochAndXid(epoch, xid);
+}

I hate it, but I don't have a better concrete suggestion right now.
Whatever I come up with amounts to the same thing on some level,
though I feel like it might be better to used an infrequently updated
oldestFxid as the lower bound in a conversion.  An upper bound would
also seem better, though requires much trickier interlocking.  What
you have means "it's near here!"... isn't that too prone to bugs that
are hidden because of the ambient fuzziness?  A lower bound seems like
it could move extremely infrequently and therefore it'd be OK for it
to be protected by both proc array and xid gen locks (ie it'd be
recomputed when nextFxid needs to move too far ahead of it, so every
~2 billion xacts).  I haven't looked at this long enough to have a
strong opinion, though.

On a more constructive note:

GetOldestXminInt() does:

        LWLockAcquire(ProcArrayLock, LW_SHARED);

+       nextfxid = ShmemVariableCache->nextFullXid;
+
...
        LWLockRelease(ProcArrayLock);
...
+       return FullXidViaRelative(nextfxid, result);

But nextFullXid is protected by XidGenLock; maybe that's OK from a
data freshness point of view (I'm not sure), but from an atomicity
point of view, you can't do that can you?

>       This patch currently breaks old_snapshot_threshold, as I've not
>       yet integrated it with the new functions. I think we can make the
>       old snapshot stuff a lot more precise with this - instead of
>       always triggering conflicts when a RecentGlobalXmin is too old, we
>       can do so only in the cases we actually remove a row. I ran out of
>       energy threading that through the heap_page_prune and
>       HeapTupleSatisfiesVacuum.

CCing Kevin as an FYI.

> 0008: Move PGXACT->xmin back to PGPROC.
>
>       Now that GetSnapshotData() doesn't access xmin anymore, we can
>       make it a normal field in PGPROC again.
>
>
> 0009: Improve GetSnapshotData() performance by avoiding indirection for xid access.
> 0010: Improve GetSnapshotData() performance by avoiding indirection for vacuumFlags
> 0011: Improve GetSnapshotData() performance by avoiding indirection for nsubxids access
>
>       These successively move the remaining PGXACT fields into separate
>       arrays in ProcGlobal, and adjust GetSnapshotData() to take
>       advantage.  Those arrays are dense in the sense that they only
>       contain data for PGPROCs that are in use (i.e. when disconnecting,
>       the array is moved around)..
>
>       I think xid, and vacuumFlags are pretty reasonable. But need
>       cleanup, obviously:
>       - The biggest cleanup would be to add a few helper functions for
>         accessing the values, rather than open coding that.
>       - Perhaps we should call the ProcGlobal ones 'cached', and name
>         the PGPROC ones as the one true source of truth?
>
>       For subxid I thought it'd be nice to have nxids and overflow be
>       only one number. But that probably was the wrong call? Now
>       TransactionIdInProgress() cannot look at at the subxids that did
>       fit in PGPROC.subxid. I'm not sure that's important, given the
>       likelihood of misses?  But I'd probably still have the subxid
>       array be one of {uint8 nsubxids; bool overflowed} instead.
>
>
>       To keep the arrays dense they copy the logic for pgprocnos. Which
>       means that ProcArrayAdd/Remove move things around. Unfortunately
>       that requires holding both ProcArrayLock and XidGenLock currently
>       (to avoid GetNewTransactionId() having to hold ProcArrayLock). But
>       that doesn't seem too bad?

In the places where you now acquire both, I guess you also need to
release both in the error path?

[1] https://www.postgresql.org/message-id/flat/CA%2BCSw_tEpJ%3Dmd1zgxPkjH6CWDnTDft4gBi%3D%2BP9SnoC%2BWy3pKdA%40mail.gmail.com


Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Andres Freund
Hi,

Thanks for looking!

On 2020-03-20 18:23:03 +1300, Thomas Munro wrote:
> On Sun, Mar 1, 2020 at 9:36 PM Andres Freund <[hidden email]> wrote:
> > I'm still working on cleaning that part of the patch up, I'll post it in
> > a bit.
>
> I looked at that part on your public pgxact-split branch.  In that
> version you used "CSN" rather than something based on LSNs, which I
> assume avoids complications relating to WAL locking or something like
> that.

Right, I first tried to use LSNs, but after further tinkering found that
it's too hard to address the difference between visiblity order and LSN
order. I don't think there's an easy way to address the difference.


> We should probably be careful to avoid confusion with the
> pre-existing use of the term "commit sequence number" (CommitSeqNo,
> CSN) that appears in predicate.c.

I looked at that after you mentioned it on IM. But I find it hard to
grok what it's precisely defined at. There's basically no comments
explaining what it's really supposed to do, and I find the relevant code
far from easy to grok :(.


> This also calls to mind the 2013-2016 work by Ants Aasma and others[1]
> on CSN-based snapshots, which is obviously a much more radical change,
> but really means what it says (commits).

Well, I think you could actually build some form of more dense snapshots
ontop of "my" CSN, with a bit of effort (and lot of handwaving). I don't
think they're that different concepts.


> The CSN in your patch set is used purely as a level-change for
> snapshot cache invalidation IIUC, and it advances also for aborts --
> so maybe it should be called something like completed_xact_count,
> using existing terminology from procarray.c.

I expect it to be used outside of snapshots too, in the future, FWIW.

completed_xact_count sounds good to me.


> +       if (snapshot->csn != 0 && MyProc->xidCopy == InvalidTransactionId &&
> +               UINT64_ACCESS_ONCE(ShmemVariableCache->csn) == snapshot->csn)
>
> Why is it OK to read ShmemVariableCache->csn without at least a read
> barrier?  I suppose this allows a cached snapshot to be used very soon
> after a transaction commits and should be visible to you, but ...
> hmmmrkwjherkjhg... I guess it must be really hard to observe any
> anomaly.  Let's see... maybe it's possible on a relaxed memory system
> like POWER or ARM, if you use a shm flag to say "hey I just committed
> a transaction", and the other guy sees the flag but can't yet see the
> new CSN, so an SPI query can't see the transaction?

Yea, it does need more thought / comments. I can't really see an actual
correctness violation though. As far as I can tell you'd never be able
to get an "older" ShmemVariableCache->csn than one since *after* the
last lock acquired/released by the current backend - which then also
means a different "ordering" would have been possible allowing the
current backend to take the snapshot earlier.


> Another theoretical problem is the non-atomic read of a uint64 on some
> 32 bit platforms.

Yea, it probably should be a pg_atomic_uint64 to address that. I don't
think it really would cause problems, because I think it'd always end up
causing an unnecessary snapshot build. But there's no need to go there.


> > 0007: WIP: Introduce abstraction layer for "is tuple invisible" tests.
> >
> >       This is the most crucial piece. Instead of code directly using
> >       RecentOldestXmin, there's a new set of functions for testing
> >       whether an xid is visible (InvisibleToEveryoneTestXid() et al).
> >
> >       Those function use new horizon boundaries computed as part of
> >       GetSnapshotData(), and recompute an accurate boundary when the
> >       tested xid falls inbetween.
> >
> >       There's a bit more infrastructure needed - we need to limit how
> >       often an accurate snapshot is computed. Probably to once per
> >       snapshot? Or once per transaction?
> >
> >
> >       To avoid issues with the lower boundary getting too old and
> >       presenting a wraparound danger, I made all the xids be
> >       FullTransactionIds. That imo is a good thing?
>
> +1, as long as we don't just move the wraparound danger to the places
> where we convert xids to fxids!
>
> +/*
> + * Be very careful about when to use this function. It can only safely be used
> + * when there is a guarantee that, at the time of the call, xid is within 2
> + * billion xids of rel. That e.g. can be guaranteed if the the caller assures
> + * a snapshot is held by the backend, and xid is from a table (where
> + * vacuum/freezing ensures the xid has to be within that range).
> + */
> +static inline FullTransactionId
> +FullXidViaRelative(FullTransactionId rel, TransactionId xid)
> +{
> +       uint32 rel_epoch = EpochFromFullTransactionId(rel);
> +       TransactionId rel_xid = XidFromFullTransactionId(rel);
> +       uint32 epoch;
> +
> +       /*
> +        * TODO: A function to easily write an assertion ensuring that xid is
> +        * between [oldestXid, nextFullXid) woudl be useful here, and in plenty
> +        * other places.
> +        */
> +
> +       if (xid > rel_xid)
> +               epoch = rel_epoch - 1;
> +       else
> +               epoch = rel_epoch;
> +
> +       return FullTransactionIdFromEpochAndXid(epoch, xid);
> +}
>
> I hate it, but I don't have a better concrete suggestion right now.
> Whatever I come up with amounts to the same thing on some level,
> though I feel like it might be better to used an infrequently updated
> oldestFxid as the lower bound in a conversion.

I am not sure it's as clearly correct to use oldestFxid in as many
cases. Normally PGPROC->xmin (PGXACT->xmin currently) should prevent the
"system wide" xid horizon to move too far relative to that, but I think
there are more plausible problems with the "oldest" xid horizon to move
concurrently with the a backend inspecting values.

It shouldn't be a problem here since the values are taken under a lock
preventing both from being moved I think, and since we're only comparing
those two values without taking anything else into account, the "global"
horizon changing concurrently wouldn't matter.

But it seems easier to understand the correctness when comparing to
nextXid?

What's the benefit of looking at an "infrequently updated" value
instead? I guess you can argue that it'd be more likely to be in cache,
but since all of this lives in a single cacheline...


> An upper bound would also seem better, though requires much trickier
> interlocking.  What you have means "it's near here!"... isn't that too
> prone to bugs that are hidden because of the ambient fuzziness?

I can't follow the last sentence. Could you expand?


> On a more constructive note:
>
> GetOldestXminInt() does:
>
>         LWLockAcquire(ProcArrayLock, LW_SHARED);
>
> +       nextfxid = ShmemVariableCache->nextFullXid;
> +
> ...
>         LWLockRelease(ProcArrayLock);
> ...
> +       return FullXidViaRelative(nextfxid, result);
>
> But nextFullXid is protected by XidGenLock; maybe that's OK from a
> data freshness point of view (I'm not sure), but from an atomicity
> point of view, you can't do that can you?

Hm. Yea, I think it's not safe against torn 64bit reads, you're right.


> >       This patch currently breaks old_snapshot_threshold, as I've not
> >       yet integrated it with the new functions. I think we can make the
> >       old snapshot stuff a lot more precise with this - instead of
> >       always triggering conflicts when a RecentGlobalXmin is too old, we
> >       can do so only in the cases we actually remove a row. I ran out of
> >       energy threading that through the heap_page_prune and
> >       HeapTupleSatisfiesVacuum.
>
> CCing Kevin as an FYI.

If anybody has an opinion on this sketch I'd be interested. I've started
to implement it, so ...


> > 0008: Move PGXACT->xmin back to PGPROC.
> >
> >       Now that GetSnapshotData() doesn't access xmin anymore, we can
> >       make it a normal field in PGPROC again.
> >
> >
> > 0009: Improve GetSnapshotData() performance by avoiding indirection for xid access.
> > 0010: Improve GetSnapshotData() performance by avoiding indirection for vacuumFlags
> > 0011: Improve GetSnapshotData() performance by avoiding indirection for nsubxids access
> >
> >       These successively move the remaining PGXACT fields into separate
> >       arrays in ProcGlobal, and adjust GetSnapshotData() to take
> >       advantage.  Those arrays are dense in the sense that they only
> >       contain data for PGPROCs that are in use (i.e. when disconnecting,
> >       the array is moved around)..
> >
> >       I think xid, and vacuumFlags are pretty reasonable. But need
> >       cleanup, obviously:
> >       - The biggest cleanup would be to add a few helper functions for
> >         accessing the values, rather than open coding that.
> >       - Perhaps we should call the ProcGlobal ones 'cached', and name
> >         the PGPROC ones as the one true source of truth?
> >
> >       For subxid I thought it'd be nice to have nxids and overflow be
> >       only one number. But that probably was the wrong call? Now
> >       TransactionIdInProgress() cannot look at at the subxids that did
> >       fit in PGPROC.subxid. I'm not sure that's important, given the
> >       likelihood of misses?  But I'd probably still have the subxid
> >       array be one of {uint8 nsubxids; bool overflowed} instead.
> >
> >
> >       To keep the arrays dense they copy the logic for pgprocnos. Which
> >       means that ProcArrayAdd/Remove move things around. Unfortunately
> >       that requires holding both ProcArrayLock and XidGenLock currently
> >       (to avoid GetNewTransactionId() having to hold ProcArrayLock). But
> >       that doesn't seem too bad?
>
> In the places where you now acquire both, I guess you also need to
> release both in the error path?

Hm. I guess you mean:

        if (arrayP->numProcs >= arrayP->maxProcs)
        {
                /*
                 * Oops, no room.  (This really shouldn't happen, since there is a
                 * fixed supply of PGPROC structs too, and so we should have failed
                 * earlier.)
                 */
                LWLockRelease(ProcArrayLock);
                ereport(FATAL,
                                (errcode(ERRCODE_TOO_MANY_CONNECTIONS),
                                 errmsg("sorry, too many clients already")));
        }

I think we should just remove the LWLockRelease? At this point we
already have set up ProcKill(), which would release all lwlocks after
the error was thrown?

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Andres Freund
In reply to this post by David Rowley
Hi,

On 2020-03-17 23:59:14 +1300, David Rowley wrote:
> Nice performance gains.

Thanks.


> On Sun, 1 Mar 2020 at 21:36, Andres Freund <[hidden email]> wrote:
> 2. I don't quite understand your change in
> UpdateSubscriptionRelState(). snap seems unused. Drilling down into
> SearchSysCacheCopy2, in SearchCatCacheMiss() the systable_beginscan()
> passes a NULL snapshot.
>
> the whole patch does this. I guess I don't understand why 0002 does this.

See the thread at https://postgr.es/m/20200229052459.wzhqnbhrriezg4v2%40alap3.anarazel.de

Basically, the way catalog snapshots are handled right now, it's not
correct to much without a snapshot held. Any concurrent invalidation can
cause the catalog snapshot to be released, which can reset the backend's
xmin. Which in turn can allow for pruning etc to remove required data.

This is part of this series only because I felt I needed to add stronger
asserts to be confident in what's happening. And they started to trigger
all over :( - and weren't related to the patchset :(.


> 4. Did you consider the location of 'delayChkpt' in PGPROC. Couldn't
> you slot it in somewhere it would fit in existing padding?
>
> 0007

Hm, maybe. I'm not sure what the best thing to do here is - there's some
arguments to be made that we should keep the fields moved from PGXACT
together on their own cacheline. Compared to some of the other stuff in
PGPROC they're still accessed from other backends fairly frequently.


> 5. GinPageIsRecyclable() has no comments at all. I know that
> ginvacuum.c is not exactly the modal citizen for function header
> comments, but likely this patch is no good reason to continue the
> trend.

Well, I basically just moved the code from the macro of the same
name... I'll add something.


> 9. I get the idea you don't intend to keep the debug message in
> InvisibleToEveryoneTestFullXid(), but if you do, then shouldn't it be
> using UINT64_FORMAT?

Yea, I don't intend to keep them - they're way too verbose, even for
DEBUG*. Note that there's some advantage in the long long cast approach
- it's easier to deal with for translations IIRC.

> 13. is -> are
>
> * accessed data is stored contiguously in memory in as few cache lines as

Oh? 'data are stored' sounds wrong to me, somehow.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Thomas Munro-5
On Sun, Mar 29, 2020 at 1:49 PM Andres Freund <[hidden email]> wrote:
> > 13. is -> are
> >
> > * accessed data is stored contiguously in memory in as few cache lines as
>
> Oh? 'data are stored' sounds wrong to me, somehow.

In computer contexts it seems pretty well established that we treat
"data" as an uncountable noun (like "air"), so I think "is" is right
here.  In maths or science contexts it's usually treated as a plural
following Latin, which admittedly sounds cleverer, but it also has a
slightly different meaning, not bits and bytes but something more like
samples or (wince) datums.


Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Peter Geoghegan-4
In reply to this post by Andres Freund
On Sun, Mar 1, 2020 at 12:36 AM Andres Freund <[hidden email]> wrote:

> The workload is a pgbench readonly, with pgbench -M prepared -c $conns
> -j $conns -S -n for each client count.  This is on a machine with 2
> Intel(R) Xeon(R) Platinum 8168, but virtualized.
>
> conns   tps master              tps pgxact-split
>
> 1       26842.492845            26524.194821
> 10      246923.158682           249224.782661
> 50      695956.539704           709833.746374
> 100     1054727.043139          1903616.306028
> 200     964795.282957           1949200.338012
> 300     906029.377539           1927881.231478
> 400     845696.690912           1911065.369776
> 500     812295.222497           1926237.255856
> 600     888030.104213           1903047.236273
> 700     866896.532490           1886537.202142
> 800     863407.341506           1883768.592610
> 900     871386.608563           1874638.012128
> 1000    887668.277133           1876402.391502
> 1500    860051.361395           1815103.564241
> 2000    890900.098657           1775435.271018
> 3000    874184.980039           1653953.817997
> 4000    845023.080703           1582582.316043
> 5000    817100.195728           1512260.802371
>
> I think these are pretty nice results.

This scalability improvement is clearly very significant. There is
little question that this is a strategically important enhancement for
the Postgres project in general. I hope that you will ultimately be
able to commit the patchset before feature freeze.

I have heard quite a few complaints about the scalability of snapshot
acquisition in Postgres. Generally from very large users that are not
well represented on the mailing lists, for a variety of reasons. The
GetSnapshotData() bottleneck is a *huge* problem for us. (As problems
for Postgres users go, I would probably rank it second behind issues
with VACUUM.)

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Bruce Momjian
On Sat, Mar 28, 2020 at 06:39:32PM -0700, Peter Geoghegan wrote:

> On Sun, Mar 1, 2020 at 12:36 AM Andres Freund <[hidden email]> wrote:
> > The workload is a pgbench readonly, with pgbench -M prepared -c $conns
> > -j $conns -S -n for each client count.  This is on a machine with 2
> > Intel(R) Xeon(R) Platinum 8168, but virtualized.
> >
> > conns   tps master              tps pgxact-split
> >
> > 1       26842.492845            26524.194821
> > 10      246923.158682           249224.782661
> > 50      695956.539704           709833.746374
> > 100     1054727.043139          1903616.306028
> > 200     964795.282957           1949200.338012
> > 300     906029.377539           1927881.231478
> > 400     845696.690912           1911065.369776
> > 500     812295.222497           1926237.255856
> > 600     888030.104213           1903047.236273
> > 700     866896.532490           1886537.202142
> > 800     863407.341506           1883768.592610
> > 900     871386.608563           1874638.012128
> > 1000    887668.277133           1876402.391502
> > 1500    860051.361395           1815103.564241
> > 2000    890900.098657           1775435.271018
> > 3000    874184.980039           1653953.817997
> > 4000    845023.080703           1582582.316043
> > 5000    817100.195728           1512260.802371
> >
> > I think these are pretty nice results.
>
> This scalability improvement is clearly very significant. There is
> little question that this is a strategically important enhancement for
> the Postgres project in general. I hope that you will ultimately be
> able to commit the patchset before feature freeze.

+1

> I have heard quite a few complaints about the scalability of snapshot
> acquisition in Postgres. Generally from very large users that are not
> well represented on the mailing lists, for a variety of reasons. The
> GetSnapshotData() bottleneck is a *huge* problem for us. (As problems
> for Postgres users go, I would probably rank it second behind issues
> with VACUUM.)

+1

--
  Bruce Momjian  <[hidden email]>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Andres Freund
In reply to this post by Peter Geoghegan-4
Hi,

On 2020-03-28 18:39:32 -0700, Peter Geoghegan wrote:
> I have heard quite a few complaints about the scalability of snapshot
> acquisition in Postgres. Generally from very large users that are not
> well represented on the mailing lists, for a variety of reasons. The
> GetSnapshotData() bottleneck is a *huge* problem for us. (As problems
> for Postgres users go, I would probably rank it second behind issues
> with VACUUM.)

Yea, I see it similarly. For busy databases, my experience is that
vacuum is the big problem for write heavy workloads (or the write
portion), and snapshot scalability the big problem for read heavy oltp
workloads.


> This scalability improvement is clearly very significant. There is
> little question that this is a strategically important enhancement for
> the Postgres project in general. I hope that you will ultimately be
> able to commit the patchset before feature freeze.

I've done a fair bit of cleanup, but I'm still fighting with how to
implement old_snapshot_threshold in a good way. It's not hard to get it
back to kind of working, but it requires some changes that go into the
wrong direction.

The problem basically is that the current old_snapshot_threshold
implementation just reduces OldestXmin to whatever is indicated by
old_snapshot_threshold, even if not necessary for pruning to do the
specific cleanup that's about to be done. If OldestXmin < threshold,
it'll set shared state that fails all older accesses.  But that doesn't
really work well with approach in the patch of using a lower/upper
boundary for potentially valid xmin horizons.

I thinkt he right approach would be to split
TransactionIdLimitedForOldSnapshots() into separate parts. One that
determines the most aggressive horizon that old_snapshot_threshold
allows, and a separate part that increases the threshold after which
accesses need to error out
(i.e. SetOldSnapshotThresholdTimestamp()). Then we can only call
SetOldSnapshotThresholdTimestamp() for exactly the xids that are
removed, not for the most aggressive interpretation.

Unfortunately I think that basically requires changing
HeapTupleSatisfiesVacuum's signature, to take a more complex parameter
than OldestXmin (to take InvisibleToEveryoneState *), which quickly
increases the size of the patch.


I'm currently doing that and seeing how the result makes me feel about
the patch.

Alternatively we also can just be less efficient and call
GetOldestXmin() more aggressively when old_snapshot_threshold is
set. That'd be easier to implement - but seems like an ugly gotcha.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Alexander Korotkov
In reply to this post by Peter Geoghegan-4
On Sun, Mar 29, 2020 at 4:40 AM Peter Geoghegan <[hidden email]> wrote:

> On Sun, Mar 1, 2020 at 12:36 AM Andres Freund <[hidden email]> wrote:
> > The workload is a pgbench readonly, with pgbench -M prepared -c $conns
> > -j $conns -S -n for each client count.  This is on a machine with 2
> > Intel(R) Xeon(R) Platinum 8168, but virtualized.
> >
> > conns   tps master              tps pgxact-split
> >
> > 1       26842.492845            26524.194821
> > 10      246923.158682           249224.782661
> > 50      695956.539704           709833.746374
> > 100     1054727.043139          1903616.306028
> > 200     964795.282957           1949200.338012
> > 300     906029.377539           1927881.231478
> > 400     845696.690912           1911065.369776
> > 500     812295.222497           1926237.255856
> > 600     888030.104213           1903047.236273
> > 700     866896.532490           1886537.202142
> > 800     863407.341506           1883768.592610
> > 900     871386.608563           1874638.012128
> > 1000    887668.277133           1876402.391502
> > 1500    860051.361395           1815103.564241
> > 2000    890900.098657           1775435.271018
> > 3000    874184.980039           1653953.817997
> > 4000    845023.080703           1582582.316043
> > 5000    817100.195728           1512260.802371
> >
> > I think these are pretty nice results.
>
> This scalability improvement is clearly very significant. There is
> little question that this is a strategically important enhancement for
> the Postgres project in general. I hope that you will ultimately be
> able to commit the patchset before feature freeze.

+1, this is really very cool results.

Despite this patchset is expected to be clearly a big win on majority
of workloads, I think we still need to investigate different workloads
on different hardware to ensure there is no regression.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Andres Freund
Hi,

On March 29, 2020 11:24:32 AM PDT, Alexander Korotkov <[hidden email]> wrote:
> clearly a big win on majority
>of workloads, I think we still need to investigate different workloads
>on different hardware to ensure there is no regression.

Definitely. Which workloads are you thinking of? I can think of those affected facets: snapshot speed, commit speed with writes, connection establishment, prepared transaction speed. All in the small and large connection count cases.

I did measurements on all of those but prepared xacts, fwiw. That definitely needs to be measured, due to the locking changes around procarrayaddd/remove.

I don't think regressions besides perhaps 2pc are likely - there's nothing really getting more expensive but procarray add/remove.


Andres

Regards,

Andres


--
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Tomas Vondra-4
On Sun, Mar 29, 2020 at 11:50:10AM -0700, Andres Freund wrote:

>Hi,
>
>On March 29, 2020 11:24:32 AM PDT, Alexander Korotkov
><[hidden email]> wrote:
>> clearly a big win on majority
>>of workloads, I think we still need to investigate different workloads
>>on different hardware to ensure there is no regression.
>
>Definitely. Which workloads are you thinking of? I can think of those
>affected facets: snapshot speed, commit speed with writes, connection
>establishment, prepared transaction speed. All in the small and large
>connection count cases.
>
>I did measurements on all of those but prepared xacts, fwiw. That
>definitely needs to be measured, due to the locking changes around
>procarrayaddd/remove.
>
>I don't think regressions besides perhaps 2pc are likely - there's
>nothing really getting more expensive but procarray add/remove.
>

If I get some instructions what tests to do, I can run a bunch of tests
on my machinees (not the largest boxes, but at least something). I don't
have the bandwidth to come up with tests on my own, at the moment.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Alexander Korotkov
In reply to this post by Andres Freund
On Sun, Mar 29, 2020 at 9:50 PM Andres Freund <[hidden email]> wrote:
> On March 29, 2020 11:24:32 AM PDT, Alexander Korotkov <[hidden email]> wrote:
> > clearly a big win on majority
> >of workloads, I think we still need to investigate different workloads
> >on different hardware to ensure there is no regression.
>
> Definitely. Which workloads are you thinking of? I can think of those affected facets: snapshot speed, commit speed with writes, connection establishment, prepared transaction speed. All in the small and large connection count cases.

Following pgbench scripts comes first to my mind:
1) SELECT txid_current(); (artificial but good for checking corner case)
2) Single insert statement (as example of very short transaction)
3) Plain pgbench read-write (you already did it for sure)
4) pgbench read-write script with increased amount of SELECTs.  Repeat
select from pgbench_accounts say 10 times with different aids.
5) 10% pgbench read-write, 90% of pgbench read-only

> I did measurements on all of those but prepared xacts, fwiw

Great, it would be nice to see the results in the thread.

> That definitely needs to be measured, due to the locking changes around procarrayaddd/remove.
>
> I don't think regressions besides perhaps 2pc are likely - there's nothing really getting more expensive but procarray add/remove.

I agree that ProcArrayAdd()/Remove() should be first subject of
investigation, but other cases should be checked as well IMHO.
Regarding 2pc I can following scenarios come to my mind:
1) pgbench read-write modified so that every transaction is prepared
first, then commit prepared.
2) 10% of 2pc pgbench read-write, 90% normal pgbench read-write
3) 10% of 2pc pgbench read-write, 90% normal pgbench read-only

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Andres Freund
In reply to this post by Andres Freund
Hi,

I'm still fighting with snapshot_too_old. The feature is just badly
undertested, underdocumented, and there's lots of other oddities. I've
now spent about as much time on that feature than on the whole rest of
the patchset.

As an example for under-documented, here's a definitely non-trivial
block of code without a single comment explaining what it's doing.

                                if (oldSnapshotControl->count_used > 0 &&
                                        ts >= oldSnapshotControl->head_timestamp)
                                {
                                        int offset;

                                        offset = ((ts - oldSnapshotControl->head_timestamp)
                                                          / USECS_PER_MINUTE);
                                        if (offset > oldSnapshotControl->count_used - 1)
                                                offset = oldSnapshotControl->count_used - 1;
                                        offset = (oldSnapshotControl->head_offset + offset)
                                                % OLD_SNAPSHOT_TIME_MAP_ENTRIES;
                                        xlimit = oldSnapshotControl->xid_by_minute[offset];

                                        if (NormalTransactionIdFollows(xlimit, recentXmin))
                                                SetOldSnapshotThresholdTimestamp(ts, xlimit);
                                }

                                LWLockRelease(OldSnapshotTimeMapLock);

Also, SetOldSnapshotThresholdTimestamp() acquires a separate spinlock -
not great to call that with the lwlock held.


Then there's this comment:

                /*
                 * Failsafe protection against vacuuming work of active transaction.
                 *
                 * This is not an assertion because we avoid the spinlock for
                 * performance, leaving open the possibility that xlimit could advance
                 * and be more current; but it seems prudent to apply this limit.  It
                 * might make pruning a tiny bit less aggressive than it could be, but
                 * protects against data loss bugs.
                 */
                if (TransactionIdIsNormal(latest_xmin)
                        && TransactionIdPrecedes(latest_xmin, xlimit))
                        xlimit = latest_xmin;

                if (NormalTransactionIdFollows(xlimit, recentXmin))
                        return xlimit;

So this is not using lock, so the values aren't accurate, but it avoids
data loss bugs? I also don't know which spinlock is avoided on the path
here as mentioend - the acquisition is unconditional.

But more importantly - if this is about avoiding data loss bugs, how on
earth is it ok that we don't go through these checks in the
old_snapshot_threshold == 0 path?

                /*
                 * Zero threshold always overrides to latest xmin, if valid.  Without
                 * some heuristic it will find its own snapshot too old on, for
                 * example, a simple UPDATE -- which would make it useless for most
                 * testing, but there is no principled way to ensure that it doesn't
                 * fail in this way.  Use a five-second delay to try to get useful
                 * testing behavior, but this may need adjustment.
                 */
                if (old_snapshot_threshold == 0)
                {
                        if (TransactionIdPrecedes(latest_xmin, MyProc->xmin)
                                && TransactionIdFollows(latest_xmin, xlimit))
                                xlimit = latest_xmin;

                        ts -= 5 * USECS_PER_SEC;
                        SetOldSnapshotThresholdTimestamp(ts, xlimit);

                        return xlimit;
                }


This feature looks like it was put together by applying force until
something gave, and then stopping just there.


Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

Andres Freund
Hi,

On 2020-03-31 13:04:38 -0700, Andres Freund wrote:
> I'm still fighting with snapshot_too_old. The feature is just badly
> undertested, underdocumented, and there's lots of other oddities. I've
> now spent about as much time on that feature than on the whole rest of
> the patchset.

To expand on this being under-tested: The whole time mapping
infrastructure is not tested, because all of that is bypassed when
old_snapshot_threshold = 0.  And old_snapshot_threshold = 0 basically
only exists for testing.  The largest part of the complexity of this
feature are TransactionIdLimitedForOldSnapshots() and
MaintainOldSnapshotTimeMapping() - and none of the complexity is tested
due to the tests running with old_snapshot_threshold = 0.

So we have test only infrastructure that doesn't allow to actually test
the feature.


And the tests that we do have don't have a single comment explaining
what the expected results are. Except for the newer
sto_using_hash_index.spec, they just run all permutations. I don't know
how those tests actually help, since it's not clear why any of the
results are the way they are. And which just are the results of
bugs. Ore not affected by s_t_o.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Improving connection scalability: GetSnapshotData()

David Rowley
In reply to this post by Andres Freund
On Sun, 1 Mar 2020 at 21:47, Andres Freund <[hidden email]> wrote:

> On 2020-03-01 00:36:01 -0800, Andres Freund wrote:
> > conns   tps master            tps pgxact-split
> >
> > 1       26842.492845            26524.194821
> > 10      246923.158682           249224.782661
> > 50      695956.539704           709833.746374
> > 100     1054727.043139          1903616.306028
> > 200     964795.282957           1949200.338012
> > 300     906029.377539           1927881.231478
> > 400     845696.690912           1911065.369776
> > 500     812295.222497           1926237.255856
> > 600     888030.104213           1903047.236273
> > 700     866896.532490           1886537.202142
> > 800     863407.341506           1883768.592610
> > 900     871386.608563           1874638.012128
> > 1000    887668.277133           1876402.391502
> > 1500    860051.361395           1815103.564241
> > 2000    890900.098657           1775435.271018
> > 3000    874184.980039           1653953.817997
> > 4000    845023.080703           1582582.316043
> > 5000    817100.195728           1512260.802371
> >
> > I think these are pretty nice results.
FWIW, I took this for a spin on an AMD 3990x:

# setup
pgbench -i postgres

#benchmark
#!/bin/bash

for i in 1 10 50 100 200 300 400 500 600 700 800 900 1000 1500 2000
3000 4000 5000;
do
echo Testing with $i connections >> bench.log
pgbench2 -M prepared -c $i -j $i -S -n -T 60 postgres >> bench.log
done

pgbench2 is your patched version pgbench. I got some pretty strange
results with the unpatched version.  Up to about 50 million tps for
excluding connection establishing, which seems pretty farfetched

connections        Unpatched        Patched
1        49062.24413        49834.64983
10        428673.1027        453290.5985
50        1552413.084        1849233.821
100        2039675.027        2261437.1
200        3139648.845        3632008.991
300        3091248.316        3597748.942
400        3056453.5        3567888.293
500        3019571.47        3574009.053
600        2991052.393        3537518.903
700        2952484.763        3553252.603
800        2910976.875        3539404.865
900        2873929.989        3514353.776
1000        2846859.499        3490006.026
1500        2540003.038        3370093.934
2000        2361799.107        3197556.738
3000        2056973.778        2949740.692
4000        1751418.117        2627174.81
5000        1464786.461        2334586.042

> Attached as a graph as well.

Likewise.

David

excluding_connections_est_3990x.png (64K) Download Attachment
1234 ... 6