Deleting older versions in unique indexes to avoid page splits

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
90 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
On Sat, Oct 24, 2020 at 8:01 AM Peter Geoghegan <[hidden email]> wrote:
> There probably are some workloads with worse latency and throughput,
> but generally only with high contention/small indexes. I'll try to
> fine tune those, but some amount of it is probably inevitable. On
> average query latency is quite a lot lower with the patch (where it is
> affected at all - the mechanism is only used with non-hot updates).

Attached is v5, which has changes that are focused on two important
high level goals:

1. Keeping costs down generally, especially in high contention cases
where costs are most noticeable.

2. Making indexes on low cardinality columns that naturally really
benefit from merge deduplication in Postgres 13 receive largely the
same benefits that you've already seen with high cardinality indexes
(at least outside of the extreme cases where it isn't sensible to
try).

CPU costs (especially from sorting temp work arrays) seem to be the
big cost overall. It turns out that the costs of accessing heap pages
within the new mechanism is not really noticeable. This isn't all that
surprising, though. The heap pages are accessed in a way that
naturally exploits locality across would-be page splits in different
indexes.

To some degree the two goals that I describe conflict with each other.
If merge deduplication increases the number of logical rows that "fit
on each leaf page" (by increasing the initial number of TIDs on each
leaf page by over 3x when the index is in the pristine CREATE INDEX
state), then naturally the average amount of work required to maintain
indexes in their pristine state is increased. We cannot expect to pay
nothing to avoid "unnecessary page splits" -- we can only expect to
come out ahead over time.

The main new thing that allowed me to more or less accomplish the
second goal is granular deletion in posting list tuples. That is, v5
teaches the delete mechanism to do VACUUM-style granular TID deletion
within posting lists. This factor turned out to matter a lot.

Here is an extreme benchmark that I ran this morning for this patch,
which shows both strengths and weaknesses:

pgbench scale 1000 (with customizations to indexing that I go into
below), 16 + 32 clients, 30 minutes per run (so 2 hours total
excluding initial data loading). Same queries as last time:

\set aid1 random_gaussian(1, 100000 * :scale, 4.0)
\set aid2 random_gaussian(1, 100000 * :scale, 4.5)
\set aid3 random_gaussian(1, 100000 * :scale, 4.2)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid1;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid2;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid3;
END;

And just like last time, we replace the usual pgbench_accounts index
with an INCLUDE index to artificially make every UPDATE a non-HOT
UPDATE:

create unique index aid_pkey_include_abalance on pgbench_accounts
(aid) include(abalance);

Unlike last time we have a variety of useless-to-us *low cardinality*
indexes. These are much smaller than aid_pkey_include_abalance due to
merge deduplication, but are still quite similar to it:

create index fiver on pgbench_accounts ((aid - (aid%5)));
create index tenner on pgbench_accounts ((aid - (aid%10)));
create index score on pgbench_accounts ((aid - (aid%20)));

The silly indexes this time around are designed to have the same skew
as the PK, but with low cardinality data. So, for example "score", has
twenty distinct logical rows for each distinct aid value. Which is
pretty extreme as far as the delete mechanism is concerned. That's why
this is more of a mixed picture compared to the earlier benchmark. I'm
trying to really put the patch through its paces, not make it look
good.

First the good news. The patch held up perfectly in one important way
-- the size of the indexes didn't change at all compared to the
original pristine size. That looked like this at the start for both
patch + master:

aid_pkey_include_abalance: 274,194 pages/2142 MB
fiver: 142,348 pages/1112 MB
tenner: 115,352 pages/901 MB
score: 94,677 pages/740 MB

By the end, master looked like this:

aid_pkey_include_abalance: 428,759 pages (~1.56x original size)
fiver: 220,114 pages (~1.54x original size)
tenner: 176,983 pages (~1.53x original size)
score: 178,165 pages (~1.88x original size)

(As I said, no change in the size of indexes with the patch -- not
even one single page split.)

Now for the not-so-good news. The TPS numbers looked like this
(results in original chronological order of the runs, which I've
interleaved):

patch_1_run_16.out: "tps = 30452.530518 (including connections establishing)"
master_1_run_16.out: "tps = 35101.867559 (including connections establishing)"
patch_1_run_32.out: "tps = 26000.991486 (including connections establishing)"
master_1_run_32.out: "tps = 32582.129545 (including connections establishing)"

The latency numbers aren't great for the patch, either. Take the 16 client case:

number of transactions actually processed: 54814992
latency average = 0.525 ms
latency stddev = 0.326 ms
tps = 30452.530518 (including connections establishing)
tps = 30452.612796 (excluding connections establishing)
statement latencies in milliseconds:
         0.001  \set aid1 random_gaussian(1, 100000 * :scale, 4.0)
         0.000  \set aid2 random_gaussian(1, 100000 * :scale, 4.5)
         0.000  \set aid3 random_gaussian(1, 100000 * :scale, 4.2)
         0.000  \set bid random(1, 1 * :scale)
         0.000  \set tid random(1, 10 * :scale)
         0.000  \set delta random(-5000, 5000)
         0.046  BEGIN;
         0.159  UPDATE pgbench_accounts SET abalance = abalance +
:delta WHERE aid = :aid1;
         0.153  SELECT abalance FROM pgbench_accounts WHERE aid = :aid2;
         0.091  SELECT abalance FROM pgbench_accounts WHERE aid = :aid3;
         0.075  END;

vs master's 16 client case:

number of transactions actually processed: 63183870
latency average = 0.455 ms
latency stddev = 0.307 ms
tps = 35101.867559 (including connections establishing)
tps = 35101.914573 (excluding connections establishing)
statement latencies in milliseconds:
         0.001  \set aid1 random_gaussian(1, 100000 * :scale, 4.0)
         0.000  \set aid2 random_gaussian(1, 100000 * :scale, 4.5)
         0.000  \set aid3 random_gaussian(1, 100000 * :scale, 4.2)
         0.000  \set bid random(1, 1 * :scale)
         0.000  \set tid random(1, 10 * :scale)
         0.000  \set delta random(-5000, 5000)
         0.049  BEGIN;
         0.117  UPDATE pgbench_accounts SET abalance = abalance +
:delta WHERE aid = :aid1;
         0.120  SELECT abalance FROM pgbench_accounts WHERE aid = :aid2;
         0.091  SELECT abalance FROM pgbench_accounts WHERE aid = :aid3;
         0.077  END;

Earlier variations of this same workload that were run with a 10k TPS
rate limit had SELECT statements that had somewhat lower latency with
the patch, while the UPDATE statements were slower by roughly the same
amount as you see here. Here we see that at least the aid3 SELECT is
just as fast with the patch. (Don't have a proper example of this
rate-limited phenomenon close at hand now, since I saw it happen back
when the patch was less well optimized.)

This benchmark is designed to be extreme, and to really stress the
patch. For one thing we have absolutely no index scans for all but one
of the four indexes, so controlling the bloat there isn't going to
give you the main expected benefit. Which is that index scans don't
even have to visit heap pages from old versions, since they're not in
the index for very long (unless there aren't very many versions for
the logical rows in question, which isn't really a problem for us).
For another the new mechanism is constantly needed, which just isn't
very realistic. It seems as if the cost is mostly paid by non-HOT
updaters, which seems exactly right to me.

I probably could have cheated by making the aid_pkey_include_abalance
index a non-unique index, denying the master branch the benefit of
LP_DEAD setting within _bt_check_unique(). Or I could have cheated
(perhaps I should just say "gone for something a bit more
sympathetic") by not having skew on all the indexes (say by hashing on
aid in the indexes that use merge deduplication). I also think that
the TPS gap would have been smaller if I'd spent more time on each
run, but I didn't have time for that today. In the past I've seen it
take a couple of hours or more for the advantages of the patch to come
through (it takes that long for reasons that should be obvious).

Even the overhead we see here is pretty tolerable IMV. I believe that
it will be far more common for the new mechanism to hardly get used at
all, and yet have a pretty outsized effect on index bloat. To give you
a simple example of how that can happen, consider that if this did
happen in a real workload it would probably be caused by a surge in
demand -- now we don't have to live with the bloat consequences of an
isolated event forever (or until the next REINDEX). I can make more
sophisticated arguments than that one, but it doesn't seem useful
right now so I'll refrain.

The patch adds a backstop. It seems to me that that's really what we
need here. Predictability over time and under a large variety of
different conditions. Real workloads constantly fluctuate.

Even if people end up not buying my argument that it's worth it for
workloads like this, there are various options. And, I bet I can
further improve the high contention cases without losing the valuable
part -- there are a number of ways in which I can get the CPU costs
down further that haven't been fully explored (yes, it really does
seem to be CPU costs, especially due to TID sorting). Again, this
patch is all about extreme pathological workloads, system stability,
and system efficiency over time -- it is not simply about increasing
system throughput. There are some aspects of this design (that come up
with extreme workloads) that may in the end come down to value
judgments. I'm not going to tell somebody that they're wrong for
prioritizing different things (within reason, at least). In my opinion
almost all of the problems we have with VACUUM are ultimately
stability problems, not performance problems per se. And, I suspect
that we do very well with stupid benchmarks like this compared to
other DB systems precisely because we currently allow non-HOT updaters
to "live beyond their means" (which could in theory be great if you
frame it a certain way that seems pretty absurd to me). This suggests
we can "afford" to go a bit slower here as far as the competitive
pressures determine what we should do (notice that this is a distinct
argument to my favorite argument, which is that we cannot afford to
*not* go a bit slower in certain extreme cases).

I welcome debate about this.

--
Peter Geoghegan

v5-0001-Add-sort_template.h-for-making-fast-sort-function.patch (18K) Download Attachment
v5-0002-Add-delete-deduplication-to-nbtree.patch (117K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Simon Riggs
On Mon, 26 Oct 2020 at 21:15, Peter Geoghegan <[hidden email]> wrote:

> Now for the not-so-good news. The TPS numbers looked like this
> (results in original chronological order of the runs, which I've
> interleaved):

While it is important we investigate the worst cases, I don't see this
is necessarily bad.

HOT was difficult to measure, but on a 2+ hour run on a larger table,
the latency graph was what showed it was a winner. Short runs and
in-memory data masked the benefits in our early analyses.

So I suggest not looking at the totals and averages but on the peaks
and the long term trend. Showing that in graphical form is best.

> The patch adds a backstop. It seems to me that that's really what we
> need here. Predictability over time and under a large variety of
> different conditions. Real workloads constantly fluctuate.

Yeh, agreed. This is looking like a winner now, but lets check.

> Even if people end up not buying my argument that it's worth it for
> workloads like this, there are various options. And, I bet I can
> further improve the high contention cases without losing the valuable
> part -- there are a number of ways in which I can get the CPU costs
> down further that haven't been fully explored (yes, it really does
> seem to be CPU costs, especially due to TID sorting). Again, this
> patch is all about extreme pathological workloads, system stability,
> and system efficiency over time -- it is not simply about increasing
> system throughput. There are some aspects of this design (that come up
> with extreme workloads) that may in the end come down to value
> judgments. I'm not going to tell somebody that they're wrong for
> prioritizing different things (within reason, at least). In my opinion
> almost all of the problems we have with VACUUM are ultimately
> stability problems, not performance problems per se. And, I suspect
> that we do very well with stupid benchmarks like this compared to
> other DB systems precisely because we currently allow non-HOT updaters
> to "live beyond their means" (which could in theory be great if you
> frame it a certain way that seems pretty absurd to me). This suggests
> we can "afford" to go a bit slower here as far as the competitive
> pressures determine what we should do (notice that this is a distinct
> argument to my favorite argument, which is that we cannot afford to
> *not* go a bit slower in certain extreme cases).
>
> I welcome debate about this.

Agreed, we can trade initial speed for long term consistency. I guess
there are some heuristics there on that tradeoff.

--
Simon Riggs                http://www.EnterpriseDB.com/


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
On Tue, Oct 27, 2020 at 2:44 AM Simon Riggs <[hidden email]> wrote:
> While it is important we investigate the worst cases, I don't see this
> is necessarily bad.

I looked at "perf top" a few times when the test from yesterday ran. I
saw that the proposed delete mechanism was the top consumer of CPU
cycles. It seemed as if the mechanism was very expensive. However,
that's definitely the wrong conclusion about what happens in the
general case, or even in slightly less extreme cases. It at least
needs to be put in context.

I reran exactly the same benchmark overnight, but added a 10k TPS rate
limit this time (so about a third of the TPS that's possible without a
limit). I also ran it for longer, and saw much improved latency. (More
on the latency aspect below, for now I want to talk about "perf top").

The picture with "perf top" changed significantly with a 10k TPS rate
limit, even though the workload itself is very similar. Certainly the
new mechanism/function is still quite close to the top consumer of CPU
cycles. But it no longer uses more cycles than the familiar super hot
functions that you expect to see right at the top with pgbench (e.g.
_bt_compare(), hash_search_with_hash_value()). It's now something like
the 4th or 5th hottest function (I think that that means that the cost
in cycles is more than an order of magnitude lower, but I didn't
check). Just adding this 10k TPS rate limit makes the number of CPU
cycles consumed by the new mechanism seem quite reasonable. The
benefits that the patch brings are not diminished at all compared to
the original no-rate-limit variant -- the master branch now only takes
slightly longer to completely bloat all its indexes with this 10k TPS
limit (while the patch avoids even a single page split -- no change
there).

Again, this is because the mechanism is a backstop. It only works as
hard as needed to avoid unnecessary page splits. When the system is
working as hard as possible to add version churn to indexes (which is
what the original/unthrottled test involved), then the mechanism also
works quite hard. In this artificial and contrived scenario, any
cycles we can save from cleaning up bloat (by micro optimizing the
code in the patch) go towards adding even more bloat instead...which
necessitates doing more cleanup. This is why optimizing the code in
the patch with this unrealistic scenario in mind is subject to sharp
diminishing returns. It's also why you can get a big benefit from the
patch when the new mechanism is barely ever used. I imagine that if I
ran the same test again but with a 1k TPS limit I would hardly see the
new mechanism in "perf top" at all....but in the end the bloat
situation would be very similar.

I think that you could model this probabilistically if you had the
inclination. Yes, the more you throttle throughput (by lowering the
pgbench rate limit further), the less chance you have of any given
leaf page splitting moment to moment (for the master branch). But in
the long run every original leaf page splits at least once anyway,
because each leaf page still only has to be unlucky once. It is still
inevitable that they'll all get split eventually (and probably not
before too long), unless and until you *drastically* throttle pgbench.

I believe that things like opportunistic HOT chain truncation (heap
pruning) and index tuple LP_DEAD bit setting are very effective much
of the time. The problem is that it's easy to utterly rely on them
without even realizing it, which creates hidden risk that may or may
not result in big blow ups down the line. There is nothing inherently
wrong with being lazy or opportunistic about cleaning up garbage
tuples -- I think that there are significant advantages, in fact. But
only if it isn't allowed to create systemic risk. More concretely,
bloat cannot be allowed to become concentrated in any one place -- no
individual query should have to deal with more than 2 or 3 versions
for any given logical row. If we're focussed on the *average* number
of physical versions per logical row then we may reach dramatically
wrong conclusions about what to do (which is a problem in a world
where autovacuum is supposed to do most garbage collection...unless
your app happens to look like standard pgbench!).

And now back to latency with this 10k TPS limited variant I ran last
night. After 16 hours we have performed 8 runs, each lasting 2 hours.
In the original chronological order, these runs are:

patch_1_run_16.out: "tps = 10000.095914 (including connections establishing)"
master_1_run_16.out: "tps = 10000.171945 (including connections establishing)"
patch_1_run_32.out: "tps = 10000.082975 (including connections establishing)"
master_1_run_32.out: "tps = 10000.533370 (including connections establishing)"
patch_2_run_16.out: "tps = 10000.076865 (including connections establishing)"
master_2_run_16.out: "tps = 9997.933215 (including connections establishing)"
patch_2_run_32.out: "tps = 9999.911988 (including connections establishing)"
master_2_run_32.out: "tps = 10000.864031 (including connections establishing)"

Here is what I see at the end of "patch_2_run_32.out" (i.e. at the end
of the final 2 hour run for the patch):

number of transactions actually processed: 71999872
latency average = 0.265 ms
latency stddev = 0.110 ms
rate limit schedule lag: avg 0.046 (max 30.274) ms
tps = 9999.911988 (including connections establishing)
tps = 9999.915766 (excluding connections establishing)
statement latencies in milliseconds:
         0.001  \set aid1 random_gaussian(1, 100000 * :scale, 4.0)
         0.000  \set aid2 random_gaussian(1, 100000 * :scale, 4.5)
         0.000  \set aid3 random_gaussian(1, 100000 * :scale, 4.2)
         0.000  \set bid random(1, 1 * :scale)
         0.000  \set tid random(1, 10 * :scale)
         0.000  \set delta random(-5000, 5000)
         0.023  BEGIN;
         0.099  UPDATE pgbench_accounts SET abalance = abalance +
:delta WHERE aid = :aid1;
         0.036  SELECT abalance FROM pgbench_accounts WHERE aid = :aid2;
         0.034  SELECT abalance FROM pgbench_accounts WHERE aid = :aid3;
         0.025  END;

Here is what I see at the end of "master_2_run_32.out" (i.e. at the
end of the final run for master):

number of transactions actually processed: 72006803
latency average = 0.266 ms
latency stddev = 2.722 ms
rate limit schedule lag: avg 0.074 (max 396.853) ms
tps = 10000.864031 (including connections establishing)
tps = 10000.868545 (excluding connections establishing)
statement latencies in milliseconds:
         0.001  \set aid1 random_gaussian(1, 100000 * :scale, 4.0)
         0.000  \set aid2 random_gaussian(1, 100000 * :scale, 4.5)
         0.000  \set aid3 random_gaussian(1, 100000 * :scale, 4.2)
         0.000  \set bid random(1, 1 * :scale)
         0.000  \set tid random(1, 10 * :scale)
         0.000  \set delta random(-5000, 5000)
         0.022  BEGIN;
         0.073  UPDATE pgbench_accounts SET abalance = abalance +
:delta WHERE aid = :aid1;
         0.036  SELECT abalance FROM pgbench_accounts WHERE aid = :aid2;
         0.034  SELECT abalance FROM pgbench_accounts WHERE aid = :aid3;
         0.025  END;

Notice the following:

1. The overall "latency average" for the patch is very slightly lower.
2. The overall "latency stddev" for the patch is far far lower -- over
20x lower, in fact.
3. The patch's latency for the UPDATE statement is still somewhat
higher, but it's not so bad. We're still visibly paying a price in
some sense, but at least we're imposing the new costs squarely on the
query that is responsible for all of our problems.
4. The patch's latency for the SELECT statements is exactly the same
for the patch. We're not imposing any new cost on "innocent" SELECT
statements that didn't create the problem, even if they didn't quite
manage to benefit here. (Without LP_DEAD setting in _bt_check_unique()
I'm sure that the SELECT latency would be significantly lower for the
patch.)

The full results (with lots of details pulled from standard system
views after each run) can be downloaded as a .tar.gz archive from:

https://drive.google.com/file/d/1Dn8rSifZqT7pOOIgyKstl-tdACWH-hqO/view?usp=sharing

(It's probably not that interesting to drill down any further, but I
make the full set of results available just in case. There are loads
of things included just because I automatically capture them when
benchmarking anything at all.)

> HOT was difficult to measure, but on a 2+ hour run on a larger table,
> the latency graph was what showed it was a winner. Short runs and
> in-memory data masked the benefits in our early analyses.

Yeah, that's what was nice about working on sorting -- almost instant
feedback. Who wants to spend at least 2 hours to test out every little
theory?  :-)

> So I suggest not looking at the totals and averages but on the peaks
> and the long term trend. Showing that in graphical form is best.

I think that you're right that a graphical representation with an
X-axis that shows how much time has passed would be very useful. I'll
try to find a way of incorporating that into my benchmarking workflow.

This is especially likely to help when modelling how cases with a long
running xact/snapshot behave. That isn't a specific goal of mine here,
but I expect that it'll help with that a lot too. For now I'm just
focussing on downsides and not upsides, for the usual reasons.

> Agreed, we can trade initial speed for long term consistency. I guess
> there are some heuristics there on that tradeoff.

Right. Another way of looking at it is this: it should be possible for
reasonably good DBAs to develop good intuitions about how the system
will hold up over time, based on past experience and common sense --
no chaos theory required. Whatever the cost of the mechanism is, at
least it's only something that gets shaved off the top minute to
minute. It seems almost impossible for the cost to cause sudden
surprises (except maybe once, after an initial upgrade to Postgres 14,
though I doubt it). Whereas it seems very likely to prevent many large
and unpleasant surprises caused by hidden, latent risk.

I believe that approximately 100% of DBAs would gladly take that
trade-off, even if the total cost in cycles was higher. It happens to
also be true that they're very likely to use far fewer cycles over
time, but that's really just a bonus.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Victor Yegorov
In reply to this post by Peter Geoghegan-4
пн, 26 окт. 2020 г. в 22:15, Peter Geoghegan <[hidden email]>:
Attached is v5, which has changes that are focused on two important
high level goals:

I've reviewed v5 of the patch and did some testing.

First things first, the niceties must be observed:

  Patch applies, compiles and passes checks without any issues.
  It has a good amount of comments that describe the changes very well.

Now to its contents.

I now see what you mean by saying that this patch is a natural and logical
extension of the deduplication v13 work. I agree with this.

Basically, 2 major deduplication strategies exist now:
- by merging duplicates into a posting list; suits non-unique indexes better,
  'cos actual duplicates come from the logically different tuples. This is
  existing functionality.
- by deleting dead tuples and reducing need for deduplication at all; suits
  unique indexes mostly. This is a subject of this patch and it (to some
  extent) undoes v13 functionality around unique indexes, making it better.

Some comments on the patch.

1. In the following comment:

+ * table_index_batch_check() is a variant that is specialized to garbage
+ * collection of dead tuples in index access methods.  Duplicates are
+ * commonly caused by MVCC version churn when an optimization like
+ * heapam's HOT cannot be applied.  It can make sense to opportunistically
+ * guess that many index tuples are dead versions, particularly in unique
+ * indexes.

I don't quite like the last sentence. Given that this code is committed,
I would rather make it:

  … cannot be applied. Therefore we opportunistically check for dead tuples
    and reuse the space, delaying leaf page splits.

I understand that "we" shouldn't be used here, but I fail to think of a
proper way to express this.

2. in _bt_dedup_delete_pass() and heap_index_batch_check() you're using some
constants, like:
- expected score of 25
- nblocksaccessed checks for 1, 2 and 3 blocks
- maybe more, but the ones above caught my attention.

Perhaps, it'd be better to use #define-s here instead?

3. Do we really need to touch another heap page, if all conditions are met?

+           if (uniqueindex && nblocksaccessed == 1 && score == 0)
+               break;
+           if (!uniqueindex && nblocksaccessed == 2 && score == 0)
+               break;
+           if (nblocksaccessed == 3)
+               break;

I was really wondering why to look into 2 heap pages. By not doing it straight away,
we just delay the work for the next occasion that'll work on the same page we're
processing. I've modified this piece and included it in my tests (see below), I reduced
2nd condition to just 1 block and limited the 3rd case to 2 blocks (just a quick hack).

Now for the tests.

I used an i3en.6xlarge EC2 instance with EBS disks attached (24 cores, 192GB RAM).
I've employed the same tests Peter described on Oct 16 (right after v2 of the patch).
There were some config changes (attached), mostly to produce more logs and enable
proper query monitoring with pg_stat_statements.

This server is used also for other tests, therefore I am not able to utilize all core/RAM.
I'm interested in doing so though, subject for the next run of tests.

I've used scale factor 10 000, adjusted indexes (resulting in a 189GB size database)
and run the following pgbench:

    pgbench -f testcase.pgbench -r -c32 -j8 -T 3600 bench


Results (see also attachment):

/* 1, master */
latency average = 16.482 ms
tps = 1941.513487 (excluding connections establishing)
statement latencies in milliseconds:
         4.476  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid1;
         2.084  SELECT abalance FROM pgbench_accounts WHERE aid = :aid2;
         2.090  SELECT abalance FROM pgbench_accounts WHERE aid = :aid3;
/* 2, v5-patch */
latency average = 12.509 ms
tps = 2558.119453 (excluding connections establishing)
statement latencies in milliseconds:
         2.009  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid1;
         0.868  SELECT abalance FROM pgbench_accounts WHERE aid = :aid2;
         0.893  SELECT abalance FROM pgbench_accounts WHERE aid = :aid3;
/* 3, v5-restricted */
latency average = 12.338 ms
tps = 2593.519240 (excluding connections establishing)
statement latencies in milliseconds:
         1.955  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid1;
         0.846  SELECT abalance FROM pgbench_accounts WHERE aid = :aid2;
         0.866  SELECT abalance FROM pgbench_accounts WHERE aid = :aid3;

I can see a clear benefit from this patch *under specified conditions, YMMW*
- 32% increase in TPS
- 24% drop in average latency
- most important — stable index size!

Looking at the attached graphs (including statement specific ones):
- CPU activity, Disk reads (reads, not hits) and Transaction throughput are very
  stable for patched version
- CPU's "iowait" is stable and reduced for patched version (expected)
- CPU's "user" peaks out when master starts to split leafs, no such peaks
  for the patched version
- there's expected increase in amount of "Disk reads" for patched versions,
  although on master we start pretty much on the same level and by the end of
  the test we seem to climb up on reads
- on master, UPDATEs are spending 2x more time on average, reading 3x more
  pages than on patched versions
- in fact, "Average query time" and "Query stages" graphs show very nice caching
  effect for patched UPDATEs, a bit clumsy for SELECTs, but still visible

Comparing original and restricted patch versions:
- there's no visible difference in amount of "Disk reads"
- on restricted version UPDATEs behave more gradually, I like this pattern
  more, as it feels more stable and predictable

In my opinion, patch provides clear benefits from IO reduction and index size
control perspective. I really like the stability of operations on patched
version. I would rather stick to the "restricted" version of the patch though.

Hope this helps. I'm open to do more tests if necessary.

P.S. I am using automated monitoring for graphs, do not have metrics around, sorry.

--
Victor Yegorov

20201028-v5-results.txt (4K) Download Attachment
20201028-v5-cpu.png (257K) Download Attachment
20201028-v5-dr.png (292K) Download Attachment
20201028-v5-txn.png (202K) Download Attachment
20201028-v5-tqt.png (301K) Download Attachment
20201028-v5-update-1-master.png (299K) Download Attachment
20201028-v5-update-2-patch.png (293K) Download Attachment
20201028-v5-update-3-restricted.png (287K) Download Attachment
20201028-v5-select-1-master.png (299K) Download Attachment
20201028-v5-select-2-patch.png (299K) Download Attachment
20201028-v5-select-3-restricted.png (299K) Download Attachment
testcase.pgbench (614 bytes) Download Attachment
postgresql.auto.conf (656 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Victor Yegorov
In reply to this post by Peter Geoghegan-4
пн, 26 окт. 2020 г. в 22:15, Peter Geoghegan <[hidden email]>:
Attached is v5, which has changes that are focused on two important
high level goals:

And some more comments after another round of reading the patch.

1. Looks like UNIQUE_CHECK_NO_WITH_UNCHANGED is used for HOT updates,
   should we use UNIQUE_CHECK_NO_HOT here? It is better understood like this.

2. You're modifying the table_tuple_update() function on 1311 line of include/access/tableam.h,
   adding modified_attrs_hint. There's a large comment right before it describing parameters,
   I think there should be a note about modified_attrs_hint parameter in that comment, 'cos
   it is referenced from other places in tableam.h and also from backend/access/heap/heapam.c

3. Can you elaborate on the scoring model you're using?
   Why do we expect a score of 25, what's the rationale behind this number?
   And should it be #define-d ?

4. heap_compute_xid_horizon_for_tuples contains duplicate logic. Is it possible to avoid this?

5. In this comment

+ * heap_index_batch_check() helper function.  Sorts deltids array in the
+ * order needed for useful processing.

   perhaps it is better to replace "useful" with more details? Or point to the place
   where "useful processing" is described.

6. In this snippet in _bt_dedup_delete_pass()

+       else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+                _bt_dedup_save_htid(state, itup))
+       {
+
+       }

   I would rather add a comment, explaining that the empty body of the clause is actually expected.

7. In the _bt_dedup_delete_finish_pending() you're setting ispromising to false for both
   posting and non-posting tuples. This contradicts comments before function.
 



--
Victor Yegorov
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
In reply to this post by Victor Yegorov
On Wed, Oct 28, 2020 at 4:05 PM Victor Yegorov <[hidden email]> wrote:
> I've reviewed v5 of the patch and did some testing.

Thanks!

> I now see what you mean by saying that this patch is a natural and logical
> extension of the deduplication v13 work. I agree with this.

I tried the patch out with a long running transaction yesterday. I
think that the synergy with the v13 deduplication work really helped.
It took a really long time for an old snapshot to lead to pgbench page
splits (relative to the master branch, running a benchmark like the
one I talked about recently -- the fiver, tenner, score, etc index
benchmark). When the page splits finally started, they seemed much
more gradual -- I don't think that I saw the familiar pattern of
distinct waves of page splits that are clearly all correlated. I think
that the indexes grew at a low steady rate, which looked like the rate
that heap relations usually grow at.

We see a kind of "tick tock" pattern with this new mechanism + v13
deduplication: even when we don't delete very many TIDs, we still free
a few, and then merge the remaining TIDs to buy more time. Very
possibly enough time that a long running transaction goes away by the
time the question of splitting the page comes up again. Maybe there is
another long running transaction by then, but deleting just a few of
the TIDs the last time around is enough to not waste time on that
block this time around, and therefore to actually succeed despite the
second, newer long running transaction (we can delete older TIDs, just
not the now-oldest TIDs that the newer long running xact might still
need).

If this scenario sounds unlikely, bear in mind that "unnecessary" page
splits (which are all we really care about here) are usually only
barely necessary today, if you think about it in a localized/page
level way. What the master branch shows is that most individual
"unnecessary" page splits are in a sense *barely* unnecessary (which
of course doesn't make the consequences any better). We could buy many
hours until the next time the question of splitting a page comes up by
just freeing a small number of tuples -- even on a very busy database.

I found that the "fiver" and "tenner" indexes in particular took a
very long time to have even one page split with a long running
transaction.  Another interesting effect was that all page splits
suddenly stopped when my one hour 30 minute long transaction/snapshot
finally went away -- the indexes stopped growing instantly when I
killed the psql session. But on the master branch the cascading
version driven page splits took at least several minutes to stop when
I killed the psql session/snapshot at that same point of the benchmark
(maybe longer). With the master branch, we can get behind on LP_DEAD
index tuple bit setting, and then have no chance of catching up.
Whereas the patch gives us a second chance for each page.

(I really have only started to think about long running transactions
this week, so my understanding is still very incomplete, and based on
guesses and intuitions.)

> I don't quite like the last sentence. Given that this code is committed,
> I would rather make it:
>
>   … cannot be applied. Therefore we opportunistically check for dead tuples
>     and reuse the space, delaying leaf page splits.
>
> I understand that "we" shouldn't be used here, but I fail to think of a
> proper way to express this.

Makes sense.

> 2. in _bt_dedup_delete_pass() and heap_index_batch_check() you're using some
> constants, like:
> - expected score of 25
> - nblocksaccessed checks for 1, 2 and 3 blocks
> - maybe more, but the ones above caught my attention.
>
> Perhaps, it'd be better to use #define-s here instead?

Yeah. It's still evolving, which is why it's still rough.

It's not easy to come up with a good interface here. Not because it's
very important and subtle. It's actually very *unimportant*, in a way.
nbtree cannot really expect too much from heapam here (though it might
get much more than expected too, when it happens to be easy for
heapam). The important thing is always what happens to be possible at
the local/page level -- the exact preferences of nbtree are not so
important. Beggars cannot be choosers.

It only makes sense to have a "score" like this because sometimes the
situation is so favorable (i.e. there are so many TIDs that can be
killed) that we want to avoid vastly exceeding what is likely to be
useful to nbtree. Actually, this situation isn't that rare (which
maybe means I was wrong to say the score thing was unimportant, but
hopefully you get the idea).

Easily hitting our target score of 25 on the first heap page probably
happens almost all the time when certain kinds of unique indexes use
the mechanism, for example. And when that happens it is nice to only
have to visit one heap block. We're almost sure that it isn't worth
visiting a second, regardless of how many TIDs we're likely to find
there.

> 3. Do we really need to touch another heap page, if all conditions are met?
>
> +           if (uniqueindex && nblocksaccessed == 1 && score == 0)
> +               break;
> +           if (!uniqueindex && nblocksaccessed == 2 && score == 0)
> +               break;
> +           if (nblocksaccessed == 3)
> +               break;
>
> I was really wondering why to look into 2 heap pages. By not doing it straight away,
> we just delay the work for the next occasion that'll work on the same page we're
> processing. I've modified this piece and included it in my tests (see below), I reduced
> 2nd condition to just 1 block and limited the 3rd case to 2 blocks (just a quick hack).

The benchmark that you ran involved indexes that are on a column whose
values are already unique, pgbench_accounts.aid (the extra indexes are
not actually unique indexes, but they could work as unique indexes).
If you actually made them unique indexes then you would have seen the
same behavior anyway.

The 2 heap pages thing is useful with low cardinality indexes. Maybe
that could be better targeted - not sure. Things are still moving
quite fast, and I'm still working on the patch by solving the biggest
problem I see on the horizon. So I will probably change this and then
change it again in the next week anyway.

I've had further success microoptimizing the sorts in heapam.c in the
past couple of days. I think that the regression that I reported can
be further shrunk. To recap, we saw a ~15% lost of throughput/TPS with
16 clients, extreme contention (no rate limiting), several low
cardinality indexes, with everything still fitting in shared_buffers.
It now looks like I can get that down to ~7%, which seems acceptable
to me given the extreme nature of the workload (and given the fact
that we still win on efficiency here -- no index growth).

> I've used scale factor 10 000, adjusted indexes (resulting in a 189GB size database)
> and run the following pgbench:
>
>     pgbench -f testcase.pgbench -r -c32 -j8 -T 3600 bench

> I can see a clear benefit from this patch *under specified conditions, YMMW*
> - 32% increase in TPS
> - 24% drop in average latency
> - most important — stable index size!

Nice. When I did a similar test on October 16th it was on a much
smaller database. I think that I saw a bigger improvement because the
initial DB size was close to shared_buffers. So not going over
shared_buffers makes a much bigger difference. Whereas here the DB
size is several times larger, so there is no question about
significantly exceeding shared_buffers -- it's going to happen for the
master branch as well as the patch. (This is kind of obvious, but
pointing it out just in case.)

> In my opinion, patch provides clear benefits from IO reduction and index size
> control perspective. I really like the stability of operations on patched
> version. I would rather stick to the "restricted" version of the patch though.

You're using EBS here, which probably has much higher latency than
what I have here (an NVME SSD). What you have is probably more
relevant to the real world, though.

> Hope this helps. I'm open to do more tests if necessary.

It's great, thanks!

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
In reply to this post by Victor Yegorov
On Thu, Oct 29, 2020 at 3:05 PM Victor Yegorov <[hidden email]> wrote:
> And some more comments after another round of reading the patch.
>
> 1. Looks like UNIQUE_CHECK_NO_WITH_UNCHANGED is used for HOT updates,
>    should we use UNIQUE_CHECK_NO_HOT here? It is better understood like this.

This would probably get me arrested by the tableam police, though.

FWIW the way that that works is still kind of a hack. I think that I
actually need a new boolean flag, rather than overloading the enum
like this.

> 2. You're modifying the table_tuple_update() function on 1311 line of include/access/tableam.h,
>    adding modified_attrs_hint. There's a large comment right before it describing parameters,
>    I think there should be a note about modified_attrs_hint parameter in that comment, 'cos
>    it is referenced from other places in tableam.h and also from backend/access/heap/heapam.c

Okay, makes sense.

> 3. Can you elaborate on the scoring model you're using?
>    Why do we expect a score of 25, what's the rationale behind this number?
>    And should it be #define-d ?

See my remarks on this from the earlier e-mail.

> 4. heap_compute_xid_horizon_for_tuples contains duplicate logic. Is it possible to avoid this?

Maybe? I think that duplicating code is sometimes the lesser evil.
Like in tuplesort.c, for example. I'm not sure if that's true here,
but it certainly can be true. This is the kind of thing that I usually
only make my mind up about at the last minute. It's a matter of taste.

> 5. In this comment
>
> + * heap_index_batch_check() helper function.  Sorts deltids array in the
> + * order needed for useful processing.
>
>    perhaps it is better to replace "useful" with more details? Or point to the place
>    where "useful processing" is described.

Okay.

> +       else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
> +                _bt_dedup_save_htid(state, itup))
> +       {
> +
> +       }
>
>    I would rather add a comment, explaining that the empty body of the clause is actually expected.

Okay. Makes sense.

> 7. In the _bt_dedup_delete_finish_pending() you're setting ispromising to false for both
>    posting and non-posting tuples. This contradicts comments before function.

The idea is that we can have plain tuples (non-posting list tuples)
that are non-promising when they're duplicates. Because why not?
Somebody might have deleted them (rather than updating them). It is
fine to have an open mind about this possibility despite the fact that
it is close to zero (in the general case). Including these TIDs
doesn't increase the amount of work we do in heapam. Even when we
don't succeed in finding any of the non-dup TIDs as dead (which is
very much the common case), telling heapam about their existence could
help indirectly (which is somewhat more common). This factor alone
could influence which heap pages heapam visits when there is no
concentration of promising tuples on heap pages (since the total
number of TIDs on each block is the tie-breaker condition when
comparing heap blocks with an equal number of promising tuples during
the block group sort in heapam.c). I believe that this approach tends
to result in heapam going after older TIDs when it wouldn't otherwise,
at least in some cases.

You're right, though -- this is still unclear. Actually, I think that
I should move the handling of promising/duplicate tuples into
_bt_dedup_delete_finish_pending(), too (move it from
_bt_dedup_delete_pass()). That would allow me to talk about all of the
TIDs that get added to the deltids array (promising and non-promising)
in one central function. I'll do it that way soon.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
In reply to this post by Peter Geoghegan-4
On Thu, Oct 29, 2020 at 4:30 PM Peter Geoghegan <[hidden email]> wrote:

> I found that the "fiver" and "tenner" indexes in particular took a
> very long time to have even one page split with a long running
> transaction.  Another interesting effect was that all page splits
> suddenly stopped when my one hour 30 minute long transaction/snapshot
> finally went away -- the indexes stopped growing instantly when I
> killed the psql session. But on the master branch the cascading
> version driven page splits took at least several minutes to stop when
> I killed the psql session/snapshot at that same point of the benchmark
> (maybe longer). With the master branch, we can get behind on LP_DEAD
> index tuple bit setting, and then have no chance of catching up.
> Whereas the patch gives us a second chance for each page.

I forgot to say that this long running xact/snapshot test I ran
yesterday was standard pgbench (more or less) -- no custom indexes.
Unlike my other testing, the only possible source of non-HOT updates
here was not being able to fit a heap tuple on the same heap page
(typically because we couldn't truncate HOT chains in time due to a
long running xact holding back cleanup).

The normal state (without a long running xact/snapshot) is no page
splits, both with the patch and with master. But when you introduce a
long running xact, both master and patch will get page splits. The
difference with the patch is that it'll take much longer to start up
compared to master, the page splits are more gradual and smoother with
the patch, and the patch will stop having page splits just as soon as
the xact goes away -- the same second. With the master branch we're
reliant on LP_DEAD bit setting, and if that gets temporarily held back
by a long snapshot then we have little chance of catching up after the
snapshot goes away but before some pages have unnecessary
version-driven page splits.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
In reply to this post by Peter Geoghegan-4
On Mon, Oct 26, 2020 at 2:15 PM Peter Geoghegan <[hidden email]> wrote:
> Now for the not-so-good news.

> The latency numbers aren't great for the patch, either. Take the 16 client case:

Attached is v6, which more or less totally fixes the problem we saw
with this silly "lots of low cardinality indexes" benchmark.

I wasn't really expecting this to happen -- the benchmark in question
is extreme and rather unrealistic -- but I had some new insights that
made it possible (more details on the code changes towards the end of
this email). Now I don't have to convince anyone that the performance
hit for extreme cases is worth it in order to realize big benefits for
other workloads. There pretty much isn't a performance hit to speak of
now (or so it would seem). I've managed to take this small loss of
performance and turn it into a small gain. And without having to make
any compromises on the core goal of the patch ("no unnecessary page
splits caused by version churn").

With the same indexes ("score", "tenner", "fiver", etc) as before on
pgbench_accounts, and same pgbench-variant queries, we once again see
lots of index bloat with master, but no index bloat at all with v6 of
the patch (no change there). The raw latency numbers are where we see
new improvements for v6. Summary:

2020-11-02 17:35:29 -0800 - Start of initial data load for run
"patch.r1c16" (DB is also used by later runs)
2020-11-02 17:40:21 -0800 - End of initial data load for run "patch.r1c16"
2020-11-02 17:40:21 -0800 - Start of pgbench run "patch.r1c16"
2020-11-02 19:40:31 -0800 - End of pgbench run "patch.r1c16":
patch.r1c16.bench.out: "tps = 9998.129224 (including connections
establishing)" "latency average = 0.243 ms" "latency stddev = 0.088
ms"
2020-11-02 19:40:46 -0800 - Start of initial data load for run
"master.r1c16" (DB is also used by later runs)
2020-11-02 19:45:42 -0800 - End of initial data load for run "master.r1c16"
2020-11-02 19:45:42 -0800 - Start of pgbench run "master.r1c16"
2020-11-02 21:45:52 -0800 - End of pgbench run "master.r1c16":
master.r1c16.bench.out: "tps = 9998.674505 (including connections
establishing)" "latency average = 0.231 ms" "latency stddev = 0.717
ms"
2020-11-02 21:46:10 -0800 - Start of pgbench run "patch.r1c32"
2020-11-02 23:46:23 -0800 - End of pgbench run "patch.r1c32":
patch.r1c32.bench.out: "tps = 9999.968794 (including connections
establishing)" "latency average = 0.256 ms" "latency stddev = 0.104
ms"
2020-11-02 23:46:39 -0800 - Start of pgbench run "master.r1c32"
2020-11-03 01:46:54 -0800 - End of pgbench run "master.r1c32":
master.r1c32.bench.out: "tps = 10001.097045 (including connections
establishing)" "latency average = 0.250 ms" "latency stddev = 1.858
ms"
2020-11-03 01:47:32 -0800 - Start of pgbench run "patch.r2c16"
2020-11-03 03:47:45 -0800 - End of pgbench run "patch.r2c16":
patch.r2c16.bench.out: "tps = 9999.290688 (including connections
establishing)" "latency average = 0.247 ms" "latency stddev = 0.103
ms"
2020-11-03 03:48:04 -0800 - Start of pgbench run "master.r2c16"
2020-11-03 05:48:18 -0800 - End of pgbench run "master.r2c16":
master.r2c16.bench.out: "tps = 10000.424117 (including connections
establishing)" "latency average = 0.241 ms" "latency stddev = 1.587
ms"
2020-11-03 05:48:39 -0800 - Start of pgbench run "patch.r2c32"
2020-11-03 07:48:52 -0800 - End of pgbench run "patch.r2c32":
patch.r2c32.bench.out: "tps = 9999.539730 (including connections
establishing)" "latency average = 0.258 ms" "latency stddev = 0.125
ms"
2020-11-03 07:49:11 -0800 - Start of pgbench run "master.r2c32"
2020-11-03 09:49:26 -0800 - End of pgbench run "master.r2c32":
master.r2c32.bench.out: "tps = 10000.833754 (including connections
establishing)" "latency average = 0.250 ms" "latency stddev = 0.997
ms"

These are 2 hour runs, 16 and 32 clients -- same as last time (though
note the 10k TPS limit). So 4 pairs of runs (each pair of runs is a
pair of patch/master runs) making 8 runs total, lasting 16 hours total
(not including initial data loading).

Notice that the final pair of runs shows that the master branch
eventually gets to the point of having far higher latency stddev. The
stddev starts high and gets higher as time goes on. Here is the
latency breakdown for the final pair of runs:

patch.r2c32.bench.out:

scaling factor: 1000
query mode: prepared
number of clients: 32
number of threads: 8
duration: 7200 s
number of transactions actually processed: 71997119
latency average = 0.258 ms
latency stddev = 0.125 ms
rate limit schedule lag: avg 0.046 (max 39.151) ms
tps = 9999.539730 (including connections establishing)
tps = 9999.544743 (excluding connections establishing)
statement latencies in milliseconds:
         0.001  \set aid1 random_gaussian(1, 100000 * :scale, 4.0)
         0.000  \set aid2 random_gaussian(1, 100000 * :scale, 4.5)
         0.000  \set aid3 random_gaussian(1, 100000 * :scale, 4.2)
         0.000  \set bid random(1, 1 * :scale)
         0.000  \set tid random(1, 10 * :scale)
         0.000  \set delta random(-5000, 5000)
         0.022  BEGIN;
         0.091  UPDATE pgbench_accounts SET abalance = abalance +
:delta WHERE aid = :aid1;
         0.036  SELECT abalance FROM pgbench_accounts WHERE aid = :aid2;
         0.034  SELECT abalance FROM pgbench_accounts WHERE aid = :aid3;
         0.025  END;

master.r2c32.bench.out:

query mode: prepared
number of clients: 32
number of threads: 8
duration: 7200 s
number of transactions actually processed: 72006667
latency average = 0.250 ms
latency stddev = 0.997 ms
rate limit schedule lag: avg 0.053 (max 233.045) ms
tps = 10000.833754 (including connections establishing)
tps = 10000.839935 (excluding connections establishing)
statement latencies in milliseconds:
         0.001  \set aid1 random_gaussian(1, 100000 * :scale, 4.0)
         0.000  \set aid2 random_gaussian(1, 100000 * :scale, 4.5)
         0.000  \set aid3 random_gaussian(1, 100000 * :scale, 4.2)
         0.000  \set bid random(1, 1 * :scale)
         0.000  \set tid random(1, 10 * :scale)
         0.000  \set delta random(-5000, 5000)
         0.023  BEGIN;
         0.075  UPDATE pgbench_accounts SET abalance = abalance +
:delta WHERE aid = :aid1;
         0.037  SELECT abalance FROM pgbench_accounts WHERE aid = :aid2;
         0.035  SELECT abalance FROM pgbench_accounts WHERE aid = :aid3;
         0.026  END;

There is only one aspect of this that the master branch still wins on
by the end -- the latency on the UPDATE is still a little lower on
master. This happens for the obvious reason: the UPDATE doesn't clean
up after itself on the master branch (to the extent that the average
xact latency is still a tiny bit higher with the patch). But the
SELECT queries are never actually slower with the patch, even during
earlier runs -- they're just as fast as the master branch (and faster
than the master branch by the end). Only the UPDATEs can ever be made
slower, so AFAICT any new cost is only experienced by the queries that
create the problem for the system as a whole. We're imposing new costs
fairly.

Performance with the patch is consistently much more stable. We don't
get overwhelmed by FPIs in the way we see with the master branch,
which causes sudden gluts that are occasionally very severe (notice
that master.r2c16.bench.out was a big stddev outlier). Maybe the most
notable thing is the max rate limit schedule lag. In the last run we
see max 39.151 ms for the patch vs max 233.045 ms for master.

Now for a breakdown of the code enhancements behind the improved
benchmark numbers. They are:

* The most notable improvement is to the sort order of the heap block
groups within heapam.c. We now round up to the next power of two when
sorting candidate heap blocks (this is the sort used to decide which
heap blocks to visit, and in what order). The "number of promising
TIDs" field (as well as the "number of TIDs total" tiebreaker) is
rounded up so that we ignore relatively small differences. We now tend
to process related groups of contiguous pages in relatively big
batches -- though only where appropriate.

* A closely related optimization was also added to heapam.c:
"favorable blocks". That is, we recognize groups of related heap
blocks that are contiguous. When we encounter these blocks we make
heapam.c effectively increase its per-call batch size, so that it
processes more blocks in the short term but does less absolute work in
the long run.

The key insight behind both of these enhancements is that physically
close heap blocks are generally also close together in time, and
generally share similar characteristics (mostly LP_DEAD items vs
mostly live items, already in shared_buffers, etc). So we're focussing
more on heap locality when the hint we get from nbtree isn't giving us
a very strong signal about what to do (we need to be very judicious
because we are still only willing to access a small number of heap
blocks at a time). While in general the best information heapam has to
go on comes from nbtree, heapam should not care about noise-level
variations in that information -- better to focus on heap locality
(IOW heapam.c needs to have a sophisticated understanding of the
limitations of the hints it receives from nbtree). As a result of
these two changes, heapam.c tends to process earlier blocks/records
first, in order, in a way that is correlated across time and across
indexes -- with more sequential I/O and more confidence in a
successful outcome when we undertake work. (Victor should note that
heapam.c no longer has any patience when it encounters even a single
heap block with no dead TIDs -- he was right about that. The new
heapam.c stuff works well enough that there is no possible upside to
"being patient", even with indexes on low cardinality data that
experience lots of version churn, a workload that my recent
benchmarking exemplifies.)

* nbtdedup.c will now give heapam.c an accurate idea about its
requirements -- it now expresses those in terms of space freed, which
heapam.c now cares about directly. This matters a lot with low
cardinality data, where freeing an entire index tuple is a lot more
valuable to nbtree than freeing just one TID in a posting list
(because the former case frees up 20 bytes at a minimum, while the
latter case usually only frees up 6 bytes).

I said something to Victor about nbtree's wishes not being important
here -- heapam.c is in charge. But that now seems like the wrong
mental model. After all, how can nbtree not be important if it is
entitled to call heapam.c as many times as it feels like, without
caring about the cost of thrashing (as we saw with low cardinality
data prior to v6)? With v6 of the patch I took my own advice about not
thinking of each operation as an isolated event. So now heapam.c has a
more nuanced understanding of the requirements of nbtree, and can be
either more eager or more lazy according to 1) nbtree requirements,
and 2.) conditions local to heapam.c.

* Improved handling of low cardinality data in nbtdedup.c -- we now
always merge together items with low cardinality data, regardless of
how well we do with deleting TIDs. This buys more time before the next
delete attempt for the same page.

* Specialized shellsort implementations for heapam.c.

Shellsort is sometimes used as a lightweight system qsort in embedded
systems. It has many of the same useful properties as a well optimized
quicksort for smaller datasets, and also has the merit of compiling to
far fewer instructions when specialized like this. Specializing on
qsort (as in v5) results in machine code that seems rather heavyweight
given the heapam.c requirements. Instruction cache matters here
(although that's easy to miss when profiling).

v6 still needs more polishing -- my focus has still been on the
algorithm itself. But I think I'm almost done with that part -- it
seems unlikely that I'll be able to make any additional significant
improvements in that area after v6. The new bucketized heap block
sorting behavior seems to be really effective, especially with low
cardinality data, and especially over time, as the heap naturally
becomes more fragmented. We're now blending together locality from
nbtree and heapam in an adaptive way.

I'm pretty sure that the performance on sympathetic cases (such as the
case Victor tested) will also be a lot better, though I didn't look
into that on v6. If Victor is in a position to run further benchmarks
on v6, that would certainly be welcome (independent validation always
helps).

I'm not aware of any remaining cases that it would be fair to describe
as being regressed by the patch -- can anybody else think of any
possible candidates?

BTW, I will probably rename the mechanism added by the patch to
"bottom-up index vacuuming", or perhaps "bottom-up index deletion" --
that seems to capture the general nature of what the patch does quite
well. Now regular index vacuuming can be thought of as a top-down
complementary mechanism that takes care of remaining diffuse spots of
garbage tuples that queries happen to miss (more or less). Also, while
it is true that there are some ways in which the patch is related to
deduplication, it doesn't seem useful to emphasize that part anymore.
Plus clarifying which kind of deduplication I'm talking about in code
comments is starting to become really annoying.

--
Peter Geoghegan

v6-0001-Add-delete-deduplication-to-nbtree.patch (134K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
On Tue, Nov 3, 2020 at 12:44 PM Peter Geoghegan <[hidden email]> wrote:
> v6 still needs more polishing -- my focus has still been on the
> algorithm itself. But I think I'm almost done with that part -- it
> seems unlikely that I'll be able to make any additional significant
> improvements in that area after v6.

Attached is v7, which tidies everything up. The project is now broken
up into multiple patches, which can be committed separately. Every
patch has a descriptive commit message. This should make it a lot
easier to review.

I've renamed the feature to "bottom-up index deletion" in this latest
revision. This seems like a better name than "dedup deletion". This
name suggests that the feature complements "top-down index deletion"
by VACUUM. This name is descriptive of what the new mechanism is
supposed to do at a high level.

Other changes in v7 include:

* We now fully use the tableam interface -- see the first patch.

The bottom-up index deletion API has been fully worked out. There is
now an optional callback/shim function. The bottom-up index deletion
caller (nbtree) is decoupled from the callee (heapam) by the tableam
shim. This was what allowed me to break the patch into multiple
pieces/patches.

* The executor no longer uses a IndexUniqueCheck-enum-constant as a
hint to nbtree. Rather, we have a new aminsert() bool argument/flag
that hints to the index AM -- see the second patch.

To recap, the hint tells nbtree that the incoming tuple is a duplicate
of an existing tuple caused by an UPDATE, without any logical changes
for the indexed columns. Bottom-up deletion is effective when there is
a local concentration of these index tuples that become garbage
quickly.

A dedicated aminsert() argument seems a lot cleaner. Though I wonder
if this approach can be generalized a bit further, so that we can
support other similar aminsert() hints in the future without adding
even more arguments. Maybe some new enum instead of a boolean?

* Code cleanup for the nbtdedup.c changes. Better explanation of when
and how posting list TIDs are marked promising, and why.

* Streamlined handling of the various strategies that nbtinsert.c uses
to avoid a page split (e.g. traditional LP_DEAD deletion,
deduplication).

A new unified function in nbtinsert.c was added. This organization is
a lot cleaner -- it greatly simplifies _bt_findinsertloc(), which
became more complicated than it really needed to be due to the changes
needed for deduplication in PostgreSQL 13. This change almost seems
like an independently useful piece of work.

--
Peter Geoghegan

v7-0004-Teach-heapam-to-support-bottom-up-index-deletion.patch (33K) Download Attachment
v7-0002-Pass-down-logically-unchanged-index-hint.patch (35K) Download Attachment
v7-0003-Teach-nbtree-to-use-bottom-up-index-deletion.patch (92K) Download Attachment
v7-0001-Make-tableam-interface-support-bottom-up-deletion.patch (10K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Victor Yegorov
пн, 9 нояб. 2020 г. в 18:21, Peter Geoghegan <[hidden email]>:
On Tue, Nov 3, 2020 at 12:44 PM Peter Geoghegan <[hidden email]> wrote:
> v6 still needs more polishing -- my focus has still been on the
> algorithm itself. But I think I'm almost done with that part -- it
> seems unlikely that I'll be able to make any additional significant
> improvements in that area after v6.

Attached is v7, which tidies everything up. The project is now broken
up into multiple patches, which can be committed separately. Every
patch has a descriptive commit message. This should make it a lot
easier to review.

I've looked at the latest (v7) patchset.
I've decided to use a quite common (in my practice) setup with an indexed mtime column over scale 1000 set:

alter table pgbench_accounts add mtime timestamp default now();
create or replace function fill_mtime() returns trigger as $$begin NEW.mtime=now(); return NEW; END;$$ language plpgsql;
create trigger t_accounts_mtime before update on pgbench_accounts for each row execute function fill_mtime();
create index accounts_mtime on pgbench_accounts (mtime, aid);
create index tenner on pgbench_accounts ((aid - (aid%10)));
ANALYZE pgbench_accounts;

For the test, I've used 3 pgbench scripts (started in parallel sessions):
1. UPDATE + single PK SELECT in a transaction
2. three PK SELECTs in a transaction
3. SELECT of all modifications for the last 15 minutes

Given the size of the set, all data was cached and UPDATEs were fast enough to make 3rd query sit on disk-based sorting.
Some figures follow.

Master sizes
------------
 relkind |        relname        |   nrows   | blk_before | mb_before | blk_after | mb_after
---------+-----------------------+-----------+------------+-----------+-----------+----------
 r       | pgbench_accounts      | 100000000 |    1639345 |   12807.4 |   1677861 |  13182.8
 i       | accounts_mtime        | 100000000 |     385042 |    3008.1 |    424413 |   3565.6
 i       | pgbench_accounts_pkey | 100000000 |     274194 |    2142.1 |    274194 |   2142.3
 i       | tenner                | 100000000 |     115352 |     901.2 |    128513 |   1402.9
(4 rows)

Patchset v7 sizes
-----------------
 relkind |        relname        |   nrows   | blk_before | mb_before | blk_after | mb_after
---------+-----------------------+-----------+------------+-----------+-----------+----------
 r       | pgbench_accounts      | 100000000 |    1639345 |   12807.4 |   1676887 |  13170.2
 i       | accounts_mtime        | 100000000 |     385042 |    3008.1 |    424521 |   3536.4
 i       | pgbench_accounts_pkey | 100000000 |     274194 |    2142.1 |    274194 |   2142.1
 i       | tenner                | 100000000 |     115352 |     901.2 |    115352 |    901.2
(4 rows)

TPS
---
     query      | Master TPS | Patched TPS
----------------+------------+-------------
UPDATE + SELECT |       5150 |        4884
3 SELECT in txn |      23133 |       23193
15min SELECT    |       0.75 |        0.78


We can see that:
- unused index is not suffering from not-HOT updates at all, which is the point of the patch
- we have ordinary queries performing on the same level as on master
- we have 5,2% slowdown in UPDATE speed

Looking at graphs (attached), I can see that on the patched version we're doing some IO (which is expected) during UPADTEs.
We're also reading quite a lot from disks for simple SELECTs, compared to the master version.

I'm not sure if this should be counted as regression, though, as graphs go on par pretty much.
Still, I would like to understand why this combination of indexes and queries slows down UPDATEs.


During compilation I got one warning for make -C contrib:

blutils.c: In function ‘blhandler’:
blutils.c:133:22: warning: assignment from incompatible pointer type [-Wincompatible-pointer-types]
  amroutine->aminsert = blinsert;

I agree with the rename to "bottom-up index deletion", using "vacuuming" generally makes users think
that functionality is used only during VACUUM (misleading).
I haven't looked at the code yet.


--
Victor Yegorov

20201110-results-master.txt (3K) Download Attachment
20201110-q1-UPDATE.png (363K) Download Attachment
20201110-q2-SELECT.png (313K) Download Attachment
20201110-q3-15min-SELECT.png (331K) Download Attachment
20201110-postgresql.auto.conf (964 bytes) Download Attachment
20201110-testcase1.pgbench (380 bytes) Download Attachment
20201110-testcase2.pgbench (458 bytes) Download Attachment
20201110-testcase3.pgbench (126 bytes) Download Attachment
20201110-results-patched.txt (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Victor Yegorov
In reply to this post by Peter Geoghegan-4
пн, 9 нояб. 2020 г. в 18:21, Peter Geoghegan <[hidden email]>:
On Tue, Nov 3, 2020 at 12:44 PM Peter Geoghegan <[hidden email]> wrote:
> v6 still needs more polishing -- my focus has still been on the
> algorithm itself. But I think I'm almost done with that part -- it
> seems unlikely that I'll be able to make any additional significant
> improvements in that area after v6.

Attached is v7, which tidies everything up. The project is now broken
up into multiple patches, which can be committed separately. Every
patch has a descriptive commit message. This should make it a lot
easier to review.

And another test session, this time with scale=2000 and shared_buffers=512MB
(vs scale=1000 and shared_buffers=16GB previously). The rest of the setup is the same:
- mtime column that is tracks update time
- index on (mtime, aid)
- tenner low cardinality index from Peter's earlier e-mail
- 3 pgbench scripts run in parallel on master and on v7 patchset (scripts from the previous e-mail used here).

(I just realized that the size-after figures in my previous e-mail are off, 'cos failed
to ANALYZE table after the tests.) 

Master
------
 relkind |        relname        |   nrows   | blk_before | mb_before | blk_after | mb_after |  Diff
---------+-----------------------+-----------+------------+-----------+-----------+----------+-------
 r       | pgbench_accounts      | 200000000 |    3278689 |   25614.8 |   3314951 |  25898.1 | +1.1%
 i       | accounts_mtime        | 200000000 |     770080 |    6016.3 |    811946 |   6343.3 | +5.4%
 i       | pgbench_accounts_pkey | 200000000 |     548383 |    4284.2 |    548383 |   4284.2 |    0
 i       | tenner                | 200000000 |     230701 |    1802.4 |    252346 |   1971.5 | +9.4%
(4 rows)

Patched
-------
 relkind |        relname        |   nrows   | blk_before | mb_before | blk_after | mb_after |  Diff
---------+-----------------------+-----------+------------+-----------+-----------+----------+-------
 r       | pgbench_accounts      | 200000000 |    3278689 |   25614.8 |   3330788 |  26021.8 | +1.6%
 i       | accounts_mtime        | 200000000 |     770080 |    6016.3 |    806920 |   6304.1 | +4.8%
 i       | pgbench_accounts_pkey | 200000000 |     548383 |    4284.2 |    548383 |   4284.2 |    0
 i       | tenner                | 200000000 |     230701 |    1802.4 |    230701 |   1802.4 |    0
(4 rows)

TPS
---
     query      | Master TPS | Patched TPS | Diff
----------------+------------+-------------+------
UPDATE + SELECT |       3024 |        2661 | -12%
3 SELECT in txn |      19073 |       19852 |  +4%
15min SELECT    |        2.4 |         3.9 | +60%


We can see that the patched version does much less disk writes during UPDATEs and simple SELECTs and
eliminates write amplification for not involved indexes. (I'm really excited to see these figures.)

On the other hand, there's quite a big drop on the UPDATEs throughput. For sure, undersized shared_bufefrs
contribute to this drop. Still, my experience tells me that under conditions at hand (disabled HOT due to index
over update time column) tables will tend to accumulate bloat and produce unnecessary IO also from WAL.

Perhaps I need to conduct a longer test session, say 8+ hours to make obstacles appear more like in real life.


--
Victor Yegorov

20201111-results-master.txt (3K) Download Attachment
20201111-transactions-and-disks.png (459K) Download Attachment
20201111-q1-UPDATE.png (442K) Download Attachment
20201111-q2-SELECT.png (402K) Download Attachment
20201111-q3-15min-SELECT.png (391K) Download Attachment
20201111-postgresql.auto.conf (964 bytes) Download Attachment
20201111-results-patched.txt (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
In reply to this post by Victor Yegorov
On Wed, Nov 11, 2020 at 6:17 AM Victor Yegorov <[hidden email]> wrote:
> I've looked at the latest (v7) patchset.
> I've decided to use a quite common (in my practice) setup with an indexed mtime column over scale 1000 set:

Thanks for testing!

> We can see that:
> - unused index is not suffering from not-HOT updates at all, which is the point of the patch
> - we have ordinary queries performing on the same level as on master
> - we have 5,2% slowdown in UPDATE speed

I think that I made a mistake with v7: I changed the way that we
detect low cardinality data during bottom-up deletion, which made us
do extra/early deduplication in more cases than we really should. I
suspect that this partially explains the slowdown in UPDATE latency
that you reported. I will fix this in v8.

I don't think that the solution is to go back to the v6 behavior in
this area, though. I now believe that this whole "proactive
deduplication for low cardinality data" thing only made sense as a way
of compensating for deficiencies in earlier versions of the patch.
Deficiencies that I've since fixed anyway. The best solution now is to
simplify. We can have generic criteria for "should we dedup the page
early after bottom-up deletion finishes without freeing up very much
space?". This seemed to work well during my latest testing. Probably
because heapam.c is now smart about the requirements from nbtree, as
well as the cost of accessing heap pages.

> I'm not sure if this should be counted as regression, though, as graphs go on par pretty much.
> Still, I would like to understand why this combination of indexes and queries slows down UPDATEs.

Another thing that I'll probably add to v8: Prefetching. This is
probably necessary just so I can have parity with the existing
heapam.c function that the new code is based on,
heap_compute_xid_horizon_for_tuples(). That will probably help here,
too.

> During compilation I got one warning for make -C contrib:

Oops.

> I agree with the rename to "bottom-up index deletion", using "vacuuming" generally makes users think
> that functionality is used only during VACUUM (misleading).

Yeah. That's kind of a problem already, because sometimes we use the
word VACUUM when talking about the long established LP_DEAD deletion
stuff. But I see that as a problem to be fixed. Actually, I would like
to fix it very soon.

> I haven't looked at the code yet.

It would be helpful if you could take a look at the nbtree patch --
particularly the changes related to deprecating the page-level
BTP_HAS_GARBAGE flag. I would like to break those parts out into a
separate patch, and commit it in the next week or two. It's just
refactoring, really. (This commit can also make nbtree only use the
word VACUUM for things that strictly involve VACUUM. For example,
it'll rename _bt_vacuum_one_page() to _bt_delete_or_dedup_one_page().)

We almost don't care about the flag already, so there is almost no
behavioral change from deprecated BTP_HAS_GARBAGE in this way.

Indexes that use deduplication already don't rely on BTP_HAS_GARBAGE
being set ever since deduplication was added to Postgres 13 (the
deduplication code doesn't want to deal with LP_DEAD bits, and cannot
trust that no LP_DEAD bits can be set just because BTP_HAS_GARBAGE
isn't set in the special area). Trusting the BTP_HAS_GARBAGE flag can
cause us to miss out on deleting items with their LP_DEAD bits set --
we're better off "assuming that BTP_HAS_GARBAGE is always set", and
finding out if there really are LP_DEAD bits set for ourselves each
time.

Missing LP_DEAD bits like this can happen when VACUUM unsets the
page-level flag without actually deleting the items at the same time,
which is expected when the items became garbage (and actually had
their LP_DEAD bits set) after VACUUM began, but before VACUUM reached
the leaf pages. That's really wasteful, and doesn't actually have any
upside -- we're scanning all of the line pointers only when we're
about to split (or dedup) the same page anyway, so the extra work is
practically free.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
In reply to this post by Victor Yegorov
On Wed, Nov 11, 2020 at 12:58 PM Victor Yegorov <[hidden email]> wrote:
> On the other hand, there's quite a big drop on the UPDATEs throughput. For sure, undersized shared_bufefrs
> contribute to this drop. Still, my experience tells me that under conditions at hand (disabled HOT due to index
> over update time column) tables will tend to accumulate bloat and produce unnecessary IO also from WAL.

I think that the big SELECT statement with an "ORDER BY mtime ... "
was a good way of demonstrating the advantages of the patch.

Attached is v8, which has the enhancements for low cardinality data
that I mentioned earlier today. It also simplifies the logic for
dealing with posting lists that we need to delete some TIDs from.
These posting list simplifications also make the code a bit more
efficient, which might be noticeable during benchmarking.

Perhaps your "we have 5,2% slowdown in UPDATE speed" issue will be at
least somewhat fixed by the enhancements to v8?

Another consideration when testing the patch is the behavioral
differences we see between cases where system throughput is as high as
possible, versus similar cases where we have a limit in place (i.e.
pgbench --rate=? was used).  These two cases are quite different with
the patch because the patch can no longer be lazy without a limit --
we tend to see noticeably more CPU cycles spent doing bottom-up
deletion, though everything else about the profile looks similar (I
generally use "perf top" to keep an eye on these things).

It's possible to sometimes see increases in latency (regressions) when
running without a limit, at least in the short term. These increases
can go away when a rate limit is imposed that is perhaps as high as
50% of the max TPS. In general, I think that it makes sense to focus
on latency when we have some kind of limit in place. A
non-rate-limited case is less realistic.

> Perhaps I need to conduct a longer test session, say 8+ hours to make obstacles appear more like in real life.

That would be ideal. It is inconvenient to run longer benchmarks, but
it's an important part of performance validation.

BTW, another related factor is that the patch takes longer to "warm
up". I notice this "warm-up" effect on the second or subsequent runs,
where we have lots of garbage in indexes even with the patch, and even
in the first 5 seconds of execution. The extra I/Os for heap page
accesses end up being buffer misses instead of buffer hits, until the
cache warms. This is not really a problem with fewer longer runs,
because there is relatively little "special warm-up time". (We rarely
experience heap page misses during ordinary execution because the
heapam.c code is smart about locality of access.)

I noticed that the pgbench_accounts_pkey index doesn't grow at all on
the master branch in 20201111-results-master.txt. But it's always just
a matter of time until that happens without the patch. The PK/unique
index takes longer to bloat because it alone benefits from LP_DEAD
setting, especially within _bt_check_unique(). But this advantage will
naturally erode over time. It'll probably take a couple of hours or
more with larger scale factors -- I'm thinking of pgbench scale
factors over 2000.

When the LP_DEAD bit setting isn't very effective (say it's 50%
effective), it's only a matter of time until every original page
splits. But that's also true when LP_DEAD setting is 99% effective.
While it is much less likely that any individual page will split when
LP_DEAD bits are almost always set, the fundamental problem remains,
even with 99% effectiveness. That problem is: each page only has to be
unlucky once. On a long enough timeline, the difference between 50%
effective and 99% effective may be very small. And "long enough
timeline" may not actually be very long, at least to a human.

Of course, the patch still benefits from LP_DEAD bits getting set by
queries -- no change there. It just doesn't *rely* on LP_DEAD bits
keeping up with transactions that create bloat on every leaf page.
Maybe the patch will behave exactly the same way as the master branch
-- it's workload dependent. Actually, it behaves in exactly the same
way for about the first 5 - 15 minutes following pgbench
initialization. This is roughly how long it takes before the master
branch has even one page split. You could say that the patch "makes
the first 5 minutes last forever".

(Not sure if any of this is obvious to you by now, just saying.)

Thanks!
--
Peter Geoghegan

v8-0001-Deprecate-nbtree-s-BTP_HAS_GARBAGE-flag.patch (29K) Download Attachment
v8-0002-Make-tableam-interface-support-bottom-up-deletion.patch (10K) Download Attachment
v8-0004-Teach-nbtree-to-use-bottom-up-index-deletion.patch (72K) Download Attachment
v8-0005-Teach-heapam-to-support-bottom-up-index-deletion.patch (33K) Download Attachment
v8-0003-Pass-down-logically-unchanged-index-hint.patch (36K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
On Thu, Nov 12, 2020 at 3:00 PM Peter Geoghegan <[hidden email]> wrote:
> Attached is v8, which has the enhancements for low cardinality data
> that I mentioned earlier today. It also simplifies the logic for
> dealing with posting lists that we need to delete some TIDs from.
> These posting list simplifications also make the code a bit more
> efficient, which might be noticeable during benchmarking.

One more thing: I repeated a pgbench test that was similar to my
earlier low cardinality tests -- same indexes (fiver, tenner, score,
aid_pkey_include_abalance). And same queries. But longer runs: 4 hours
each. Plus a larger DB: scale 2,500. Plus a rate-limit of 5000 TPS.

Here is the high level report, with 4 runs -- one pair with 16
clients, another pair with 32 clients:

2020-11-11 19:03:26 -0800 - Start of initial data load for run
"patch.r1c16" (DB is also used by later runs)
2020-11-11 19:18:16 -0800 - End of initial data load for run "patch.r1c16"
2020-11-11 19:18:16 -0800 - Start of pgbench run "patch.r1c16"
2020-11-11 23:18:43 -0800 - End of pgbench run "patch.r1c16":
patch.r1c16.bench.out: "tps = 4999.100006 (including connections
establishing)" "latency average = 3.355 ms" "latency stddev = 58.455
ms"
2020-11-11 23:19:12 -0800 - Start of initial data load for run
"master.r1c16" (DB is also used by later runs)
2020-11-11 23:34:33 -0800 - End of initial data load for run "master.r1c16"
2020-11-11 23:34:33 -0800 - Start of pgbench run "master.r1c16"
2020-11-12 03:35:01 -0800 - End of pgbench run "master.r1c16":
master.r1c16.bench.out: "tps = 5000.061623 (including connections
establishing)" "latency average = 8.591 ms" "latency stddev = 64.851
ms"
2020-11-12 03:35:41 -0800 - Start of pgbench run "patch.r1c32"
2020-11-12 07:36:10 -0800 - End of pgbench run "patch.r1c32":
patch.r1c32.bench.out: "tps = 5000.141420 (including connections
establishing)" "latency average = 1.253 ms" "latency stddev = 9.935
ms"
2020-11-12 07:36:40 -0800 - Start of pgbench run "master.r1c32"
2020-11-12 11:37:19 -0800 - End of pgbench run "master.r1c32":
master.r1c32.bench.out: "tps = 5000.542942 (including connections
establishing)" "latency average = 3.069 ms" "latency stddev = 24.640
ms"
2020-11-12 11:38:18 -0800 - Start of pgbench run "patch.r2c16"

We see a very significant latency advantage for the patch here. Here
is the breakdown on query latency from the final patch run,
patch.r1c32:

scaling factor: 2500
query mode: prepared
number of clients: 32
number of threads: 8
duration: 14400 s
number of transactions actually processed: 72002280
latency average = 1.253 ms
latency stddev = 9.935 ms
rate limit schedule lag: avg 0.406 (max 694.645) ms
tps = 5000.141420 (including connections establishing)
tps = 5000.142503 (excluding connections establishing)
statement latencies in milliseconds:
         0.002  \set aid1 random_gaussian(1, 100000 * :scale, 4.0)
         0.001  \set aid2 random_gaussian(1, 100000 * :scale, 4.5)
         0.001  \set aid3 random_gaussian(1, 100000 * :scale, 4.2)
         0.001  \set bid random(1, 1 * :scale)
         0.001  \set tid random(1, 10 * :scale)
         0.001  \set delta random(-5000, 5000)
         0.063  BEGIN;
         0.361  UPDATE pgbench_accounts SET abalance = abalance +
:delta WHERE aid = :aid1;
         0.171  SELECT abalance FROM pgbench_accounts WHERE aid = :aid2;
         0.172  SELECT abalance FROM pgbench_accounts WHERE aid = :aid3;
         0.074  END;

Here is the equivalent for master:

scaling factor: 2500
query mode: prepared
number of clients: 32
number of threads: 8
duration: 14400 s
number of transactions actually processed: 72008125
latency average = 3.069 ms
latency stddev = 24.640 ms
rate limit schedule lag: avg 1.695 (max 1097.628) ms
tps = 5000.542942 (including connections establishing)
tps = 5000.544213 (excluding connections establishing)
statement latencies in milliseconds:
         0.002  \set aid1 random_gaussian(1, 100000 * :scale, 4.0)
         0.001  \set aid2 random_gaussian(1, 100000 * :scale, 4.5)
         0.001  \set aid3 random_gaussian(1, 100000 * :scale, 4.2)
         0.001  \set bid random(1, 1 * :scale)
         0.001  \set tid random(1, 10 * :scale)
         0.001  \set delta random(-5000, 5000)
         0.078  BEGIN;
         0.560  UPDATE pgbench_accounts SET abalance = abalance +
:delta WHERE aid = :aid1;
         0.320  SELECT abalance FROM pgbench_accounts WHERE aid = :aid2;
         0.308  SELECT abalance FROM pgbench_accounts WHERE aid = :aid3;
         0.102  END;

So even the UPDATE is much faster here.

This is also something we see with pg_statio_tables, which looked like
this by the end for patch:

-[ RECORD 1 ]---+-----------------
schemaname      | public
relname         | pgbench_accounts
heap_blks_read  | 117,384,599
heap_blks_hit   | 1,051,175,835
idx_blks_read   | 24,761,513
idx_blks_hit    | 4,024,776,723

For the patch:

-[ RECORD 1 ]---+-----------------
schemaname      | public
relname         | pgbench_accounts
heap_blks_read  | 191,947,522
heap_blks_hit   | 904,536,584
idx_blks_read   | 65,653,885
idx_blks_hit    | 4,002,061,803

Notice that heap_blks_read is down from 191,947,522 on master, to
117,384,599 with the patch -- so it's ~0.611x with the patch. A huge
reduction like this is possible with the patch because it effectively
amortizes the cost of accessing heap blocks to find garbage to clean
up ("nipping the [index bloat] problem in the bud" is much cheaper
than letting it get out of hand for many reasons, locality in
shared_buffers is one more reason). The patch accesses garbage tuples
in heap blocks close together in time for all indexes, at a point in
time when the blocks are still likely to be found in shared_buffers.

Also notice that idx_blks_read is ~0.38x with the patch. That's less
important, but still significant.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Victor Yegorov
In reply to this post by Peter Geoghegan-4
пт, 13 нояб. 2020 г. в 00:01, Peter Geoghegan <[hidden email]>:
On Wed, Nov 11, 2020 at 12:58 PM Victor Yegorov <[hidden email]> wrote:
> On the other hand, there's quite a big drop on the UPDATEs throughput. For sure, undersized shared_bufefrs
> contribute to this drop. Still, my experience tells me that under conditions at hand (disabled HOT due to index
> over update time column) tables will tend to accumulate bloat and produce unnecessary IO also from WAL.

I think that the big SELECT statement with an "ORDER BY mtime ... "
was a good way of demonstrating the advantages of the patch.

Attached is v8, which has the enhancements for low cardinality data
that I mentioned earlier today. It also simplifies the logic for
dealing with posting lists that we need to delete some TIDs from.
These posting list simplifications also make the code a bit more
efficient, which might be noticeable during benchmarking.

Perhaps your "we have 5,2% slowdown in UPDATE speed" issue will be at
least somewhat fixed by the enhancements to v8?

Yes, v8 looks very nice!

I've done two 8 hour long sessions with scale=2000 and shared_buffers=512MB
(previously sent postgresql.auto.conf used here with no changes).
The rest of the setup is the same:
- mtime column that is tracks update time
- index on (mtime, aid)
- tenner low cardinality index from Peter's earlier e-mail
- 3 pgbench scripts run in parallel on master and on v8 patchset (scripts from the previous e-mail used here).

Master
------
        relname        |   nrows   | blk_before | mb_before | blk_after | mb_after |  diff
-----------------------+-----------+------------+-----------+-----------+----------+--------
 pgbench_accounts      | 300000000 |    4918033 |   38422.1 |   5066589 |  39582.7 |  +3.0%
 accounts_mtime        | 300000000 |    1155119 |    9024.4 |   1422354 |  11112.1 | +23.1%
 pgbench_accounts_pkey | 300000000 |     822573 |    6426.4 |    822573 |   6426.4 |     0
 tenner                | 300000000 |     346050 |    2703.5 |    563101 |   4399.2 | +62.7%
(4 rows)

DB size: 59.3..64.5 (+5.2GB / +8.8%)

Patched
-------
        relname        |   nrows   | blk_before | mb_before | blk_after | mb_after |  diff
-----------------------+-----------+------------+-----------+-----------+----------+--------
 pgbench_accounts      | 300000000 |    4918033 |   38422.1 |   5068092 |  39594.5 |  +3.0%
 accounts_mtime        | 300000000 |    1155119 |    9024.4 |   1428972 |  11163.8 | +23.7%
 pgbench_accounts_pkey | 300000000 |     822573 |    6426.4 |    822573 |   6426.4 |     0
 tenner                | 300000000 |     346050 |    2703.5 |    346050 |   2703.5 |     0
(4 rows)

DB size: 59.3..62.8 (+3.5GB / +5.9%)

TPS
---
     query      | Master TPS | Patched TPS |  diff
----------------+------------+-------------+-------
UPDATE + SELECT |       2413 |        2473 | +2.5%
3 SELECT in txn |      19737 |       19545 | -0.9%
15min SELECT    |       0.74 |        1.03 |  +39%


Based on the figures and also on the graphs attached, I can tell v8 has no visible regression
in terms of TPS, IO pattern changes slightly, but the end result is worth it.
In my view, this patch can be applied from a performance POV.

I wanted to share these before I'll finish with the code review, I'm planning to send it tomorrow.


--
Victor Yegorov

20201114-results-master.txt (2K) Download Attachment
20201114-q1-UPDATE-master.png (558K) Download Attachment
20201114-q2-SELECT-master.png (466K) Download Attachment
20201114-q3-15min-SELECT-master.png (472K) Download Attachment
20201114-overview-master.png (720K) Download Attachment
20201114-q1-UPDATE-v8.png (522K) Download Attachment
20201114-q2-SELECT-v8.png (427K) Download Attachment
20201114-q3-15min-SELECT-v8.png (471K) Download Attachment
20201114-overview-v8.png (711K) Download Attachment
20201114-results-patched.txt (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
On Sun, Nov 15, 2020 at 2:29 PM Victor Yegorov <[hidden email]> wrote:

> TPS
> ---
>      query      | Master TPS | Patched TPS |  diff
> ----------------+------------+-------------+-------
> UPDATE + SELECT |       2413 |        2473 | +2.5%
> 3 SELECT in txn |      19737 |       19545 | -0.9%
> 15min SELECT    |       0.74 |        1.03 |  +39%
>
> Based on the figures and also on the graphs attached, I can tell v8 has no visible regression
> in terms of TPS, IO pattern changes slightly, but the end result is worth it.
> In my view, this patch can be applied from a performance POV.

Great, thanks for testing!

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Victor Yegorov
In reply to this post by Peter Geoghegan-4
чт, 12 нояб. 2020 г. в 23:00, Peter Geoghegan <[hidden email]>:
It would be helpful if you could take a look at the nbtree patch --
particularly the changes related to deprecating the page-level
BTP_HAS_GARBAGE flag. I would like to break those parts out into a
separate patch, and commit it in the next week or two. It's just
refactoring, really. (This commit can also make nbtree only use the
word VACUUM for things that strictly involve VACUUM. For example,
it'll rename _bt_vacuum_one_page() to _bt_delete_or_dedup_one_page().)

We almost don't care about the flag already, so there is almost no
behavioral change from deprecated BTP_HAS_GARBAGE in this way.

Indexes that use deduplication already don't rely on BTP_HAS_GARBAGE
being set ever since deduplication was added to Postgres 13 (the
deduplication code doesn't want to deal with LP_DEAD bits, and cannot
trust that no LP_DEAD bits can be set just because BTP_HAS_GARBAGE
isn't set in the special area). Trusting the BTP_HAS_GARBAGE flag can
cause us to miss out on deleting items with their LP_DEAD bits set --
we're better off "assuming that BTP_HAS_GARBAGE is always set", and
finding out if there really are LP_DEAD bits set for ourselves each
time.

Missing LP_DEAD bits like this can happen when VACUUM unsets the
page-level flag without actually deleting the items at the same time,
which is expected when the items became garbage (and actually had
their LP_DEAD bits set) after VACUUM began, but before VACUUM reached
the leaf pages. That's really wasteful, and doesn't actually have any
upside -- we're scanning all of the line pointers only when we're
about to split (or dedup) the same page anyway, so the extra work is
practically free.

I've looked over the BTP_HAS_GARBAGE modifications, they look sane.
I've double checked that heapkeyspace indexes don't use this flag (don't rely on it),
while pre-v4 ones still use it.

I have a question. This flag is raised in the _bt_check_unique() and in _bt_killitems().
If we're deprecating this flag, perhaps it'd be good to either avoid raising it at least for
_bt_check_unique(), as it seems to me that conditions are dealing with postings, therefore
we are speaking of heapkeyspace indexes here.

If we'll conditionally raise this flag in the functions above, we can get rid of blocks that drop it 
in _bt_delitems_delete(), I think.

--
Victor Yegorov
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Victor Yegorov
In reply to this post by Peter Geoghegan-4
чт, 12 нояб. 2020 г. в 23:00, Peter Geoghegan <[hidden email]>:
Another thing that I'll probably add to v8: Prefetching. This is
probably necessary just so I can have parity with the existing
heapam.c function that the new code is based on,
heap_compute_xid_horizon_for_tuples(). That will probably help here,
too.

I don't quite see this part. Do you mean top_block_groups_favorable() here? 


--
Victor Yegorov
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Victor Yegorov
In reply to this post by Peter Geoghegan-4
пт, 13 нояб. 2020 г. в 00:01, Peter Geoghegan <[hidden email]>:
Attached is v8, which has the enhancements for low cardinality data
that I mentioned earlier today. It also simplifies the logic for
dealing with posting lists that we need to delete some TIDs from.
These posting list simplifications also make the code a bit more
efficient, which might be noticeable during benchmarking.

I've looked through the code and it looks very good from my end:
- plenty comments, good description of what's going on
- I found no loose ends in terms of AM integration
- magic constants replaced with defines
Code looks good. Still, it'd be good if somebody with more experience could look into this patch.


Question: why in the comments you're using double spaces after dots?
Is this a convention of the project?

I am thinking of two more scenarios that require testing:
- queue in the table, with a high rate of INSERTs+DELETEs and a long transaction.
  Currently I've seen such conditions yield indexes of several GB in size wil holding less
  than a thousand of live records.
- upgraded cluster with !heapkeyspace indexes.

--
Victor Yegorov
12345