Deleting older versions in unique indexes to avoid page splits

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
Attached is a POC patch that teaches nbtree to delete old duplicate
versions from unique indexes. The optimization targets non-HOT
duplicate version bloat. Although the patch is rather rough, it
nevertheless manages to more or less eliminate a whole class of index
bloat: Unique index bloat from non-HOT updates in workloads where no
transaction lasts for more than a few seconds. For example, it
eliminates index bloat with a custom pgbench workload that uses an
INCLUDE unique index on pgbench_accounts.aid (with abalance as the
non-key attribute), instead of the usual accounts primary key.
Similarly, with a standard pgbench_accounts primary key alongside an
extra non-unique index on abalance, the primary key will never have
any page splits with the patch applied. It's almost as if the updates
were actually HOT updates, at least if you focus on the unique index
(assuming that there are no long-running transactions).

The patch repurposes the deduplication infrastructure to delete
duplicates within unique indexes, provided they're actually safe to
VACUUM. This is somewhat different to the _bt_unique_check() LP_DEAD
bit setting stuff, in that we have to access heap pages that we
probably would not have to access otherwise -- it's something that we
go out of our way to make happen at the point that the page is about
to split, not something that happens in passing at no extra cost. The
general idea is to exploit the fact that duplicates in unique indexes
are usually deadwood.

We only need to "stay one step ahead" of the bloat to avoid all page
splits in many important cases. So we usually only have to access a
couple of heap pages to avoid a page split in each case. In
traditional serial/identity column primary key indexes, any page split
that happens that isn't a split of the current rightmost page must be
caused by version churn. It should be possible to avoid these
"unnecessary" page splits altogether (again, barring long-running
transactions).

I would like to get early feedback on high level direction. While the
patch seems quite promising, I am uncertain about my general approach,
and how it might fit into some much broader effort to control bloat in
general.

There are some clear downsides to my approach. The patch has grotty
heuristics that try to limit the extra work performed to avoid page
splits -- the cost of accessing additional heap pages while a buffer
lock is held on the leaf page needs to be kept. under control. No
doubt this regresses some workloads without giving them a clear
benefit. Also, the optimization only ever gets used with unique
indexes, since they're the only case where a duplicate naturally
suggests version churn, which can be targeted fairly directly, and
without wasting too many cycles when it doesn't work out.

It's not at all clear how we could do something like this with
non-unique indexes. One related-though-distinct idea that might be
worth considering occurs to me: teach nbtree to try to set LP_DEAD
bits in non-unique indexes, in about the same way as it will in
_bt_check_unique() for unique indexes. Perhaps the executor could hint
to btinsert()/aminsert() that it's inserting a duplicate caused by a
non-HOT update, so it's worth trying to LP_DEAD nearby duplicates --
especially if they're on the same heap page as the incoming item.

There is a wholly separate question about index bloat that is of long
term strategic importance to the Postgres project: what should we do
about long running transactions? I tend to think that we can address
problems in that area by indicating that it is safe to delete
"intermediate" versions -- tuples that are not too old to be seen by
the oldest transaction, that are nevertheless not needed (they're too
new to be interesting to the old transaction's snapshot, but also too
old to be interesting to any other snapshot). Perhaps this
optimization could be pursued in a phased fashion, starting with index
AMs, where it seems less scary.

I recently read a paper that had some ideas about what we could do
here [1]. IMV it is past time that we thrashed together a "remove
useless intermediate versions" design that is compatible with the
current heapam design.

[1] https://dl.acm.org/doi/pdf/10.1145/3318464.3389714
--
Peter Geoghegan

0001-Non-opportunistically-delete-B-Tree-items.patch (19K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
On Tue, Jun 30, 2020 at 5:03 PM Peter Geoghegan <[hidden email]> wrote:
> Attached is a POC patch that teaches nbtree to delete old duplicate
> versions from unique indexes. The optimization targets non-HOT
> duplicate version bloat. Although the patch is rather rough, it
> nevertheless manages to more or less eliminate a whole class of index
> bloat: Unique index bloat from non-HOT updates in workloads where no
> transaction lasts for more than a few seconds.

I'm slightly surprised that this thread didn't generate more interest
back in June. After all, maintaining the pristine initial state of
(say) a primary key index even after many high throughput non-HOT
updates (i.e. avoiding "logically unnecessary" page splits entirely)
is quite appealing. It arguably goes some way towards addressing long
held criticisms of our approach to MVCC. Especially if it can be
generalized to all b-tree indexes -- the Uber blog post mentioned
tables that have several indexes, which presumably means that there
can be no HOT updates (the author of the blog post didn't seem to be
aware of HOT at all).

I've been trying to generalize my approach to work with all indexes. I
think that I can find a strategy that is largely effective at
preventing version churn page splits that take place with workloads
that have many non-HOT updates, without any serious downsides for
workloads that do not benefit. I want to get feedback on that now,
since I expect that it will be controversial. Teaching indexes about
how tuples are versioned or chaining tuples seems like a non-starter,
so the trick will be to come up with something that behaves in
approximately the same way as that in cases where it helps.

The approach I have in mind is to pass down a hint from the executor
to btinsert(), which lets nbtree know that the index tuple insert is
in fact associated with a non-HOT update. This hint is only given when
the update did not logically modify the columns that the index covers
(so presumably the majority of unchanged indexes get the hint, but not
the one or two indexes whose columns were modified by our update
statement -- or maybe the non-HOT update was caused by not being able
to fit a new version on the same heap page, in which case all the
btinsert() calls/all the indexes on the table get the hint). Of
course, this hint makes it all but certain that the index tuple is the
successor for some other index tuple located on the same leaf page. We
don't actually include information about which other existing tuple it
is, since it pretty much doesn't matter. Even if we did, we definitely
cannot opportunistically delete it, because it needs to stay in the
index at least until our transaction commits (this should be obvious).
Actually, we already try to delete it from within _bt_check_unique()
today for unique indexes -- we just never opportunistically mark it
dead at that point (as I said, it's needed until the xact commits at
the earliest).

Here is the maybe-controversial part: The algorithm initially assumes
that all indexes more or less have the same properties as unique
indexes from a versioning point of view, even though that's clearly
not true. That is, it initially assumes that the only reason why there
can be duplicates on any leaf page it looks at is because some
previous transaction also did a non-HOT update that added a new,
unchanged index tuple. The new algorithm only runs when the hint is
passed down from the executor and when the only alternative is to
split the page (or have a deduplication pass), so clearly there is
some justification for this assumption -- it really is highly unlikely
that this update (which is on the verge of splitting the page) just so
happened to be the first such update that affected the page.

It's extremely likely that there will be at least a couple of other
tuples like that on the page, and probably quite a few more. And even
if we only manage to free one or two tuples, that'll still generally
be enough to fit the incoming tuple. In general that is usually quite
valuable. Even with a busy database, that might buy us minutes or
hours before the question of splitting the same leaf page arises
again. By the time that happens, longer running transactions may have
committed, VACUUM may have run, etc. Like unique index deduplication,
this isn't about freeing space -- it's about buying time.

To be blunt: It may be controversial that we're accessing multiple
heap pages while holding an exclusive lock on a leaf page, in the
hopes that we can avoid a page split, but without any certainty that
it'll work out.

Sometimes (maybe even most of the time), this assumption turns out to
be mostly correct, and we benefit in the obvious way -- no
"unnecessary" page splits for affected non-unique indexes. Other times
it won't work out, usually because the duplicates are in fact logical
duplicates from distinct logical rows. When the new deletion thing
doesn't work out, the situation works itself out in the obvious way:
we get a deduplication pass. If that doesn't work out we get a page
split. So we have three legitimate strategies for resolving the
"pressure" against a leaf page: last minute emergency duplicate checks
+ deletion (the new thing), a deduplication pass, or a page split. The
strategies are in competition with each other (though only when we
have non-HOT updates).

We're only willing to access a fixed number of heap pages (heap pages
pointed to by duplicate index tuples) to try to delete some index
tuples and avoid a split, and only in the specific context of the hint
being received at the point a leaf page fills and it looks like we
might have to split it. I think that we should probably avoid doing
anything with posting list tuples left behind by deduplication, except
maybe if there are only 2 or 3 TIDs -- just skip over them. That way,
cases with duplicates across logical rows (not version churn) tend to
get made into a posting list early (e.g. during an index build), so we
don't even bother trying to delete from them later. Non-posting-list
duplicates suggest recency, which suggests version churn -- those dups
must at least have come after the most recent deduplication pass. Plus
we have heuristics that maximize the chances of finding versions to
kill. And we tend to look at the same blocks again and again -- like
in the patch I posted, we look at the value with the most dups for
things to kill first, and so on. So repeated version deletion passes
won't end up accessing totally different heap blocks each time, unless
they're successful at killing old versions.

(I think that this new feature should be framed as extending the
deduplication infrastructure to do deletes -- so it can only be used
on indexes that use deduplication.)

Even if this new mechanism ends up slowing down non-HOT updates
noticeably -- which is *not* something that I can see with my
benchmarks now -- then that could still be okay. I think that there is
something sensible about imposing a penalty on non-HOT update queries
that can cause problems for everybody today. Why shouldn't they have
to clean up their own mess? I think that it buys us a lot to condition
cleanup on avoiding page splits, because any possible penalty is only
paid in cases where there isn't something else that keeps the bloat
under control. If something like the kill_prior_tuple mechanism mostly
keeps bloat under control already, then we'll resolve the situation
that way instead.

An important point about this technique is that it's just a back stop,
so it can run very infrequently while still preventing an index from
growing -- an index that can double in size today. If existing
mechanisms against "logically unnecessary" page splits are 99%
effective today, then they may still almost be useless to users --
your index still doubles in size. It just takes a little longer in
Postgres 13 (with deduplication) compared to Postgres 12. So there is
a really huge asymmetry that we still aren't doing enough about, even
with deduplication.

Deduplication cannot prevent the first "wave" of page splits with
primary key style indexes due to the dimensions of the index tuples on
the page. The first wave may be all that matters (deduplication does
help more when things get really bad, but I'd like to avoid "merely
bad" performance characteristics, too). Consider the simplest possible
real world example. If we start out with 366 items on a leaf page
initially (the actual number with default fillfactor + 64-bit
alignment for the pgbench indexes), we can store another 40 non-HOT
dups on the same leaf page before the page splits. We only save 4
bytes by merging a dup into one of the 366 original tuples. It's
unlikely that many of the 40 dups that go on the page will be
duplicates of *each other*, and deduplication only really starts to
save space when posting list tuples have 4 or 5 TIDs each. So
eventually all of the original leaf pages split when they really
shouldn't, despite the availability of deduplication.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Andrey Borodin-2


> 8 окт. 2020 г., в 04:48, Peter Geoghegan <[hidden email]> написал(а):
>
> On Tue, Jun 30, 2020 at 5:03 PM Peter Geoghegan <[hidden email]> wrote:
>> Attached is a POC patch that teaches nbtree to delete old duplicate
>> versions from unique indexes. The optimization targets non-HOT
>> duplicate version bloat. Although the patch is rather rough, it
>> nevertheless manages to more or less eliminate a whole class of index
>> bloat: Unique index bloat from non-HOT updates in workloads where no
>> transaction lasts for more than a few seconds.
>
> I'm slightly surprised that this thread didn't generate more interest
> back in June.

The idea looks very interesting.
It resembles GiST microvacuum: GiST tries to vacuum single page before split.
I'm curious how cost of page deduplication is compared to cost of page split? Should we do deduplication of page will still remain 99% full?

Best regards, Andrey Borodin.



Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
On Mon, Oct 12, 2020 at 3:47 AM Andrey Borodin <[hidden email]> wrote:
> The idea looks very interesting.
> It resembles GiST microvacuum: GiST tries to vacuum single page before split.

AFAICT the GiST microvacuum mechanism is based on the one in nbtree,
which is based on setting LP_DEAD bits when index scans find that the
TIDs are dead-to-all. That's easy to justify, because it is easy and
cheap to save future index scans the trouble of following the TIDs
just to discover the same thing for themselves.

The difference here is that we're simply making an intelligent guess
-- there have been no index scans, but we're going to do a kind of
special index scan at the last minute to see if we can avoid a page
split. I think that this is okay because in practice we can do it in a
reasonably targeted way. We will never do it with INSERTs, for example
(except in unique indexes, though only when we know that there are
duplicates because we saw them in _bt_check_unique()). In the worst
case we make non-HOT updates a bit slower...and I'm not even sure that
that's a bad thing when we don't manage to delete anything. After all,
non-HOT updates impose a huge cost on the system. They have "negative
externalities".

Another way to look at it is that the mechanism I propose to add takes
advantage of "convexity" [1]. (Actually, several of my indexing ideas
are based on similar principles -- like the nbtsplitloc.c stuff.)

Attached is v2. It controls the cost of visiting the heap by finding
the heap page that has the most TIDs that we might be able to kill
(among those that are duplicates on a leaf page). It also adds a
hinting mechanism to the executor to avoid uselessly applying the
optimization with INSERTs.

> I'm curious how cost of page deduplication is compared to cost of page split? Should we do deduplication of page will still remain 99% full?

It depends on how you measure it, but in general I would say that the
cost of traditional Postgres 13 deduplication is much lower.
Especially as measured in WAL volume. I also believe that the same is
true for this new delete deduplication mechanism. The way we determine
which heap pages to visit maximizes the chances that we'll get lucky
while minimizing the cost (currently we don't visit more than 2 heap
pages unless we get at least one kill -- I think I could be more
conservative here without losing much). A page split also has to
exclusive lock two other pages (the right sibling page and parent
page), so even the locking is perhaps better.

The attached patch can completely or almost completely avoid index
bloat in extreme cases with non-HOT updates. This can easily increase
throughput by 2x or more, depending on how extreme you make it (i.e.
how many indexes you have). It seems like the main problem caused by
non-HOT updates is in fact page splits themselves. It is not so much a
problem with dirtying of pages.

You can test this with a benchmark like the one that was used for WARM
back in 2017:

https://www.postgresql.org/message-id/flat/CABOikdMNy6yowA%2BwTGK9RVd8iw%2BCzqHeQSGpW7Yka_4RSZ_LOQ%40mail.gmail.com

I suspect that it's maybe even more effective than WARM was with this benchmark.

[1] https://fooledbyrandomness.com/ConvexityScience.pdf
--
Peter Geoghegan

v2-0001-Add-delete-deduplication-to-nbtree.patch (76K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Anastasia Lubennikova
In reply to this post by Peter Geoghegan-4
On 08.10.2020 02:48, Peter Geoghegan wrote:
On Tue, Jun 30, 2020 at 5:03 PM Peter Geoghegan [hidden email] wrote:
Attached is a POC patch that teaches nbtree to delete old duplicate
versions from unique indexes. The optimization targets non-HOT
duplicate version bloat. Although the patch is rather rough, it
nevertheless manages to more or less eliminate a whole class of index
bloat: Unique index bloat from non-HOT updates in workloads where no
transaction lasts for more than a few seconds.
I'm slightly surprised that this thread didn't generate more interest
back in June. After all, maintaining the pristine initial state of
(say) a primary key index even after many high throughput non-HOT
updates (i.e. avoiding "logically unnecessary" page splits entirely)
is quite appealing. It arguably goes some way towards addressing long
held criticisms of our approach to MVCC. Especially if it can be
generalized to all b-tree indexes -- the Uber blog post mentioned
tables that have several indexes, which presumably means that there
can be no HOT updates (the author of the blog post didn't seem to be
aware of HOT at all).
The idea seems very promising, especially when extended to handle non-unique indexes too.
I've been trying to generalize my approach to work with all indexes. I
think that I can find a strategy that is largely effective at
preventing version churn page splits that take place with workloads
that have many non-HOT updates, without any serious downsides for
workloads that do not benefit. I want to get feedback on that now,
since I expect that it will be controversial. Teaching indexes about
how tuples are versioned or chaining tuples seems like a non-starter,
so the trick will be to come up with something that behaves in
approximately the same way as that in cases where it helps.

The approach I have in mind is to pass down a hint from the executor
to btinsert(), which lets nbtree know that the index tuple insert is
in fact associated with a non-HOT update. This hint is only given when
the update did not logically modify the columns that the index covers
That's exactly what I wanted to discuss after the first letter. If we could make (non)HOT-updates index specific, I think it could improve performance a lot.
Here is the maybe-controversial part: The algorithm initially assumes
that all indexes more or less have the same properties as unique
indexes from a versioning point of view, even though that's clearly
not true. That is, it initially assumes that the only reason why there
can be duplicates on any leaf page it looks at is because some
previous transaction also did a non-HOT update that added a new,
unchanged index tuple. The new algorithm only runs when the hint is
passed down from the executor and when the only alternative is to
split the page (or have a deduplication pass), so clearly there is
some justification for this assumption -- it really is highly unlikely
that this update (which is on the verge of splitting the page) just so
happened to be the first such update that affected the page.
To be blunt: It may be controversial that we're accessing multiple
heap pages while holding an exclusive lock on a leaf page, in the
hopes that we can avoid a page split, but without any certainty that
it'll work out.

Sometimes (maybe even most of the time), this assumption turns out to
be mostly correct, and we benefit in the obvious way -- no
"unnecessary" page splits for affected non-unique indexes. Other times
it won't work out, usually because the duplicates are in fact logical
duplicates from distinct logical rows.
I think that this optimization can affect low cardinality indexes negatively, but it is hard to estimate impact without tests. Maybe it won't be a big deal, given that we attempt to eliminate old copies not very often and that low cardinality b-trees are already not very useful. Besides, we can always make this thing optional, so that users could tune it to their workload.


I wonder, how this new feature will interact with physical replication? Replica may have quite different performance profile. For example, there can be long running queries, that now prevent vacuum from removing recently-dead rows. How will we handle same situation with this optimized deletion?

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
On Wed, Oct 14, 2020 at 7:07 AM Anastasia Lubennikova
<[hidden email]> wrote:
> The idea seems very promising, especially when extended to handle non-unique indexes too.

Thanks!

> That's exactly what I wanted to discuss after the first letter. If we could make (non)HOT-updates index specific, I think it could improve performance a lot.

Do you mean accomplishing the same goal in heapam, by making the
optimization more intelligent about which indexes need new versions?
We did have a patch that did that in 2007, as you may recall -- this
was called WARM:

https://www.postgresql.org/message-id/flat/CABOikdMNy6yowA%2BwTGK9RVd8iw%2BCzqHeQSGpW7Yka_4RSZ_LOQ%40mail.gmail.com

This didn't go anywhere. I think that this solution in more pragmatic.
It's cheap enough to remove it if a better solution becomes available
in the future. But this is a pretty good solution by all important
measures.

> I think that this optimization can affect low cardinality indexes negatively, but it is hard to estimate impact without tests. Maybe it won't be a big deal, given that we attempt to eliminate old copies not very often and that low cardinality b-trees are already not very useful. Besides, we can always make this thing optional, so that users could tune it to their workload.

Right. The trick is to pay only a fixed low cost (maybe as low as one
heap page access) when we start out, and ratchet it up only if the
first heap page access looks promising. And to avoid posting list
tuples. Regular deduplication takes place when this fails. It's useful
for the usual reasons, but also because this new mechanism learns not
to try the posting list TIDs.

> I wonder, how this new feature will interact with physical replication? Replica may have quite different performance profile.

I think of that as equivalent to having a long running transaction on
the primary. When I first started working on this patch I thought
about having "long running transaction detection". But I quickly
realized that that isn't a meaningful concept. A transaction is only
truly long running relative to the writes that take place that have
obsolete row versions that cannot be cleaned up. It has to be
something we can deal with, but it cannot be meaningfully
special-cased.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
On Wed, Oct 14, 2020 at 7:40 AM Peter Geoghegan <[hidden email]> wrote:
> Right. The trick is to pay only a fixed low cost (maybe as low as one
> heap page access) when we start out, and ratchet it up only if the
> first heap page access looks promising.

Just as an example of how the patch can help, consider the following
pgbench variant script:

\set aid1 random_gaussian(1, 100000 * :scale, 2.0)
\set aid2 random_gaussian(1, 100000 * :scale, 2.5)
\set aid3 random_gaussian(1, 100000 * :scale, 2.2)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid1;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid2;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid3;
END;

(These details you see here are a bit arbitrary; don't worry about the
specifics too much.)

Before running the script with pgbench, I initialized pgbench to scale
1500, and made some changes to the indexing (in order to actually test
the patch). There was no standard pgbench_accounts PK. Instead, I
created a unique index that had an include column, which is enough to
make every update a non-HOT update. I also added two more redundant
non-unique indexes to create more overhead from non-HOT updates. It
looked like this:

create unique index aid_pkey_include_abalance on pgbench_accounts
(aid) include (abalance);
create index one on pgbench_accounts (aid);
create index two on pgbench_accounts (aid);

So 3 indexes on the accounts table.

I ran the script for two hours and 16 clients with the patch, then for
another two hours with master. After that time, all 3 indexes were
exactly the same size with the patch, but had almost doubled in size
on master:

aid_pkey_include_abalance: 784,264 pages (or ~5.983 GiB)
one: 769,545 pages (or ~5.871 GiB)
two: 770,295 pages (or ~5.876 GiB)

(With the patch, all three indexes were 100% pristine -- they remained
precisely 411,289 pages in size by the end, which is ~3.137 GiB.)

Note that traditional deduplication is used by the indexes I've called
"one" and "two" here, but not the include index called
"aid_pkey_include_abalance". But it makes little difference, for
reasons that will be obvious if you think about what this looks like
at the leaf page level. Cases that Postgres 13 deduplication does
badly with are often the ones that this new mechanism does well with.
Deduplication by deleting and by merging are truly complementary -- I
haven't just structured the code that way because it was convenient to
use dedup infrastructure just to get the dups at the start. (Yes, it
*was* convenient, but there clearly are cases where each mechanism
competes initially, before nbtree converges on the best strategy at
the local level. So FWIW this patch is a natural and logical extension
of the deduplication work in my mind.)

The TPS/throughput is about what you'd expect for the two hour run:

18,988.762398 TPS for the patch
11,123.551707 TPS for the master branch.

This is a ~1.7x improvement, but I can get more than 3x by changing
the details at the start -- just add more indexes. I don't think that
the precise throughput difference you see here matters. The important
point is that we've more or less fixed a pathological set of behaviors
that have poorly understood cascading effects. Full disclosure: I rate
limited pgbench to 20k for this run, which probably wasn't significant
because neither patch nor master hit that limit for long.

Big latency improvements for that same run, too:

Patch:

statement latencies in milliseconds:
         0.001  \set aid1 random_gaussian(1, 100000 * :scale, 2.0)
         0.000  \set aid2 random_gaussian(1, 100000 * :scale, 2.5)
         0.000  \set aid3 random_gaussian(1, 100000 * :scale, 2.2)
         0.000  \set bid random(1, 1 * :scale)
         0.000  \set tid random(1, 10 * :scale)
         0.000  \set delta random(-5000, 5000)
         0.057  BEGIN;
         0.294  UPDATE pgbench_accounts SET abalance = abalance +
:delta WHERE aid = :aid1;
         0.204  SELECT abalance FROM pgbench_accounts WHERE aid = :aid2;
         0.195  SELECT abalance FROM pgbench_accounts WHERE aid = :aid3;
         0.090  END;

Master:

statement latencies in milliseconds:
         0.002  \set aid1 random_gaussian(1, 100000 * :scale, 2.0)
         0.001  \set aid2 random_gaussian(1, 100000 * :scale, 2.5)
         0.001  \set aid3 random_gaussian(1, 100000 * :scale, 2.2)
         0.001  \set bid random(1, 1 * :scale)
         0.000  \set tid random(1, 10 * :scale)
         0.001  \set delta random(-5000, 5000)
         0.084  BEGIN;
         0.604  UPDATE pgbench_accounts SET abalance = abalance +
:delta WHERE aid = :aid1;
         0.317  SELECT abalance FROM pgbench_accounts WHERE aid = :aid2;
         0.311  SELECT abalance FROM pgbench_accounts WHERE aid = :aid3;
         0.120  END;

Note that the mechanism added by the patch naturally limits the number
of versions that are in the index for each logical row, which seems
much more important than the total amount of garbage tuples cleaned
up. It's great that the index is half its original size, but even that
is less important than the effect of more or less bounding the worst
case number of heap pages accessed by point index scans. Even though
this patch shows big performance improvements (as well as very small
performance regressions for small indexes with skew), the patch is
mostly about stability. I believe that Postgres users want greater
stability and predictability in this area more than anything else.

The behavior of the system as a whole that we see for the master
branch here is not anywhere near linear. Page splits are of course
expensive, but they also occur in distinct waves [1] and have lasting
consequences. They're very often highly correlated, with clear tipping
points, so you see relatively sudden slow downs in the real world.
Worse still, with skew the number of hot pages that you have can
double in a short period of time. This very probably happens at the
worst possible time for the user, since the database was likely
already organically experiencing very high index writes at the point
of experiencing the first wave of splits (splits correlated both
within and across indexes on the same table). And, from that point on,
the number of FPIs for the same workload also doubles forever (or at
least until REINDEX).

[1] https://btw.informatik.uni-rostock.de/download/tagungsband/B2-2.pdf
--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Victor Yegorov
пт, 16 окт. 2020 г. в 21:12, Peter Geoghegan <[hidden email]>:
I ran the script for two hours and 16 clients with the patch, then for
another two hours with master. After that time, all 3 indexes were
exactly the same size with the patch, but had almost doubled in size
on master:

aid_pkey_include_abalance: 784,264 pages (or ~5.983 GiB)
one: 769,545 pages (or ~5.871 GiB)
two: 770,295 pages (or ~5.876 GiB)

(With the patch, all three indexes were 100% pristine -- they remained
precisely 411,289 pages in size by the end, which is ~3.137 GiB.)



The TPS/throughput is about what you'd expect for the two hour run:

18,988.762398 TPS for the patch
11,123.551707 TPS for the master branch.

I really like these results, great work!

I'm also wondering how IO numbers changed due to these improvements, shouldn't be difficult to look into.

Peter, according to cfbot patch no longer compiles.
Can you send and update, please?


--
Victor Yegorov
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
On Fri, Oct 16, 2020 at 1:00 PM Victor Yegorov <[hidden email]> wrote:
> I really like these results, great work!

Thanks Victor!

> I'm also wondering how IO numbers changed due to these improvements, shouldn't be difficult to look into.

Here is the pg_statio_user_indexes for patch for the same run:

 schemaname |     relname      |       indexrelname        |
idx_blks_read | idx_blks_hit
------------+------------------+---------------------------+---------------+---------------
 public     | pgbench_accounts | aid_pkey_include_abalance |
12,828,736 |   534,819,826
 public     | pgbench_accounts | one                       |
12,750,275 |   534,486,742
 public     | pgbench_accounts | two                       |
2,474,893 | 2,216,047,568
(3 rows)

And for master:

 schemaname |     relname      |       indexrelname        |
idx_blks_read | idx_blks_hit
------------+------------------+---------------------------+---------------+---------------
 public     | pgbench_accounts | aid_pkey_include_abalance |
29,526,568 |   292,705,432
 public     | pgbench_accounts | one                       |
28,239,187 |   293,164,160
 public     | pgbench_accounts | two                       |
6,505,615 | 1,318,164,692
(3 rows)

Here is pg_statio_user_tables patch:

 schemaname |     relname      | heap_blks_read | heap_blks_hit |
idx_blks_read | idx_blks_hit  | toast_blks_read | toast_blks_hit |
tidx_blks_read | tidx_blks_hit
------------+------------------+----------------+---------------+---------------+---------------+-----------------+----------------+----------------+---------------
 public     | pgbench_accounts |    123,195,496 |   696,805,485 |
28,053,904 | 3,285,354,136 |                 |                |
        |
 public     | pgbench_branches |             11 |         1,553 |
         |               |                 |                |
      |
 public     | pgbench_history  |              0 |             0 |
         |               |                 |                |
      |
 public     | pgbench_tellers  |             86 |        15,416 |
         |               |                 |                |
      |
(4 rows)

And the pg_statio_user_tables for master:

 schemaname |     relname      | heap_blks_read | heap_blks_hit |
idx_blks_read | idx_blks_hit  | toast_blks_read | toast_blks_hit |
tidx_blks_read | tidx_blks_hit
------------+------------------+----------------+---------------+---------------+---------------+-----------------+----------------+----------------+---------------
 public     | pgbench_accounts |    106,502,089 |   334,875,058 |
64,271,370 | 1,904,034,284 |                 |                |
        |
 public     | pgbench_branches |             11 |         1,553 |
         |               |                 |                |
      |
 public     | pgbench_history  |              0 |             0 |
         |               |                 |                |
      |
 public     | pgbench_tellers  |             86 |        15,416 |
         |               |                 |                |
      |
(4 rows)

Of course, it isn't fair to make a direct comparison because we're
doing ~1.7x times more work with the patch. But even still, the
idx_blks_read is less than half with the patch.

BTW, the extra heap_blks_hit from the patch are not only due to the
fact that the system does more directly useful work. It's also due to
the extra garbage collection triggered in indexes. The same is
probably *not* true with heap_blks_read, though. I minimize the number
of heap pages accessed by the new cleamup mechanism each time, and
temporal locality will help a lot. I think that we delete index
entries pointing to garbage in the heap at pretty predictable
intervals. Heap pages full of LP_DEAD line pointer garbage only get
processed with a few times close together in time, after which they're
bound to either get VACUUM'd or get evicted from shared buffers.

> Peter, according to cfbot patch no longer compiles.
> Can you send and update, please?

Attached is v3, which is rebased against the master branch as of
today. No real changes, though.

--
Peter Geoghegan

v3-0001-Add-delete-deduplication-to-nbtree.patch (78K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
On Fri, Oct 16, 2020 at 1:58 PM Peter Geoghegan <[hidden email]> wrote:
> Attached is v3, which is rebased against the master branch as of
> today. No real changes, though.

And now here's v4.

This version adds handling of posting list tuples, which I was
skipping over before. Highly contended leaf pages with posting list
tuples could still sometimes get "logically unnecessary" page splits
in v3, which this seems to fix (barring the most extreme cases of
version churn, where the patch cannot and should not prevent page
splits). Actually, posting list tuple handling was something that we
had in v1 but lost in v2, because v2 changed our general strategy to
focus on what is convenient to the heap/tableam, which is the most
important thing by far (the cost of failing must be a low, fixed, well
understood and well targeted cost). The fix was to include TIDs from
posting lists, while marking them "non-promising". Only plain tuples
that are duplicates are considered promising. Only promising tuples
influence where we look for tuples to kill most of the time. The
exception is when there is an even number of promising tuples on two
heap pages, where we tiebreak on the total number of TIDs that point
to the heap page from the leaf page.

I seem to be able to cap the costs paid when the optimization doesn't
work out to the extent that we can get away with visiting only *one*
heap page before giving up. And, we now never visit more than 3 total
(2 is the common case when the optimization is really effective). This
may sound conservative -- because it is -- but it seems to work in
practice. I may change my mind about that and decide to be less
conservative, but so far all of the evidence I've seen suggests that
it doesn't matter -- the heuristics seem to really work. Might as well
be completely conservative. I'm *sure* that we can afford one heap
access here -- we currently *always* visit at least one heap tuple in
roughly the same way during each and every unique index tuple insert
(not just when the page fills).

Posting list TIDs are not the only kind of TIDs that are marked
non-promising. We now also include TIDs from non-duplicate tuples. So
we're prepared to kill any of the TIDs on the page, though we only
really expect it to happen with the "promising" tuples (duplicates).
But why not be open to the possibility of killing some extra TIDs in
passing? We don't have to visit any extra heap pages to do so, so it's
practically free. Empirical evidence shows that this happens quite
often.

Here's why this posting list tuple strategy is a good one: we consider
posting list tuple TIDs non-promising to represent that we think that
there are probably actually multiple logical rows involved, or at
least to represent that they didn't work -- simple trial and error
suggests that they aren't very "promising", whatever the true reason
happens to be. But why not "keep an open mind" about the TIDs not each
being for distinct logical rows? If it just so happens that the
posting list TIDs really were multiple versions of the same logical
row all along, then there is a reasonable chance that there'll be even
more versions on that same heap page later on. When this happens and
when we end up back on the same B-Tree leaf page to think about dedup
deletion once again, it's pretty likely that we'll also naturally end
up looking into the later additional versions on this same heap page
from earlier. At which point we won't miss the opportunity to check
the posting lists TIDs in passing. So we get to delete the posting
list after all!

(If you're thinking "but we probably won't get lucky like that", then
consider that it doesn't have to happen on the next occasion when
delete deduplication happens on the same leaf page. Just at some point
in the future. This is due to the way we visit the heap pages that
look most promising first. It might take a couple of rounds of dedup
deletion to get back to the same heap page, but that's okay. The
simple heuristics proposed in the patch seem to have some really
interesting and useful consequences in practice. It's hard to quantify
how important these kinds of accidents of locality will be. I haven't
targeted this or that effect -- my heuristics are dead simple, and
based almost entirely on keeping costs down. You can think of it as
"increase the number of TIDs to increase our chances of success" if
you prefer.)

The life cycle of logical rows/heap tuples seems to naturally lead to
these kinds of opportunities. Obviously heapam is naturally very keen
on storing related versions on the same heap block already, without
any input from this patch. The patch is still evolving, and the
overall structure and interface certainly still needs work -- I've
focussed on the algorithm so far. I could really use some feedback on
high level direction, though. It's a relatively short patch, even with
all of my README commentary. But...it's not an easy concept to digest.

Note: I'm not really sure if it's necessary to provide specialized
qsort routines for the sorting we do to lists of TIDs, etc. I did do
some experimentation with that, and it's an open question. So for now
I rely on the patch that Thomas Munro posted to do that a little while
back, which is why that's included here. The question of whether this
is truly needed is unsettled.

--
Peter Geoghegan

v4-0001-Add-sort_template.h-for-making-fast-sort-function.patch (18K) Download Attachment
v4-0002-Add-delete-deduplication-to-nbtree.patch (86K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Robert Haas
In reply to this post by Peter Geoghegan-4
On Wed, Oct 7, 2020 at 7:48 PM Peter Geoghegan <[hidden email]> wrote:
> To be blunt: It may be controversial that we're accessing multiple
> heap pages while holding an exclusive lock on a leaf page, in the
> hopes that we can avoid a page split, but without any certainty that
> it'll work out.

That certainly isn't great. I mean, it might be not be too terrible,
because it's a leaf index page isn't nearly as potentially hot as a VM
page or a clog page, but it hurts interruptibility and risks hurting
concurrency, but if it were possible to arrange to hold only a pin on
the page during all this rather than a lock, it would be better. I'm
not sure how realistic that is, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
On Wed, Oct 21, 2020 at 8:25 AM Robert Haas <[hidden email]> wrote:
> That certainly isn't great. I mean, it might be not be too terrible,
> because it's a leaf index page isn't nearly as potentially hot as a VM
> page or a clog page, but it hurts interruptibility and risks hurting
> concurrency, but if it were possible to arrange to hold only a pin on
> the page during all this rather than a lock, it would be better. I'm
> not sure how realistic that is, though.

I don't think that it's realistic. Well, technically you could do
something like that, but you'd end up with some logically equivalent
mechanism which would probably be slower. As you know, in nbtree pins
are generally much less helpful than within heapam (you cannot read a
page without a shared buffer lock, no matter what). Holding a pin only
provides a very weak guarantee about VACUUM and TID recycling that
usually doesn't come up.

Bear in mind that we actually do practically the same thing all the
time with the current LP_DEAD setting stuff, where we need to call
compute_xid_horizon_for_tuples/heap_compute_xid_horizon_for_tuples
with a leaf buffer lock held in almost the same way. That's actually
potentially far worse if you look at it in isolation, because you
could potentially have hundreds of heap pages, whereas this is just 1
- 3. (BTW, next version will also do that work in passing, so you're
practically guaranteed to do less with a buffer lock held compared to
the typical case of nbtree LP_DEAD setting, even without counting how
the LP_DEAD bits get set in the first place.)

I could also point out that something very similar happens in
_bt_check_unique().

Also bear in mind that the alternative is pretty much a page split, which means:

* Locking the leaf page

* Then obtaining relation extension lock

* Locking to create new right sibling

* Releasing relation extension lock

* Locking original right sibling page

* Release original right sibling page

* Release new right sibling page

* Lock parent page

* Release original now-split page

* Release parent page

(I will refrain from going into all of the absurd and near-permanent
secondary costs that just giving up and splitting the page imposes for
now. I didn't even include all of the information about locking --
there is one thing that didn't seem worth mentioning.)

The key concept here is of course asymmetry. The asymmetry here is not
only favorable; it's just outrageous. The other key concept is it's
fundamentally impossible to pay more than a very small fixed cost
without getting a benefit.

That said, I accept that there is still some uncertainty that all
workloads that get a benefit will be happy with the trade-off. I am
still fine tuning how this works in cases with high contention. I
welcome any help with that part. But note that this doesn't
necessarily have much to do with the heap page accesses. It's not
always strictly better to never have any bloat at all (it's pretty
close to that, but not quite). We saw this with the Postgres 12 work,
where small TPC-C test cases had some queries go slower simply because
a small highly contended index did not get bloated due to a smarter
split algorithm. There is no reason to believe that it had anything to
do with the cost of making better decisions. It was the decisions
themselves.

I don't want to completely prevent "version driven page splits"
(though a person could reasonably imagine that that is in fact my
precise goal); rather, I want to make non-hot updates work to prove
that it's almost certainly necessary to split the page due to version
churn - then and only then should it be accepted. Currently we meekly
roll over and let non-hot updaters impose negative externalities on
the system as a whole. The patch usually clearly benefits even
workloads that consist entirely of non-hot updaters. Negative
externalities are only good for the individual trying to impose costs
on the collective when they can be a true freeloader. It's always bad
for the collective, but it's even bad for the bad actors once they're
more than a small minority.

Currently non-hot updaters are not merely selfish to the extent that
they impose a downside on the collective or the system as a whole that
is roughly proportionate to the upside benefit they get. Not cleaning
up their mess as they go creates a downside that is a huge multiple of
any possible upside for them. To me this seems incontrovertible.
Worrying about the precise extent to which this is true in each
situation doesn't seem particularly productive to me. Whatever the
actual extent of the imbalance is, the solution is that you don't let
them do that.

This patch is not really about overall throughput. It could be
justified on that basis, but that's not how I like to think of it.
Rather, it's about providing a stabilizing backstop mechanism, which
tends to bound the amount of index bloat and the number of versions in
each index for each *logical row* -- that's the most important benefit
of the patch. There are workloads that will greatly benefit despite
only invoking the new mechanism very occasionally, as a backstop. And
even cases with a fair amount of contention don't really use it that
often (which is why the heap page access cost is pretty much a
question about specific high contention patterns only). The proposed
new cleanup mechanism may only be used in certain parts of the key
space for certain indexes at certain times, in a bottom-up fashion. We
don't have to be eager about cleaning up bloat most of the time, but
it's also true that there are cases where we ought to work very hard
at it in a localized way.

This explanation may sound unlikely, but the existing behaviors taken
together present us with outrageous cost/benefit asymmetry, arguably
in multiple dimensions.

I think that having this backstop cleanup mechanism (and likely others
in other areas) will help to make the assumptions underlying
autovacuum scheduling much more reasonable in realistic settings. Now
it really is okay that autovacuum doesn't really care about the needs
of queries, and is largely concerned with macro level things like free
space management. It's top down approach isn't so bad once it has true
bottom up complementary mechanisms. The LP_DEAD microvacuum stuff is
nice because it marks things as dead in passing, pretty much for free.
That's not enough on its own -- it's no backstop. The current LP_DEAD
stuff appears to work rather well, until one day it suddenly doesn't
and you curse Postgres for it. I could go on about the non-linear
nature of the system as a whole, hidden tipping points, and other
stuff like that. But I won't right now.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Robert Haas
On Wed, Oct 21, 2020 at 3:36 PM Peter Geoghegan <[hidden email]> wrote:

> Bear in mind that we actually do practically the same thing all the
> time with the current LP_DEAD setting stuff, where we need to call
> compute_xid_horizon_for_tuples/heap_compute_xid_horizon_for_tuples
> with a leaf buffer lock held in almost the same way. That's actually
> potentially far worse if you look at it in isolation, because you
> could potentially have hundreds of heap pages, whereas this is just 1
> - 3. (BTW, next version will also do that work in passing, so you're
> practically guaranteed to do less with a buffer lock held compared to
> the typical case of nbtree LP_DEAD setting, even without counting how
> the LP_DEAD bits get set in the first place.)

That's fair. It's not that I'm trying to enforce some absolute coding
rule, as if I even had the right to do such a thing. But, people have
sometimes proposed patches which would have caused major regression in
this area and I think it's really important that we avoid that.
Normally, it doesn't matter: I/O requests complete quickly and
everything is fine. But, on a system where things are malfunctioning,
it makes a big difference whether you can regain control by hitting
^C. I expect you've probably at some point had the experience of being
unable to recover control of a terminal window because some process
was stuck in wait state D on the kernel level, and you probably
thought, "well, this sucks." It's even worse if the kernel's entire
process table fills up with such processes. This kind of thing is
essentially the same issue at the database level, and it's smart to do
what we can to mitigate it.

But that being said, I'm not trying to derail this patch. It isn't,
and shouldn't be, the job of this patch to solve that problem. It's
just better if it doesn't regress things, or maybe even (as you say)
makes them a little better. I think the idea you've got here is
basically good, and a lot of it comes down to how well it works in
practice. I completely agree that looking at amortized cost rather
than worst-case cost is a reasonable principle here; you can't take
that to a ridiculous extreme because people also care about
consistently of performance, but it seems pretty clear from your
description that your patch should not create any big problem in that
area, because the worst-case number of extra buffer accesses for a
single operation is tightly bounded. And, of course, containing index
bloat is its own reward.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Simon Riggs
In reply to this post by Peter Geoghegan-4
On Fri, 16 Oct 2020 at 20:12, Peter Geoghegan <[hidden email]> wrote:

> The TPS/throughput is about what you'd expect for the two hour run:
>
> 18,988.762398 TPS for the patch
> 11,123.551707 TPS for the master branch.

Very good.

> Patch:
>
> statement latencies in milliseconds:
>          0.294  UPDATE pgbench_accounts SET abalance = abalance +
> :delta WHERE aid = :aid1;
>
> Master:
>
> statement latencies in milliseconds:
>          0.604  UPDATE pgbench_accounts SET abalance = abalance +
> :delta WHERE aid = :aid1;

The average latency is x2. What is the distribution of latencies?
Occasional very long or all uniformly x2?

I would guess that holding the page locks will also slow down SELECT
workload, so I think you should also report that workload as well.

Hopefully that will be better in the latest version.

I wonder whether we can put this work into a background process rather
than pay the cost in the foreground? Perhaps that might not need us to
hold page locks??

--
Simon Riggs                http://www.EnterpriseDB.com/


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
On Thu, Oct 22, 2020 at 10:12 AM Simon Riggs <[hidden email]> wrote:
> > 18,988.762398 TPS for the patch
> > 11,123.551707 TPS for the master branch.
>
> Very good.

I'm happy with this result, but as I said it's not really the point. I
can probably get up to a 5x or more improvement in TPS if I simply add
enough indexes.

The point is that we're preventing pathological behavior. The patch
does not so much add something helpful as subtract something harmful.
You can contrive a case that has as much of that harmful element as
you like.

> The average latency is x2. What is the distribution of latencies?
> Occasional very long or all uniformly x2?

The latency is generally very even with the patch. There is a constant
hum of cleanup by the new mechanism in the case of the benchmark
workload. As opposed to a cascade of page splits, which occur in
clearly distinct correlated waves.

> I would guess that holding the page locks will also slow down SELECT
> workload, so I think you should also report that workload as well.
>
> Hopefully that will be better in the latest version.

But the same benchmark that you're asking about here has two SELECT
statements and only one UPDATE. It already is read-heavy in that
sense. And we see that the latency is also significantly improved for
the SELECT queries.

Even if there was often a performance hit rather than a benefit (which
is definitely not what we see), it would still probably be worth it.
Users create indexes for a reason. I believe that we are obligated to
maintain the indexes to a reasonable degree, and not just when it
happens to be convenient to do so in passing.

> I wonder whether we can put this work into a background process rather
> than pay the cost in the foreground? Perhaps that might not need us to
> hold page locks??

Holding a lock on the leaf page is unavoidable.

This patch is very effective because it intervenes at precisely the
right moment in precisely the right place only. We don't really have
to understand anything about workload characteristics to be sure of
this, because it's all based on the enormous asymmetries I've
described, which are so huge that it just seems impossible that
anything else could matter. Trying to do any work in a background
process works against this local-first, bottom-up laissez faire
strategy. The strength of the design is in how clever it isn't.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

heapam and bottom-up garbage collection, keeping version chains short (Was: Deleting older versions in unique indexes to avoid page splits)

Peter Geoghegan-4
In reply to this post by Robert Haas
On Thu, Oct 22, 2020 at 6:18 AM Robert Haas <[hidden email]> wrote:
> But that being said, I'm not trying to derail this patch. It isn't,
> and shouldn't be, the job of this patch to solve that problem. It's
> just better if it doesn't regress things, or maybe even (as you say)
> makes them a little better. I think the idea you've got here is
> basically good, and a lot of it comes down to how well it works in
> practice.

Thanks. I would like to have a more general conversation about how
some of the principles embodied by this patch could be generalized to
heapam in the future. I think that we could do something similar with
pruning, or at least with the truncation of HOT chains. Here are a few
of the principles I'm thinking of, plus some details of how they might
be applied in heapam:

* The most important metric is not the total amount of dead tuples in
the whole table. Rather, it's something like the 90th + 99th
percentile length of version chains for each logical row, or maybe
only those logical rows actually accessed by index scans recently.
(Not saying that we should try to target this in some explicit way -
just that it should be kept low with real workloads as a natural
consequence of the design.)

* Have a variety of strategies for dealing with pruning that compete
with each other. Allow them to fight it out organically. Start with
the cheapest thing and work your way up. So for example, truncation of
HOT chains is always preferred, and works in just the same way as it
does today. Then it gets more complicated and more expensive, but in a
way that is a natural adjunct of the current proven design.

I believe that this creates a bit more pain earlier on, but avoids
much more pain later. There is no point powering ahead if we'll end up
hopelessly indebted with garbage tuples in the end.

For heapam we could escalate from regular pruning by using an old
version store for versions that were originally part of HOT chains,
but were found to not fit on the same page at the point of
opportunistic pruning. The old version store is for committed versions
only -- it isn't really much like UNDO. We leave forwarding
information on the original heap page.

We add old committed versions to the version store at the point when a
heap page is about to experience an event that is roughly the
equivalent of a B-Tree page split -- for heapam a "split" is being
unable to keep the same logical row on the same heap page over time in
the presence of many updates. This is our last line of defense against
this so-called page split situation. It occurs after regular pruning
(which is an earlier line of defense) fails to resolve the situation
inexpensively. We can make moving tuples into the old version store
WAL efficient by making it work like an actual B-Tree page split -- it
can describe the changes in a way that's mostly logical, and based on
the existing page image. And like an actual page split, we're
amortizing costs in a localized way.

It is also possible to apply differential compression to whole HOT
chains rather than storing them in the old version store, for example,
if that happens to look favorable (think of rows with something like a
pgbench_accounts.filler column -- not uncommon). We have options, we
can add new options in the future as new requirements come to light.
We allow the best option to present itself to us in a bottom-up
fashion. Sometimes the best strategy at the local level is actually a
combination of two different strategies applied alternately over time.
For example, we use differential compression first when we fail to
prune, then we prune the same page later (a little like merge
deduplication in my recent nbtree delete dedup patch).

Maybe each of these two strategies (differential compression +
traditional heap HOT chain truncation) get applied alternately against
the same heap page over time, in a tick-tock fashion. We naturally
avoid availing of the old version store structure, which is good,
since that is a relatively expensive strategy that we should apply
only as a last resort. This tick-tock behavior is an emergent property
of the workload rather than something planned or intelligent, and yet
it kind of appears to be an intelligent strategy. (Maybe it works that
way permanently in some parts of the heap, or maybe the same heap
blocks only tick-tock like this on Tuesdays. It may be possible for
stuff like that to happen sanely with well chosen simple heuristics
that exploit asymmetry.)

* Work with (not against) the way that Postgres strongly decouples the
physical representation of data from the logical contents of the
database compared to other DB systems. But at the same time, work hard
to make the physical representation of the data as close as is
practically possible to an idealized, imaginary logical version of the
data. Do this because it makes queries faster, not because it's
strictly necessary for concurrency control/recovery/whatever.

Concretely, this mostly means that we try our best to keep each
logical row (i.e. the latest physical version or two of each row)
located on the same physical heap block over time, using the
escalation strategy I described or something like it. Notice that
we're imposing a cost on queries that are arguably responsible for
creating garbage, but only when we really can't tolerate more garbage
collection debt. But if it doesn't work out that's okay -- we tried. I
have a feeling that it will in fact mostly work out. Why shouldn't it
be possible to have no more than one or two uncommitted row versions
in a heap page at any given time, just like with my nbtree patch? (I
think that the answer to that question is "weird workloads", but I'm
okay with the idea that they're a little slower or whatever.)

Notice that this makes the visibility map work better in practice. I
also think that the FSM needs to be taught that it isn't a good thing
to reuse a little fragment of space on its own, because it works
against our goal of trying to avoid relocating rows. The current logic
seems focussed on using every little scrap of free space no matter
what, which seems pretty misguided. Penny-wise, pound-foolish.

Also notice that fighting to keep the same logical row on the same
block has a non-linear payoff. We don't need to give up on that goal
at the first sign of trouble. If it's hard to do a conventional prune
after succeeding a thousand times then it's okay to work harder. Only
a sucker gives up at the first sign of trouble. We're playing a long
game here. If it becomes so hard that even applying a more aggressive
strategy fails, then it's probably also true that it has become
inherently impossible to sustain the original layout. We graciously
accept defeat and cut our losses, having only actually wasted a little
effort to learn that we need to move our incoming successor version to
some other heap block (especially because the cost is amortized across
versions that live on the same heap block).

* Don't try to completely remove VACUUM. Treat its top down approach
as complementary to the new bottom-up approaches we add.

There is nothing wrong with taking a long time to clean up garbage
tuples in heap pages that have very little garbage total. In fact I
think that it might be a good thing. Why be in a hurry to dirty the
page again? If it becomes a real problem in the short term then the
bottom-up stuff can take care of it. Under this old-but-new paradigm,
maybe VACUUM has to visit indexes a lot less (maybe it just decides
not to sometimes, based on stats about the indexes, like we see today
with vacuum_index_cleanup = off). VACUUM is for making infrequent
"clean sweeps", though it typically leaves most of the work to new
bottom-up approaches, that are well placed to understand the needs of
queries that touch nearby data. Autovacuum does not presume to
understand the needs of queries at all.

It would also be possible for VACUUM to make more regular sweeps over
the version store without disturbing the main relation under this
model. The version store isn't that different to a separate heap
relation, but naturally doesn't require traditional index cleanup or
freezing, and naturally stores things in approximately temporal order.
So we recycle space in the version store in a relatively eager,
circular fashion, because it naturally contains data that favors such
an approach. We make up the fixed cost of using the separate old
version store structure by reducing deferred costs like this. And by
only using it when it is demonstrably helpful.

It might also make sense for us to prioritize putting heap tuples that
represent versions whose indexed columns were changed by update
(basically a non-hot update) in the version store -- we work extra
hard on that (and just leave behind an LP_DEAD line pointer). That way
VACUUM can do retail index tuple deletion for the index whose columns
were modified when it finds them in the version store (presumably this
nbtree patch of mine works well enough with the logically unchanged
index entries for other indexes that we don't need to go out of our
way).

I'm sure that there are more than a few holes in this sketch of mine.
It's not really worked out, but it has certain properties that are
generally under appreciated -- especially the thing about the worst
case number of versions per logical row being extremely important, as
well as the idea that back pressure can be a good thing when push
comes to shove -- provided it is experienced locally, and only by
queries that update the same very hot logical rows. Back pressure
needs to be proportionate and approximately fair. Another important
point that I want to call out again here is that we should try to
exploit cost/benefit asymmetry opportunistically. That seems to have
worked out extremely well in my recent B-Tree delete dedup patch.

I don't really expect anybody to take this seriously -- especially as
a total blueprint. Maybe some elements of what I've sketched can be
used as part of some future big effort. You've got to start somewhere.
It has to be incremental.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Simon Riggs
In reply to this post by Peter Geoghegan-4
On Thu, 22 Oct 2020 at 18:42, Peter Geoghegan <[hidden email]> wrote:

> > The average latency is x2. What is the distribution of latencies?
> > Occasional very long or all uniformly x2?
>
> The latency is generally very even with the patch. There is a constant
> hum of cleanup by the new mechanism in the case of the benchmark
> workload. As opposed to a cascade of page splits, which occur in
> clearly distinct correlated waves.

Please publish details of how long a pre-split cleaning operation
takes and what that does to transaction latency. It *might* be true
that the cost of avoiding splits is worth it in balance against the
cost of splitting, but it might not.

You've shown us a very nice paper analysing the page split waves, but
we need a similar detailed analysis so we can understand if what you
propose is better or not (and in what situations).

> > I would guess that holding the page locks will also slow down SELECT
> > workload, so I think you should also report that workload as well.
> >
> > Hopefully that will be better in the latest version.
>
> But the same benchmark that you're asking about here has two SELECT
> statements and only one UPDATE. It already is read-heavy in that
> sense. And we see that the latency is also significantly improved for
> the SELECT queries.
>
> Even if there was often a performance hit rather than a benefit (which
> is definitely not what we see), it would still probably be worth it.
> Users create indexes for a reason. I believe that we are obligated to
> maintain the indexes to a reasonable degree, and not just when it
> happens to be convenient to do so in passing.

The leaf page locks are held for longer, so we need to perform
sensible tests that show if this has a catastrophic effect on related
workloads, or not.

The SELECT tests proposed need to be aimed at the same table, at the same time.

> The strength of the design is in how clever it isn't.

What it doesn't do could be good or bad so we need to review more
details on behavior. Since the whole idea of the patch is to change
behavior, that seems a reasonable ask. I don't have any doubt about
the validity of the approach or coding.

What you've done so far is very good and I am very positive about
this, well done.

--
Simon Riggs                http://www.EnterpriseDB.com/


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
On Fri, Oct 23, 2020 at 9:03 AM Simon Riggs <[hidden email]> wrote:
> Please publish details of how long a pre-split cleaning operation
> takes and what that does to transaction latency. It *might* be true
> that the cost of avoiding splits is worth it in balance against the
> cost of splitting, but it might not.

I don't think that you understand how the patch works. I cannot very
well isolate that cost because the patch is designed to only pay it
when there is a strong chance of getting a much bigger reward, and
when the only alternative is to split the page. When it fails the
question naturally doesn't come up again for the same two pages that
follow from the page split. As far as I know the only cases that are
regressed all involve small indexes with lots of contention, which is
not surprising. And not necessarily due to the heap page accesses -
making indexes smaller sometimes has that effect, even when it happens
due to things like better page split heuristics.

If anybody finds a serious problem with my patch then it'll be a
weakness or hole in the argument I just made -- it won't have much to
do with how expensive any of these operations are in isolation. It
usually isn't sensible to talk about page splits as isolated things.
Most of my work on B-Trees in the past couple of years built on the
observation that sometimes successive page splits are related to each
other in one way or another.

It is a fallacy of composition to think of the patch as a thing that
prevents some page splits. The patch is valuable because it more or
less eliminates *unnecessary* page splits (and makes it so that there
cannot be very many TIDs for each logical row in affected indexes).
The overall effect is not linear. If you added code to artificially
make the mechanism fail randomly 10% of the time (at the point where
it becomes clear that the current operation would otherwise be
successful) that wouldn't make the resulting code 90% as useful as the
original. It would actually make it approximately 0% as useful. On
human timescales the behavioral difference between this hobbled
version of my patch and the master branch would be almost
imperceptible.

It's obvious that a page split is more expensive than the delete
operation (when it works out). It doesn't need a microbenchmark (and I
really can't think of one that would make any sense). Page splits
typically have WAL records that are ~4KB in size, whereas the
opportunistic delete records are almost always less than 100 bytes,
and typically close to 64 bytes -- which is the same size as most
individual leaf page insert WAL records. Plus you have almost double
the FPIs going forward with the page split.

> You've shown us a very nice paper analysing the page split waves, but
> we need a similar detailed analysis so we can understand if what you
> propose is better or not (and in what situations).

That paper was just referenced in passing. It isn't essential to the
main argument.

> The leaf page locks are held for longer, so we need to perform
> sensible tests that show if this has a catastrophic effect on related
> workloads, or not.
>
> The SELECT tests proposed need to be aimed at the same table, at the same time.

But that's exactly what I did the first time!

I had two SELECT statements against the same table. They use almost
the same distribution as the UPDATE, so that they'd hit the same part
of the key space without it being exactly the same as the UPDATE from
the same xact in each case (I thought that if it was exactly the same
part of the table then that might unfairly favor my patch).

> > The strength of the design is in how clever it isn't.
>
> What it doesn't do could be good or bad so we need to review more
> details on behavior. Since the whole idea of the patch is to change
> behavior, that seems a reasonable ask. I don't have any doubt about
> the validity of the approach or coding.

I agree, but the patch isn't the sum of its parts. You need to talk
about a workload or a set of conditions, and how things develop over
time.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Simon Riggs
On Fri, 23 Oct 2020 at 18:14, Peter Geoghegan <[hidden email]> wrote:

> It's obvious that a page split is more expensive than the delete
> operation (when it works out).

The problem I highlighted is that the average UPDATE latency is x2
what it is on current HEAD. That is not consistent with the reported
TPS, so it remains an issue and that isn't obvious.

> It doesn't need a microbenchmark (and I
> really can't think of one that would make any sense).

I'm asking for detailed timings so we can understand the latency
issue. I didn't ask for a microbenchmark.

I celebrate your results, but we do need to understand the issue, somehow.

--
Simon Riggs                http://www.EnterpriseDB.com/


Reply | Threaded
Open this post in threaded view
|

Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
On Sat, Oct 24, 2020 at 2:55 AM Simon Riggs <[hidden email]> wrote:
> The problem I highlighted is that the average UPDATE latency is x2
> what it is on current HEAD. That is not consistent with the reported
> TPS, so it remains an issue and that isn't obvious.

Why do you say that? I reported that the UPDATE latency is less than
half for the benchmark.

There probably are some workloads with worse latency and throughput,
but generally only with high contention/small indexes. I'll try to
fine tune those, but some amount of it is probably inevitable. On
average query latency is quite a lot lower with the patch (where it is
affected at all - the mechanism is only used with non-hot updates).

--
Peter Geoghegan