Deleting older versions in unique indexes to avoid page splits

Previous Topic Next Topic
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan-4
Attached is a POC patch that teaches nbtree to delete old duplicate
versions from unique indexes. The optimization targets non-HOT
duplicate version bloat. Although the patch is rather rough, it
nevertheless manages to more or less eliminate a whole class of index
bloat: Unique index bloat from non-HOT updates in workloads where no
transaction lasts for more than a few seconds. For example, it
eliminates index bloat with a custom pgbench workload that uses an
INCLUDE unique index on pgbench_accounts.aid (with abalance as the
non-key attribute), instead of the usual accounts primary key.
Similarly, with a standard pgbench_accounts primary key alongside an
extra non-unique index on abalance, the primary key will never have
any page splits with the patch applied. It's almost as if the updates
were actually HOT updates, at least if you focus on the unique index
(assuming that there are no long-running transactions).

The patch repurposes the deduplication infrastructure to delete
duplicates within unique indexes, provided they're actually safe to
VACUUM. This is somewhat different to the _bt_unique_check() LP_DEAD
bit setting stuff, in that we have to access heap pages that we
probably would not have to access otherwise -- it's something that we
go out of our way to make happen at the point that the page is about
to split, not something that happens in passing at no extra cost. The
general idea is to exploit the fact that duplicates in unique indexes
are usually deadwood.

We only need to "stay one step ahead" of the bloat to avoid all page
splits in many important cases. So we usually only have to access a
couple of heap pages to avoid a page split in each case. In
traditional serial/identity column primary key indexes, any page split
that happens that isn't a split of the current rightmost page must be
caused by version churn. It should be possible to avoid these
"unnecessary" page splits altogether (again, barring long-running

I would like to get early feedback on high level direction. While the
patch seems quite promising, I am uncertain about my general approach,
and how it might fit into some much broader effort to control bloat in

There are some clear downsides to my approach. The patch has grotty
heuristics that try to limit the extra work performed to avoid page
splits -- the cost of accessing additional heap pages while a buffer
lock is held on the leaf page needs to be kept. under control. No
doubt this regresses some workloads without giving them a clear
benefit. Also, the optimization only ever gets used with unique
indexes, since they're the only case where a duplicate naturally
suggests version churn, which can be targeted fairly directly, and
without wasting too many cycles when it doesn't work out.

It's not at all clear how we could do something like this with
non-unique indexes. One related-though-distinct idea that might be
worth considering occurs to me: teach nbtree to try to set LP_DEAD
bits in non-unique indexes, in about the same way as it will in
_bt_check_unique() for unique indexes. Perhaps the executor could hint
to btinsert()/aminsert() that it's inserting a duplicate caused by a
non-HOT update, so it's worth trying to LP_DEAD nearby duplicates --
especially if they're on the same heap page as the incoming item.

There is a wholly separate question about index bloat that is of long
term strategic importance to the Postgres project: what should we do
about long running transactions? I tend to think that we can address
problems in that area by indicating that it is safe to delete
"intermediate" versions -- tuples that are not too old to be seen by
the oldest transaction, that are nevertheless not needed (they're too
new to be interesting to the old transaction's snapshot, but also too
old to be interesting to any other snapshot). Perhaps this
optimization could be pursued in a phased fashion, starting with index
AMs, where it seems less scary.

I recently read a paper that had some ideas about what we could do
here [1]. IMV it is past time that we thrashed together a "remove
useless intermediate versions" design that is compatible with the
current heapam design.

Peter Geoghegan

0001-Non-opportunistically-delete-B-Tree-items.patch (19K) Download Attachment