Attached is a POC patch that teaches nbtree to delete old duplicate
versions from unique indexes. The optimization targets non-HOT duplicate version bloat. Although the patch is rather rough, it nevertheless manages to more or less eliminate a whole class of index bloat: Unique index bloat from non-HOT updates in workloads where no transaction lasts for more than a few seconds. For example, it eliminates index bloat with a custom pgbench workload that uses an INCLUDE unique index on pgbench_accounts.aid (with abalance as the non-key attribute), instead of the usual accounts primary key. Similarly, with a standard pgbench_accounts primary key alongside an extra non-unique index on abalance, the primary key will never have any page splits with the patch applied. It's almost as if the updates were actually HOT updates, at least if you focus on the unique index (assuming that there are no long-running transactions). The patch repurposes the deduplication infrastructure to delete duplicates within unique indexes, provided they're actually safe to VACUUM. This is somewhat different to the _bt_unique_check() LP_DEAD bit setting stuff, in that we have to access heap pages that we probably would not have to access otherwise -- it's something that we go out of our way to make happen at the point that the page is about to split, not something that happens in passing at no extra cost. The general idea is to exploit the fact that duplicates in unique indexes are usually deadwood. We only need to "stay one step ahead" of the bloat to avoid all page splits in many important cases. So we usually only have to access a couple of heap pages to avoid a page split in each case. In traditional serial/identity column primary key indexes, any page split that happens that isn't a split of the current rightmost page must be caused by version churn. It should be possible to avoid these "unnecessary" page splits altogether (again, barring long-running transactions). I would like to get early feedback on high level direction. While the patch seems quite promising, I am uncertain about my general approach, and how it might fit into some much broader effort to control bloat in general. There are some clear downsides to my approach. The patch has grotty heuristics that try to limit the extra work performed to avoid page splits -- the cost of accessing additional heap pages while a buffer lock is held on the leaf page needs to be kept. under control. No doubt this regresses some workloads without giving them a clear benefit. Also, the optimization only ever gets used with unique indexes, since they're the only case where a duplicate naturally suggests version churn, which can be targeted fairly directly, and without wasting too many cycles when it doesn't work out. It's not at all clear how we could do something like this with non-unique indexes. One related-though-distinct idea that might be worth considering occurs to me: teach nbtree to try to set LP_DEAD bits in non-unique indexes, in about the same way as it will in _bt_check_unique() for unique indexes. Perhaps the executor could hint to btinsert()/aminsert() that it's inserting a duplicate caused by a non-HOT update, so it's worth trying to LP_DEAD nearby duplicates -- especially if they're on the same heap page as the incoming item. There is a wholly separate question about index bloat that is of long term strategic importance to the Postgres project: what should we do about long running transactions? I tend to think that we can address problems in that area by indicating that it is safe to delete "intermediate" versions -- tuples that are not too old to be seen by the oldest transaction, that are nevertheless not needed (they're too new to be interesting to the old transaction's snapshot, but also too old to be interesting to any other snapshot). Perhaps this optimization could be pursued in a phased fashion, starting with index AMs, where it seems less scary. I recently read a paper that had some ideas about what we could do here [1]. IMV it is past time that we thrashed together a "remove useless intermediate versions" design that is compatible with the current heapam design. [1] https://dl.acm.org/doi/pdf/10.1145/3318464.3389714 -- Peter Geoghegan |
On Tue, Jun 30, 2020 at 5:03 PM Peter Geoghegan <[hidden email]> wrote:
> Attached is a POC patch that teaches nbtree to delete old duplicate > versions from unique indexes. The optimization targets non-HOT > duplicate version bloat. Although the patch is rather rough, it > nevertheless manages to more or less eliminate a whole class of index > bloat: Unique index bloat from non-HOT updates in workloads where no > transaction lasts for more than a few seconds. I'm slightly surprised that this thread didn't generate more interest back in June. After all, maintaining the pristine initial state of (say) a primary key index even after many high throughput non-HOT updates (i.e. avoiding "logically unnecessary" page splits entirely) is quite appealing. It arguably goes some way towards addressing long held criticisms of our approach to MVCC. Especially if it can be generalized to all b-tree indexes -- the Uber blog post mentioned tables that have several indexes, which presumably means that there can be no HOT updates (the author of the blog post didn't seem to be aware of HOT at all). I've been trying to generalize my approach to work with all indexes. I think that I can find a strategy that is largely effective at preventing version churn page splits that take place with workloads that have many non-HOT updates, without any serious downsides for workloads that do not benefit. I want to get feedback on that now, since I expect that it will be controversial. Teaching indexes about how tuples are versioned or chaining tuples seems like a non-starter, so the trick will be to come up with something that behaves in approximately the same way as that in cases where it helps. The approach I have in mind is to pass down a hint from the executor to btinsert(), which lets nbtree know that the index tuple insert is in fact associated with a non-HOT update. This hint is only given when the update did not logically modify the columns that the index covers (so presumably the majority of unchanged indexes get the hint, but not the one or two indexes whose columns were modified by our update statement -- or maybe the non-HOT update was caused by not being able to fit a new version on the same heap page, in which case all the btinsert() calls/all the indexes on the table get the hint). Of course, this hint makes it all but certain that the index tuple is the successor for some other index tuple located on the same leaf page. We don't actually include information about which other existing tuple it is, since it pretty much doesn't matter. Even if we did, we definitely cannot opportunistically delete it, because it needs to stay in the index at least until our transaction commits (this should be obvious). Actually, we already try to delete it from within _bt_check_unique() today for unique indexes -- we just never opportunistically mark it dead at that point (as I said, it's needed until the xact commits at the earliest). Here is the maybe-controversial part: The algorithm initially assumes that all indexes more or less have the same properties as unique indexes from a versioning point of view, even though that's clearly not true. That is, it initially assumes that the only reason why there can be duplicates on any leaf page it looks at is because some previous transaction also did a non-HOT update that added a new, unchanged index tuple. The new algorithm only runs when the hint is passed down from the executor and when the only alternative is to split the page (or have a deduplication pass), so clearly there is some justification for this assumption -- it really is highly unlikely that this update (which is on the verge of splitting the page) just so happened to be the first such update that affected the page. It's extremely likely that there will be at least a couple of other tuples like that on the page, and probably quite a few more. And even if we only manage to free one or two tuples, that'll still generally be enough to fit the incoming tuple. In general that is usually quite valuable. Even with a busy database, that might buy us minutes or hours before the question of splitting the same leaf page arises again. By the time that happens, longer running transactions may have committed, VACUUM may have run, etc. Like unique index deduplication, this isn't about freeing space -- it's about buying time. To be blunt: It may be controversial that we're accessing multiple heap pages while holding an exclusive lock on a leaf page, in the hopes that we can avoid a page split, but without any certainty that it'll work out. Sometimes (maybe even most of the time), this assumption turns out to be mostly correct, and we benefit in the obvious way -- no "unnecessary" page splits for affected non-unique indexes. Other times it won't work out, usually because the duplicates are in fact logical duplicates from distinct logical rows. When the new deletion thing doesn't work out, the situation works itself out in the obvious way: we get a deduplication pass. If that doesn't work out we get a page split. So we have three legitimate strategies for resolving the "pressure" against a leaf page: last minute emergency duplicate checks + deletion (the new thing), a deduplication pass, or a page split. The strategies are in competition with each other (though only when we have non-HOT updates). We're only willing to access a fixed number of heap pages (heap pages pointed to by duplicate index tuples) to try to delete some index tuples and avoid a split, and only in the specific context of the hint being received at the point a leaf page fills and it looks like we might have to split it. I think that we should probably avoid doing anything with posting list tuples left behind by deduplication, except maybe if there are only 2 or 3 TIDs -- just skip over them. That way, cases with duplicates across logical rows (not version churn) tend to get made into a posting list early (e.g. during an index build), so we don't even bother trying to delete from them later. Non-posting-list duplicates suggest recency, which suggests version churn -- those dups must at least have come after the most recent deduplication pass. Plus we have heuristics that maximize the chances of finding versions to kill. And we tend to look at the same blocks again and again -- like in the patch I posted, we look at the value with the most dups for things to kill first, and so on. So repeated version deletion passes won't end up accessing totally different heap blocks each time, unless they're successful at killing old versions. (I think that this new feature should be framed as extending the deduplication infrastructure to do deletes -- so it can only be used on indexes that use deduplication.) Even if this new mechanism ends up slowing down non-HOT updates noticeably -- which is *not* something that I can see with my benchmarks now -- then that could still be okay. I think that there is something sensible about imposing a penalty on non-HOT update queries that can cause problems for everybody today. Why shouldn't they have to clean up their own mess? I think that it buys us a lot to condition cleanup on avoiding page splits, because any possible penalty is only paid in cases where there isn't something else that keeps the bloat under control. If something like the kill_prior_tuple mechanism mostly keeps bloat under control already, then we'll resolve the situation that way instead. An important point about this technique is that it's just a back stop, so it can run very infrequently while still preventing an index from growing -- an index that can double in size today. If existing mechanisms against "logically unnecessary" page splits are 99% effective today, then they may still almost be useless to users -- your index still doubles in size. It just takes a little longer in Postgres 13 (with deduplication) compared to Postgres 12. So there is a really huge asymmetry that we still aren't doing enough about, even with deduplication. Deduplication cannot prevent the first "wave" of page splits with primary key style indexes due to the dimensions of the index tuples on the page. The first wave may be all that matters (deduplication does help more when things get really bad, but I'd like to avoid "merely bad" performance characteristics, too). Consider the simplest possible real world example. If we start out with 366 items on a leaf page initially (the actual number with default fillfactor + 64-bit alignment for the pgbench indexes), we can store another 40 non-HOT dups on the same leaf page before the page splits. We only save 4 bytes by merging a dup into one of the 366 original tuples. It's unlikely that many of the 40 dups that go on the page will be duplicates of *each other*, and deduplication only really starts to save space when posting list tuples have 4 or 5 TIDs each. So eventually all of the original leaf pages split when they really shouldn't, despite the availability of deduplication. -- Peter Geoghegan |
> 8 окт. 2020 г., в 04:48, Peter Geoghegan <[hidden email]> написал(а): > > On Tue, Jun 30, 2020 at 5:03 PM Peter Geoghegan <[hidden email]> wrote: >> Attached is a POC patch that teaches nbtree to delete old duplicate >> versions from unique indexes. The optimization targets non-HOT >> duplicate version bloat. Although the patch is rather rough, it >> nevertheless manages to more or less eliminate a whole class of index >> bloat: Unique index bloat from non-HOT updates in workloads where no >> transaction lasts for more than a few seconds. > > I'm slightly surprised that this thread didn't generate more interest > back in June. The idea looks very interesting. It resembles GiST microvacuum: GiST tries to vacuum single page before split. I'm curious how cost of page deduplication is compared to cost of page split? Should we do deduplication of page will still remain 99% full? Best regards, Andrey Borodin. |
On Mon, Oct 12, 2020 at 3:47 AM Andrey Borodin <[hidden email]> wrote:
> The idea looks very interesting. > It resembles GiST microvacuum: GiST tries to vacuum single page before split. AFAICT the GiST microvacuum mechanism is based on the one in nbtree, which is based on setting LP_DEAD bits when index scans find that the TIDs are dead-to-all. That's easy to justify, because it is easy and cheap to save future index scans the trouble of following the TIDs just to discover the same thing for themselves. The difference here is that we're simply making an intelligent guess -- there have been no index scans, but we're going to do a kind of special index scan at the last minute to see if we can avoid a page split. I think that this is okay because in practice we can do it in a reasonably targeted way. We will never do it with INSERTs, for example (except in unique indexes, though only when we know that there are duplicates because we saw them in _bt_check_unique()). In the worst case we make non-HOT updates a bit slower...and I'm not even sure that that's a bad thing when we don't manage to delete anything. After all, non-HOT updates impose a huge cost on the system. They have "negative externalities". Another way to look at it is that the mechanism I propose to add takes advantage of "convexity" [1]. (Actually, several of my indexing ideas are based on similar principles -- like the nbtsplitloc.c stuff.) Attached is v2. It controls the cost of visiting the heap by finding the heap page that has the most TIDs that we might be able to kill (among those that are duplicates on a leaf page). It also adds a hinting mechanism to the executor to avoid uselessly applying the optimization with INSERTs. > I'm curious how cost of page deduplication is compared to cost of page split? Should we do deduplication of page will still remain 99% full? It depends on how you measure it, but in general I would say that the cost of traditional Postgres 13 deduplication is much lower. Especially as measured in WAL volume. I also believe that the same is true for this new delete deduplication mechanism. The way we determine which heap pages to visit maximizes the chances that we'll get lucky while minimizing the cost (currently we don't visit more than 2 heap pages unless we get at least one kill -- I think I could be more conservative here without losing much). A page split also has to exclusive lock two other pages (the right sibling page and parent page), so even the locking is perhaps better. The attached patch can completely or almost completely avoid index bloat in extreme cases with non-HOT updates. This can easily increase throughput by 2x or more, depending on how extreme you make it (i.e. how many indexes you have). It seems like the main problem caused by non-HOT updates is in fact page splits themselves. It is not so much a problem with dirtying of pages. You can test this with a benchmark like the one that was used for WARM back in 2017: https://www.postgresql.org/message-id/flat/CABOikdMNy6yowA%2BwTGK9RVd8iw%2BCzqHeQSGpW7Yka_4RSZ_LOQ%40mail.gmail.com I suspect that it's maybe even more effective than WARM was with this benchmark. [1] https://fooledbyrandomness.com/ConvexityScience.pdf -- Peter Geoghegan |
In reply to this post by Peter Geoghegan-4
On 08.10.2020 02:48, Peter Geoghegan
wrote:
The idea seems very promising, especially when extended to handle non-unique indexes too.On Tue, Jun 30, 2020 at 5:03 PM Peter Geoghegan [hidden email] wrote:Attached is a POC patch that teaches nbtree to delete old duplicate versions from unique indexes. The optimization targets non-HOT duplicate version bloat. Although the patch is rather rough, it nevertheless manages to more or less eliminate a whole class of index bloat: Unique index bloat from non-HOT updates in workloads where no transaction lasts for more than a few seconds.I'm slightly surprised that this thread didn't generate more interest back in June. After all, maintaining the pristine initial state of (say) a primary key index even after many high throughput non-HOT updates (i.e. avoiding "logically unnecessary" page splits entirely) is quite appealing. It arguably goes some way towards addressing long held criticisms of our approach to MVCC. Especially if it can be generalized to all b-tree indexes -- the Uber blog post mentioned tables that have several indexes, which presumably means that there can be no HOT updates (the author of the blog post didn't seem to be aware of HOT at all). That's exactly what I wanted to discuss after the first letter. If we could make (non)HOT-updates index specific, I think it could improve performance a lot.I've been trying to generalize my approach to work with all indexes. I think that I can find a strategy that is largely effective at preventing version churn page splits that take place with workloads that have many non-HOT updates, without any serious downsides for workloads that do not benefit. I want to get feedback on that now, since I expect that it will be controversial. Teaching indexes about how tuples are versioned or chaining tuples seems like a non-starter, so the trick will be to come up with something that behaves in approximately the same way as that in cases where it helps. The approach I have in mind is to pass down a hint from the executor to btinsert(), which lets nbtree know that the index tuple insert is in fact associated with a non-HOT update. This hint is only given when the update did not logically modify the columns that the index covers Here is the maybe-controversial part: The algorithm initially assumes that all indexes more or less have the same properties as unique indexes from a versioning point of view, even though that's clearly not true. That is, it initially assumes that the only reason why there can be duplicates on any leaf page it looks at is because some previous transaction also did a non-HOT update that added a new, unchanged index tuple. The new algorithm only runs when the hint is passed down from the executor and when the only alternative is to split the page (or have a deduplication pass), so clearly there is some justification for this assumption -- it really is highly unlikely that this update (which is on the verge of splitting the page) just so happened to be the first such update that affected the page. I think that this optimization can affect low cardinality indexes negatively, but it is hard to estimate impact without tests. Maybe it won't be a big deal, given that we attempt to eliminate old copies not very often and that low cardinality b-trees are already not very useful. Besides, we can always make this thing optional, so that users could tune it to their workload.To be blunt: It may be controversial that we're accessing multiple heap pages while holding an exclusive lock on a leaf page, in the hopes that we can avoid a page split, but without any certainty that it'll work out. Sometimes (maybe even most of the time), this assumption turns out to be mostly correct, and we benefit in the obvious way -- no "unnecessary" page splits for affected non-unique indexes. Other times it won't work out, usually because the duplicates are in fact logical duplicates from distinct logical rows.
-- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company |
On Wed, Oct 14, 2020 at 7:07 AM Anastasia Lubennikova
<[hidden email]> wrote: > The idea seems very promising, especially when extended to handle non-unique indexes too. Thanks! > That's exactly what I wanted to discuss after the first letter. If we could make (non)HOT-updates index specific, I think it could improve performance a lot. Do you mean accomplishing the same goal in heapam, by making the optimization more intelligent about which indexes need new versions? We did have a patch that did that in 2007, as you may recall -- this was called WARM: https://www.postgresql.org/message-id/flat/CABOikdMNy6yowA%2BwTGK9RVd8iw%2BCzqHeQSGpW7Yka_4RSZ_LOQ%40mail.gmail.com This didn't go anywhere. I think that this solution in more pragmatic. It's cheap enough to remove it if a better solution becomes available in the future. But this is a pretty good solution by all important measures. > I think that this optimization can affect low cardinality indexes negatively, but it is hard to estimate impact without tests. Maybe it won't be a big deal, given that we attempt to eliminate old copies not very often and that low cardinality b-trees are already not very useful. Besides, we can always make this thing optional, so that users could tune it to their workload. Right. The trick is to pay only a fixed low cost (maybe as low as one heap page access) when we start out, and ratchet it up only if the first heap page access looks promising. And to avoid posting list tuples. Regular deduplication takes place when this fails. It's useful for the usual reasons, but also because this new mechanism learns not to try the posting list TIDs. > I wonder, how this new feature will interact with physical replication? Replica may have quite different performance profile. I think of that as equivalent to having a long running transaction on the primary. When I first started working on this patch I thought about having "long running transaction detection". But I quickly realized that that isn't a meaningful concept. A transaction is only truly long running relative to the writes that take place that have obsolete row versions that cannot be cleaned up. It has to be something we can deal with, but it cannot be meaningfully special-cased. -- Peter Geoghegan |
On Wed, Oct 14, 2020 at 7:40 AM Peter Geoghegan <[hidden email]> wrote:
> Right. The trick is to pay only a fixed low cost (maybe as low as one > heap page access) when we start out, and ratchet it up only if the > first heap page access looks promising. Just as an example of how the patch can help, consider the following pgbench variant script: \set aid1 random_gaussian(1, 100000 * :scale, 2.0) \set aid2 random_gaussian(1, 100000 * :scale, 2.5) \set aid3 random_gaussian(1, 100000 * :scale, 2.2) \set bid random(1, 1 * :scale) \set tid random(1, 10 * :scale) \set delta random(-5000, 5000) BEGIN; UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid1; SELECT abalance FROM pgbench_accounts WHERE aid = :aid2; SELECT abalance FROM pgbench_accounts WHERE aid = :aid3; END; (These details you see here are a bit arbitrary; don't worry about the specifics too much.) Before running the script with pgbench, I initialized pgbench to scale 1500, and made some changes to the indexing (in order to actually test the patch). There was no standard pgbench_accounts PK. Instead, I created a unique index that had an include column, which is enough to make every update a non-HOT update. I also added two more redundant non-unique indexes to create more overhead from non-HOT updates. It looked like this: create unique index aid_pkey_include_abalance on pgbench_accounts (aid) include (abalance); create index one on pgbench_accounts (aid); create index two on pgbench_accounts (aid); So 3 indexes on the accounts table. I ran the script for two hours and 16 clients with the patch, then for another two hours with master. After that time, all 3 indexes were exactly the same size with the patch, but had almost doubled in size on master: aid_pkey_include_abalance: 784,264 pages (or ~5.983 GiB) one: 769,545 pages (or ~5.871 GiB) two: 770,295 pages (or ~5.876 GiB) (With the patch, all three indexes were 100% pristine -- they remained precisely 411,289 pages in size by the end, which is ~3.137 GiB.) Note that traditional deduplication is used by the indexes I've called "one" and "two" here, but not the include index called "aid_pkey_include_abalance". But it makes little difference, for reasons that will be obvious if you think about what this looks like at the leaf page level. Cases that Postgres 13 deduplication does badly with are often the ones that this new mechanism does well with. Deduplication by deleting and by merging are truly complementary -- I haven't just structured the code that way because it was convenient to use dedup infrastructure just to get the dups at the start. (Yes, it *was* convenient, but there clearly are cases where each mechanism competes initially, before nbtree converges on the best strategy at the local level. So FWIW this patch is a natural and logical extension of the deduplication work in my mind.) The TPS/throughput is about what you'd expect for the two hour run: 18,988.762398 TPS for the patch 11,123.551707 TPS for the master branch. This is a ~1.7x improvement, but I can get more than 3x by changing the details at the start -- just add more indexes. I don't think that the precise throughput difference you see here matters. The important point is that we've more or less fixed a pathological set of behaviors that have poorly understood cascading effects. Full disclosure: I rate limited pgbench to 20k for this run, which probably wasn't significant because neither patch nor master hit that limit for long. Big latency improvements for that same run, too: Patch: statement latencies in milliseconds: 0.001 \set aid1 random_gaussian(1, 100000 * :scale, 2.0) 0.000 \set aid2 random_gaussian(1, 100000 * :scale, 2.5) 0.000 \set aid3 random_gaussian(1, 100000 * :scale, 2.2) 0.000 \set bid random(1, 1 * :scale) 0.000 \set tid random(1, 10 * :scale) 0.000 \set delta random(-5000, 5000) 0.057 BEGIN; 0.294 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid1; 0.204 SELECT abalance FROM pgbench_accounts WHERE aid = :aid2; 0.195 SELECT abalance FROM pgbench_accounts WHERE aid = :aid3; 0.090 END; Master: statement latencies in milliseconds: 0.002 \set aid1 random_gaussian(1, 100000 * :scale, 2.0) 0.001 \set aid2 random_gaussian(1, 100000 * :scale, 2.5) 0.001 \set aid3 random_gaussian(1, 100000 * :scale, 2.2) 0.001 \set bid random(1, 1 * :scale) 0.000 \set tid random(1, 10 * :scale) 0.001 \set delta random(-5000, 5000) 0.084 BEGIN; 0.604 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid1; 0.317 SELECT abalance FROM pgbench_accounts WHERE aid = :aid2; 0.311 SELECT abalance FROM pgbench_accounts WHERE aid = :aid3; 0.120 END; Note that the mechanism added by the patch naturally limits the number of versions that are in the index for each logical row, which seems much more important than the total amount of garbage tuples cleaned up. It's great that the index is half its original size, but even that is less important than the effect of more or less bounding the worst case number of heap pages accessed by point index scans. Even though this patch shows big performance improvements (as well as very small performance regressions for small indexes with skew), the patch is mostly about stability. I believe that Postgres users want greater stability and predictability in this area more than anything else. The behavior of the system as a whole that we see for the master branch here is not anywhere near linear. Page splits are of course expensive, but they also occur in distinct waves [1] and have lasting consequences. They're very often highly correlated, with clear tipping points, so you see relatively sudden slow downs in the real world. Worse still, with skew the number of hot pages that you have can double in a short period of time. This very probably happens at the worst possible time for the user, since the database was likely already organically experiencing very high index writes at the point of experiencing the first wave of splits (splits correlated both within and across indexes on the same table). And, from that point on, the number of FPIs for the same workload also doubles forever (or at least until REINDEX). [1] https://btw.informatik.uni-rostock.de/download/tagungsband/B2-2.pdf -- Peter Geoghegan |
пт, 16 окт. 2020 г. в 21:12, Peter Geoghegan <[hidden email]>: I ran the script for two hours and 16 clients with the patch, then for I really like these results, great work! I'm also wondering how IO numbers changed due to these improvements, shouldn't be difficult to look into. Peter, according to cfbot patch no longer compiles. Can you send and update, please? Victor Yegorov |
On Fri, Oct 16, 2020 at 1:00 PM Victor Yegorov <[hidden email]> wrote:
> I really like these results, great work! Thanks Victor! > I'm also wondering how IO numbers changed due to these improvements, shouldn't be difficult to look into. Here is the pg_statio_user_indexes for patch for the same run: schemaname | relname | indexrelname | idx_blks_read | idx_blks_hit ------------+------------------+---------------------------+---------------+--------------- public | pgbench_accounts | aid_pkey_include_abalance | 12,828,736 | 534,819,826 public | pgbench_accounts | one | 12,750,275 | 534,486,742 public | pgbench_accounts | two | 2,474,893 | 2,216,047,568 (3 rows) And for master: schemaname | relname | indexrelname | idx_blks_read | idx_blks_hit ------------+------------------+---------------------------+---------------+--------------- public | pgbench_accounts | aid_pkey_include_abalance | 29,526,568 | 292,705,432 public | pgbench_accounts | one | 28,239,187 | 293,164,160 public | pgbench_accounts | two | 6,505,615 | 1,318,164,692 (3 rows) Here is pg_statio_user_tables patch: schemaname | relname | heap_blks_read | heap_blks_hit | idx_blks_read | idx_blks_hit | toast_blks_read | toast_blks_hit | tidx_blks_read | tidx_blks_hit ------------+------------------+----------------+---------------+---------------+---------------+-----------------+----------------+----------------+--------------- public | pgbench_accounts | 123,195,496 | 696,805,485 | 28,053,904 | 3,285,354,136 | | | | public | pgbench_branches | 11 | 1,553 | | | | | | public | pgbench_history | 0 | 0 | | | | | | public | pgbench_tellers | 86 | 15,416 | | | | | | (4 rows) And the pg_statio_user_tables for master: schemaname | relname | heap_blks_read | heap_blks_hit | idx_blks_read | idx_blks_hit | toast_blks_read | toast_blks_hit | tidx_blks_read | tidx_blks_hit ------------+------------------+----------------+---------------+---------------+---------------+-----------------+----------------+----------------+--------------- public | pgbench_accounts | 106,502,089 | 334,875,058 | 64,271,370 | 1,904,034,284 | | | | public | pgbench_branches | 11 | 1,553 | | | | | | public | pgbench_history | 0 | 0 | | | | | | public | pgbench_tellers | 86 | 15,416 | | | | | | (4 rows) Of course, it isn't fair to make a direct comparison because we're doing ~1.7x times more work with the patch. But even still, the idx_blks_read is less than half with the patch. BTW, the extra heap_blks_hit from the patch are not only due to the fact that the system does more directly useful work. It's also due to the extra garbage collection triggered in indexes. The same is probably *not* true with heap_blks_read, though. I minimize the number of heap pages accessed by the new cleamup mechanism each time, and temporal locality will help a lot. I think that we delete index entries pointing to garbage in the heap at pretty predictable intervals. Heap pages full of LP_DEAD line pointer garbage only get processed with a few times close together in time, after which they're bound to either get VACUUM'd or get evicted from shared buffers. > Peter, according to cfbot patch no longer compiles. > Can you send and update, please? Attached is v3, which is rebased against the master branch as of today. No real changes, though. -- Peter Geoghegan |
On Fri, Oct 16, 2020 at 1:58 PM Peter Geoghegan <[hidden email]> wrote:
> Attached is v3, which is rebased against the master branch as of > today. No real changes, though. And now here's v4. This version adds handling of posting list tuples, which I was skipping over before. Highly contended leaf pages with posting list tuples could still sometimes get "logically unnecessary" page splits in v3, which this seems to fix (barring the most extreme cases of version churn, where the patch cannot and should not prevent page splits). Actually, posting list tuple handling was something that we had in v1 but lost in v2, because v2 changed our general strategy to focus on what is convenient to the heap/tableam, which is the most important thing by far (the cost of failing must be a low, fixed, well understood and well targeted cost). The fix was to include TIDs from posting lists, while marking them "non-promising". Only plain tuples that are duplicates are considered promising. Only promising tuples influence where we look for tuples to kill most of the time. The exception is when there is an even number of promising tuples on two heap pages, where we tiebreak on the total number of TIDs that point to the heap page from the leaf page. I seem to be able to cap the costs paid when the optimization doesn't work out to the extent that we can get away with visiting only *one* heap page before giving up. And, we now never visit more than 3 total (2 is the common case when the optimization is really effective). This may sound conservative -- because it is -- but it seems to work in practice. I may change my mind about that and decide to be less conservative, but so far all of the evidence I've seen suggests that it doesn't matter -- the heuristics seem to really work. Might as well be completely conservative. I'm *sure* that we can afford one heap access here -- we currently *always* visit at least one heap tuple in roughly the same way during each and every unique index tuple insert (not just when the page fills). Posting list TIDs are not the only kind of TIDs that are marked non-promising. We now also include TIDs from non-duplicate tuples. So we're prepared to kill any of the TIDs on the page, though we only really expect it to happen with the "promising" tuples (duplicates). But why not be open to the possibility of killing some extra TIDs in passing? We don't have to visit any extra heap pages to do so, so it's practically free. Empirical evidence shows that this happens quite often. Here's why this posting list tuple strategy is a good one: we consider posting list tuple TIDs non-promising to represent that we think that there are probably actually multiple logical rows involved, or at least to represent that they didn't work -- simple trial and error suggests that they aren't very "promising", whatever the true reason happens to be. But why not "keep an open mind" about the TIDs not each being for distinct logical rows? If it just so happens that the posting list TIDs really were multiple versions of the same logical row all along, then there is a reasonable chance that there'll be even more versions on that same heap page later on. When this happens and when we end up back on the same B-Tree leaf page to think about dedup deletion once again, it's pretty likely that we'll also naturally end up looking into the later additional versions on this same heap page from earlier. At which point we won't miss the opportunity to check the posting lists TIDs in passing. So we get to delete the posting list after all! (If you're thinking "but we probably won't get lucky like that", then consider that it doesn't have to happen on the next occasion when delete deduplication happens on the same leaf page. Just at some point in the future. This is due to the way we visit the heap pages that look most promising first. It might take a couple of rounds of dedup deletion to get back to the same heap page, but that's okay. The simple heuristics proposed in the patch seem to have some really interesting and useful consequences in practice. It's hard to quantify how important these kinds of accidents of locality will be. I haven't targeted this or that effect -- my heuristics are dead simple, and based almost entirely on keeping costs down. You can think of it as "increase the number of TIDs to increase our chances of success" if you prefer.) The life cycle of logical rows/heap tuples seems to naturally lead to these kinds of opportunities. Obviously heapam is naturally very keen on storing related versions on the same heap block already, without any input from this patch. The patch is still evolving, and the overall structure and interface certainly still needs work -- I've focussed on the algorithm so far. I could really use some feedback on high level direction, though. It's a relatively short patch, even with all of my README commentary. But...it's not an easy concept to digest. Note: I'm not really sure if it's necessary to provide specialized qsort routines for the sorting we do to lists of TIDs, etc. I did do some experimentation with that, and it's an open question. So for now I rely on the patch that Thomas Munro posted to do that a little while back, which is why that's included here. The question of whether this is truly needed is unsettled. -- Peter Geoghegan ![]() ![]() |
In reply to this post by Peter Geoghegan-4
On Wed, Oct 7, 2020 at 7:48 PM Peter Geoghegan <[hidden email]> wrote:
> To be blunt: It may be controversial that we're accessing multiple > heap pages while holding an exclusive lock on a leaf page, in the > hopes that we can avoid a page split, but without any certainty that > it'll work out. That certainly isn't great. I mean, it might be not be too terrible, because it's a leaf index page isn't nearly as potentially hot as a VM page or a clog page, but it hurts interruptibility and risks hurting concurrency, but if it were possible to arrange to hold only a pin on the page during all this rather than a lock, it would be better. I'm not sure how realistic that is, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company |
On Wed, Oct 21, 2020 at 8:25 AM Robert Haas <[hidden email]> wrote:
> That certainly isn't great. I mean, it might be not be too terrible, > because it's a leaf index page isn't nearly as potentially hot as a VM > page or a clog page, but it hurts interruptibility and risks hurting > concurrency, but if it were possible to arrange to hold only a pin on > the page during all this rather than a lock, it would be better. I'm > not sure how realistic that is, though. I don't think that it's realistic. Well, technically you could do something like that, but you'd end up with some logically equivalent mechanism which would probably be slower. As you know, in nbtree pins are generally much less helpful than within heapam (you cannot read a page without a shared buffer lock, no matter what). Holding a pin only provides a very weak guarantee about VACUUM and TID recycling that usually doesn't come up. Bear in mind that we actually do practically the same thing all the time with the current LP_DEAD setting stuff, where we need to call compute_xid_horizon_for_tuples/heap_compute_xid_horizon_for_tuples with a leaf buffer lock held in almost the same way. That's actually potentially far worse if you look at it in isolation, because you could potentially have hundreds of heap pages, whereas this is just 1 - 3. (BTW, next version will also do that work in passing, so you're practically guaranteed to do less with a buffer lock held compared to the typical case of nbtree LP_DEAD setting, even without counting how the LP_DEAD bits get set in the first place.) I could also point out that something very similar happens in _bt_check_unique(). Also bear in mind that the alternative is pretty much a page split, which means: * Locking the leaf page * Then obtaining relation extension lock * Locking to create new right sibling * Releasing relation extension lock * Locking original right sibling page * Release original right sibling page * Release new right sibling page * Lock parent page * Release original now-split page * Release parent page (I will refrain from going into all of the absurd and near-permanent secondary costs that just giving up and splitting the page imposes for now. I didn't even include all of the information about locking -- there is one thing that didn't seem worth mentioning.) The key concept here is of course asymmetry. The asymmetry here is not only favorable; it's just outrageous. The other key concept is it's fundamentally impossible to pay more than a very small fixed cost without getting a benefit. That said, I accept that there is still some uncertainty that all workloads that get a benefit will be happy with the trade-off. I am still fine tuning how this works in cases with high contention. I welcome any help with that part. But note that this doesn't necessarily have much to do with the heap page accesses. It's not always strictly better to never have any bloat at all (it's pretty close to that, but not quite). We saw this with the Postgres 12 work, where small TPC-C test cases had some queries go slower simply because a small highly contended index did not get bloated due to a smarter split algorithm. There is no reason to believe that it had anything to do with the cost of making better decisions. It was the decisions themselves. I don't want to completely prevent "version driven page splits" (though a person could reasonably imagine that that is in fact my precise goal); rather, I want to make non-hot updates work to prove that it's almost certainly necessary to split the page due to version churn - then and only then should it be accepted. Currently we meekly roll over and let non-hot updaters impose negative externalities on the system as a whole. The patch usually clearly benefits even workloads that consist entirely of non-hot updaters. Negative externalities are only good for the individual trying to impose costs on the collective when they can be a true freeloader. It's always bad for the collective, but it's even bad for the bad actors once they're more than a small minority. Currently non-hot updaters are not merely selfish to the extent that they impose a downside on the collective or the system as a whole that is roughly proportionate to the upside benefit they get. Not cleaning up their mess as they go creates a downside that is a huge multiple of any possible upside for them. To me this seems incontrovertible. Worrying about the precise extent to which this is true in each situation doesn't seem particularly productive to me. Whatever the actual extent of the imbalance is, the solution is that you don't let them do that. This patch is not really about overall throughput. It could be justified on that basis, but that's not how I like to think of it. Rather, it's about providing a stabilizing backstop mechanism, which tends to bound the amount of index bloat and the number of versions in each index for each *logical row* -- that's the most important benefit of the patch. There are workloads that will greatly benefit despite only invoking the new mechanism very occasionally, as a backstop. And even cases with a fair amount of contention don't really use it that often (which is why the heap page access cost is pretty much a question about specific high contention patterns only). The proposed new cleanup mechanism may only be used in certain parts of the key space for certain indexes at certain times, in a bottom-up fashion. We don't have to be eager about cleaning up bloat most of the time, but it's also true that there are cases where we ought to work very hard at it in a localized way. This explanation may sound unlikely, but the existing behaviors taken together present us with outrageous cost/benefit asymmetry, arguably in multiple dimensions. I think that having this backstop cleanup mechanism (and likely others in other areas) will help to make the assumptions underlying autovacuum scheduling much more reasonable in realistic settings. Now it really is okay that autovacuum doesn't really care about the needs of queries, and is largely concerned with macro level things like free space management. It's top down approach isn't so bad once it has true bottom up complementary mechanisms. The LP_DEAD microvacuum stuff is nice because it marks things as dead in passing, pretty much for free. That's not enough on its own -- it's no backstop. The current LP_DEAD stuff appears to work rather well, until one day it suddenly doesn't and you curse Postgres for it. I could go on about the non-linear nature of the system as a whole, hidden tipping points, and other stuff like that. But I won't right now. -- Peter Geoghegan |
On Wed, Oct 21, 2020 at 3:36 PM Peter Geoghegan <[hidden email]> wrote:
> Bear in mind that we actually do practically the same thing all the > time with the current LP_DEAD setting stuff, where we need to call > compute_xid_horizon_for_tuples/heap_compute_xid_horizon_for_tuples > with a leaf buffer lock held in almost the same way. That's actually > potentially far worse if you look at it in isolation, because you > could potentially have hundreds of heap pages, whereas this is just 1 > - 3. (BTW, next version will also do that work in passing, so you're > practically guaranteed to do less with a buffer lock held compared to > the typical case of nbtree LP_DEAD setting, even without counting how > the LP_DEAD bits get set in the first place.) That's fair. It's not that I'm trying to enforce some absolute coding rule, as if I even had the right to do such a thing. But, people have sometimes proposed patches which would have caused major regression in this area and I think it's really important that we avoid that. Normally, it doesn't matter: I/O requests complete quickly and everything is fine. But, on a system where things are malfunctioning, it makes a big difference whether you can regain control by hitting ^C. I expect you've probably at some point had the experience of being unable to recover control of a terminal window because some process was stuck in wait state D on the kernel level, and you probably thought, "well, this sucks." It's even worse if the kernel's entire process table fills up with such processes. This kind of thing is essentially the same issue at the database level, and it's smart to do what we can to mitigate it. But that being said, I'm not trying to derail this patch. It isn't, and shouldn't be, the job of this patch to solve that problem. It's just better if it doesn't regress things, or maybe even (as you say) makes them a little better. I think the idea you've got here is basically good, and a lot of it comes down to how well it works in practice. I completely agree that looking at amortized cost rather than worst-case cost is a reasonable principle here; you can't take that to a ridiculous extreme because people also care about consistently of performance, but it seems pretty clear from your description that your patch should not create any big problem in that area, because the worst-case number of extra buffer accesses for a single operation is tightly bounded. And, of course, containing index bloat is its own reward. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company |
In reply to this post by Peter Geoghegan-4
On Fri, 16 Oct 2020 at 20:12, Peter Geoghegan <[hidden email]> wrote:
> The TPS/throughput is about what you'd expect for the two hour run: > > 18,988.762398 TPS for the patch > 11,123.551707 TPS for the master branch. Very good. > Patch: > > statement latencies in milliseconds: > 0.294 UPDATE pgbench_accounts SET abalance = abalance + > :delta WHERE aid = :aid1; > > Master: > > statement latencies in milliseconds: > 0.604 UPDATE pgbench_accounts SET abalance = abalance + > :delta WHERE aid = :aid1; The average latency is x2. What is the distribution of latencies? Occasional very long or all uniformly x2? I would guess that holding the page locks will also slow down SELECT workload, so I think you should also report that workload as well. Hopefully that will be better in the latest version. I wonder whether we can put this work into a background process rather than pay the cost in the foreground? Perhaps that might not need us to hold page locks?? -- Simon Riggs http://www.EnterpriseDB.com/ |
On Thu, Oct 22, 2020 at 10:12 AM Simon Riggs <[hidden email]> wrote:
> > 18,988.762398 TPS for the patch > > 11,123.551707 TPS for the master branch. > > Very good. I'm happy with this result, but as I said it's not really the point. I can probably get up to a 5x or more improvement in TPS if I simply add enough indexes. The point is that we're preventing pathological behavior. The patch does not so much add something helpful as subtract something harmful. You can contrive a case that has as much of that harmful element as you like. > The average latency is x2. What is the distribution of latencies? > Occasional very long or all uniformly x2? The latency is generally very even with the patch. There is a constant hum of cleanup by the new mechanism in the case of the benchmark workload. As opposed to a cascade of page splits, which occur in clearly distinct correlated waves. > I would guess that holding the page locks will also slow down SELECT > workload, so I think you should also report that workload as well. > > Hopefully that will be better in the latest version. But the same benchmark that you're asking about here has two SELECT statements and only one UPDATE. It already is read-heavy in that sense. And we see that the latency is also significantly improved for the SELECT queries. Even if there was often a performance hit rather than a benefit (which is definitely not what we see), it would still probably be worth it. Users create indexes for a reason. I believe that we are obligated to maintain the indexes to a reasonable degree, and not just when it happens to be convenient to do so in passing. > I wonder whether we can put this work into a background process rather > than pay the cost in the foreground? Perhaps that might not need us to > hold page locks?? Holding a lock on the leaf page is unavoidable. This patch is very effective because it intervenes at precisely the right moment in precisely the right place only. We don't really have to understand anything about workload characteristics to be sure of this, because it's all based on the enormous asymmetries I've described, which are so huge that it just seems impossible that anything else could matter. Trying to do any work in a background process works against this local-first, bottom-up laissez faire strategy. The strength of the design is in how clever it isn't. -- Peter Geoghegan |
In reply to this post by Robert Haas
On Thu, Oct 22, 2020 at 6:18 AM Robert Haas <[hidden email]> wrote:
> But that being said, I'm not trying to derail this patch. It isn't, > and shouldn't be, the job of this patch to solve that problem. It's > just better if it doesn't regress things, or maybe even (as you say) > makes them a little better. I think the idea you've got here is > basically good, and a lot of it comes down to how well it works in > practice. Thanks. I would like to have a more general conversation about how some of the principles embodied by this patch could be generalized to heapam in the future. I think that we could do something similar with pruning, or at least with the truncation of HOT chains. Here are a few of the principles I'm thinking of, plus some details of how they might be applied in heapam: * The most important metric is not the total amount of dead tuples in the whole table. Rather, it's something like the 90th + 99th percentile length of version chains for each logical row, or maybe only those logical rows actually accessed by index scans recently. (Not saying that we should try to target this in some explicit way - just that it should be kept low with real workloads as a natural consequence of the design.) * Have a variety of strategies for dealing with pruning that compete with each other. Allow them to fight it out organically. Start with the cheapest thing and work your way up. So for example, truncation of HOT chains is always preferred, and works in just the same way as it does today. Then it gets more complicated and more expensive, but in a way that is a natural adjunct of the current proven design. I believe that this creates a bit more pain earlier on, but avoids much more pain later. There is no point powering ahead if we'll end up hopelessly indebted with garbage tuples in the end. For heapam we could escalate from regular pruning by using an old version store for versions that were originally part of HOT chains, but were found to not fit on the same page at the point of opportunistic pruning. The old version store is for committed versions only -- it isn't really much like UNDO. We leave forwarding information on the original heap page. We add old committed versions to the version store at the point when a heap page is about to experience an event that is roughly the equivalent of a B-Tree page split -- for heapam a "split" is being unable to keep the same logical row on the same heap page over time in the presence of many updates. This is our last line of defense against this so-called page split situation. It occurs after regular pruning (which is an earlier line of defense) fails to resolve the situation inexpensively. We can make moving tuples into the old version store WAL efficient by making it work like an actual B-Tree page split -- it can describe the changes in a way that's mostly logical, and based on the existing page image. And like an actual page split, we're amortizing costs in a localized way. It is also possible to apply differential compression to whole HOT chains rather than storing them in the old version store, for example, if that happens to look favorable (think of rows with something like a pgbench_accounts.filler column -- not uncommon). We have options, we can add new options in the future as new requirements come to light. We allow the best option to present itself to us in a bottom-up fashion. Sometimes the best strategy at the local level is actually a combination of two different strategies applied alternately over time. For example, we use differential compression first when we fail to prune, then we prune the same page later (a little like merge deduplication in my recent nbtree delete dedup patch). Maybe each of these two strategies (differential compression + traditional heap HOT chain truncation) get applied alternately against the same heap page over time, in a tick-tock fashion. We naturally avoid availing of the old version store structure, which is good, since that is a relatively expensive strategy that we should apply only as a last resort. This tick-tock behavior is an emergent property of the workload rather than something planned or intelligent, and yet it kind of appears to be an intelligent strategy. (Maybe it works that way permanently in some parts of the heap, or maybe the same heap blocks only tick-tock like this on Tuesdays. It may be possible for stuff like that to happen sanely with well chosen simple heuristics that exploit asymmetry.) * Work with (not against) the way that Postgres strongly decouples the physical representation of data from the logical contents of the database compared to other DB systems. But at the same time, work hard to make the physical representation of the data as close as is practically possible to an idealized, imaginary logical version of the data. Do this because it makes queries faster, not because it's strictly necessary for concurrency control/recovery/whatever. Concretely, this mostly means that we try our best to keep each logical row (i.e. the latest physical version or two of each row) located on the same physical heap block over time, using the escalation strategy I described or something like it. Notice that we're imposing a cost on queries that are arguably responsible for creating garbage, but only when we really can't tolerate more garbage collection debt. But if it doesn't work out that's okay -- we tried. I have a feeling that it will in fact mostly work out. Why shouldn't it be possible to have no more than one or two uncommitted row versions in a heap page at any given time, just like with my nbtree patch? (I think that the answer to that question is "weird workloads", but I'm okay with the idea that they're a little slower or whatever.) Notice that this makes the visibility map work better in practice. I also think that the FSM needs to be taught that it isn't a good thing to reuse a little fragment of space on its own, because it works against our goal of trying to avoid relocating rows. The current logic seems focussed on using every little scrap of free space no matter what, which seems pretty misguided. Penny-wise, pound-foolish. Also notice that fighting to keep the same logical row on the same block has a non-linear payoff. We don't need to give up on that goal at the first sign of trouble. If it's hard to do a conventional prune after succeeding a thousand times then it's okay to work harder. Only a sucker gives up at the first sign of trouble. We're playing a long game here. If it becomes so hard that even applying a more aggressive strategy fails, then it's probably also true that it has become inherently impossible to sustain the original layout. We graciously accept defeat and cut our losses, having only actually wasted a little effort to learn that we need to move our incoming successor version to some other heap block (especially because the cost is amortized across versions that live on the same heap block). * Don't try to completely remove VACUUM. Treat its top down approach as complementary to the new bottom-up approaches we add. There is nothing wrong with taking a long time to clean up garbage tuples in heap pages that have very little garbage total. In fact I think that it might be a good thing. Why be in a hurry to dirty the page again? If it becomes a real problem in the short term then the bottom-up stuff can take care of it. Under this old-but-new paradigm, maybe VACUUM has to visit indexes a lot less (maybe it just decides not to sometimes, based on stats about the indexes, like we see today with vacuum_index_cleanup = off). VACUUM is for making infrequent "clean sweeps", though it typically leaves most of the work to new bottom-up approaches, that are well placed to understand the needs of queries that touch nearby data. Autovacuum does not presume to understand the needs of queries at all. It would also be possible for VACUUM to make more regular sweeps over the version store without disturbing the main relation under this model. The version store isn't that different to a separate heap relation, but naturally doesn't require traditional index cleanup or freezing, and naturally stores things in approximately temporal order. So we recycle space in the version store in a relatively eager, circular fashion, because it naturally contains data that favors such an approach. We make up the fixed cost of using the separate old version store structure by reducing deferred costs like this. And by only using it when it is demonstrably helpful. It might also make sense for us to prioritize putting heap tuples that represent versions whose indexed columns were changed by update (basically a non-hot update) in the version store -- we work extra hard on that (and just leave behind an LP_DEAD line pointer). That way VACUUM can do retail index tuple deletion for the index whose columns were modified when it finds them in the version store (presumably this nbtree patch of mine works well enough with the logically unchanged index entries for other indexes that we don't need to go out of our way). I'm sure that there are more than a few holes in this sketch of mine. It's not really worked out, but it has certain properties that are generally under appreciated -- especially the thing about the worst case number of versions per logical row being extremely important, as well as the idea that back pressure can be a good thing when push comes to shove -- provided it is experienced locally, and only by queries that update the same very hot logical rows. Back pressure needs to be proportionate and approximately fair. Another important point that I want to call out again here is that we should try to exploit cost/benefit asymmetry opportunistically. That seems to have worked out extremely well in my recent B-Tree delete dedup patch. I don't really expect anybody to take this seriously -- especially as a total blueprint. Maybe some elements of what I've sketched can be used as part of some future big effort. You've got to start somewhere. It has to be incremental. -- Peter Geoghegan |
In reply to this post by Peter Geoghegan-4
On Thu, 22 Oct 2020 at 18:42, Peter Geoghegan <[hidden email]> wrote:
> > The average latency is x2. What is the distribution of latencies? > > Occasional very long or all uniformly x2? > > The latency is generally very even with the patch. There is a constant > hum of cleanup by the new mechanism in the case of the benchmark > workload. As opposed to a cascade of page splits, which occur in > clearly distinct correlated waves. Please publish details of how long a pre-split cleaning operation takes and what that does to transaction latency. It *might* be true that the cost of avoiding splits is worth it in balance against the cost of splitting, but it might not. You've shown us a very nice paper analysing the page split waves, but we need a similar detailed analysis so we can understand if what you propose is better or not (and in what situations). > > I would guess that holding the page locks will also slow down SELECT > > workload, so I think you should also report that workload as well. > > > > Hopefully that will be better in the latest version. > > But the same benchmark that you're asking about here has two SELECT > statements and only one UPDATE. It already is read-heavy in that > sense. And we see that the latency is also significantly improved for > the SELECT queries. > > Even if there was often a performance hit rather than a benefit (which > is definitely not what we see), it would still probably be worth it. > Users create indexes for a reason. I believe that we are obligated to > maintain the indexes to a reasonable degree, and not just when it > happens to be convenient to do so in passing. The leaf page locks are held for longer, so we need to perform sensible tests that show if this has a catastrophic effect on related workloads, or not. The SELECT tests proposed need to be aimed at the same table, at the same time. > The strength of the design is in how clever it isn't. What it doesn't do could be good or bad so we need to review more details on behavior. Since the whole idea of the patch is to change behavior, that seems a reasonable ask. I don't have any doubt about the validity of the approach or coding. What you've done so far is very good and I am very positive about this, well done. -- Simon Riggs http://www.EnterpriseDB.com/ |
On Fri, Oct 23, 2020 at 9:03 AM Simon Riggs <[hidden email]> wrote:
> Please publish details of how long a pre-split cleaning operation > takes and what that does to transaction latency. It *might* be true > that the cost of avoiding splits is worth it in balance against the > cost of splitting, but it might not. I don't think that you understand how the patch works. I cannot very well isolate that cost because the patch is designed to only pay it when there is a strong chance of getting a much bigger reward, and when the only alternative is to split the page. When it fails the question naturally doesn't come up again for the same two pages that follow from the page split. As far as I know the only cases that are regressed all involve small indexes with lots of contention, which is not surprising. And not necessarily due to the heap page accesses - making indexes smaller sometimes has that effect, even when it happens due to things like better page split heuristics. If anybody finds a serious problem with my patch then it'll be a weakness or hole in the argument I just made -- it won't have much to do with how expensive any of these operations are in isolation. It usually isn't sensible to talk about page splits as isolated things. Most of my work on B-Trees in the past couple of years built on the observation that sometimes successive page splits are related to each other in one way or another. It is a fallacy of composition to think of the patch as a thing that prevents some page splits. The patch is valuable because it more or less eliminates *unnecessary* page splits (and makes it so that there cannot be very many TIDs for each logical row in affected indexes). The overall effect is not linear. If you added code to artificially make the mechanism fail randomly 10% of the time (at the point where it becomes clear that the current operation would otherwise be successful) that wouldn't make the resulting code 90% as useful as the original. It would actually make it approximately 0% as useful. On human timescales the behavioral difference between this hobbled version of my patch and the master branch would be almost imperceptible. It's obvious that a page split is more expensive than the delete operation (when it works out). It doesn't need a microbenchmark (and I really can't think of one that would make any sense). Page splits typically have WAL records that are ~4KB in size, whereas the opportunistic delete records are almost always less than 100 bytes, and typically close to 64 bytes -- which is the same size as most individual leaf page insert WAL records. Plus you have almost double the FPIs going forward with the page split. > You've shown us a very nice paper analysing the page split waves, but > we need a similar detailed analysis so we can understand if what you > propose is better or not (and in what situations). That paper was just referenced in passing. It isn't essential to the main argument. > The leaf page locks are held for longer, so we need to perform > sensible tests that show if this has a catastrophic effect on related > workloads, or not. > > The SELECT tests proposed need to be aimed at the same table, at the same time. But that's exactly what I did the first time! I had two SELECT statements against the same table. They use almost the same distribution as the UPDATE, so that they'd hit the same part of the key space without it being exactly the same as the UPDATE from the same xact in each case (I thought that if it was exactly the same part of the table then that might unfairly favor my patch). > > The strength of the design is in how clever it isn't. > > What it doesn't do could be good or bad so we need to review more > details on behavior. Since the whole idea of the patch is to change > behavior, that seems a reasonable ask. I don't have any doubt about > the validity of the approach or coding. I agree, but the patch isn't the sum of its parts. You need to talk about a workload or a set of conditions, and how things develop over time. -- Peter Geoghegan |
On Fri, 23 Oct 2020 at 18:14, Peter Geoghegan <[hidden email]> wrote:
> It's obvious that a page split is more expensive than the delete > operation (when it works out). The problem I highlighted is that the average UPDATE latency is x2 what it is on current HEAD. That is not consistent with the reported TPS, so it remains an issue and that isn't obvious. > It doesn't need a microbenchmark (and I > really can't think of one that would make any sense). I'm asking for detailed timings so we can understand the latency issue. I didn't ask for a microbenchmark. I celebrate your results, but we do need to understand the issue, somehow. -- Simon Riggs http://www.EnterpriseDB.com/ |
On Sat, Oct 24, 2020 at 2:55 AM Simon Riggs <[hidden email]> wrote:
> The problem I highlighted is that the average UPDATE latency is x2 > what it is on current HEAD. That is not consistent with the reported > TPS, so it remains an issue and that isn't obvious. Why do you say that? I reported that the UPDATE latency is less than half for the benchmark. There probably are some workloads with worse latency and throughput, but generally only with high contention/small indexes. I'll try to fine tune those, but some amount of it is probably inevitable. On average query latency is quite a lot lower with the patch (where it is affected at all - the mechanism is only used with non-hot updates). -- Peter Geoghegan |
Free forum by Nabble | Edit this page |