I've not yet been involved in postgresql's development process, so here's a first. Please find attached a patch for improving the BT binary search speeds, with accompanying performance data.
v1-0001 adapts _bt_compare and _bt_binsrch* to make use of static properties of L&Y-style nbtree indexes to speed up finding an initial position in the nbtee.
I alter the logic from _bt_compare to accept a 'start-comparing-at-column' argument, and to also return which column the comparison result came from. This allows us to keep track of which columns we've already checked for equality, and when getting deeper into the index this allows us to skip checking the already equal key columns.
Specifically, when binary-searching through a page, we keep track of which column was checked for the high-key, and which for the low-key. The first column of these that is not yet equal will be used for the next comparison. Any columns before this column must be equal, as both the high- and the low-keys in that page have already been validated as having an equal prefix. The column number is then passed on through the insertion key, allowing the optimization to be used in the climbing of the nbtree index.
v1-0001 will be especially performant in nbtree indexes with a relatively low initial cardinality and high compare cost. My own performance testing was done on a laptop (4c/8t), after first getting it warm with other benchmarks & compilations, so the results are a bit unstable.
While testing this I also noticed that there were a lot of pipeline stalls around the arguments and results of _bt_compare in the hot loops of _bt_binsrch* where the new functionality would be used, so I've moved the core logic of _bt_compare to a static inline function _bt_compare_inline, which helped the code to go from a 2% TPS decrease for single-column indexes to ~ 8% TPS increase for an insert-only workload, and for 3-column text all-collated indexes a TPS increase of 20% on my system. This also allowed me to keep the signature of _bt_compare the same as it was, limiting the scope of the changes to only the named functions.
Finally, this could be a great start on prefix truncation for btree indexes, though that is _not_ the goal of this patch. This patch skips, but does not truncate, the common prefix.
Matthias van de Meent
P.S. One more thing I noticed in benchmarking is that a significant part of the costs of non-default collations is in looking up the collation twice in the collation cache. My guess from flame graphs is that there could be a large throughput increase for short strings if the collation lookup from lc_collate_is_c() in varstr_cmp could be reused in the subsequent call to pg_newlocale_from_collation()
On Fri, Sep 11, 2020 at 7:57 AM Matthias van de Meent
<[hidden email]> wrote:
> I've not yet been involved in postgresql's development process, so here's a first. Please find attached a patch for improving the BT binary search speeds, with accompanying performance data.
> v1-0001 adapts _bt_compare and _bt_binsrch* to make use of static properties of L&Y-style nbtree indexes to speed up finding an initial position in the nbtee.
I wrote a patch that implemented the same idea, which is sometimes
called dynamic prefix truncation. I found some very subtle problems
with it, though. Concurrent page deletions could occur, which could
cause a scan that skips a prefix to miss that the page it landed on
doesn't have the same common prefix anymore.
> there could be a large throughput increase for short strings if the collation lookup from lc_collate_is_c() in varstr_cmp could be reused in the subsequent call to pg_newlocale_from_collation()
Thank you for pointing me to that thread, I was not yet familiar with it. It took me some time to get familiar with the Lanin and Shasha  paper, but the concerns regarding concurrent page deletion are indeed problematic for algorithmic prefix truncation, and do make it near impossible to use algorithmic prefix truncation without resetting the accumulated prefix every page.
At that, I will retract this current patch, and (unless someone's already working on it) start research on implementing physical prefix truncation through deduplicating the prefix shared with a page's highkey. This would likely work fine for inner nodes (there are still flag bits left unused, and the concerns related to deduplication of equal-but-distinct values do not apply there), though I'm uncertain about the ability to truncate duplicate prefixes in leaf nodes as it is basically prefix deduplication with the same problems attached as 'normal' deduplication.