[PATCH] Full support for index LP_DEAD hint bits on standby

Previous Topic Next Topic
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view

[PATCH] Full support for index LP_DEAD hint bits on standby

Michail Nikolaev
Hello, hackers.


Execution of queries to hot standby is one of the most popular ways to
scale application workload. Most of the modern Postgres installations
have two standby nodes for high-availability support. So, utilization
of replica's CPU seems to be a reasonable idea.
At the same time, some queries (index scans) could be much slower on
hot standby rather than on the primary one. It happens because the
LP_DEAD index hint bits mechanics is ignored in index scans during
recovery. It is done for reasons, of course [1]:

     * We do this because the xmin on the primary node could easily be
     * later than the xmin on the standby node, so that what the primary
     * thinks is killed is supposed to be visible on standby. So for correct
     * MVCC for queries during recovery we must ignore these hints and check
     * all tuples.

Also, according to [2] and cases like [3], it seems to be a good idea
to support "ignore_killed_tuples" on standby.

The goal of this patch is to provide full support for index hint bits
on hot standby. The mechanism should be based on well-tested
functionality and not cause a lot of recovery conflicts.

This thread is the continuation (and party copy-paste) of the old
previous one [4].


The standby itself can set and read hint bits during recovery. Such
bits are even correct according to standby visibility rules. But the
problem here - is full-page-write WAL records coming from the primary.
Such WAL records could bring invalid (according to standby xmin) hint

So, if we could be sure the scan doesn’t see any invalid hint bit from
primary - the problem is solved. And we will even be able to allow
standby to set its LP_DEAD bits itself.

The idea is simple: let WAL log hint bits before FPW somehow. It could
cause a lot of additional logs, however...

But there are ways to avoid it:
1) Send only one `latestRemovedXid` of all tuples marked as dead
during page scan.
2) Remember the latest sent `latestRemovedXid` in shared memory. And
optimistically skip WAL records with older xid values [5].

Such WAL records would cause a lot of recovery conflicts on standbys.
But we could be tricky here - let use hint bits only if
hot_standby_feedback is enabled and effective on standby. If HSF is
effective - then conflicts are not possible. If HSF is off - then
standby ignores both hint bits and additional conflict resolution. The
major thing here is that HSF is just optimization and has nothing with
MVCC correctness.


The patch introduces a new WAL record (named
XLOG_INDEX_HINT_BITS_HORIZON) to define a horizon of xmin required for
standbys snapshot to use LP_DEAD bits for an index scan.

`table_index_fetch_tuple` now returns `latest_removed_xid` value
additionally to `all_dead`. This value is used to advance
`killedLatestRemovedXid` at time of updating `killedItems` (see

Primary sends the value of `killedLatestRemovedXid` in
XLOG_INDEX_HINT_BITS_HORIZON before it marks page dirty after setting
LP_DEAD bits on the index page (by calling

New WAL is always sent before possible FPW. It is required to send
such a record only if its `latestRemovedXid` is newer than the one was
sent before for the current database (see

There is a new flag in the PGPROC structure -
`indexIgnoreKilledTuples`. If the flag is set to true – standby
queries are going to use LP_DEAD bits in index scans. In such a case
snapshot is required to satisfice the new horizon pushed by

It is safe to set `indexIgnoreKilledTuples` to any value from the
perspective of correctness. But `true` value could cause recovery
conflict. It is just some kind of compromise – use LP_DEAD bits but be
aware of XLOG_INDEX_HINT_BITS_HORIZON or vice versa.

What is the way to make the right decision about this compromise? It
is pretty simple – if `hot_standby_feedback` is on and primary
confirmed feedback is received – then set
`indexIgnoreKilledTuples`(see `GetSnapshotIndexIgnoreKilledTuples`).

While feedback is working as expected – the query will never be

To support cascading standby setups (with a possible break of feedback
chain in the middle) – an additional byte was added to the keep-alive
message of the feedback protocol. This byte is used to make sure our
xmin is honored by primary (see
`sender_propagates_feedback_to_primary`). Also, the WAL sender now
always sends a keep-alive after receiving a feedback message.

So, this way, it is safe to use LP_DEAD bits received from the primary
when we want to.

And, as a result, it is safe to set LP_DEAD bits on standby.
Even if:
* the primary changes vacuum_defer_cleanup_age
* standby restarted
* standby promoted to the primary
* base backup taken from standby
* standby is serving queries during recovery
– nothing could go wrong here.

Because `HeapTupleIsSurelyDead` (and index LP_DEAD as result) needs
*heap* hint bits to be already set at standby. So, the same code
decides to set hint bits on the heap (it is done already on standby
for a long time) and in the index.

It is not possible to find an ideal performance test for such kind of

But there is a possible example in the attachment. It is a standard
pgbench schema with an additional index on balance and random balance

On primary test do next:
1) transfer some money from one random of the top 100 rich accounts to
one random of the top 100 poor accounts.
2) calculate the amount of money in the top 10 rich and top 10 poor
accounts (and include an additional field to avoid index-only-scan).
In the case of standby only step 2 is used.

The patched version is about 9x faster for standby queries - like 455
TPS versus 4192 TPS on my system. There is no visible difference for

To estimate the additional amount of WAL logs, I have checked records
in WAL-segments during different conditions:
(pg_waldump pgdata/pg_wal/XXX | grep INDEX_HINT_BITS_HORIZON | wc -l)

- hot_standby_feedback=off - 5181 of 226274 records ~2%
- hot_standby_feedback=on (without load on standby) - 70 of 202594
records ~ 0.03%
- hot_standby_feedback=on (with load on standby) - 17 of 70504 records ~ 0.02%

So, with HSF=on (which is the default value) WAL increase is not
significant. Also, for HSF=off it should be possible to radically
reduce the number of additional WAL logs by using `latestRemovedXid`
from other records (like Heap2/CLEAN) in "send only newer xid"
optimization (I have skipped it for now for simplicity).


The only thing we pay – a few additional WAL records and some
additional moderate code complexity. But the support of hint-bits on
standby is a huge advantage for many workloads. I was able to get more
than a 900% performance boost (and it is not surprising – index hint
bits are just great optimization). And it works for almost all index
types out of the box.

Another major thing here – everything is based on old, well-tested
mechanics: query cancelation because of snapshot conflicts, setting
heap hint bits on standby, hot standby feedback.


[1] - https://www.postgresql.org/message-id/flat/7067.1529246768%40sss.pgh.pa.us#d9e2e570ba34fc96c4300a362cbe8c38
[2] - https://www.postgresql.org/message-id/flat/12843.1529331619%40sss.pgh.pa.us#6df9694fdfd5d550fbb38e711d162be8
[3] - https://www.postgresql.org/message-id/flat/20170428133818.24368.33533%40wrigleys.postgresql.org
[4] - https://www.postgresql.org/message-id/flat/CANtu0ohOvgteBYmCMc2KERFiJUvpWGB0bRTbK_WseQH-L1jkrQ%40mail.gmail.com
[5] - https://www.postgresql.org/message-id/flat/CANtu0oigC0%2BH0UkxktyovdLLU67ikM0%2BDw3J4EQqiDDeGhcwsQ%40mail.gmail.com

code.patch (68K) Download Attachment
docs.patch (7K) Download Attachment
test.patch (14K) Download Attachment
pefttest.tar.gz (956 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view

Re: [PATCH] Full support for index LP_DEAD hint bits on standby

Michail Nikolaev
Hello, everyone.

Oh, I just realized that it seems like I was too naive to allow
standby to set LP_DEAD bits this way.
There is a possible consistency problem in the case of low
minRecoveryPoint value (because hint bits do not move PageLSN

Something like this:

<-----------minRecoveryPoint will go here
               REPLICA SCANS INDEX AND SET hint bits (index_lsn=10)
               INDEX IS FLUSHED (minRecoveryPoint=index_lsn=10)

On crash recovery, a standby will be able to handle queries after
LSN=10. But the index page contains hints bits from the future
So, need to think here.


Reply | Threaded
Open this post in threaded view

Re: [PATCH] Full support for index LP_DEAD hint bits on standby

Michail Nikolaev
Hello, hackers.

I think I was able to fix the issue related to minRecoveryPoint and crash recovery. To make sure standby will be consistent after crash recovery, we need to take the current value of minRecoveryPoint into account while setting LP_DEAD hints (almost the same way as it is done for *heap* hint bits already).

I have introduced new structure IndexHintBitsData:
    /* guaranteed not visible for all backends */
    bool all_dead;

    /* latest removed xid if known */
    TransactionId latest_removed_xid;

     /* lsn of page where dead tuple located */
    XLogRecPtr page_lsn;

This structure is filled by the `heap_hot_search_buffer` function. After, we decide to set or not `kill_prior_tuple` depending on its content (calling `IsMarkBufferDirtyIndexHintAllowed`).

For primary - it is always safe to set LP_DEAD in index if `all_dead` == true.

In the case of standby, we need to check `latest_removed_xid` (if available) first. If commit LSN of the latest removed xid is already lower than minRecoveryPoint (`XLogNeedsFlush`) - it is safe to set `kill_prior_tuple`.

Sometimes we are not sure about the latest removed xid - heap record could be marked dead by the XLOG_HEAP2_CLEAN record, for example. In such a case we check the LSN of the *heap* page containing the tuple (LSN could be updated by other transactions already - but it does not matter in that situation). If page LSN is lower than minRecoveryPoint - it is safe to set LP_DEAD in the index too. Otherwise - just leave the index tuple alive.

So, to bring it all together:

* Normal operation, proc->indexIgnoreKilledTuples is true:
      It is safe for standby to use hint bits from the primary FPI because of XLOG_INDEX_HINT_BITS_HORIZON conflict resolution.
      It is safe for standby to set its index hint bits because `ComputeXidHorizons` honors other read-only procs xmin and lowest xid on primary (`KnownAssignedXidsGetOldestXmin`).

* Normal operation, proc->indexIgnoreKilledTuples is false:
      Index hint bits are never set or taken into account.

* Crash recovery, proc->indexIgnoreKilledTuples is true:
      It is safe for standby to use hint bits from the primary FPW because XLOG_INDEX_HINT_BITS_HORIZON is always logged before FPI, and commit record of transaction removed the tuple is logged before XLOG_INDEX_HINT_BITS_HORIZON. So, if FPI with hints was flushed (and taken into account by minRecoveryPoint) - both transaction-remover and horizon records are replayed before reading queries.
      It is safe for standby to use its hint bits because they can be set only if the commit record of transaction-remover is lower than minRecoveryPoint or LSN of heap page with removed tuples is lower than minRecoveryPoint.

* Crash recovery, proc->indexIgnoreKilledTuples is false:
      Index hint bits are never set or taken into account.

So, now it seems correct to me.

Another interesting point here - now position of minRecoveryPoint affects performance a lot. It is happening already (because of *heap* hint bits) but after the patch, it is noticeable even more. Is there any sense to keep minRecoveryPoint at a low value?

Rebased and updated patch in attachment.

Will be happy if someone could recheck my ideas or even the code :)

Thanks a lot,

test.patch (14K) Download Attachment
docs.patch (8K) Download Attachment
code.patch (73K) Download Attachment
Reply | Threaded
Open this post in threaded view

Re: [PATCH] Full support for index LP_DEAD hint bits on standby

Michail Nikolaev
Hello, everyone.

After some correspondence with Peter Geoghegan (1) and his ideas, I
have reworked the patch a lot and now it is much more simple with even
better performance (no new WAL or conflict resolution, hot standby
feedback is unrelated).

The idea is pretty simple now - let’s mark the page with
“standby-safe” LP_DEAD hints by the bit in btpo_flags
(BTP_LP_SAFE_ON_STANDBY and similar for gist and hash).

If standby wants to set LP_DEAD - it checks BTP_LP_SAFE_ON_STANDBY on
the page first, if it is not set - all “primary” hints are removed
first, and then the flag is set (with memory barrier to avoid memory
ordering issues in concurrent scans).
Also, standby checks BTP_LP_SAFE_ON_STANDBY to be sure about ignoring
tuples marked by LP_DEAD during the scan.

Of course, it is not so easy. If standby was promoted (or primary was
restored from standby backup) - it is still possible to receive FPI
with such flag set in WAL logs. So, the main problem is still there.

But we could just clear this flag while applying FPI because the page
remains dirty after that anyway! It should not cause any checksum,
consistency, or pg_rewind issues as explained in (2).
Semantically it is the same as set hint bit one milisecond after FPI
was applied (while page still remains dirty after FPI replay) - and
standby already does it with *heap* hint bits.

Also, TAP-test attached to (2) shows how it is easy to flush a hint
bit which was set by standby to achieve different checksum comparing
to primary already.

If standby was promoted (or restored from standby backup) it is safe
to use LP_DEAD with or without BTP_LP_SAFE_ON_STANDBY on a page. But
for accuracy BTP_LP_SAFE_ON_STANDBY is cleared by primary if found.

Also, we should take into account minRecoveryPoint as described in (3)
to avoid consistency issues during crash recovery (see

Also, as far as I know - there is no practical sense to keep
minRecoveryPoint at a low value. So, there is an optional patch that
moves minRecoveryPoint forward at each xl_running_data (to allow
standby to set hint bits and LP_DEADs more aggressively). It is about
every 15s.

There are some graphics showing performance testing results on my PC
in the attachment (test is taken from (4)). Each test was running for
10 minutes.
Additional primary performance is probably just measurement error. But
standby performance gain is huge.

Feel free to ask if you need more proof about correctness.


[1] - https://www.postgresql.org/message-id/flat/CAH2-Wz%3D-BoaKgkN-MnKj6hFwO1BOJSA%2ByLMMO%2BLRZK932fNUXA%40mail.gmail.com#6d7cdebd68069cc493c11b9732fd2040
[2] - https://www.postgresql.org/message-id/flat/CANtu0oiAtteJ%2BMpPonBg6WfEsJCKrxuLK15P6GsaGDcYGjefVQ%40mail.gmail.com#091fca433185504f2818d5364819f7a4
[3] - https://www.postgresql.org/message-id/flat/CANtu0oh28mX5gy5jburH%2Bn1mcczK5_dCQnhbBnCM%3DPfqh-A26Q%40mail.gmail.com#ecfe5a331a3058f895c0cba698fbc4d3
[4] - https://www.postgresql.org/message-id/flat/CANtu0oiP18H31dSaEzn0B0rW6tA_q1G7%3D9Y92%2BUS_WHGOoQevg%40mail.gmail.com

code_optional.patch (770 bytes) Download Attachment
doc.patch (3K) Download Attachment
test.patch (12K) Download Attachment
code.patch (63K) Download Attachment
performance_testing.png (25K) Download Attachment