Thoughts on "killed tuples" index hint bits support on standby

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

Thoughts on "killed tuples" index hint bits support on standby

Michail Nikolaev
Hello, hackers.

Currently hint bits in the index pages (dead tuples) are set and taken
into account only at primary server. Standby just ignores it. It is
done for reasons, of course (see RelationGetIndexScan and [1]):

        * We do this because the xmin on the primary node could easily be
        * later than the xmin on the standby node, so that what the primary
        * thinks is killed is supposed to be visible on standby. So for correct
        * MVCC for queries during recovery we must ignore these hints and check
        * all tuples.

Also, according to [2] and cases like [3] it seems to be good idea to
support "ignore_killed_tuples" on standby.

I hope I know the way to support it correctly with reasonable amount of changes.

First thing we need to consider - checksums and wal_log_hints are
widely used these days. So, at any moment master could send FPW page
with new "killed tuples" hints and overwrite hints set by standby.
Moreover it is not possible to distinguish hints are set by primary or standby.

And there is where hot_standby_feedback comes to play. Master node
considers xmin of hot_standy_feedback replicas (RecentGlobalXmin)
while setting "killed tuples" bits.  So, if hot_standby_feedback is
enabled on standby for a while - it could safely trust hint bits from
master.
Also, standby could set own hints using xmin it sends to primary
during feedback (but without marking page as dirty).

Of course all is not so easy, there are a few things and corner cases
to care about
* Looks like RecentGlobalXmin could be moved backwards in case of new
replica with lower xmin is connected (or by switching some replica to
hot_standby_feedback=on). We must ensure RecentGlobalXmin is moved
strictly forward.
* hot_standby_feedback could be enabled on the fly. In such a case we
need distinguish transactions which are safe or unsafe to deal with
hints. Standby could receive fresh RecentGlobalXmin as response to
feedback message. All standby transactions with xmin >=
RecentGlobalXmin are safe to use hints.
* hot_standby_feedback could be disabled on the fly. In such situation
standby needs to continue to send feedback while canceling all queries
with ignore_killed_tuples=true. Once all such queries are canceled -
feedback are no longer needed and should be disabled.

Could someone validate my thoughts please? If the idea is mostly
correct - I could try to implement and test it.

[1] - https://www.postgresql.org/message-id/flat/7067.1529246768%40sss.pgh.pa.us#d9e2e570ba34fc96c4300a362cbe8c38
[2] - https://www.postgresql.org/message-id/flat/12843.1529331619%40sss.pgh.pa.us#6df9694fdfd5d550fbb38e711d162be8
[3] - https://www.postgresql.org/message-id/flat/20170428133818.24368.33533%40wrigleys.postgresql.org


Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on "killed tuples" index hint bits support on standby

Andres Freund
Hi,

On 2020-01-16 14:30:12 +0300, Michail Nikolaev wrote:
> First thing we need to consider - checksums and wal_log_hints are
> widely used these days. So, at any moment master could send FPW page
> with new "killed tuples" hints and overwrite hints set by standby.
> Moreover it is not possible to distinguish hints are set by primary or standby.

Note that the FPIs are only going to be sent for the first write to a
page after a checksum. I don't think you're suggesting we rely on them
for correctness (this far into the email at least), but it still seems
worthwhile to point out.


> And there is where hot_standby_feedback comes to play. Master node
> considers xmin of hot_standy_feedback replicas (RecentGlobalXmin)
> while setting "killed tuples" bits.  So, if hot_standby_feedback is
> enabled on standby for a while - it could safely trust hint bits from
> master.

Well, not easily. There's no guarantee that the node it reports
hot_standby_feedback to is actually the primary. It could be an
cascading replica setup, that doesn't report hot_standby_feedback
upstream. Also hot_standby_feedback only takes effect while a streaming
connection is active, if that is even temporarily interrupted, the
primary will loose all knowledge of the standby's horizon - unless
replication slots are in use, that is.

Additionally, we also need to handle the case where the replica
currently has restarted, and is recovering using local WAL, and/or
archive based recovery. In that case the standby could already have sent
a *newer* horizon as feedback upstream, but currently again have an
older view. It is entirely possible that the standby is consistent and
queryable in such a state (if nothing forced disk writes during WAL
replay, minRecoveryLSN will not be set to something too new).


> Also, standby could set own hints using xmin it sends to primary
> during feedback (but without marking page as dirty).

We do something similar for heap hint bits already...


> Of course all is not so easy, there are a few things and corner cases
> to care about
> * Looks like RecentGlobalXmin could be moved backwards in case of new
> replica with lower xmin is connected (or by switching some replica to
> hot_standby_feedback=on). We must ensure RecentGlobalXmin is moved
> strictly forward.

I'm not sure this is a problem. If that happens we cannot rely on the
different xmin horizon anyway, because action may have been taken on the
old RecentGlobalXmin. Thus we need to be safe against that anyway.


> * hot_standby_feedback could be enabled on the fly. In such a case we
> need distinguish transactions which are safe or unsafe to deal with
> hints. Standby could receive fresh RecentGlobalXmin as response to
> feedback message. All standby transactions with xmin >=
> RecentGlobalXmin are safe to use hints.
> * hot_standby_feedback could be disabled on the fly. In such situation
> standby needs to continue to send feedback while canceling all queries
> with ignore_killed_tuples=true. Once all such queries are canceled -
> feedback are no longer needed and should be disabled.

I don't think we can rely on hot_standby_feedback at all. We can to
avoid unnecessary cancellations, etc, and even assume it's setup up
reasonably for some configurations, but there always needs to be an
independent correctness backstop.



I think it might be more promising to improve the the kill logic based
on the WAL logged horizons from the primary. All I think we need to do
is to use a more conservatively computed RecentGlobalXmin when
determining whether tuples are dead, I think. We already regularly log a
xl_running_xacts, adding information about the primaries horizon to
that, and stashing it in shared memory on the standby, shouldn't be too
hard.  Then we can make a correct, albeit likely overly pessimistic,
visibility determinations about tuples, and go on to set LP_DEAD.

There are some complexities around how to avoid unnecessary query
cancellations. We'd not want to trigger recovery conflicts based on the
new xl_running_xacts field, as that'd make the conflict rate go through
the roof - but I think we could safely use the logical minimum of the
local RecentGlobalXmin, and the primaries'.

That should allow us to set additional LP_DEAD safely, I believe. We
could even rely on those LP_DEAD bits. But:

I'm less clear on how we can make sure that we can *rely* on LP_DEAD to
skip over entries during scans, however. The bits set as described above
would be safe, but we also can see LP_DEAD set by the primary (and even
upstream cascading standbys at least in case of new base backups taken
from them), due to them being not being WAL logged. As we don't WAL log,
there is no conflict associated with the LP_DEADs being set.  My gut
feeling is that it's going to be very hard to get around this, without
adding WAL logging for _bt_killitems et al (including an interface for
kill_prior_tuple to report the used horizon to the index).


I'm wondering if we could recycle BTPageOpaqueData.xact to store the
horizon applying to killed tuples on the page. We don't need to store
the level for leaf pages, because we have BTP_LEAF, so we could make
space for that (potentially signalled by a new BTP flag).  Obviously we
have to be careful with storing xids in the index, due to potential
wraparound danger - but I think such page would have to be vacuumed
anyway, before a potential wraparound.  I think we could safely unset
the xid during nbtree single page cleanup, and vacuum, by making sure no
LP_DEAD entries survive, and by including the horizon in the generated
WAL record.

That however still doesn't really fully allow us to set LP_DEAD on
standbys, however - but it'd allow us to take the primary's LP_DEADs
into account on a standby. I think we'd run into torn page issues, if we
were to do so without WAL logging, because we'd rely on the LP_DEAD bits
and BTPageOpaqueData.xact to be in sync.  I *think* we might be safe to
do so *iff* the page's LSN indicates that there has been a WAL record
covering it since the last redo location.


I only had my first cup of coffee for the day, so I might also just be
entirely off base here.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on "killed tuples" index hint bits support on standby

Peter Geoghegan-4
On Thu, Jan 16, 2020 at 9:54 AM Andres Freund <[hidden email]> wrote:
> I don't think we can rely on hot_standby_feedback at all. We can to
> avoid unnecessary cancellations, etc, and even assume it's setup up
> reasonably for some configurations, but there always needs to be an
> independent correctness backstop.

+1

> I'm less clear on how we can make sure that we can *rely* on LP_DEAD to
> skip over entries during scans, however. The bits set as described above
> would be safe, but we also can see LP_DEAD set by the primary (and even
> upstream cascading standbys at least in case of new base backups taken
> from them), due to them being not being WAL logged. As we don't WAL log,
> there is no conflict associated with the LP_DEADs being set.  My gut
> feeling is that it's going to be very hard to get around this, without
> adding WAL logging for _bt_killitems et al (including an interface for
> kill_prior_tuple to report the used horizon to the index).

I agree.

What about calling _bt_vacuum_one_page() more often than strictly
necessary to avoid a page split on the primary? The B-Tree
deduplication patch sometimes does that, albeit for completely
unrelated reasons. (We don't want to have to unset an LP_DEAD bit in
the case when a new/incoming duplicate tuple has a TID that overlaps
with the posting list range of some existing duplicate posting list
tuple.)

I have no idea how you'd determine that it was time to call
_bt_vacuum_one_page(). Seems worth considering.

> I'm wondering if we could recycle BTPageOpaqueData.xact to store the
> horizon applying to killed tuples on the page. We don't need to store
> the level for leaf pages, because we have BTP_LEAF, so we could make
> space for that (potentially signalled by a new BTP flag).  Obviously we
> have to be careful with storing xids in the index, due to potential
> wraparound danger - but I think such page would have to be vacuumed
> anyway, before a potential wraparound.

You would think that, but unfortunately we don't currently do it that
way. We store XIDs in deleted leaf pages that can sometimes be missed
until the next wraparound.

We need to do something like commit
6655a7299d835dea9e8e0ba69cc5284611b96f29, but for B-Tree. It's
somewhere on my TODO list.

> I think we could safely unset
> the xid during nbtree single page cleanup, and vacuum, by making sure no
> LP_DEAD entries survive, and by including the horizon in the generated
> WAL record.
>
> That however still doesn't really fully allow us to set LP_DEAD on
> standbys, however - but it'd allow us to take the primary's LP_DEADs
> into account on a standby. I think we'd run into torn page issues, if we
> were to do so without WAL logging, because we'd rely on the LP_DEAD bits
> and BTPageOpaqueData.xact to be in sync.  I *think* we might be safe to
> do so *iff* the page's LSN indicates that there has been a WAL record
> covering it since the last redo location.

That sounds like a huge mess.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on "killed tuples" index hint bits support on standby

Michail Nikolaev
Hello again.

Andres, Peter, thanks for your comments.

Some of issues your mentioned (reporting feedback to the another
cascade standby, processing queries after restart and newer xid
already reported) could be fixed in provided design, but your
intention to have "independent correctness backstop" is a right thing
to do.

So, I was thinking about another approach which is:
* still not too tricky to implement
* easy to understand
* does not rely on hot_standby_feedback for correctness, but only for efficiency
* could be used with any kind of index
* does not generate a lot of WAL

Let's add a new type of WAL record like "some index killed tuple hint
bits are set according to RecentGlobalXmin=x" (without specifying page
or even relation). Let's call 'x' as 'LastKilledIndexTuplesXmin' and
track it in standby memory. It is sent only in case of
wal_log_hints=true. If hints cause FPW - it is sent before FPW record.
Also, it is not required to write such WAL every time primary marks
index tuple as dead. It should be done only in case
'LastKilledIndexTuplesXmin' is changed (moved forward).

On standby such record is used to cancel queries. If transaction is
executed with "ignore_killed_tuples==true" (set on snapshot creation)
and its xid is less than received LastKilledIndexTuplesXmin - just
cancel the query (because it could rely on invalid hint bit). So,
technically it should be correct to use hints received from master to
skip tuples according to MVCC, but "the conflict rate goes through the
roof".

To avoid any real conflicts standby sets
    ignore_killed_tuples = (hot_standby_feedback is on)
                               AND (wal_log_hints is on on primary)
                               AND (standby new snapshot xid >= last
LastKilledIndexTuplesXmin received)
                               AND (hot_standby_feedback is reported
directly to master).

So, hot_standby_feedback loop effectively eliminates any conflicts
(because LastKilledIndexTuplesXmin is technically RecentGlobalXmin in
such case). But if feedback is broken for some reason - query
cancellation logic will keep everything safe.

For correctness LastKilledIndexTuplesXmin (and as consequence
RecentGlobalXmin) should be moved only forward.

To set killed bits on standby we should check tuples visibility
according to last LastKilledIndexTuplesXmin received. It is just like
master sets these bits according to its state - so it is even safe to
transfer them to another standby.

Does it look better now?

Thanks, Michail.


Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on "killed tuples" index hint bits support on standby

Peter Geoghegan-4
Hi Michail,

On Fri, Jan 24, 2020 at 6:16 AM Michail Nikolaev
<[hidden email]> wrote:
> Some of issues your mentioned (reporting feedback to the another
> cascade standby, processing queries after restart and newer xid
> already reported) could be fixed in provided design, but your
> intention to have "independent correctness backstop" is a right thing
> to do.

Attached is a very rough POC patch of my own, which makes item
deletion occur "non-opportunistically" in unique indexes. The idea is
that we exploit the uniqueness property of unique indexes to identify
"version churn" from non-HOT updates. If any single value on a leaf
page has several duplicates, then there is a good chance that we can
safely delete some of them. It's worth going to the heap to check
whether that's safe selectively, at the point where we'd usually have
to split the page. We only have to free one or two items to avoid
splitting the page. If we can avoid splitting the page immediately, we
may well avoid splitting it indefinitely, or forever.

This approach seems to be super effective. It can leave the PK on
pgbench_accounts in pristine condition (no page splits) after many
hours with a pgbench-like workload that makes all updates non-HOT
updates. Even with many clients, and a skewed distribution. Provided
the index isn't tiny to begin with, we can always keep up with
controlling index bloat -- once the client backends themselves begin
to take active responsibility for garbage collection, rather than just
treating it as a nice-to-have. I'm pretty sure that I'm going to be
spending a lot of time developing this approach, because it really
works.

This seems fairly relevant to what you're doing. It makes almost all
index cleanup occur using the new delete infrastructure for some of
the most interesting workloads where deletion takes place in client
backends. In practice, a standby will almost be in the same position
as the primary in a workload that this approach really helps with,
since setting the LP_DEAD bit itself doesn't really need to happen (we
can go straight to deleting the items in the new deletion path).

To address the questions you've asked: I don't really like the idea of
introducing new rules around tuple visibility and WAL logging to set
more LP_DEAD bits like this at all. It seems very complicated. I
suspect that we'd be better off introducing ways of making the actual
deletes occur sooner on the primary, possibly much sooner, avoiding
any need for special infrastructure on the standby. This is probably
not limited to the special unique index case that my patch focuses on
-- we can probably push this general approach forward in a number of
different ways. I just started with unique indexes because that seemed
most promising. I have only worked on the project for a few days. I
don't really know how it will evolve.

--
Peter Geoghegan

0001-Non-opportunistically-delete-B-Tree-items.patch (16K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on "killed tuples" index hint bits support on standby

Michail Nikolaev
Hello, Peter.

Thanks for your feedback.

> Attached is a very rough POC patch of my own, which makes item
> deletion occur "non-opportunistically" in unique indexes. The idea is
> that we exploit the uniqueness property of unique indexes to identify
> "version churn" from non-HOT updates. If any single value on a leaf
> page has several duplicates, then there is a good chance that we can
> safely delete some of them. It's worth going to the heap to check
> whether that's safe selectively, at the point where we'd usually have
> to split the page. We only have to free one or two items to avoid
> splitting the page. If we can avoid splitting the page immediately, we
> may well avoid splitting it indefinitely, or forever.
Yes, it is a brilliant idea to use uniqueness to avoid bloating the index. I am
not able to understand all the code now, but I’ll check the patch in more
detail later.

> This seems fairly relevant to what you're doing. It makes almost all
> index cleanup occur using the new delete infrastructure for some of
> the most interesting workloads where deletion takes place in client
> backends. In practice, a standby will almost be in the same position
> as the primary in a workload that this approach really helps with,
> since setting the LP_DEAD bit itself doesn't really need to happen (we
> can go straight to deleting the items in the new deletion path).

> This is probably
> not limited to the special unique index case that my patch focuses on
> -- we can probably push this general approach forward in a number of
> different ways. I just started with unique indexes because that seemed
> most promising. I have only worked on the project for a few days. I
> don't really know how it will evolve.

Yes, it is relevant, but I think it is «located in a different plane» and
complement each other. Because most of the indexes are not unique these days
and most of the standbys (and even primaries) have long snapshots (up to
minutes, hours) – so, multiple versions of index records are still required for
them. Even if we could avoid multiple versions somehow - it could lead to a very
high rate of query cancelations on standby.

> To address the questions you've asked: I don't really like the idea of
> introducing new rules around tuple visibility and WAL logging to set
> more LP_DEAD bits like this at all. It seems very complicated.

I don’t think it is too complicated. I have polished the idea a little and now
it looks even elegant for me :) I’ll try to explain the concept briefly (there
are no new visibility rules or changes to set more LP_DEAD bits than now –
everything is based on well-tested mechanics):

1) There is some kind of horizon of xmin values primary pushes to a standby
currently. All standby’s snapshots are required to satisfice this horizon to
access heap and indexes. This is done by *ResolveRecoveryConflictWithSnapshot*
and corresponding WAL records (for example -XLOG_BTREE_DELETE).

2) We could introduce a new WAL record (named XLOG_INDEX_HINT in the patch for
now) to define a horizon of xmin required for standby’s snapshot to use LP_DEAD
bits in the indexes.

3) Master sends XLOG_INDEX_HINT in case it sets LP_DEAD bit on the index page
(but before possible FPW caused by hints) by calling *LogIndexHintIfNeeded*. It
is required to send such a record only if the new xmin value is greater than
one send before. I made tests to estimate the amount of new WAL – it is really
small (especially compared to FPW writes done because of LP_DEAD bit set).

4) New XLOG_INDEX_HINT contains only a database id and value of
*latestIndexHintXid* (new horizon position). For simplicity, the primary could
set just set horizon to *RecentGlobalXmin*. But for now in the patch horizon
value extracted from heap in *HeapTupleIsSurelyDead* to reduce the number of
XLOG_INDEX_HINT records even more).


5) There is a new field in PGPROC structure - *indexIgnoreKilledTuples*. If it
is set to true – standby queries are going to use LP_DEAD bits in index scans.
In such a case snapshot is required to satisfice new LP_DEAD-horizon pushed by
XLOG_INDEX_HINT records. It is done by the same mechanism as used for heap -
*ResolveRecoveryConflictWithSnapshot*.

6) The major thing here – it is safe to set *indexIgnoreKilledTuples* to both
‘true’ and ‘false’ from the perspective of correctness. It is just some kind of
performance compromise – use LP_DEAD bits but be aware of XLOG_INDEX_HINT
horizon or vice versa.

7) What is the way to do the right decision about this compromise? It is pretty
simple – if hot_standby_feedback is on and primary confirmed our feedback is
received – then set *indexIgnoreKilledTuples* too ‘true’ – since while feedback
is working as expected – the query will be never canceled by XLOG_INDEX_HINT
horizon!

8) To support cascading standby setups (with a possible break of feedback
chain) – additional byte added to the ‘keep-alive’ message of the feedback
protocol.

9) So, at the moment we are safe to use LP_DEAD bits received from the
primary when we want to.

10) What is about setting LP_DEAD bits by standby? The main thing here -
*RecentGlobalXmin* on standby is always lower than XLOG_INDEX_HINT horizon by
definition – standby is always behind the primary. So, if something looks dead
on standby – it is definitely dead on the primary.

11) Even if:

* the primary changes vacuum_defer_cleanup_age
* standby restarted
* standby promoted to the primary
* base backup taken from standby
* standby is serving queries during recovery
– nothing could go wrong here.

Because *HeapTupleIsSurelyDead* (and index LP_DEAD as result) needs *HEAP* hint
bits to be already set at standby. So, the same code decides to set hint bits
in the heap (it is done already on standby for a long time) and in the index.

So, the only thing we pay – a few additional bytes of WAL and some additional
moderate code complexity. But the support of hint-bits on standby is a huge
advantage for many workloads. I was able to get more than 1000% performance
boost (and it is not surprising – index hint bits is just great optimization).
And it works for almost all index types out of the box.

Another major thing here – everything is based on old, well-tested mechanics:
query cancelation because of snapshot conflicts and setting heap hint bits on
standby.

Most of the patch – are technical changes to support new query cancelation
counters, new WAL record, new PGPROC field and so on. There are some places I
am not sure about yet, naming is bad – it is still POC.

Thanks,
Michail.

standby-hint-bits-poc.patch (66K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on "killed tuples" index hint bits support on standby

Peter Geoghegan-4
On Wed, Apr 8, 2020 at 5:23 AM Michail Nikolaev
<[hidden email]> wrote:
> Yes, it is a brilliant idea to use uniqueness to avoid bloating the index. I am
> not able to understand all the code now, but I’ll check the patch in more
> detail later.

The design is probably simpler than you imagine. The algorithm tries
to be clever about unique indexes in various ways, but don't get
distracted by those details. At a high level we're merely performing
atomic actions that can already happen, and already use the same high
level rules.

There is LP_DEAD bit setting that's similar to what often happens in
_bt_check_unique() already, plus there is a new way in which we can
call _bt_delitems_delete(). Importantly, we're only changing the time
and the reasons for doing these things.

> Yes, it is relevant, but I think it is «located in a different plane» and
> complement each other. Because most of the indexes are not unique these days
> and most of the standbys (and even primaries) have long snapshots (up to
> minutes, hours) – so, multiple versions of index records are still required for
> them. Even if we could avoid multiple versions somehow - it could lead to a very
> high rate of query cancelations on standby.

I admit that I didn't really understand what you were trying to do
initially. I now understand a little better.

I think that there is something to be said for calling
_bt_delitems_delete() more frequently on the standby, not necessarily
just for the special case of unique indexes. That might also help on
standbys, at the cost of more recovery conflicts. I admit that it's
unclear how much that would help with the cases that you seem to
really care about. I'm not going to argue that the kind of thing I'm
talking about (actually doing deletion more frequently on the primary)
is truly a replacement for your patch, even if it was generalized
further than my POC patch -- it is not a replacement. As best, it is a
simpler way of "sending the LP_DEAD bits on the primary to the standby
sooner". Even still, I cannot help but wonder how much value there is
in just doing this much (or figuring out some way to make LP_DEAD bits
from the primary usable on the standby). That seems like a far less
risky project, even if it is less valuable to some users.

Let me make sure I understand your position:

You're particularly concerned about cases where there are relatively
few page splits, and the standby has to wait for VACUUM to run on the
primary before dead index tuples get cleaned up. The primary itself
probably has no problem with setting LP_DEAD bits to avoid having
index scans visiting the heap unnecessarily. Or maybe the queries are
different on the standby anyway, so it matters to the standby that
certain index pages get LP_DEAD bits set quickly, though not to the
primary (at least not for the same pages). Setting the LP_DEAD bits on
the standby (in about the same way as we can already on the primary)
is a "night and day" level difference. And we're willing to account
for FPIs on the primary (and the LP_DEAD bits set there) just to be
able to also set LP_DEAD bits on the standby.

Right?

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on "killed tuples" index hint bits support on standby

Michail Nikolaev
Hello, Peter.

> Let me make sure I understand your position:

> You're particularly concerned about cases where there are relatively
> few page splits, and the standby has to wait for VACUUM to run on the
> primary before dead index tuples get cleaned up. The primary itself
> probably has no problem with setting LP_DEAD bits to avoid having
> index scans visiting the heap unnecessarily. Or maybe the queries are
> different on the standby anyway, so it matters to the standby that
> certain index pages get LP_DEAD bits set quickly, though not to the
> primary (at least not for the same pages). Setting the LP_DEAD bits on
> the standby (in about the same way as we can already on the primary)
> is a "night and day" level difference.
> Right?

Yes, exactly.

My initial attempts were too naive (first and second letter) - but you and
Andres gave me some hints on how to make it reliable.

The main goal is to make the standby to be able to use and set LP_DEAD almost
as a primary does. Of course, standby could receive LP_DEAD with FPI from
primary at any moment - so, some kind of cancellation logic is required. Also,
we should keep the frequency of query cancellation at the same level - for that
reason LP_DEAD bits better to be used only by standbys with
hot_standby_feedback enabled. So, I am just repeating myself from the previous
letter here.

> And we're willing to account
> for FPIs on the primary (and the LP_DEAD bits set there) just to be
> able to also set LP_DEAD bits on the standby.

Yes, metaphorically saying - master sending WAL record with the letter:
"Attention, it is possible to receive FPI from me with LP_DEAD set for tuple
with xmax=ABCD, so, if you using LP_DEAD - your xmin should be greater or you
should cancel yourself". And such a letter is required only if this horizon is
moved forward.

And... Looks like it works - queries are mush faster, results look correct,
additional WAL traffic is low, cancellation at the same level... As far as I
can see - the basic concept is correct and effective (but of course, I
could miss something).

The patch is hard to look into - I'll try to split it into several patches
later. And of course, a lot of polishing is required (and there are few places
I am not sure about yet).

Thanks,
Michail.


Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on "killed tuples" index hint bits support on standby

Michail Nikolaev
Hello, hackers.

Sorry for necroposting, but if someone is interested - I hope the patch is ready now and available in the other thread (1).

Thanks,
Michail.

[1] https://www.postgresql.org/message-id/flat/CANtu0oiP18H31dSaEzn0B0rW6tA_q1G7%3D9Y92%2BUS_WHGOoQevg%40mail.gmail.com
Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on "killed tuples" index hint bits support on standby

Peter Geoghegan-4
On Wed, Jan 27, 2021 at 11:30 AM Michail Nikolaev
<[hidden email]> wrote:
> Sorry for necroposting, but if someone is interested - I hope the patch is ready now and available in the other thread (1).

I wonder if it would help to not actually use the LP_DEAD bit for
this. Instead, you could use the currently-unused-in-indexes
LP_REDIRECT bit. The general idea here is that this status bit is
interpreted as a "recently dead" bit in nbtree. It is always something
that is true *relative* to some *ephemeral* coarse-grained definition
of recently dead. Whether or not "recently dead" means "dead to my
particular MVCC snapshot" can be determined using some kind of
in-memory state that won't survive a crash (or a per-index-page
epoch?). We still have LP_DEAD bits in this world, which work in the
same way as now (and so unambiguously indicate that the index tuple is
dead-to-all, at least on the primary). I think that this idea of a
"recently dead" bit might be able to accomplish a few goals all at
once, including your specific goal of setting "hint bits" in indexes.

The issue here is that it would also be nice to use a "recently dead"
bit on the primary, but if you do that then maybe you're back to the
original problem. OTOH, I also think that we could repurpose the
LP_NORMAL bit in index AMs, so we could potentially have 3 different
definitions of dead-ness without great difficulty!

Perhaps this doesn't seem all that helpful, since I am expanding scope
here for a project that is already very difficult. And maybe it just
isn't helpful -- I don't know. But it is at least possible that
expanding scope could actually *help* your case a lot, even if you
only ever care about your original goal. My personal experience with
nbtree patches is just that: sometimes *expanding* scope actually
makes things *easier*, not harder. This is sometimes called "The
Inventor's Paradox":

https://en.wikipedia.org/wiki/Inventor%27s_paradox

Consider the example of my work on nbtree in PostgreSQL 12. It was
actually about 6 or 7 enhancements, not just 1 big enhancement -- it
is easy to forget that now. I think that it was *necessary* to add at
least 5 of these enhancements at the same time (maybe 2 or so really
were optional). This is deeply counter-intuitive, but still seems to
be true in my case. The specific reasons why I believe that this is
true of the PostgreSQL 12 work are complicated, but it boils down to
this: the ~5 related-though-distinct enhancements were individually
not all that compelling (certainly not compelling enough to make
on-disk changes for), but were much greater than the sum of their
parts when considered together. Maybe I got lucky there.

More generally, the nbtree stuff that I worked on in 12, 13, and now
14 actually feels like one big project to me. I will refrain from
explaining exactly why that is right now, but you might be very
surprised at how closely related it all is. I didn't exactly plan it
that way, but trying to see things in the most general terms turned
out to be a huge asset to me. If there are several independent reasons
to move in one particular direction all at once, you can generally
afford to be wrong about a lot of things without it truly harming
anybody. Plus it's easy to see that you will not frustrate future work
that is closely related but distinct when that future work is
*directly enabled* by what you're doing.

What's more, the extra stuff I'm talking about probably has a bunch of
other benefits on the primary, if done well. Consider how the deletion
stuff with LP_DEAD bits now considers "extra" index tuples to delete
when they're close by. We could even do something like that with these
LP_REDIRECT/recently dead bits on the primary.

I understand that it's hard to have a really long term outlook. I
think that that's almost necessary when working on a project like
this, though.

Don't forget that this works both ways. Maybe a person that wanted to
do this "recently dead" stuff (which sounds really interesting to me
right now) would similarly be compelled to consider the bigger
picture, including the question of setting hint bits on standbys --
this other person had better not make that harder in the future, for
the same reasons (obviously what you want to do here makes sense, it
just isn't everything to everybody). This is yet another way in which
expanding scope can help.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on "killed tuples" index hint bits support on standby

Peter Geoghegan-4
On Wed, Jan 27, 2021 at 5:24 PM Peter Geoghegan <[hidden email]> wrote:
> The issue here is that it would also be nice to use a "recently dead"
> bit on the primary, but if you do that then maybe you're back to the
> original problem. OTOH, I also think that we could repurpose the
> LP_NORMAL bit in index AMs, so we could potentially have 3 different
> definitions of dead-ness without great difficulty!

To be clear, what I mean is that you currently have two bits in line
pointers. In an nbtree leaf page we only currently use one -- the
LP_DEAD bit. But bringing the second bit into it allows us to have a
representation for two additional states (not just one), since of
course the meaning of the second bit can be interpreted using the
LP_DEAD bit. You could for example having encodings for each of the
following distinct per-LP states, while still preserving on-disk
compatibility:

"Not known to be dead in any sense" (0), "Unambiguously dead to all"
(what we now simply call LP_DEAD), "recently dead on standby"
(currently-spare bit is set), and "recently dead on primary" (both
'lp_flags' bits set).

Applying FPIs on the standby would have to be careful to preserve a
standby-only bit. I'm probably not thinking of every subtlety, but
"putting all of the pieces of the puzzle together for further
consideration" is likely to be a useful exercise.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on "killed tuples" index hint bits support on standby

Michail Nikolaev
Hello, Peter.


> I wonder if it would help to not actually use the LP_DEAD bit for
> this. Instead, you could use the currently-unused-in-indexes
> LP_REDIRECT bit.

Hm… Sound very promising - an additional bit is a lot in this situation.

> Whether or not "recently dead" means "dead to my
> particular MVCC snapshot" can be determined using some kind of
> in-memory state that won't survive a crash (or a per-index-page
> epoch?).

Do you have any additional information about this idea? (maybe some thread). What kind of “in-memory state that won't survive a crash” and how to deal with flushed bits after the crash?

> "Not known to be dead in any sense" (0), "Unambiguously dead to all"
> (what we now simply call LP_DEAD), "recently dead on standby"
> (currently-spare bit is set), and "recently dead on primary" (both
> 'lp_flags' bits set).

Hm. What is about this way:

    10 - dead to all on standby (LP_REDIRECT)
    11 - dead to all on primary (LP_DEAD)
    01 - future “recently DEAD” on primary (LP_NORMAL)

In such a case standby could just always ignore all LP_DEAD bits from primary (standby will lose its own hint after FPI - but it is not a big deal). So, we don’t need any conflict resolution (and any additional WAL records). Also, hot_standby_feedback-related stuff is not required anymore. All we need to do (without details of course) - is correctly check if it is safe to set LP_REDIRECT on standby according to `minRecoveryPoint` (to avoid consistency issues during crash recovery). Or, probably, introduce some kind of `indexHintMinRecoveryPoint`.

Also, looks like both GIST and HASH indexes also do not use LP_REDIRECT.

So, it will remove more than 80% of the current patch complexity!

Also, btw, do you know any reason to keep minRecoveryPoint at a low value? Because it blocks standby for settings hints bits in *heap* already. And, probably, will block standby to set *index* hint bits aggressively.

Thanks a lot,
Michail.

Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on "killed tuples" index hint bits support on standby

Peter Geoghegan-4
On Thu, Jan 28, 2021 at 10:16 AM Michail Nikolaev
<[hidden email]> wrote:
> > I wonder if it would help to not actually use the LP_DEAD bit for
> > this. Instead, you could use the currently-unused-in-indexes
> > LP_REDIRECT bit.
>
> Hm… Sound very promising - an additional bit is a lot in this situation.

Yeah, it would help a lot. But those bits are precious. So it makes
sense to think about what to do with both of them in index AMs at the
same time.  Otherwise we risk missing some important opportunity.

> > Whether or not "recently dead" means "dead to my
> > particular MVCC snapshot" can be determined using some kind of
> > in-memory state that won't survive a crash (or a per-index-page
> > epoch?).
>
> Do you have any additional information about this idea? (maybe some thread). What kind of “in-memory state that won't survive a crash” and how to deal with flushed bits after the crash?

Honestly, that part wasn't very well thought out. A lot of things might work.

Some kind of "recently dead" bit is easier on the primary. If we have
recently dead bits set on the primary (using a dedicated LP bit for
original execution recently-dead-ness), then we wouldn't even
necessarily have to change anything about index scans/visibility at
all. There would still be a significant benefit if we simply used the
recently dead bits when considering which heap blocks nbtree simple
deletion will visit inside _bt_deadblocks() -- in practice there would
probably be no real downside from assuming that the recently dead bits
are now fully dead (it would sometimes be wrong, but not enough to
matter, probably only when there is a snapshot held for way way too
long).

Deletion in indexes can work well while starting off with only an
*approximate* idea of which index tuples will be safe to delete --
this is a high level idea behind my recent commit d168b666823. It
seems very possible that that could be pushed even further here on the
primary.

On standbys (which set standby recently dead bits) it will be
different, because you need "index hint bits" set that are attuned to
the workload on the standby, and because you don't ever use the bit to
help with deleting anything on the standby (that all happens during
original execution).

BTW, what happens when the page splits on the primary, btw? Does your
patch "move over" the LP_DEAD bits to each half of the split?

> Hm. What is about this way:
>
>     10 - dead to all on standby (LP_REDIRECT)
>     11 - dead to all on primary (LP_DEAD)
>     01 - future “recently DEAD” on primary (LP_NORMAL)

Not sure.

> Also, looks like both GIST and HASH indexes also do not use LP_REDIRECT.

Right -- if we were to do this, the idea would be that it would apply
to all index AMs that currently have (or will ever have) something
like the LP_DEAD bit stuff. The GiST and hash support for index
deletion is directly based on the original nbtree version, and there
is no reason why we cannot eventually do all this stuff in at least
those three AMs.

There are already some line-pointer level differences in index AMs:
LP_DEAD items have storage in index AMs, but not in heapam. This
all-table-AMs/all-index-AMs divide in how item pointers work would be
preserved.

> Also, btw, do you know any reason to keep minRecoveryPoint at a low value?

Not offhand.


--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on "killed tuples" index hint bits support on standby

Michail Nikolaev
Hello, Peter.

> Yeah, it would help a lot. But those bits are precious. So it makes
> sense to think about what to do with both of them in index AMs at the
> same time.  Otherwise we risk missing some important opportunity.

Hm. I was trying to "expand the scope" as you said and got an idea... Why we even should do any conflict resolution for hint bits? Or share precious LP bits?

The only way standby could get an "invalid" hint bit - is an FPI from the primary. We could just use the current *btree_mask* infrastructure and clear all "probably invalid" bits from primary in case of standby while applying FPI (in `XLogReadBufferForRedoExtended`)!
For binary compatibility, we could use one of `btpo_flags` bits to mark the page as "primary bits masked". The same way would work for hash\gist too.

No conflicts, only LP_DEAD bit is used by standby, `btpo_flags` has many free bits for now, easy to implement, page content of primary\standby already differs in this bits...
Looks like an easy and effective solution for me.

What do you think?

>> Also, btw, do you know any reason to keep minRecoveryPoint at a low value?
> Not offhand.

If so, looks like it is not a bad idea to move minRecoveryPoint forward from time to time (for more aggressive standby index hint bits). For each `xl_running_xacts` (about each 15s), for example.

> BTW, what happens when the page splits on the primary, btw? Does your
> patch "move over" the LP_DEAD bits to each half of the split?

That part is not changed in any way.

Thanks,
Michail.

Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on "killed tuples" index hint bits support on standby

Peter Geoghegan-4
On Sat, Jan 30, 2021 at 9:11 AM Michail Nikolaev
<[hidden email]> wrote:
> > Yeah, it would help a lot. But those bits are precious. So it makes
> > sense to think about what to do with both of them in index AMs at the
> > same time.  Otherwise we risk missing some important opportunity.
>
> Hm. I was trying to "expand the scope" as you said and got an idea... Why we even should do any conflict resolution for hint bits? Or share precious LP bits?

What does it mean today when an LP_DEAD bit is set on a standby? I
don't think that it means nothing at all -- at least not if you think
about it in the most general terms.

In a way it actually means something like "recently dead" even today,
at least in one narrow specific sense: recovery may end, and then we
actually *will* do *something* with the LP_DEAD bit, without directly
concerning ourselves with when or how each LP_DEAD bit was set.

During recovery, we will probably always have to consider the
possibility that LP_DEAD bits that get set on the primary may be
received by a replica through some implementation detail (e.g. LP_DEAD
bits are set in FPIs we replay, maybe even some other thing that
neither of us have thought of). We can't really mask LP_DEAD bits from
the primary in recovery anyway, because of stuff like page-level
checksums. I suspect that it just isn't useful to fight against that.
These preexisting assumptions are baked into everything already.

Why should we assume that *anybody* understands all of the ways in
which that is true?

Even today, "LP_DEAD actually means a limited kind of 'recently dead'
during recovery + hot standby" is something that is true (as I just
went into), but at the same time also has a fuzzy definition. My gut
instinct is to be conservative about that. As I suggested earlier, you
could invent some distinct kind of "recently dead" that achieves your
goals in a way that is 100% orthogonal.

The two unused LP dead bits (unused in indexes, though not tables) are
only precious if we assume that somebody will eventually use them for
something -- if nobody ever uses them then they're 100% useless. The
number of possible projects that might end up using the two bits for
something is not that high -- certainly not more than 3 or 4. Besides,
it is always a good idea to keep the on-disk format as simple and
explicit as possible -- it should be easy to analyze forensically in
the event of bugs or some other kind of corruption, especially by
using tools like amcheck.

As I said in my earlier email, we can even play tricks during page
deletion by treating certain kinds of "recently dead" bits as
approximate things. As a further example, we can "rely" on the
"dead-to-all but only on standby" bits when recovery ends, during a
subsequent write transactions. We can do so by simply using them in
_bt_deadblocks() as if they were LP_DEAD (we'll recheck heap tuples in
heapam.c instead).

> The only way standby could get an "invalid" hint bit - is an FPI from the primary. We could just use the current *btree_mask* infrastructure and clear all "probably invalid" bits from primary in case of standby while applying FPI (in `XLogReadBufferForRedoExtended`)!

I don't like that idea. Apart from what I said already, you're
assuming that setting LP_DEAD bits in indexes on the primary won't
eventually have some value on the standby after it is promoted and can
accept writes -- they really do have meaning and value on standbys.
Plus you'll introduce new overhead for this process during replay,
which creates significant overhead -- often most leaf pages have some
LP_DEAD bits set during recovery.

> For binary compatibility, we could use one of `btpo_flags` bits to mark the page as "primary bits masked". The same way would work for hash\gist too.

I agree that there are plenty of btpo_flags bits. However, I have my
doubts about this.

Why shouldn't this break page-level checksums (or wal_log_hints) in
some way? What about pg_rewind, some eventual implementation of
incremental backups, etc? I suspect that it will be necessary to
invent some entirely new concept that is like a hint bit, but works on
standbys (without breaking anything else).

> No conflicts, only LP_DEAD bit is used by standby, `btpo_flags` has many free bits for now, easy to implement, page content of primary\standby already differs in this bits...
> Looks like an easy and effective solution for me.

Note that the BTP_HAS_GARBAGE flag (which goes in btpo_flags) was
deprecated in commit cf2acaf4. It was never a reliable indicator of
whether or not some LP_DEAD bits are set in the page. And I think that
it never could be made to work like that.

> >> Also, btw, do you know any reason to keep minRecoveryPoint at a low value?
> > Not offhand.
>
> If so, looks like it is not a bad idea to move minRecoveryPoint forward from time to time (for more aggressive standby index hint bits). For each `xl_running_xacts` (about each 15s), for example.

It's necessary for recoverypoints (i.e. the standby equivalent of a
checkpoint) to do that in order to ensure that we won't break
checksums. This whole area is scary to me:

https://postgr.es/m/CABOikdPOewjNL=05K5CbNMxnNtXnQjhTx2F--4p4ruorCjukbA@...

Since, as I said, it's already true that LP_DEAD bits on standbys are
some particular kind of "recently dead bit", even today, any design
that uses LP_DEAD bits in some novel new way (also on the standby) is
very hard to test. It might easily have very subtle bugs -- obviously
a recently dead bit relates to a tuple pointing to a logical row that
is bound to become dead soon enough. The difference between dead and
recently dead is already blurred, and I would rather not go there.

If you invent some entirely new category of standby-only hint bit at a
level below the access method code, then you can use it inside access
method code such as nbtree. Maybe you don't have to play games with
minRecoveryPoint in code like the "if (RecoveryInProgress())" path
from the XLogNeedsFlush() function. Maybe you can do some kind of
rudimentary "masking" for the in recovery case at the point that we
*write out* a buffer (*not* at the point hint bits are initially set)
-- maybe this could happen near to or inside FlushBuffer(), and maybe
only when checksums are enabled? I'm unsure.

> > BTW, what happens when the page splits on the primary, btw? Does your
> > patch "move over" the LP_DEAD bits to each half of the split?
>
> That part is not changed in any way.

Maybe it's okay to assume that it's no loss to throw away hint bits
set on the standby, because in practice deletion on the primary will
usually do the right thing anyway. But you never said that. I think
that you should take an explicit position on this question -- make it
a formal part of your overall design.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on "killed tuples" index hint bits support on standby

Peter Geoghegan-4
On Sat, Jan 30, 2021 at 5:39 PM Peter Geoghegan <[hidden email]> wrote:
> If you invent some entirely new category of standby-only hint bit at a
> level below the access method code, then you can use it inside access
> method code such as nbtree. Maybe you don't have to play games with
> minRecoveryPoint in code like the "if (RecoveryInProgress())" path
> from the XLogNeedsFlush() function. Maybe you can do some kind of
> rudimentary "masking" for the in recovery case at the point that we
> *write out* a buffer (*not* at the point hint bits are initially set)
> -- maybe this could happen near to or inside FlushBuffer(), and maybe
> only when checksums are enabled? I'm unsure.

I should point out that hint bits in heap pages are really not like
LP_DEAD bits in indexes -- if they're similar at all then the
similarity is only superficial/mechanistic. In fact, the term "hint
bits in indexes" does not seem useful at all to me, for this reason.

Heap hint bits indicate whether or not the xmin or xmax in a heap
tuple header committed or aborted. We cache the commit or abort status
of one particular XID in the heap tuple header. Notably, this
information alone tells us nothing about whether or not the tuple
should be visible to any given MVCC snapshot. Except perhaps in cases
involving aborted transactions -- but that "exception" is just a
limited special case (and less than 1% of transactions are aborted in
almost all environments anyway).

In contrast, a set LP_DEAD bit in an index is all the information we
need to know that the tuple is dead, and can be ignored completely
(except during hot standby, where at least today we assume nothing
about the status of the tuple, since that would be unsafe). Generally
speaking, the index page LP_DEAD bit is "direct" visibility
information about the tuple, not information about XIDs that are
stored in the tuple header. So a set LD_DEAD bit in an index is
actually like an LP_DEAD-set line pointer in the heap (that's the
closest equivalent in the heap, by far). It's also like a frozen heap
tuple (except it's dead-to-all, not live-to-all).

The difference may be subtle, but it's important here -- it justifies
inventing a whole new type of LP_DEAD-style status bit that gets set
only on standbys. Even today, heap tuples can have hint bits
"independently" set on standbys, subject to certain limitations needed
to avoid breaking things like data page checksums. Hint bits are
ultimately just a thing that remembers the status of transactions that
are known committed or aborted, and so can be set immediately after
the relevant xact commits/aborts (at least on the primary, during
original execution). A long-held MVCC snapshot is never a reason to
not set a hint bit in a heap tuple header (during original execution
or during hot standby/recovery). Of course, a long-held MVCC snapshot
*is* often a reason why we cannot set an LP_DEAD bit in an index.

Conclusion: The whole minRecoveryPoint question that you're trying to
answer to improve things for your patch is just the wrong question.
Because LP_DEAD bits in indexes are not *true* "hint bits". Maybe it
would be useful to set "true hint bits" on standbys earlier, and maybe
thinking about minRecoveryPoint would help with that problem, but that
doesn't help your index-related patch -- because indexes simply don't
have true hint bits.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on "killed tuples" index hint bits support on standby

Michail Nikolaev
Hello, Peter.

Thanks a lot for your comments.
There are some mine thoughts, related to the “masked bits” solution and your comments:

> During recovery, we will probably always have to consider the
> possibility that LP_DEAD bits that get set on the primary may be
> received by a replica through some implementation detail (e.g. LP_DEAD
> bits are set in FPIs we replay, maybe even some other thing that
> neither of us have thought of).

It is fine to receive a page to the standby from any source: `btpo_flags` should have some kind “LP_DEAD safe for standby” bit set to allow new bits to be set and old - read.

> We can't really mask LP_DEAD bits from
> the primary in recovery anyway, because of stuff like page-level
> checksums. I suspect that it just isn't useful to fight against that.

As far as I can see - there is no problem here. Checksums already differ for both heap and index pages on standby and primary. Checksums are calculated before the page is written to the disk (not after applying FPI). So, the masking page during *applying* the FPI is semantically the same as setting a bit in it 1 nanosecond after.

And `btree_mask` (and other mask functions) already used for consistency checks to exclude LP_DEAD.

> Plus you'll introduce new overhead for this process during replay,
> which creates significant overhead -- often most leaf pages have some
> LP_DEAD bits set during recovery.

I hope it is not big enough, because FPIs are not too frequent + positive effect will easily overcome additional CPU cycles of `btree_mask` (and the page is already in CPU cache at the moment).

> I don't like that idea. Apart from what I said already, you're
> assuming that setting LP_DEAD bits in indexes on the primary won't
> eventually have some value on the standby after it is promoted and can
> accept writes -- they really do have meaning and value on standbys.

I think it is fine to lose part of LP_DEAD on promotion (which can be received only by FPI in practice). They will be set on the first scan anyway. Also, bits set by standby may be used by newly promoted primary if we honor OldestXmin of the previous primary while setting it (just need to add OldestXmin in xl_running_xacts and include it into dead-horizon on standby).

> Why shouldn't this break page-level checksums (or wal_log_hints) in
> some way? What about pg_rewind, some eventual implementation of
> incremental backups, etc? I suspect that it will be necessary to
> invent some entirely new concept that is like a hint bit, but works on
> standbys (without breaking anything else).

As I said before - applying the mask on *standby* will not break any checksums - because the page is still dirty after that (and it is even possible to call `PageSetChecksumInplace` for an additional paranoia). Actual checksums on standby and primary already have different values (and, probably, in the most of the pages because LP_DEAD and “classic” hint bits).

> If you invent some entirely new category of standby-only hint bit at a
> level below the access method code, then you can use it inside access
> method code such as nbtree. Maybe you don't have to play games with
> minRecoveryPoint in code like the "if (RecoveryInProgress())" path
> from the XLogNeedsFlush() function. Maybe you can do some kind of
> rudimentary "masking" for the in recovery case at the point that we
> *write out* a buffer (*not* at the point hint bits are initially set)
> -- maybe this could happen near to or inside FlushBuffer(), and maybe
> only when checksums are enabled? I'm unsure.

Not sure I was able to understand your idea here, sorry.

> The difference may be subtle, but it's important here -- it justifies
> inventing a whole new type of LP_DEAD-style status bit that gets set
> only on standbys. Even today, heap tuples can have hint bits
> "independently" set on standbys, subject to certain limitations needed
> to avoid breaking things like data page checksums

Yes, and I see three major ways to implement it in the current infrastructure:

1) Use LP_REDIRECT (or other free value) instead of LP_DEAD on standby
2) Use LP_DEAD on standby, but involve some kind of recovery conflicts (like here - https://www.postgresql.org/message-id/flat/CANtu0oiP18H31dSaEzn0B0rW6tA_q1G7%3D9Y92%2BUS_WHGOoQevg%40mail.gmail.com )
3) Mask index FPI during a replay on hot standby + mark page as “primary LP_DEAD free” in btpo_flags

Of course, each variant requires some special additional things to keep everything safe.

As I see in SetHintsBits limitations are related to XLogNeedsFlush (check of minRecoveryPoint in case of standby).

> Conclusion: The whole minRecoveryPoint question that you're trying to
> answer to improve things for your patch is just the wrong question.
> Because LP_DEAD bits in indexes are not *true* "hint bits". Maybe it
> would be useful to set "true hint bits" on standbys earlier, and maybe
> thinking about minRecoveryPoint would help with that problem, but that
> doesn't help your index-related patch -- because indexes simply don't
> have true hint bits.

Attention to minRecoveryPoint is required because of possible situation described here - https://www.postgresql.org/message-id/flat/CANtu0ojwFcSQpyCxrGxJuLVTnOBSSrzKuF3cB_yCk0U-X-wpGw%40mail.gmail.com#4d8ef8754b18c5e35146ed589b25bf27
The source of the potential problem - is the fact what the setting of LP_DEAD does not change page LSN and it could cause consistency issues during crash recovery.

Thanks,
Michail.
Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on "killed tuples" index hint bits support on standby

Peter Geoghegan-4
On Mon, Feb 1, 2021 at 1:19 PM Michail Nikolaev
<[hidden email]> wrote:
> It is fine to receive a page to the standby from any source: `btpo_flags` should have some kind “LP_DEAD safe for standby” bit set to allow new bits to be set and old - read.
>
> > We can't really mask LP_DEAD bits from
> > the primary in recovery anyway, because of stuff like page-level
> > checksums. I suspect that it just isn't useful to fight against that.
>
> As far as I can see - there is no problem here. Checksums already differ for both heap and index pages on standby and primary.

AFAICT that's not true, at least not in any practical sense. See the
comment in the middle of MarkBufferDirtyHint() that begins with "If we
must not write WAL, due to a relfilenode-specific...", and see the
"Checksums" section at the end of src/backend/storage/page/README. The
last paragraph in the README is particularly relevant:

New WAL records cannot be written during recovery, so hint bits set during
recovery must not dirty the page if the buffer is not already dirty, when
checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
being set, but with checksums enabled, a page cannot be dirtied after setting a
hint bit (due to the torn page risk). So, it must wait for full-page images
containing the hint bit updates to arrive from the primary.

IIUC the intention is that MarkBufferDirtyHint() is a no-op during hot
standby when we successfully set a hint bit, though only in the
XLogHintBitIsNeeded() case. So we don't really dirty the page within
SetHintBits() in this specific scenario. That is, the buffer header
won't actually get marked BM_DIRTY or BM_JUST_DIRTIED within
MarkBufferDirtyHint() when in Hot Standby + XLogHintBitIsNeeded().
What else could work at all? The only "alternative" is to write an
FPI, just like on the primary -- but writing new WAL records is not
possible during Hot Standby!

A comment within MarkBufferDirtyHint() spells it out directly -- we
can have hint bits set in hot standby independently of the primary,
but it works in a way that makes sure that the hint bits never make it
out to disk:

"We can set the hint, just not dirty the page as a result so the hint
is lost when we evict the page or shutdown"

You may be right in some narrow sense -- checksums can differ on a
standby. But that's like saying that it's sometimes okay to have a
torn page on disk. Yes, it's okay, but only because we expect the
problem during crash recovery, and can reliably repair it.

> Checksums are calculated before the page is written to the disk (not after applying FPI). So, the masking page during *applying* the FPI is semantically the same as setting a bit in it 1 nanosecond after.
>
> And `btree_mask` (and other mask functions) already used for consistency checks to exclude LP_DEAD.

I don't see how that is relevant. btree_mask() is only used by
wal_consistency_checking, which is mostly just for Postgres hackers.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on "killed tuples" index hint bits support on standby

Michail Nikolaev
Hello, Peter.

> AFAICT that's not true, at least not in any practical sense. See the
> comment in the middle of MarkBufferDirtyHint() that begins with "If we
> must not write WAL, due to a relfilenode-specific...", and see the
> "Checksums" section at the end of src/backend/storage/page/README. The
> last paragraph in the README is particularly relevant:

I have attached a TAP-test to demonstrate how easily checksums on standby and primary starts to differ. The test shows two different scenarios - for both heap and index (and the bit is placed in both standby and primary).

Yes, MarkBufferDirtyHint does not mark the page as dirty… So, hint bits on secondary could be easily lost. But it leaves the page dirty if it already is (or it could be marked dirty by WAL replay later). So, hints bits could be easily flushed and taken into account during checksum calculation on both - standby and primary.

> "We can set the hint, just not dirty the page as a result so the hint
> is lost when we evict the page or shutdown"

Yes, it is not allowed to mark a page as dirty because of hints on standby. Because we could achieve this:

CHECKPOINT
SET HINT BIT
TORN FLUSH + CRASH = BROKEN CHECKSUM, SERVER FAULT

But this scenario is totally fine:

CHECKPOINT
FPI (page is still dirty)
SET HINT BIT
TORN FLUSH + CRASH = PAGE IS RECOVERED, EVERYTHING IS OK

And, as result, this is fine too:

CHECKPOINT
FPI WITH MASKED LP_DEAD (page is still dirty)
SET HINT BIT
TORN FLUSH + CRASH = PAGE IS RECOVERED + LP_DEAD MASKED AGAIN IF STANDBY

So, my point here - it is fine to mask LP_DEAD bits during replay because they are already different on standby and primary. And it is fine to set and flush hint bits (and LP_DEADs) on standby because they already could be easily flushed (just need to consider minRecovertPoint and, probably, OldesXmin from primary in case of LP_DEAD to make promotion easily).

>> And `btree_mask` (and other mask functions) already used for consistency checks to exclude LP_DEAD.
> I don't see how that is relevant. btree_mask() is only used by
> wal_consistency_checking, which is mostly just for Postgres hackers.

I was thinking about the possibility to reuse these functions in masking during replay.

Thanks,
Michail.

022_checksum_tests.pl (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Thoughts on "killed tuples" index hint bits support on standby

Michail Nikolaev
Hello, Peter.

If you are interested, the possible patch (based on FPI mask during
replay) was sent with some additional explanation and graphics to (1).
At the moment I unable to find any "incorrectness" in it.

Thanks again for your comments.

Michail.


[1] https://www.postgresql.org/message-id/flat/CANtu0ohHu1r1xQfTzEJuxeaOMYncG7xRxUQWdH%3DcMXZSf%2Bnzvg%40mail.gmail.com#4c81a4d623d8152f5e8889e97e750eec