[PATCH] Btree BackwardScan race condition on Standby during VACUUM

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

[PATCH] Btree BackwardScan race condition on Standby during VACUUM

Michail Nikolaev
Hello, hackers.

------ ABSTRACT ------
There is a race condition between btree_xlog_unlink_page and _bt_walk_left.
A lot of versions are affected including 12 and new-coming 13.
Happens only on standby. Seems like could not cause invalid query results.

------ REMARK ------
While working on support for index hint bits on standby [1] I have
started to getting
"ERROR:  could not find left sibling of block XXXX in index XXXX"
during stress tests.

I was sure I have broken something in btree and spent a lot of time
trying to figure what.
And later...  I realized what it is bug in btree since a very old times...
Because of much faster scans with LP_DEAD support on a standby it
happens much more frequently in my case.

------ HOW TO REPRODUCE  ------
It is not easy to reproduce the issue but you can try (tested on
REL_12_STABLE and master):

1) Setup master (sync replica and 'remote_apply' are not required -
just make scripts simpler):
autovacuum = off
synchronous_standby_names = '*'
synchronous_commit = 'remote_apply'

2) Setup standby:
primary_conninfo = 'user=postgres host=127.0.0.1 port=5432
sslmode=prefer sslcompression=0 gssencmode=prefer krbsrvname=postgres
target_session_attrs=any'
port = 6543

3) Prepare pgbench file with content (test.bench):
BEGIN;
select * from pgbench_accounts order by aid desc limit 1;
END;

4) Prepare the index:
./pgbench -i -s 10 -U postgres
./psql -U postgres -c "delete from pgbench_accounts where aid IN
(select aid from pgbench_accounts order by aid desc limit 500000)"

5) Start index scans on the standby:
./pgbench -f test.bench -j 1 -c ${NUMBER_OF_CORES} -n -P 1 -T 10000 -U
postgres -p 6543

6) Run vacuum on the master:
./psql -U postgres -c "vacuum pgbench_accounts"

7) You should see something like this:

> progress: 1.0 s, 5.0 tps, lat 614.530 ms stddev 95.902
> .....
> progress: 5.0 s, 10.0 tps, lat 508.561 ms stddev 82.338
> client 3 script 0 aborted in command 1 query 0: ERROR:  could not find left sibling of block 1451 in index "pgbench_accounts_pkey"
> progress: 6.0 s, 47.0 tps, lat 113.001 ms stddev 55.709
> .....
> progress: 12.0 s, 84.0 tps, lat 48.451 ms stddev 7.238
> client 2 script 0 aborted in command 1 query 0: ERROR:  could not find left sibling of block 2104 in index "pgbench_accounts_pkey"
> progress: 13.0 s, 87.0 tps, lat 39.338 ms stddev 5.417
> .....
> progress: 16.0 s, 158.0 tps, lat 18.988 ms stddev 3.251
> client 4 script 0 aborted in command 1 query 0: ERROR:  could not find left sibling of block 2501 in index "pgbench_accounts_pkey"
I was able to reproduce issue with vanilla PG_12 on different systems
including the Windows machine.
On some servers it happens 100% times. On others - very seldom.

It is possible to radically increase chance to reproduce the issue by
adding a sleep in btree_xlog_unlink_page[7]:
>   + pg_usleep(10 * 1000L);
>
>   /* Rewrite target page as empty deleted page */
>   buffer = XLogInitBufferForRedo(record, 0);

------  WHAT HAPPENS  ------
It is race condition issue between btree_xlog_unlink_page and _bt_walk_left.

btree_xlog_unlink_page removes page from btree changing btpo_prev and
btpo_next of its left and right siblings to point
to the each other and marks target page as removed. All these
operations are done using one-page-at-a-time locking because of[4]:

>    * In normal operation, we would lock all the pages this WAL record
>    * touches before changing any of them.  In WAL replay, it should be okay
>    * to lock just one page at a time, since no concurrent index updates can
>    * be happening, and readers should not care whether they arrive at the
>    * target page or not (since it's surely empty).

_bt_walk_left walks left in very tricky way. Please refer to
src/backend/access/nbtree/README for details[5]:

>    Moving left in a backward scan is complicated because we must consider
>    the possibility that the left sibling was just split (meaning we must find
>    the rightmost page derived from the left sibling), plus the possibility
>    that the page we were just on has now been deleted and hence isn't in the
>    sibling chain at all anymore.

So, this is how race is happens:

0) this is target page (B) and its siblings.
A <---> B <---> C ---> END

1) walreceiver starts btree_xlog_unlink_page for the B. It is changes
the links from C to A and from A to C (I hope my scheme will be
displayed correctly):
A <---- B ----> C ---> END
^               ^
 \_____________/

But B is not marked as BTP_DELETED yet - walreceiver stops at nbtxlog:697[2].

2) other backend starts _bt_walk_left from B.
It checks A, goes to from A to C by updated btpo_next and later sees
end of the btree.
So, next step is to check if B was deleted (nbtsearch:2011)[3] and try
to recover.

But B is not yet deleted! It will be marked as BTP_DELETED after a few
millis by walreceiver but not yet.
So, nbtsearch:2046[6] is happens.


------  HOW TO FIX  ------
The first idea was to mark page as BTP_DELETED before updating siblings links.

Second - to update pages in the following order:
* change btpo_next
* mark as BTP_DELETED
* change btpo_prev

Such a changes fix the exactly described above race condition... but
cause a more tricky ones to start happening.
And I think it is better to avoid any too complex unclear solutions here..

Another idea - to sleep a little waiting walreceiver to mark the page
as deleted. But it seems to feel too ugly. Also it is unclear how long
to wait.

So, I think right way is to lock all three pages as it is done on the
primary. As far as I can see it is not causes any real performance
regression.

Patch is attached (on top of REL_12_STABLE).

Thanks,
Michail.

[1]: https://www.postgresql.org/message-id/flat/CANtu0ohOvgteBYmCMc2KERFiJUvpWGB0bRTbK_WseQH-L1jkrQ%40mail.gmail.com
[2]: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtxlog.c#L697
[3]: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtsearch.c#L2011
[4]: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtxlog.c#L575-L581
[5]: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/README#L289-L314
[6]: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtsearch.c#L2046
[7]: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtxlog.c#L696

btree-race.patch (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM

Andrey M. Borodin
Hi Michail!

Very interesting bug.

> 16 марта 2020 г., в 19:07, Michail Nikolaev <[hidden email]> написал(а):
>
> So, I think right way is to lock all three pages as it is done on the
> primary. As far as I can see it is not causes any real performance
> regression.

It seems to me that it's exactly the same check that I was trying to verify in amcheck patch [0].
But there it was verified inside amcheck, but here it is verified by index scan.

Basically, one cannot check that two vice-versa pointers are in agreement without locking both.
As a result, they must be changed under lock too.

In my view, lock coupling is necessary here. I'm not sure we really need to lock three pages though.

Is there a reason why concurrency protocol on standby should not be exactly the same as on primary?


Best regards, Andrey Borodin.

[0] https://commitfest.postgresql.org/24/2254/

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM

Peter Geoghegan-4
In reply to this post by Michail Nikolaev
On Mon, Mar 16, 2020 at 7:08 AM Michail Nikolaev
<[hidden email]> wrote:
> ------ ABSTRACT ------
> There is a race condition between btree_xlog_unlink_page and _bt_walk_left.
> A lot of versions are affected including 12 and new-coming 13.
> Happens only on standby. Seems like could not cause invalid query results.

(CC'ing Heikki, just in case.)

Good catch! I haven't tried to reproduce the problem here just yet,
but your explanation is very easy for me to believe.

As you pointed out, the best solution is likely to involve having the
standby imitate the buffer lock acquisitions that take place on the
primary. We don't do that for page splits and page deletions. I think
that it's okay in the case of page splits, since we're only failing to
perform the same bottom-up lock coupling (I added something about that
specific thing to the README recently). Even btree_xlog_unlink_page()
would probably be safe if we didn't have to worry about backwards
scans, which are really a special case. But we do.

FWIW, while I agree that this issue is more likely to occur due to the
effects of commit 558a9165, especially when running your test case, my
own work on B-Tree indexes for Postgres 12 might also be a factor. I
won't get into the reasons now, since they're very subtle, but I have
observed that the Postgres 12 work tends to make page deletion occur
far more frequently with certain workloads. This was really obvious
when I examined the structure of B-Tree indexes over many hours while
BenchmarkSQL/TPC-C [1] ran, for example.

[1] https://github.com/petergeoghegan/benchmarksql
--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM

Peter Geoghegan-4
In reply to this post by Andrey M. Borodin
On Mon, Mar 16, 2020 at 10:20 PM Andrey M. Borodin <[hidden email]> wrote:
> It seems to me that it's exactly the same check that I was trying to verify in amcheck patch [0].
> But there it was verified inside amcheck, but here it is verified by index scan.

Maybe we can accept your patch after fixing this bug. My objection to
the patch was that it couples locks in a way that's not compatible
with btree_xlog_unlink_page(). But the problem now seems to have been
btree_xlog_unlink_page() itself. It's possible that there are problems
elsewhere, but my recollection is that btree_xlog_unlink_page() was
the problem.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM

Andrey M. Borodin


> 18 марта 2020 г., в 00:37, Peter Geoghegan <[hidden email]> написал(а):
>
> On Mon, Mar 16, 2020 at 10:20 PM Andrey M. Borodin <[hidden email]> wrote:
>> It seems to me that it's exactly the same check that I was trying to verify in amcheck patch [0].
>> But there it was verified inside amcheck, but here it is verified by index scan.
>
> Maybe we can accept your patch after fixing this bug. My objection to
> the patch was that it couples locks in a way that's not compatible
> with btree_xlog_unlink_page(). But the problem now seems to have been
> btree_xlog_unlink_page() itself. It's possible that there are problems
> elsewhere, but my recollection is that btree_xlog_unlink_page() was
> the problem.

The problem was that btree_xlog_split() and btree_xlog_unlink_page() do not couple locks during fixing left links.
Probably, patch in this thread should fix this in btree_xlog_split() too?

Best regards, Andrey Borodin.

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM

Michail Nikolaev
Hello.

>  Probably, patch in this thread should fix this in btree_xlog_split() too?

I have spent some time trying to find any possible race condition
between btree_xlog_split and _bt_walk_left… But I can’t find any.
Also, I have tried to cause any issue by putting pg_sleep put into
btree_xlog_split (between releasing and taking of locks) but without
any luck.

I agree it is better to keep the same locking logic for primary and
standby in general. But it is a possible scope of another patch.

Thanks,
Michail.


Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM

Peter Geoghegan-4
On Fri, Mar 27, 2020 at 8:58 AM Michail Nikolaev
<[hidden email]> wrote:
> I have spent some time trying to find any possible race condition
> between btree_xlog_split and _bt_walk_left… But I can’t find any.
> Also, I have tried to cause any issue by putting pg_sleep put into
> btree_xlog_split (between releasing and taking of locks) but without
> any luck.

I pushed a commit that tries to clear up some of the details around
how locking works during page splits. See commit 9945ad6e.

> I agree it is better to keep the same locking logic for primary and
> standby in general. But it is a possible scope of another patch.

It seems useful, but only up to a point. We don't need to hold locks
across related atomic operations (i.e. across each phase of a page
split or page deletion). In particular, the lock coupling across page
levels that we perform on the primary when ascending the tree
following a page split doesn't need to occur on standbys. I added
something about this to the nbtree README in commit 9f83468b353.

I'm not surprised that you didn't find any problems in
btree_xlog_split(). It is already conservative about locking the
sibling/child pages. It could hardly be more conservative (though see
the code and comments at the end of btree_xlog_split(), which mention
locking and backwards scans directly).

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM

Peter Geoghegan-4
In reply to this post by Michail Nikolaev
On Mon, Mar 16, 2020 at 7:08 AM Michail Nikolaev
<[hidden email]> wrote:
> I was sure I have broken something in btree and spent a lot of time
> trying to figure what.
> And later...  I realized what it is bug in btree since a very old times...
> Because of much faster scans with LP_DEAD support on a standby it
> happens much more frequently in my case.

On second thought, I wonder how commit 558a9165 could possibly be
relevant here. nbtree VACUUM doesn't care about the LP_DEAD bit at
all. Sure, btree_xlog_delete_get_latestRemovedXid() is not going to
have to run on the standby on Postgres 12, but that only ever happened
at the point where we might have to split the page on the primary
(i.e. when _bt_delitems_delete() is called on the primary) anyway.
_bt_delitems_delete()/btree_xlog_delete_get_latestRemovedXid() are not
related to page deletion by VACUUM.

It's true that VACUUM will routinely kill tuples that happen to have
their LP_DEAD bit set, but it isn't actually influenced by the fact
that somebody set (or didn't set) any tuple's LP_DEAD bit. VACUUM has
its own strategy for generating recovery conflicts (it relies on
conflicts generated during the pruning phase of heap VACUUMing).
VACUUM is not willing to generate ad-hoc conflicts (in the style of
_bt_delitems_delete()) just to kill a few more tuples in relatively
uncommon cases -- cases where some LP_DEAD bits were set after a
VACUUM process started, but before the VACUUM process reached an
affected (LP_DEAD bits set) leaf page.

Again, I suspect that the problem is more likely to occur on Postgres
12 in practice because page deletion is more likely to occur on that
version. IOW, due to my B-Tree work for Postgres 12: commit dd299df8,
and related commits. That's probably all that there is to it.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM

Michail Nikolaev
Hello, Peter.

>  I added
>  something about this to the nbtree README in commit 9f83468b353.

I have added some updates to your notes in the updated patch version.

I also was trying to keep the original wrapping of the paragraph, so
the patch looks too wordy.

Thanks,
Michail.

btree-race-and-docs.patch (9K) Download Attachment