Found: some pretty ugly VACUUM bugs

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Found: some pretty ugly VACUUM bugs

Tom Lane-2
I believe I've traced down the cause of the Assert trap in VACUUM FULL
that Teodor reported here:
http://archives.postgresql.org/pgsql-hackers/2005-06/msg01278.php

The case that VACUUM is tripping up on is one in which some concurrent
transaction (call it X1) updates and then later deletes a row.  By the
time VACUUM starts, X1 is committed, but it's still within the
OldestXmin horizon, so there are open transactions that should not see
its effects.  We have two tuples in the table, call 'em T1 and T2,
representing the original and updated states of the row:

                XMIN XMAX t_ctid

T1 X0 X1 points to T2
T2 X1 X1 points to T2 (ie, itself)

(X0 is whichever transaction originally inserted T1; it's old enough
to not be interesting.  The reason T1 must point to T2 is that a READ
COMMITTED transaction that decides to update T1 must be able to find
T2 instead.)

The reason this configuration is troublesome is that
HeapTupleSatisfiesVacuum has this (premature?) optimization in it:

    if (TransactionIdEquals(HeapTupleHeaderGetXmin(tuple),
                            HeapTupleHeaderGetXmax(tuple)))
    {
        /*
         * inserter also deleted it, so it was never visible to anyone
         * else
         */
        return HEAPTUPLE_DEAD;
    }

This code causes VACUUM to delete T2, even though T1 is still pointing
at it --- and because T1's XMAX is past the OldestXmin horizon, T1 will
not be deleted.  The Assert that Teodor saw arose from the following
specific series of events (which can only occur if T2 is on a lower-
numbered page than T1):

        * T2 is removed and its slot is marked free.
        * repair_frag moves some unrelated tuple into T2's slot.
        * repair_frag visits T1, decides it has to move a tuple chain,
          and moves the new occupant of T2's slot (which is already
          wrong anyway).
        * when update_hint_bits scans T2's page, it finds the wrong
          number of MOVED_IN tuples because the tuple that was moved
          into T2's slot is now MOVED_OFF.  This triggers the Assert.

However this is merely the least interesting symptom of the problem.
If the state with T1 pointing to a tuple that is actually unrelated
is allowed to persist, then after the VACUUM is done someone else could
find T1 and then update the new occupant of T2's slot under the mistaken
assumption that it's the descendant of T1.

My first instinct for a quick fix was to delete the above-quoted section
of HeapTupleSatisfiesVacuum.  This would ensure that we never remove
a tuple that some other tuple might still be pointing at, unless we are
going to remove the referencing tuple as well.  But this only fixes the
problem for VACUUM FULL, which exclusive-locks the whole table.  With
lazy VACUUM, concurrent transactions can see the intermediate state
with T2 gone and T1 not.  Thus they could write some new tuple into T2's
slot and then make the mistaken update before VACUUM gets a chance to
remove T1.

The only solution I can see (short of abandoning lazy VACUUM) is that
we have to make the code that follows t_ctid chains more wary.  That
code is already aware (at least in the places I looked at) that a t_ctid
link might lead to an empty slot, but if there is a tuple in the slot
it just assumes that that is really a descendant version of the tuple
pointing to it.  That won't do.  But I believe it would work if we also
test that the XMIN of the tuple in the slot equals the XMAX of the
referencing tuple.  If they are unequal, we can conclude that the
original child tuple is dead and has been removed, so there is no
current version of the referencing tuple.

It is clearly true that if the XMIN and XMAX are different, the putative
child tuple isn't really the child.  I believe this test is sufficient,
because VACUUM never removes rows generated by still-in-progress
transactions, and so it is not possible for a single transaction to
store different tuples into the same slot over its lifetime.

This is going to require a number of changes since there are several
places that follow t_ctid chains.  With the change in place, though,
I think it's OK to leave the xmin=xmax optimization in place in
HeapTupleSatisfiesVacuum.

Comments?  Anyone see any flaws in the reasoning?

                        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

               http://archives.postgresql.org
Reply | Threaded
Open this post in threaded view
|

Re: Found: some pretty ugly VACUUM bugs

Álvaro Herrera
On Thu, Aug 18, 2005 at 03:48:55PM -0400, Tom Lane wrote:

> The only solution I can see (short of abandoning lazy VACUUM) is that
> we have to make the code that follows t_ctid chains more wary.  That
> code is already aware (at least in the places I looked at) that a t_ctid
> link might lead to an empty slot, but if there is a tuple in the slot
> it just assumes that that is really a descendant version of the tuple
> pointing to it.  That won't do.  But I believe it would work if we also
> test that the XMIN of the tuple in the slot equals the XMAX of the
> referencing tuple.  If they are unequal, we can conclude that the
> original child tuple is dead and has been removed, so there is no
> current version of the referencing tuple.

Interesting failure mode.  While reading it I was suddenly struck by the
thought that overwriting storage managers may somehow be more resistent
to these kind of failures.  This may well be true, because there is
never need for a VACUUM process which would fail to correctly determine
whether a tuple is truly dead or not; but in the end, concurrent
processes have to follow t_ctid chains anyway.

I also considered whether the correct test was xmin=xmax, or rather a
transaction-tree test was needed.  Then I realized that it's not
possible for a transaction to create a tuple chain crossing a
subtransaction boundary.  So the xmin=xmax test is correct.  I assume
you will make a note on this somewhere, just in case we forget later.

> This is going to require a number of changes since there are several
> places that follow t_ctid chains.

I wonder whether this should be refactored so all of them use a single
piece of code.

--
Alvaro Herrera (<alvherre[a]alvh.no-ip.org>)
"Cómo ponemos nuestros dedos en la arcilla del otro. Eso es la amistad; jugar
al alfarero y ver qué formas se pueden sacar del otro" (C. Halloway en
La Feria de las Tinieblas, R. Bradbury)

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq
Reply | Threaded
Open this post in threaded view
|

Re: Found: some pretty ugly VACUUM bugs

Tom Lane-2
Alvaro Herrera <[hidden email]> writes:
> Interesting failure mode.  While reading it I was suddenly struck by the
> thought that overwriting storage managers may somehow be more resistent
> to these kind of failures.  This may well be true, because there is
> never need for a VACUUM process which would fail to correctly determine
> whether a tuple is truly dead or not; but in the end, concurrent
> processes have to follow t_ctid chains anyway.

Yeah.  I think the Oracle style has got about exactly the same issues
if they try to reuse space in the rollback segment.

> I also considered whether the correct test was xmin=xmax, or rather a
> transaction-tree test was needed.  Then I realized that it's not
> possible for a transaction to create a tuple chain crossing a
> subtransaction boundary.  So the xmin=xmax test is correct.

Actually, I thought of a counterexample: consider a tuple updated twice
in the same xact:

                XMIN XMAX t_ctid
        T1 X0 X1 -> T2
        T2 X1 X1 -> T3
        T3 X1 - -> T3 (self)

If we remove T2 we'll be unable to chain from T1 to T3, which would
definitely be wrong.  So I'm now thinking that the special case in
HeapTupleSatisfiesVacuum has to go, too.

>> This is going to require a number of changes since there are several
>> places that follow t_ctid chains.

> I wonder whether this should be refactored so all of them use a single
> piece of code.

Most of the places end up feeding into EvalPlanQual, but passing down
the original tuple's XMAX to there will require changing the APIs of
heap_update, heap_delete, and heap_lock_tuple (sigh).

                        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster