new heapcheck contrib module

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
118 messages Options
123456
Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

amul sul
Hi Mark,

I think new structures should be listed in src/tools/pgindent/typedefs.list,
otherwise, pgindent might disturb its indentation.

Regards,
Amul


On Tue, Jul 21, 2020 at 2:32 AM Mark Dilger
<[hidden email]> wrote:

>
>
>
> > On Jul 16, 2020, at 12:38 PM, Robert Haas <[hidden email]> wrote:
> >
> > On Mon, Jul 6, 2020 at 2:06 PM Mark Dilger <[hidden email]> wrote:
> >> The v10 patch without these ideas is here:
> >
> > Along the lines of what Alvaro was saying before, I think this
> > definitely needs to be split up into a series of patches. The commit
> > message for v10 describes it doing three pretty separate things, and I
> > think that argues for splitting it into a series of three patches. I'd
> > argue for this ordering:
> >
> > 0001 Refactoring existing amcheck btree checking functions to optionally
> > return corruption information rather than ereport'ing it.  This is
> > used by the new pg_amcheck command line tool for reporting back to
> > the caller.
> >
> > 0002 Adding new function verify_heapam for checking a heap relation and
> > associated toast relation, if any, to contrib/amcheck.
> >
> > 0003 Adding new contrib module pg_amcheck, which is a command line
> > interface for running amcheck's verifications against tables and
> > indexes.
> >
> > It's too hard to review things like this when it's all mixed together.
>
> The v11 patch series is broken up as you suggest.
>
> > +++ b/contrib/amcheck/t/skipping.pl
> >
> > The name of this file is inconsistent with the tree's usual
> > convention, which is all stuff like 001_whatever.pl, except for
> > src/test/modules/brin, which randomly decided to use two digits
> > instead of three. There's no precedent for a test file with no leading
> > numeric digits. Also, what does "skipping" even have to do with what
> > the test is checking? Maybe it's intended to refer to the new error
> > handling "skipping" the actual error in favor of just reporting it
> > without stopping, but that's not really what the word "skipping"
> > normally means. Finally, it seems a bit over-engineered: do we really
> > need 183 test cases to check that detecting a problem doesn't lead to
> > an abort? Like, if that's the purpose of the test, I'd expect it to
> > check one corrupt relation and one non-corrupt relation, each with and
> > without the no-error behavior. And that's about it. Or maybe it's
> > talking about skipping pages during the checks, because those pages
> > are all-visible or all-frozen? It's not very clear to me what's going
> > on here.
>
> The "skipping" did originally refer to testing verify_heapam()'s option to skip all-visible or all-frozen blocks.  I have renamed it 001_verify_heapam.pl, since it tests that function.
>
> >
> > + TransactionId nextKnownValidXid;
> > + TransactionId oldestValidXid;
> >
> > Please add explanatory comments indicating what these are intended to
> > mean.
>
> Done.
>
> > For most of the the structure members, the brief comments
> > already present seem sufficient; but here, more explanation looks
> > necessary and less is provided. The "Values for returning tuples"
> > could possibly also use some more detail.
>
> Ok, I've expanded the comments for these.
>
> > +#define HEAPCHECK_RELATION_COLS 8
> >
> > I think this should really be at the top of the file someplace.
> > Sometimes people have adopted this style when the #define is only used
> > within the function that contains it, but that's not the case here.
>
> Done.
>
> >
> > + ereport(ERROR,
> > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > + errmsg("unrecognized parameter for 'skip': %s", skip),
> > + errhint("please choose from 'all visible', 'all frozen', "
> > + "or NULL")));
> >
> > I think it would be better if we had three string values selecting the
> > different behaviors, and made the parameter NOT NULL but with a
> > default. It seems like that would be easier to understand. Right now,
> > I can tell that my options for what to skip are "all visible", "all
> > frozen", and, uh, some other thing that I don't know what it is. I'm
> > gonna guess the third option is to skip nothing, but it seems best to
> > make that explicit. Also, should we maybe consider spelling this
> > 'all-visible' and 'all-frozen' with dashes, instead of using spaces?
> > Spaces in an option value seems a little icky to me somehow.
>
> I've made the options 'all-visible', 'all-frozen', and 'none'.  It defaults to 'none'.  I did not mark the function as strict, as I think NULL is a reasonable value (and the default) for startblock and endblock.
>
> > + int64 startblock = -1;
> > + int64 endblock = -1;
> > ...
> > + if (!PG_ARGISNULL(3))
> > + startblock = PG_GETARG_INT64(3);
> > + if (!PG_ARGISNULL(4))
> > + endblock = PG_GETARG_INT64(4);
> > ...
> > + if (startblock < 0)
> > + startblock = 0;
> > + if (endblock < 0 || endblock > ctx.nblocks)
> > + endblock = ctx.nblocks;
> > +
> > + for (ctx.blkno = startblock; ctx.blkno < endblock; ctx.blkno++)
> >
> > So, the user can specify a negative value explicitly and it will be
> > treated as the default, and an endblock value that's larger than the
> > relation size will be treated as the relation size. The way pg_prewarm
> > does the corresponding checks seems superior: null indicates the
> > default value, and any non-null value must be within range or you get
> > an error. Also, you seem to be treating endblock as the first block
> > that should not be checked, whereas pg_prewarm takes what seems to me
> > to be the more natural interpretation: the end block is the last block
> > that IS checked. If you do it this way, then someone who specifies the
> > same start and end block will check no blocks -- silently, I think.
>
> Under that regime, for relations with one block of data, (startblock=0, endblock=0) means "check the zero'th block", and for relations with no blocks of data, specifying any non-null (startblock,endblock) pair raises an exception.  I don't like that too much, but I'm happy to defer to precedent.  Since you say pg_prewarm works this way (I did not check), I have changed verify_heapam to do likewise.
>
> > +               if (skip_all_frozen || skip_all_visible)
> >
> > Since you can't skip all frozen without skipping all visible, this
> > test could be simplified. Or you could introduce a three-valued enum
> > and test that skip_pages != SKIP_PAGES_NONE, which might be even
> > better.
>
> It works now with a three-valued enum.
>
> > + /* We must unlock the page from the prior iteration, if any */
> > + Assert(ctx.blkno == InvalidBlockNumber || ctx.buffer != InvalidBuffer);
> >
> > I don't understand this assertion, and I don't understand the comment,
> > either. I think ctx.blkno can never be equal to InvalidBlockNumber
> > because we never set it to anything outside the range of 0..(endblock
> > - 1), and I think ctx.buffer must always be unequal to InvalidBuffer
> > because we just initialized it by calling ReadBufferExtended(). So I
> > think this assertion would still pass if we wrote && rather than ||.
> > But even then, I don't know what that has to do with the comment or
> > why it even makes sense to have an assertion for that in the first
> > place.
>
> Yes, it is vestigial.  Removed.
>
> > +       /*
> > +        * Open the relation.  We use ShareUpdateExclusive to prevent concurrent
> > +        * vacuums from changing the relfrozenxid, relminmxid, or advancing the
> > +        * global oldestXid to be newer than those.  This protection
> > saves us from
> > +        * having to reacquire the locks and recheck those minimums for every
> > +        * tuple, which would be expensive.
> > +        */
> > +       ctx.rel = relation_open(relid, ShareUpdateExclusiveLock);
> >
> > I don't think we'd need to recheck for every tuple, would we? Just for
> > cases where there's an apparent violation of the rules.
>
> It's a bit fuzzy what an "apparent violation" might be if both ends of the range of valid xids may be moving, and arbitrarily much.  It's also not clear how often to recheck, since you'd be dealing with a race condition no matter how often you check.  Perhaps the comments shouldn't mention how often you'd have to recheck, since there is no really defensible choice for that.  I removed the offending sentence.
>
> > I guess that
> > could still be expensive if there's a lot of them, but needing
> > ShareUpdateExclusiveLock rather than only AccessShareLock is a little
> > unfortunate.
>
> I welcome strategies that would allow for taking a lesser lock.
>
> > It's also unclear to me why this concerns itself with relfrozenxid and
> > the cluster-wide oldestXid value but not with datfrozenxid. It seems
> > like if we're going to sanity-check the relfrozenxid against the
> > cluster-wide value, we ought to also check it against the
> > database-wide value. Checking neither would also seem like a plausible
> > choice. But it seems very strange to only check against the
> > cluster-wide value.
>
> If the relation has a normal relfrozenxid, then the oldest valid xid we can encounter in the table is relfrozenxid.  Otherwise, each row needs to be compared against some other minimum xid value.
>
> Logically, that other minimum xid value should be the oldest valid xid for the database, which must logically be at least as old as any valid row in the table and no older than the oldest valid xid for the cluster.
>
> Unfortunately, if the comments in commands/vacuum.c circa line 1572 can be believed, and if I am reading them correctly, the stored value for the oldest valid xid in the database has been known to be corrupted by bugs in pg_upgrade.  This is awful.  If I compare the xid of a row in a table against the oldest xid value for the database, and the xid of the row is older, what can I do?  I don't have a principled basis for determining which one of them is wrong.
>
> The logic in verify_heapam is conservative; it makes no guarantees about finding and reporting all corruption, but if it does report a row as corrupt, you can bank on that, bugs in verify_heapam itself not withstanding.  I think this is a good choice; a tool with only false negatives is much more useful than one with both false positives and false negatives.
>
> I have added a comment about my reasoning to verify_heapam.c.  I'm happy to be convinced of a better strategy for handling this situation.
>
> >
> > +               StaticAssertStmt(InvalidOffsetNumber + 1 == FirstOffsetNumber,
> > +                                                "InvalidOffsetNumber
> > increments to FirstOffsetNumber");
> >
> > If you are going to rely on this property, I agree that it is good to
> > check it. But it would be better to NOT rely on this property, and I
> > suspect the code can be written quite cleanly without relying on it.
> > And actually, that's what you did, because you first set ctx.offnum =
> > InvalidOffsetNumber but then just after that you set ctx.offnum = 0 in
> > the loop initializer. So AFAICS the first initializer, and the static
> > assert, are pointless.
>
> Ah, right you are.  Removed.
>
> >
> > +                       if (ItemIdIsRedirected(ctx.itemid))
> > +                       {
> > +                               uint16 redirect = ItemIdGetRedirect(ctx.itemid);
> > +                               if (redirect <= SizeOfPageHeaderData
> > || redirect >= ph->pd_lower)
> > ...
> > +                               if ((redirect - SizeOfPageHeaderData)
> > % sizeof(uint16))
> >
> > I think that ItemIdGetRedirect() returns an offset, not a byte
> > position. So the expectation that I would have is that it would be any
> > integer >= 0 and <= maxoff. Am I confused?
>
> I think you are right about it returning an offset, which should be between FirstOffsetNumber and maxoff, inclusive.  I have updated the checks.
>
> > BTW, it seems like it might
> > be good to complain if the item to which it points is LP_UNUSED...
> > AFAIK that shouldn't happen.
>
> Thanks for mentioning that.  It now checks for that.
>
> > +                                errmsg("\"%s\" is not a heap AM",
> >
> > I think the correct wording would be just "is not a heap." The "heap
> > AM" is the thing in pg_am, not a specific table.
>
> Fixed.
>
> > +confess(HeapCheckContext * ctx, char *msg)
> > +TransactionIdValidInRel(TransactionId xid, HeapCheckContext * ctx)
> > +check_tuphdr_xids(HeapTupleHeader tuphdr, HeapCheckContext * ctx)
> >
> > This is what happens when you pgindent without adding all the right
> > things to typedefs.list first ... or when you don't pgindent and have
> > odd ideas about how to indent things.
>
> Hmm.  I don't see the three lines of code you are quoting.  Which patch is that from?
>
> >
> > +       /*
> > +        * In principle, there is nothing to prevent a scan over a large, highly
> > +        * corrupted table from using workmem worth of memory building up the
> > +        * tuplestore.  Don't leak the msg argument memory.
> > +        */
> > +       pfree(msg);
> >
> > Maybe change the second sentence to something like: "That should be
> > OK, else the user can lower work_mem, but we'd better not leak any
> > additional memory."
>
> It may be a little wordy, but I went with
>
>     /*
>      * In principle, there is nothing to prevent a scan over a large, highly
>      * corrupted table from using workmem worth of memory building up the
>      * tuplestore.  That's ok, but if we also leak the msg argument memory
>      * until the end of the query, we could exceed workmem by more than a
>      * trivial amount.  Therefore, free the msg argument each time we are
>      * called rather than waiting for our current memory context to be freed.
>      */
>
> > +/*
> > + * check_tuphdr_xids
> > + *
> > + *     Determine whether tuples are visible for verification.  Similar to
> > + *  HeapTupleSatisfiesVacuum, but with critical differences.
> > + *
> > + *  1) Does not touch hint bits.  It seems imprudent to write hint bits
> > + *     to a table during a corruption check.
> > + *  2) Only makes a boolean determination of whether verification should
> > + *     see the tuple, rather than doing extra work for vacuum-related
> > + *     categorization.
> > + *
> > + *  The caller should already have checked that xmin and xmax are not out of
> > + *  bounds for the relation.
> > + */
> >
> > First, check_tuphdr_xids() doesn't seem like a very good name. If you
> > have a function with that name and, like this one, it returns Boolean,
> > what does true mean? What does false mean? Kinda hard to tell. And
> > also, check the tuple header XIDs *for what*? If you called it, say,
> > tuple_is_visible(), that would be self-evident.
>
> Changed.
>
> > Second, consider that we hold at least AccessShareLock on the relation
> > - actually, ATM we hold ShareUpdateExclusiveLock. Either way, there
> > cannot be a concurrent modification to the tuple descriptor in
> > progress. Therefore, I think that only a HEAPTUPLE_DEAD tuple is
> > potentially using a non-current schema. If the tuple is
> > HEAPTUPLE_INSERT_IN_PROGRESS, there's either no ADD COLUMN in the
> > inserting transaction, or that transaction committed before we got our
> > lock. Similarly if it's HEAPTUPLE_DELETE_IN_PROGRESS or
> > HEAPTUPLE_RECENTLY_DEAD, the original inserter must've committed
> > before we got our lock. Or if it's both inserted and deleted in the
> > same transaction, say, then that transaction committed before we got
> > our lock or else contains no relevant DDL. IOW, I think you can check
> > everything but dead tuples here.
>
> Ok, I have changed tuple_is_visible to return true rather than false for those other cases.
>
> > Capitalization and punctuation for messages complaining about problems
> > need to be consistent. verify_heapam() has "Invalid redirect line
> > pointer offset %u out of bounds" which starts with a capital letter,
> > but check_tuphdr_xids() has "heap tuple with XMAX_IS_MULTI is neither
> > LOCKED_ONLY nor has a valid xmax" which does not. I vote for lower
> > case, but in any event it should be the same.
>
> I standardized on all lowercase text, though I left embedded symbols and constants such as LOCKED_ONLY alone.
>
> > Also,
> > check_tuphdr_xids() has "tuple xvac = %u invalid" which is either a
> > debugging leftover or a very unclear complaint.
>
> Right.  That has been changed to "old-style VACUUM FULL transaction ID %u is invalid in this relation".
>
> > I think some real work
> > needs to be put into the phrasing of these messages so that it's more
> > clear exactly what is going on and why it's bad. For example the first
> > example in this paragraph is clearly a problem of some kind, but it's
> > not very clear exactly what is happening: is %u the offset of the
> > invalid line redirect or the value to which it points? I don't think
> > the phrasing is very grammatical, which makes it hard to tell which is
> > meant, and I actually think it would be a good idea to include both
> > things.
>
> Beware that every row returned from amcheck has more fields than just the error message.
>
>     blkno OUT bigint,
>     offnum OUT integer,
>     lp_off OUT smallint,
>     lp_flags OUT smallint,
>     lp_len OUT smallint,
>     attnum OUT integer,
>     chunk OUT integer,
>     msg OUT text
>
> Rather than including blkno, offnum, lp_off, lp_flags, lp_len, attnum, or chunk in the message, it would be better to remove these things from messages that include them.  For the specific message under consideration, I've converted the text to "line pointer redirection to item at offset number %u is outside valid bounds %u .. %u".  That avoids duplicating the offset information of the referring item, while reporting to offset of the referred item.
>
> > Project policy is generally against splitting a string across multiple
> > lines to fit within 80 characters. We like to fit within 80
> > characters, but we like to be able to grep for strings more, and
> > breaking them up like this makes that harder.
>
> Thanks for clarifying the project policy.  I joined these message strings back together.
>
> > +               confess(ctx,
> > +                               pstrdup("corrupt toast chunk va_header"));
> >
> > This is another message that I don't think is very clear. There's two
> > elements to that. One is that the phrasing is not very good, and the
> > other is that there are no % escapes
>
> Changed to "corrupt extended toast chunk with sequence number %d has invalid varlena header %0x".  I think all the other information about where the corruption was found is already present in the other returned columns.
>
> > What's somebody going to do when
> > they see this message? First, they're probably going to have to look
> > at the code to figure out in which circumstances it gets generated;
> > that's a sign that the message isn't phrased clearly enough. That will
> > tell them that an unexpected bit pattern has been found, but not what
> > that unexpected bit pattern actually was. So then, they're going to
> > have to try to find the relevant va_header by some other means and
> > fish out the relevant bit so that they can see what actually went
> > wrong.
>
> Right.
>
> >
> > + *   Checks the current attribute as tracked in ctx for corruption.  Records
> > + *   any corruption found in ctx->corruption.
> > + *
> > + *
> >
> > Extra blank line.
>
> Fixed.
>
> > +       Form_pg_attribute thisatt = TupleDescAttr(RelationGetDescr(ctx->rel),
> > +
> >                   ctx->attnum);
> >
> > Maybe you could avoid the line wrap by declaring this without
> > initializing it, and then initializing it as a separate statement.
>
> Yes, I like that better.  I did not need to do the same with infomask, but it looks better to me to break the declaration and initialization for both, so I did that.
>
> >
> > +               confess(ctx, psprintf("t_hoff + offset > lp_len (%u + %u > %u)",
> > +
> > ctx->tuphdr->t_hoff, ctx->offset,
> > +                                                         ctx->lp_len));
> >
> > Uggh! This isn't even remotely an English sentence. I don't think
> > formulas are the way to go here, but I like the idea of formulas in
> > some places and written-out messages in others even less. I guess the
> > complaint here in English is something like "tuple attribute %d should
> > start at offset %u, but tuple length is only %u" or something of that
> > sort. Also, it seems like this complaint really ought to have been
> > reported on the *preceding* loop iteration, either complaining that
> > (1) the fixed length attribute is more than the number of remaining
> > bytes in the tuple or (2) the varlena header for the tuple specifies
> > an excessively high length. It seems like you're blaming the wrong
> > attribute for the problem.
>
> Yeah, and it wouldn't complain if the final attribute of a tuple was overlong, as there wouldn't be a next attribute to blame it on.  I've changed it to report as you suggest, although it also still complains if the first attribute starts outside the bounds of the tuple.  The two error messages now read as "tuple attribute should start at offset %u, but tuple length is only %u" and "tuple attribute of length %u ends at offset %u, but tuple length is only %u".
>
> > BTW, the header comments for this function (check_tuple_attribute)
> > neglect to document the meaning of the return value.
>
> Fixed.
>
> > +                       confess(ctx, psprintf("tuple xmax = %u
> > precedes relation "
> > +
> > "relfrozenxid = %u",
> >
> > This is another example of these messages needing  work. The
> > corresponding message from heap_prepare_freeze_tuple() is "found
> > update xid %u from before relfrozenxid %u". That's better, because we
> > don't normally include equals signs in our messages like this, and
> > also because "relation relfrozenxid" is redundant. I think this should
> > say something like "tuple xmax %u precedes relfrozenxid %u".
> >
> > +                       confess(ctx, psprintf("tuple xmax = %u is in
> > the future",
> > +                                                                 xmax));
> >
> > And then this could be something like "tuple xmax %u follows
> > last-assigned xid %u". That would be more symmetric and more
> > informative.
>
> Both of these have been changed.
>
> > +               if (SizeofHeapTupleHeader + BITMAPLEN(ctx->natts) >
> > ctx->tuphdr->t_hoff)
> >
> > I think we should be able to predict the exact value of t_hoff and
> > complain if it isn't precisely equal to the expected value. Or is that
> > not possible for some reason?
>
> That is possible, and I've updated the error message to match.  There are cases where you can't know if the HEAP_HASNULL bit is wrong or if the t_hoff value is wrong, but I've changed the code to just compute the length based on the HEAP_HASNULL setting and use that as the expected value, and complain when the actual value does not match the expected.  That sidesteps the problem of not knowing exactly which value to blame.
>
> > Is there some place that's checking that lp_len >=
> > SizeOfHeapTupleHeader before check_tuple() goes and starts poking into
> > the header? If not, there should be.
>
> Good catch.  check_tuple() now does that before reading the header.
>
> > +$node->command_ok(
> >
> > +       [
> > +               'pg_amcheck', '-p', $port, 'postgres'
> > +       ],
> > +       'pg_amcheck all schemas and tables implicitly');
> > +
> > +$node->command_ok(
> > +       [
> > +               'pg_amcheck', '-i', '-p', $port, 'postgres'
> > +       ],
> > +       'pg_amcheck all schemas, tables and indexes');
> >
> > I haven't really looked through the btree-checking and pg_amcheck
> > parts of this much yet, but this caught my eye. Why would the default
> > be to check tables but not indexes? I think the default ought to be to
> > check everything we know how to check.
>
> I have changed the default to match your expectations.
>
>
>
> —
> Mark Dilger
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

amul sul
On Tue, Jul 21, 2020 at 10:58 AM Amul Sul <[hidden email]> wrote:

>
> Hi Mark,
>
> I think new structures should be listed in src/tools/pgindent/typedefs.list,
> otherwise, pgindent might disturb its indentation.
>
> Regards,
> Amul
>
>
> On Tue, Jul 21, 2020 at 2:32 AM Mark Dilger
> <[hidden email]> wrote:
> >
> >
> >
> > > On Jul 16, 2020, at 12:38 PM, Robert Haas <[hidden email]> wrote:
> > >
> > > On Mon, Jul 6, 2020 at 2:06 PM Mark Dilger <[hidden email]> wrote:
> > >> The v10 patch without these ideas is here:
> > >
> > > Along the lines of what Alvaro was saying before, I think this
> > > definitely needs to be split up into a series of patches. The commit
> > > message for v10 describes it doing three pretty separate things, and I
> > > think that argues for splitting it into a series of three patches. I'd
> > > argue for this ordering:
> > >
> > > 0001 Refactoring existing amcheck btree checking functions to optionally
> > > return corruption information rather than ereport'ing it.  This is
> > > used by the new pg_amcheck command line tool for reporting back to
> > > the caller.
> > >
> > > 0002 Adding new function verify_heapam for checking a heap relation and
> > > associated toast relation, if any, to contrib/amcheck.
> > >
> > > 0003 Adding new contrib module pg_amcheck, which is a command line
> > > interface for running amcheck's verifications against tables and
> > > indexes.
> > >
> > > It's too hard to review things like this when it's all mixed together.
> >
> > The v11 patch series is broken up as you suggest.
> >
> > > +++ b/contrib/amcheck/t/skipping.pl
> > >
> > > The name of this file is inconsistent with the tree's usual
> > > convention, which is all stuff like 001_whatever.pl, except for
> > > src/test/modules/brin, which randomly decided to use two digits
> > > instead of three. There's no precedent for a test file with no leading
> > > numeric digits. Also, what does "skipping" even have to do with what
> > > the test is checking? Maybe it's intended to refer to the new error
> > > handling "skipping" the actual error in favor of just reporting it
> > > without stopping, but that's not really what the word "skipping"
> > > normally means. Finally, it seems a bit over-engineered: do we really
> > > need 183 test cases to check that detecting a problem doesn't lead to
> > > an abort? Like, if that's the purpose of the test, I'd expect it to
> > > check one corrupt relation and one non-corrupt relation, each with and
> > > without the no-error behavior. And that's about it. Or maybe it's
> > > talking about skipping pages during the checks, because those pages
> > > are all-visible or all-frozen? It's not very clear to me what's going
> > > on here.
> >
> > The "skipping" did originally refer to testing verify_heapam()'s option to skip all-visible or all-frozen blocks.  I have renamed it 001_verify_heapam.pl, since it tests that function.
> >
> > >
> > > + TransactionId nextKnownValidXid;
> > > + TransactionId oldestValidXid;
> > >
> > > Please add explanatory comments indicating what these are intended to
> > > mean.
> >
> > Done.
> >
> > > For most of the the structure members, the brief comments
> > > already present seem sufficient; but here, more explanation looks
> > > necessary and less is provided. The "Values for returning tuples"
> > > could possibly also use some more detail.
> >
> > Ok, I've expanded the comments for these.
> >
> > > +#define HEAPCHECK_RELATION_COLS 8
> > >
> > > I think this should really be at the top of the file someplace.
> > > Sometimes people have adopted this style when the #define is only used
> > > within the function that contains it, but that's not the case here.
> >
> > Done.
> >
> > >
> > > + ereport(ERROR,
> > > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > > + errmsg("unrecognized parameter for 'skip': %s", skip),
> > > + errhint("please choose from 'all visible', 'all frozen', "
> > > + "or NULL")));
> > >
> > > I think it would be better if we had three string values selecting the
> > > different behaviors, and made the parameter NOT NULL but with a
> > > default. It seems like that would be easier to understand. Right now,
> > > I can tell that my options for what to skip are "all visible", "all
> > > frozen", and, uh, some other thing that I don't know what it is. I'm
> > > gonna guess the third option is to skip nothing, but it seems best to
> > > make that explicit. Also, should we maybe consider spelling this
> > > 'all-visible' and 'all-frozen' with dashes, instead of using spaces?
> > > Spaces in an option value seems a little icky to me somehow.
> >
> > I've made the options 'all-visible', 'all-frozen', and 'none'.  It defaults to 'none'.  I did not mark the function as strict, as I think NULL is a reasonable value (and the default) for startblock and endblock.
> >
> > > + int64 startblock = -1;
> > > + int64 endblock = -1;
> > > ...
> > > + if (!PG_ARGISNULL(3))
> > > + startblock = PG_GETARG_INT64(3);
> > > + if (!PG_ARGISNULL(4))
> > > + endblock = PG_GETARG_INT64(4);
> > > ...
> > > + if (startblock < 0)
> > > + startblock = 0;
> > > + if (endblock < 0 || endblock > ctx.nblocks)
> > > + endblock = ctx.nblocks;
> > > +
> > > + for (ctx.blkno = startblock; ctx.blkno < endblock; ctx.blkno++)
> > >
> > > So, the user can specify a negative value explicitly and it will be
> > > treated as the default, and an endblock value that's larger than the
> > > relation size will be treated as the relation size. The way pg_prewarm
> > > does the corresponding checks seems superior: null indicates the
> > > default value, and any non-null value must be within range or you get
> > > an error. Also, you seem to be treating endblock as the first block
> > > that should not be checked, whereas pg_prewarm takes what seems to me
> > > to be the more natural interpretation: the end block is the last block
> > > that IS checked. If you do it this way, then someone who specifies the
> > > same start and end block will check no blocks -- silently, I think.
> >
> > Under that regime, for relations with one block of data, (startblock=0, endblock=0) means "check the zero'th block", and for relations with no blocks of data, specifying any non-null (startblock,endblock) pair raises an exception.  I don't like that too much, but I'm happy to defer to precedent.  Since you say pg_prewarm works this way (I did not check), I have changed verify_heapam to do likewise.
> >
> > > +               if (skip_all_frozen || skip_all_visible)
> > >
> > > Since you can't skip all frozen without skipping all visible, this
> > > test could be simplified. Or you could introduce a three-valued enum
> > > and test that skip_pages != SKIP_PAGES_NONE, which might be even
> > > better.
> >
> > It works now with a three-valued enum.
> >
> > > + /* We must unlock the page from the prior iteration, if any */
> > > + Assert(ctx.blkno == InvalidBlockNumber || ctx.buffer != InvalidBuffer);
> > >
> > > I don't understand this assertion, and I don't understand the comment,
> > > either. I think ctx.blkno can never be equal to InvalidBlockNumber
> > > because we never set it to anything outside the range of 0..(endblock
> > > - 1), and I think ctx.buffer must always be unequal to InvalidBuffer
> > > because we just initialized it by calling ReadBufferExtended(). So I
> > > think this assertion would still pass if we wrote && rather than ||.
> > > But even then, I don't know what that has to do with the comment or
> > > why it even makes sense to have an assertion for that in the first
> > > place.
> >
> > Yes, it is vestigial.  Removed.
> >
> > > +       /*
> > > +        * Open the relation.  We use ShareUpdateExclusive to prevent concurrent
> > > +        * vacuums from changing the relfrozenxid, relminmxid, or advancing the
> > > +        * global oldestXid to be newer than those.  This protection
> > > saves us from
> > > +        * having to reacquire the locks and recheck those minimums for every
> > > +        * tuple, which would be expensive.
> > > +        */
> > > +       ctx.rel = relation_open(relid, ShareUpdateExclusiveLock);
> > >
> > > I don't think we'd need to recheck for every tuple, would we? Just for
> > > cases where there's an apparent violation of the rules.
> >
> > It's a bit fuzzy what an "apparent violation" might be if both ends of the range of valid xids may be moving, and arbitrarily much.  It's also not clear how often to recheck, since you'd be dealing with a race condition no matter how often you check.  Perhaps the comments shouldn't mention how often you'd have to recheck, since there is no really defensible choice for that.  I removed the offending sentence.
> >
> > > I guess that
> > > could still be expensive if there's a lot of them, but needing
> > > ShareUpdateExclusiveLock rather than only AccessShareLock is a little
> > > unfortunate.
> >
> > I welcome strategies that would allow for taking a lesser lock.
> >
> > > It's also unclear to me why this concerns itself with relfrozenxid and
> > > the cluster-wide oldestXid value but not with datfrozenxid. It seems
> > > like if we're going to sanity-check the relfrozenxid against the
> > > cluster-wide value, we ought to also check it against the
> > > database-wide value. Checking neither would also seem like a plausible
> > > choice. But it seems very strange to only check against the
> > > cluster-wide value.
> >
> > If the relation has a normal relfrozenxid, then the oldest valid xid we can encounter in the table is relfrozenxid.  Otherwise, each row needs to be compared against some other minimum xid value.
> >
> > Logically, that other minimum xid value should be the oldest valid xid for the database, which must logically be at least as old as any valid row in the table and no older than the oldest valid xid for the cluster.
> >
> > Unfortunately, if the comments in commands/vacuum.c circa line 1572 can be believed, and if I am reading them correctly, the stored value for the oldest valid xid in the database has been known to be corrupted by bugs in pg_upgrade.  This is awful.  If I compare the xid of a row in a table against the oldest xid value for the database, and the xid of the row is older, what can I do?  I don't have a principled basis for determining which one of them is wrong.
> >
> > The logic in verify_heapam is conservative; it makes no guarantees about finding and reporting all corruption, but if it does report a row as corrupt, you can bank on that, bugs in verify_heapam itself not withstanding.  I think this is a good choice; a tool with only false negatives is much more useful than one with both false positives and false negatives.
> >
> > I have added a comment about my reasoning to verify_heapam.c.  I'm happy to be convinced of a better strategy for handling this situation.
> >
> > >
> > > +               StaticAssertStmt(InvalidOffsetNumber + 1 == FirstOffsetNumber,
> > > +                                                "InvalidOffsetNumber
> > > increments to FirstOffsetNumber");
> > >
> > > If you are going to rely on this property, I agree that it is good to
> > > check it. But it would be better to NOT rely on this property, and I
> > > suspect the code can be written quite cleanly without relying on it.
> > > And actually, that's what you did, because you first set ctx.offnum =
> > > InvalidOffsetNumber but then just after that you set ctx.offnum = 0 in
> > > the loop initializer. So AFAICS the first initializer, and the static
> > > assert, are pointless.
> >
> > Ah, right you are.  Removed.
> >
> > >
> > > +                       if (ItemIdIsRedirected(ctx.itemid))
> > > +                       {
> > > +                               uint16 redirect = ItemIdGetRedirect(ctx.itemid);
> > > +                               if (redirect <= SizeOfPageHeaderData
> > > || redirect >= ph->pd_lower)
> > > ...
> > > +                               if ((redirect - SizeOfPageHeaderData)
> > > % sizeof(uint16))
> > >
> > > I think that ItemIdGetRedirect() returns an offset, not a byte
> > > position. So the expectation that I would have is that it would be any
> > > integer >= 0 and <= maxoff. Am I confused?
> >
> > I think you are right about it returning an offset, which should be between FirstOffsetNumber and maxoff, inclusive.  I have updated the checks.
> >
> > > BTW, it seems like it might
> > > be good to complain if the item to which it points is LP_UNUSED...
> > > AFAIK that shouldn't happen.
> >
> > Thanks for mentioning that.  It now checks for that.
> >
> > > +                                errmsg("\"%s\" is not a heap AM",
> > >
> > > I think the correct wording would be just "is not a heap." The "heap
> > > AM" is the thing in pg_am, not a specific table.
> >
> > Fixed.
> >
> > > +confess(HeapCheckContext * ctx, char *msg)
> > > +TransactionIdValidInRel(TransactionId xid, HeapCheckContext * ctx)
> > > +check_tuphdr_xids(HeapTupleHeader tuphdr, HeapCheckContext * ctx)
> > >
> > > This is what happens when you pgindent without adding all the right
> > > things to typedefs.list first ... or when you don't pgindent and have
> > > odd ideas about how to indent things.
> >
> > Hmm.  I don't see the three lines of code you are quoting.  Which patch is that from?
> >
> > >
> > > +       /*
> > > +        * In principle, there is nothing to prevent a scan over a large, highly
> > > +        * corrupted table from using workmem worth of memory building up the
> > > +        * tuplestore.  Don't leak the msg argument memory.
> > > +        */
> > > +       pfree(msg);
> > >
> > > Maybe change the second sentence to something like: "That should be
> > > OK, else the user can lower work_mem, but we'd better not leak any
> > > additional memory."
> >
> > It may be a little wordy, but I went with
> >
> >     /*
> >      * In principle, there is nothing to prevent a scan over a large, highly
> >      * corrupted table from using workmem worth of memory building up the
> >      * tuplestore.  That's ok, but if we also leak the msg argument memory
> >      * until the end of the query, we could exceed workmem by more than a
> >      * trivial amount.  Therefore, free the msg argument each time we are
> >      * called rather than waiting for our current memory context to be freed.
> >      */
> >
> > > +/*
> > > + * check_tuphdr_xids
> > > + *
> > > + *     Determine whether tuples are visible for verification.  Similar to
> > > + *  HeapTupleSatisfiesVacuum, but with critical differences.
> > > + *
> > > + *  1) Does not touch hint bits.  It seems imprudent to write hint bits
> > > + *     to a table during a corruption check.
> > > + *  2) Only makes a boolean determination of whether verification should
> > > + *     see the tuple, rather than doing extra work for vacuum-related
> > > + *     categorization.
> > > + *
> > > + *  The caller should already have checked that xmin and xmax are not out of
> > > + *  bounds for the relation.
> > > + */
> > >
> > > First, check_tuphdr_xids() doesn't seem like a very good name. If you
> > > have a function with that name and, like this one, it returns Boolean,
> > > what does true mean? What does false mean? Kinda hard to tell. And
> > > also, check the tuple header XIDs *for what*? If you called it, say,
> > > tuple_is_visible(), that would be self-evident.
> >
> > Changed.
> >
> > > Second, consider that we hold at least AccessShareLock on the relation
> > > - actually, ATM we hold ShareUpdateExclusiveLock. Either way, there
> > > cannot be a concurrent modification to the tuple descriptor in
> > > progress. Therefore, I think that only a HEAPTUPLE_DEAD tuple is
> > > potentially using a non-current schema. If the tuple is
> > > HEAPTUPLE_INSERT_IN_PROGRESS, there's either no ADD COLUMN in the
> > > inserting transaction, or that transaction committed before we got our
> > > lock. Similarly if it's HEAPTUPLE_DELETE_IN_PROGRESS or
> > > HEAPTUPLE_RECENTLY_DEAD, the original inserter must've committed
> > > before we got our lock. Or if it's both inserted and deleted in the
> > > same transaction, say, then that transaction committed before we got
> > > our lock or else contains no relevant DDL. IOW, I think you can check
> > > everything but dead tuples here.
> >
> > Ok, I have changed tuple_is_visible to return true rather than false for those other cases.
> >
> > > Capitalization and punctuation for messages complaining about problems
> > > need to be consistent. verify_heapam() has "Invalid redirect line
> > > pointer offset %u out of bounds" which starts with a capital letter,
> > > but check_tuphdr_xids() has "heap tuple with XMAX_IS_MULTI is neither
> > > LOCKED_ONLY nor has a valid xmax" which does not. I vote for lower
> > > case, but in any event it should be the same.
> >
> > I standardized on all lowercase text, though I left embedded symbols and constants such as LOCKED_ONLY alone.
> >
> > > Also,
> > > check_tuphdr_xids() has "tuple xvac = %u invalid" which is either a
> > > debugging leftover or a very unclear complaint.
> >
> > Right.  That has been changed to "old-style VACUUM FULL transaction ID %u is invalid in this relation".
> >
> > > I think some real work
> > > needs to be put into the phrasing of these messages so that it's more
> > > clear exactly what is going on and why it's bad. For example the first
> > > example in this paragraph is clearly a problem of some kind, but it's
> > > not very clear exactly what is happening: is %u the offset of the
> > > invalid line redirect or the value to which it points? I don't think
> > > the phrasing is very grammatical, which makes it hard to tell which is
> > > meant, and I actually think it would be a good idea to include both
> > > things.
> >
> > Beware that every row returned from amcheck has more fields than just the error message.
> >
> >     blkno OUT bigint,
> >     offnum OUT integer,
> >     lp_off OUT smallint,
> >     lp_flags OUT smallint,
> >     lp_len OUT smallint,
> >     attnum OUT integer,
> >     chunk OUT integer,
> >     msg OUT text
> >
> > Rather than including blkno, offnum, lp_off, lp_flags, lp_len, attnum, or chunk in the message, it would be better to remove these things from messages that include them.  For the specific message under consideration, I've converted the text to "line pointer redirection to item at offset number %u is outside valid bounds %u .. %u".  That avoids duplicating the offset information of the referring item, while reporting to offset of the referred item.
> >
> > > Project policy is generally against splitting a string across multiple
> > > lines to fit within 80 characters. We like to fit within 80
> > > characters, but we like to be able to grep for strings more, and
> > > breaking them up like this makes that harder.
> >
> > Thanks for clarifying the project policy.  I joined these message strings back together.

In v11-0001 and v11-0002 patches, there are still a few more errmsg that need to
be joined.

e.g:

+ /* check to see if caller supports us returning a tuplestore */
+ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot "
+ "accept a set")));
+ if (!(rsinfo->allowedModes & SFRM_Materialize))
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("materialize mode required, but it is not allowed "
+ "in this context")));

> >
> > > +               confess(ctx,
> > > +                               pstrdup("corrupt toast chunk va_header"));
> > >
> > > This is another message that I don't think is very clear. There's two
> > > elements to that. One is that the phrasing is not very good, and the
> > > other is that there are no % escapes
> >
> > Changed to "corrupt extended toast chunk with sequence number %d has invalid varlena header %0x".  I think all the other information about where the corruption was found is already present in the other returned columns.
> >
> > > What's somebody going to do when
> > > they see this message? First, they're probably going to have to look
> > > at the code to figure out in which circumstances it gets generated;
> > > that's a sign that the message isn't phrased clearly enough. That will
> > > tell them that an unexpected bit pattern has been found, but not what
> > > that unexpected bit pattern actually was. So then, they're going to
> > > have to try to find the relevant va_header by some other means and
> > > fish out the relevant bit so that they can see what actually went
> > > wrong.
> >
> > Right.
> >
> > >
> > > + *   Checks the current attribute as tracked in ctx for corruption.  Records
> > > + *   any corruption found in ctx->corruption.
> > > + *
> > > + *
> > >
> > > Extra blank line.
> >
> > Fixed.
> >
> > > +       Form_pg_attribute thisatt = TupleDescAttr(RelationGetDescr(ctx->rel),
> > > +
> > >                   ctx->attnum);
> > >
> > > Maybe you could avoid the line wrap by declaring this without
> > > initializing it, and then initializing it as a separate statement.
> >
> > Yes, I like that better.  I did not need to do the same with infomask, but it looks better to me to break the declaration and initialization for both, so I did that.
> >
> > >
> > > +               confess(ctx, psprintf("t_hoff + offset > lp_len (%u + %u > %u)",
> > > +
> > > ctx->tuphdr->t_hoff, ctx->offset,
> > > +                                                         ctx->lp_len));
> > >
> > > Uggh! This isn't even remotely an English sentence. I don't think
> > > formulas are the way to go here, but I like the idea of formulas in
> > > some places and written-out messages in others even less. I guess the
> > > complaint here in English is something like "tuple attribute %d should
> > > start at offset %u, but tuple length is only %u" or something of that
> > > sort. Also, it seems like this complaint really ought to have been
> > > reported on the *preceding* loop iteration, either complaining that
> > > (1) the fixed length attribute is more than the number of remaining
> > > bytes in the tuple or (2) the varlena header for the tuple specifies
> > > an excessively high length. It seems like you're blaming the wrong
> > > attribute for the problem.
> >
> > Yeah, and it wouldn't complain if the final attribute of a tuple was overlong, as there wouldn't be a next attribute to blame it on.  I've changed it to report as you suggest, although it also still complains if the first attribute starts outside the bounds of the tuple.  The two error messages now read as "tuple attribute should start at offset %u, but tuple length is only %u" and "tuple attribute of length %u ends at offset %u, but tuple length is only %u".
> >
> > > BTW, the header comments for this function (check_tuple_attribute)
> > > neglect to document the meaning of the return value.
> >
> > Fixed.
> >
> > > +                       confess(ctx, psprintf("tuple xmax = %u
> > > precedes relation "
> > > +
> > > "relfrozenxid = %u",
> > >
> > > This is another example of these messages needing  work. The
> > > corresponding message from heap_prepare_freeze_tuple() is "found
> > > update xid %u from before relfrozenxid %u". That's better, because we
> > > don't normally include equals signs in our messages like this, and
> > > also because "relation relfrozenxid" is redundant. I think this should
> > > say something like "tuple xmax %u precedes relfrozenxid %u".
> > >
> > > +                       confess(ctx, psprintf("tuple xmax = %u is in
> > > the future",
> > > +                                                                 xmax));
> > >
> > > And then this could be something like "tuple xmax %u follows
> > > last-assigned xid %u". That would be more symmetric and more
> > > informative.
> >
> > Both of these have been changed.
> >
> > > +               if (SizeofHeapTupleHeader + BITMAPLEN(ctx->natts) >
> > > ctx->tuphdr->t_hoff)
> > >
> > > I think we should be able to predict the exact value of t_hoff and
> > > complain if it isn't precisely equal to the expected value. Or is that
> > > not possible for some reason?
> >
> > That is possible, and I've updated the error message to match.  There are cases where you can't know if the HEAP_HASNULL bit is wrong or if the t_hoff value is wrong, but I've changed the code to just compute the length based on the HEAP_HASNULL setting and use that as the expected value, and complain when the actual value does not match the expected.  That sidesteps the problem of not knowing exactly which value to blame.
> >
> > > Is there some place that's checking that lp_len >=
> > > SizeOfHeapTupleHeader before check_tuple() goes and starts poking into
> > > the header? If not, there should be.
> >
> > Good catch.  check_tuple() now does that before reading the header.
> >
> > > +$node->command_ok(
> > >
> > > +       [
> > > +               'pg_amcheck', '-p', $port, 'postgres'
> > > +       ],
> > > +       'pg_amcheck all schemas and tables implicitly');
> > > +
> > > +$node->command_ok(
> > > +       [
> > > +               'pg_amcheck', '-i', '-p', $port, 'postgres'
> > > +       ],
> > > +       'pg_amcheck all schemas, tables and indexes');
> > >
> > > I haven't really looked through the btree-checking and pg_amcheck
> > > parts of this much yet, but this caught my eye. Why would the default
> > > be to check tables but not indexes? I think the default ought to be to
> > > check everything we know how to check.
> >
> > I have changed the default to match your expectations.
> >
> >
> >
> > —
> > Mark Dilger
> > EnterpriseDB: http://www.enterprisedb.com
> > The Enterprise PostgreSQL Company
> >
> >
> >


Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Mark Dilger-5


> On Jul 20, 2020, at 11:50 PM, Amul Sul <[hidden email]> wrote:
>
> On Tue, Jul 21, 2020 at 10:58 AM Amul Sul <[hidden email]> wrote:
>>
>> Hi Mark,
>>
>> I think new structures should be listed in src/tools/pgindent/typedefs.list,
>> otherwise, pgindent might disturb its indentation.
>>

<snip>

>
> In v11-0001 and v11-0002 patches, there are still a few more errmsg that need to
> be joined.
>
> e.g:
>
> + /* check to see if caller supports us returning a tuplestore */
> + if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
> + ereport(ERROR,
> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> + errmsg("set-valued function called in context that cannot "
> + "accept a set")));
> + if (!(rsinfo->allowedModes & SFRM_Materialize))
> + ereport(ERROR,
> + (errcode(ERRCODE_SYNTAX_ERROR),
> + errmsg("materialize mode required, but it is not allowed "
> + "in this context")));
Thanks for the review!

I believe these v12 patches resolve the two issues you raised.





Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




v12-0001-Adding-function-verify_btreeam-and-bumping-versi.patch (92K) Download Attachment
v12-0002-Adding-function-verify_heapam-to-amcheck-module.patch (82K) Download Attachment
v12-0003-Adding-contrib-module-pg_amcheck.patch (79K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

amul sul
In reply to this post by Mark Dilger-5
On Tue, Jul 21, 2020 at 2:32 AM Mark Dilger
<[hidden email]> wrote:

> [....]
> >
> > +               StaticAssertStmt(InvalidOffsetNumber + 1 == FirstOffsetNumber,
> > +                                                "InvalidOffsetNumber
> > increments to FirstOffsetNumber");
> >
> > If you are going to rely on this property, I agree that it is good to
> > check it. But it would be better to NOT rely on this property, and I
> > suspect the code can be written quite cleanly without relying on it.
> > And actually, that's what you did, because you first set ctx.offnum =
> > InvalidOffsetNumber but then just after that you set ctx.offnum = 0 in
> > the loop initializer. So AFAICS the first initializer, and the static
> > assert, are pointless.
>
> Ah, right you are.  Removed.
>

I can see the same assert and the unnecessary assignment in v12-0002,  is that
the same thing that is supposed to be removed, or am I missing something?

> [....]
> > +confess(HeapCheckContext * ctx, char *msg)
> > +TransactionIdValidInRel(TransactionId xid, HeapCheckContext * ctx)
> > +check_tuphdr_xids(HeapTupleHeader tuphdr, HeapCheckContext * ctx)
> >
> > This is what happens when you pgindent without adding all the right
> > things to typedefs.list first ... or when you don't pgindent and have
> > odd ideas about how to indent things.
>
> Hmm.  I don't see the three lines of code you are quoting.  Which patch is that from?
>

I think it was the same thing related to my previous suggestion to list new
structures to typedefs.list.  V12 has listed new structures but I think there
are still some more adjustments needed in the code e.g. see space between
HeapCheckContext and * (asterisk) that need to be fixed. I am not sure if the
pgindent will do that or not.

Here are a few more minor comments for the v12-0002 patch & some of them
apply to other patches as well:

 #include "utils/snapmgr.h"
-
+#include "amcheck.h"

Doesn't seem to be at the correct place -- need to be in sorted order.


+ if (!PG_ARGISNULL(3))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("starting block " INT64_FORMAT
+ " is out of bounds for relation with no blocks",
+ PG_GETARG_INT64(3))));
+ if (!PG_ARGISNULL(4))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("ending block " INT64_FORMAT
+ " is out of bounds for relation with no blocks",
+ PG_GETARG_INT64(4))));

I think these errmsg() strings also should be in one line.


+ if (fatal)
+ {
+ if (ctx.toast_indexes)
+ toast_close_indexes(ctx.toast_indexes, ctx.num_toast_indexes,
+ ShareUpdateExclusiveLock);
+ if (ctx.toastrel)
+ table_close(ctx.toastrel, ShareUpdateExclusiveLock);

Toast index and rel closing block style is not the same as at the ending of
verify_heapam().


+ /* If we get this far, we know the relation has at least one block */
+ startblock = PG_ARGISNULL(3) ? 0 : PG_GETARG_INT64(3);
+ endblock = PG_ARGISNULL(4) ? ((int64) ctx.nblocks) - 1 : PG_GETARG_INT64(4);
+ if (startblock < 0 || endblock >= ctx.nblocks || startblock > endblock)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("block range " INT64_FORMAT " .. " INT64_FORMAT
+ " is out of bounds for relation with block count %u",
+ startblock, endblock, ctx.nblocks)));
+
...
...
+ if (startblock < 0)
+ startblock = 0;
+ if (endblock < 0 || endblock > ctx.nblocks)
+ endblock = ctx.nblocks;

Other than endblock < 0 case, do we really need that?  I think due to the above
error check the rest of the cases will not reach this place.


+ confess(ctx, psprintf(
+   "tuple xmax %u follows last assigned xid %u",
+   xmax, ctx->nextKnownValidXid));
+ fatal = true;
+ }
+ }
+
+ /* Check for tuple header corruption */
+ if (ctx->tuphdr->t_hoff < SizeofHeapTupleHeader)
+ {
+ confess(ctx,
+ psprintf("tuple's header size is %u bytes which is less than the %u
byte minimum valid header size",
+ ctx->tuphdr->t_hoff,
+ (unsigned) SizeofHeapTupleHeader));

confess() call has two different code styles, first one where psprintf()'s only
argument got its own line and second style where psprintf has its own line with
the argument. I think the 2nd style is what we do follow & correct, not the
former.


+ if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
+ ereport(ERROR,
+ (errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ errmsg("\"%s\" is not a heap",
+ RelationGetRelationName(rel))));

Like elsewhere,  can we have errmsg as "only heap AM is supported" and error
code is ERRCODE_FEATURE_NOT_SUPPORTED ?


That all, for now, apologize for multiple review emails.

Regards,
Amul


Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Mark Dilger-5


> On Jul 26, 2020, at 9:27 PM, Amul Sul <[hidden email]> wrote:
>
> On Tue, Jul 21, 2020 at 2:32 AM Mark Dilger
> <[hidden email]> wrote:
>> [....]
>>>
>>> +               StaticAssertStmt(InvalidOffsetNumber + 1 == FirstOffsetNumber,
>>> +                                                "InvalidOffsetNumber
>>> increments to FirstOffsetNumber");
>>>
>>> If you are going to rely on this property, I agree that it is good to
>>> check it. But it would be better to NOT rely on this property, and I
>>> suspect the code can be written quite cleanly without relying on it.
>>> And actually, that's what you did, because you first set ctx.offnum =
>>> InvalidOffsetNumber but then just after that you set ctx.offnum = 0 in
>>> the loop initializer. So AFAICS the first initializer, and the static
>>> assert, are pointless.
>>
>> Ah, right you are.  Removed.
>>
>
> I can see the same assert and the unnecessary assignment in v12-0002,  is that
> the same thing that is supposed to be removed, or am I missing something?
That's the same thing.  I removed it, but obviously I somehow removed the removal prior to making the patch.  My best guess is that I reverted some set of changes that unintentionally included this one.

>
>> [....]
>>> +confess(HeapCheckContext * ctx, char *msg)
>>> +TransactionIdValidInRel(TransactionId xid, HeapCheckContext * ctx)
>>> +check_tuphdr_xids(HeapTupleHeader tuphdr, HeapCheckContext * ctx)
>>>
>>> This is what happens when you pgindent without adding all the right
>>> things to typedefs.list first ... or when you don't pgindent and have
>>> odd ideas about how to indent things.
>>
>> Hmm.  I don't see the three lines of code you are quoting.  Which patch is that from?
>>
>
> I think it was the same thing related to my previous suggestion to list new
> structures to typedefs.list.  V12 has listed new structures but I think there
> are still some more adjustments needed in the code e.g. see space between
> HeapCheckContext and * (asterisk) that need to be fixed. I am not sure if the
> pgindent will do that or not.
Hmm.  I'm not seeing an example of HeapCheckContext with wrong spacing.  Can you provide a file and line number?  There was a problem with enum SkipPages.  I've added that to the typedefs.list and rerun pgindent.

While looking at that, I noticed that the function and variable naming conventions in this patch were irregular, with names like TransactionIdValidInRel (init-caps) and tuple_is_visible (underscores), so I spent some time cleaning that up for v13.

> Here are a few more minor comments for the v12-0002 patch & some of them
> apply to other patches as well:
>
> #include "utils/snapmgr.h"
> -
> +#include "amcheck.h"
>
> Doesn't seem to be at the correct place -- need to be in sorted order.

Fixed.

> + if (!PG_ARGISNULL(3))
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("starting block " INT64_FORMAT
> + " is out of bounds for relation with no blocks",
> + PG_GETARG_INT64(3))));
> + if (!PG_ARGISNULL(4))
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("ending block " INT64_FORMAT
> + " is out of bounds for relation with no blocks",
> + PG_GETARG_INT64(4))));
>
> I think these errmsg() strings also should be in one line.
I chose not to do so, because the INT64_FORMAT bit breaks up the text even if placed all on one line.  I don't feel strongly about that, though, so I'll join them for v13.

> + if (fatal)
> + {
> + if (ctx.toast_indexes)
> + toast_close_indexes(ctx.toast_indexes, ctx.num_toast_indexes,
> + ShareUpdateExclusiveLock);
> + if (ctx.toastrel)
> + table_close(ctx.toastrel, ShareUpdateExclusiveLock);
>
> Toast index and rel closing block style is not the same as at the ending of
> verify_heapam().
I've harmonized the two.  Thanks for noticing.

> + /* If we get this far, we know the relation has at least one block */
> + startblock = PG_ARGISNULL(3) ? 0 : PG_GETARG_INT64(3);
> + endblock = PG_ARGISNULL(4) ? ((int64) ctx.nblocks) - 1 : PG_GETARG_INT64(4);
> + if (startblock < 0 || endblock >= ctx.nblocks || startblock > endblock)
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("block range " INT64_FORMAT " .. " INT64_FORMAT
> + " is out of bounds for relation with block count %u",
> + startblock, endblock, ctx.nblocks)));
> +
> ...
> ...
> + if (startblock < 0)
> + startblock = 0;
> + if (endblock < 0 || endblock > ctx.nblocks)
> + endblock = ctx.nblocks;
>
> Other than endblock < 0 case
This case does not need special checking, either.  The combination of checking that startblock >= 0 and that startblock <= endblock already handles it.

> , do we really need that?  I think due to the above
> error check the rest of the cases will not reach this place.

We don't need any of that.  Removed in v13.

> + confess(ctx, psprintf(
> +   "tuple xmax %u follows last assigned xid %u",
> +   xmax, ctx->nextKnownValidXid));
> + fatal = true;
> + }
> + }
> +
> + /* Check for tuple header corruption */
> + if (ctx->tuphdr->t_hoff < SizeofHeapTupleHeader)
> + {
> + confess(ctx,
> + psprintf("tuple's header size is %u bytes which is less than the %u
> byte minimum valid header size",
> + ctx->tuphdr->t_hoff,
> + (unsigned) SizeofHeapTupleHeader));
>
> confess() call has two different code styles, first one where psprintf()'s only
> argument got its own line and second style where psprintf has its own line with
> the argument. I think the 2nd style is what we do follow & correct, not the
> former.
Ok, standardized in v13.

> + if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
> + ereport(ERROR,
> + (errcode(ERRCODE_WRONG_OBJECT_TYPE),
> + errmsg("\"%s\" is not a heap",
> + RelationGetRelationName(rel))));
>
> Like elsewhere,  can we have errmsg as "only heap AM is supported" and error
> code is ERRCODE_FEATURE_NOT_SUPPORTED ?

I'm indifferent about that change.  Done for v13.

> That all, for now, apologize for multiple review emails.

Not at all!  I appreciate all the reviews.





Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




v13-0001-Adding-function-verify_btreeam-and-bumping-versi.patch (92K) Download Attachment
v13-0002-Adding-function-verify_heapam-to-amcheck-module.patch (81K) Download Attachment
v13-0003-Adding-contrib-module-pg_amcheck.patch (79K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Robert Haas
In reply to this post by Mark Dilger-5
On Mon, Jul 20, 2020 at 5:02 PM Mark Dilger
<[hidden email]> wrote:
> I've made the options 'all-visible', 'all-frozen', and 'none'.  It defaults to 'none'.

That looks nice.

> > I guess that
> > could still be expensive if there's a lot of them, but needing
> > ShareUpdateExclusiveLock rather than only AccessShareLock is a little
> > unfortunate.
>
> I welcome strategies that would allow for taking a lesser lock.

I guess I'm not seeing why you need any particular strategy here. Say
that at the beginning you note the starting relfrozenxid of the table
-- I think I would lean toward just ignoring datfrozenxid and the
cluster-wide value completely. You also note the current value of the
transaction ID counter. Those are the two ends of the acceptable
range.

Let's first consider the oldest acceptable XID, bounded by
relfrozenxid. If you see a value that is older than the relfrozenxid
value that you noted at the start, it is definitely invalid. If you
see a newer value, it could still be older than the table's current
relfrozenxid, but that doesn't seem very worrisome. If the user
vacuumed the table while they were running this tool, they can always
run the tool again afterward if they wish. Forcing the vacuum to wait
by taking ShareUpdateExclusiveLock doesn't actually solve anything
anyway: you STILL won't notice any problems the vacuum introduces, and
in fact you are now GUARANTEED not to notice them, plus now the vacuum
happens later.

Now let's consider the newest acceptable XID, bounded by the value of
the transaction ID counter. Any time you see a newer XID than the last
value of the transaction ID counter that you observed, you go observe
it again. If the value from the table still looks invalid, then you
complain about it. Either way, you remember the new observation and
check future tuples against that value. I think the patch is already
doing this anyway; if it weren't, you'd need an even stronger lock,
one sufficient to prevent any insert/update/delete activity on the
table altogether.

Maybe I'm just being dense here -- exactly what problem are you worried about?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Robert Haas
In reply to this post by Mark Dilger-5
On Mon, Jul 27, 2020 at 1:02 PM Mark Dilger
<[hidden email]> wrote:
> Not at all!  I appreciate all the reviews.

Reviewing 0002, reading through verify_heapam.c:

+typedef enum SkipPages
+{
+ SKIP_ALL_FROZEN_PAGES,
+ SKIP_ALL_VISIBLE_PAGES,
+ SKIP_PAGES_NONE
+} SkipPages;

This looks inconsistent. Maybe just start them all with SKIP_PAGES_.

+ if (PG_ARGISNULL(0))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("missing required parameter for 'rel'")));

This doesn't look much like other error messages in the code. Do
something like git grep -A4 PG_ARGISNULL | grep -A3 ereport and study
the comparables.

+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("unrecognized parameter for 'skip': %s", skip),
+ errhint("please choose from 'all-visible', 'all-frozen', or 'none'")));

Same problem. Check pg_prewarm's handling of the prewarm type, or
EXPLAIN's handling of the FORMAT option, or similar examples. Read the
message style guidelines concerning punctuation of hint and detail
messages.

+ * Bugs in pg_upgrade are reported (see commands/vacuum.c circa line 1572)
+ * to have sometimes rendered the oldest xid value for a database invalid.
+ * It seems unwise to report rows as corrupt for failing to be newer than
+ * a value which itself may be corrupt.  We instead use the oldest xid for
+ * the entire cluster, which must be at least as old as the oldest xid for
+ * our database.

This kind of reference to another comment will not age well; line
numbers and files change a lot. But I think the right thing to do here
is just rely on relfrozenxid and relminmxid. If the table is
inconsistent with those, then something needs fixing. datfrozenxid and
the cluster-wide value can look out for themselves. The corruption
detector shouldn't be trying to work around any bugs in setting
relfrozenxid itself; such problems are arguably precisely what we're
here to find.

+/*
+ * confess
+ *
+ *   Return a message about corruption, including information
+ *   about where in the relation the corruption was found.
+ *
+ *   The msg argument is pfree'd by this function.
+ */
+static void
+confess(HeapCheckContext *ctx, char *msg)

Contrary to what the comments say, the function doesn't return a
message about corruption or anything else. It returns void.

I don't really like the name, either. I get that it's probably
inspired by Perl, but I think it should be given a less-clever name
like report_corruption() or something.

+ * corrupted table from using workmem worth of memory building up the

This kind of thing destroys grep-ability. If you're going to refer to
work_mem, you gotta spell it the same way we do everywhere else.

+ * Helper function to construct the TupleDesc needed by verify_heapam.

Instead of saying it's the TupleDesc somebody needs, how about saying
that it's the TupleDesc that we'll use to report problems that we find
while scanning the heap, or something like that?

+ * Given a TransactionId, attempt to interpret it as a valid
+ * FullTransactionId, neither in the future nor overlong in
+ * the past.  Stores the inferred FullTransactionId in *fxid.

It really doesn't, because there's no such thing as 'fxid' referenced
anywhere here. You should really make the effort to proofread your
patches before posting, and adjust comments and so on as you go.
Otherwise reviewing takes longer, and if you keep introducing new
stuff like this as you fix other stuff, you can fail to ever produce a
committable patch.

+ * Determine whether tuples are visible for verification.  Similar to
+ *  HeapTupleSatisfiesVacuum, but with critical differences.

Not accurate, because it also reports problems, which is not mentioned
anywhere in the function header comment that purports to be a detailed
description of what the function does.

+ else if (TransactionIdIsCurrentTransactionId(raw_xmin))
+ return true; /* insert or delete in progress */
+ else if (TransactionIdIsInProgress(raw_xmin))
+ return true; /* HEAPTUPLE_INSERT_IN_PROGRESS */
+ else if (!TransactionIdDidCommit(raw_xmin))
+ {
+ return false; /* HEAPTUPLE_DEAD */
+ }

One of these cases is not punctuated like the others.

+ pstrdup("heap tuple with XMAX_IS_MULTI is neither LOCKED_ONLY nor
has a valid xmax"));

1. I don't think that's very grammatical.

2. Why abbreviate HEAP_XMAX_IS_MULTI to XMAX_IS_MULTI and
HEAP_XMAX_IS_LOCKED_ONLY to LOCKED_ONLY? I don't even think you should
be referencing C constant names here at all, and if you are I don't
think you should abbreviate, and if you do abbreviate I don't think
you should omit different numbers of words depending on which constant
it is.

I wonder what the intended division of responsibility is here,
exactly. It seems like you've ended up with some sanity checks in
check_tuple() before tuple_is_visible() is called, and others in
tuple_is_visible() proper. As far as I can see the comments don't
really discuss the logic behind the split, but there's clearly a close
relationship between the two sets of checks, even to the point where
you have "heap tuple with XMAX_IS_MULTI is neither LOCKED_ONLY nor has
a valid xmax" in tuple_is_visible() and "tuple xmax marked
incompatibly as keys updated and locked only" in check_tuple(). Now,
those are not the same check, but they seem like closely related
things, so it's not ideal that they happen in different functions with
differently-formatted messages to report problems and no explanation
of why it's different.

I think it might make sense here to see whether you could either move
more stuff out of tuple_is_visible(), so that it really just checks
whether the tuple is visible, or move more stuff into it, so that it
has the job not only of checking whether we should continue with
checks on the tuple contents but also complaining about any other
visibility problems. Or if neither of those make sense then there
should be a stronger attempt to rationalize in the comments what
checks are going where and for what reason, and also a stronger
attempt to rationalize the message wording.

+ curchunk = DatumGetInt32(fastgetattr(toasttup, 2,
+ ctx->toast_rel->rd_att, &isnull));

Should we be worrying about the possibility of fastgetattr crapping
out if the TOAST tuple is corrupted?

+ if (ctx->tuphdr->t_hoff + ctx->offset > ctx->lp_len)
+ {
+ confess(ctx,
+ psprintf("tuple attribute should start at offset %u, but tuple
length is only %u",
+ ctx->tuphdr->t_hoff + ctx->offset, ctx->lp_len));
+ return false;
+ }
+
+ /* Skip null values */
+ if (infomask & HEAP_HASNULL && att_isnull(ctx->attnum, ctx->tuphdr->t_bits))
+ return true;
+
+ /* Skip non-varlena values, but update offset first */
+ if (thisatt->attlen != -1)
+ {
+ ctx->offset = att_align_nominal(ctx->offset, thisatt->attalign);
+ ctx->offset = att_addlength_pointer(ctx->offset, thisatt->attlen,
+ tp + ctx->offset);
+ return true;
+ }

This looks like it's not going to complain about a fixed-length
attribute that overruns the tuple length. There's code further down
that handles that case for a varlena attribute, but there's nothing
comparable for the fixed-length case.

+ confess(ctx,
+ psprintf("%s toast at offset %u is unexpected",
+ va_tag == VARTAG_INDIRECT ? "indirect" :
+ va_tag == VARTAG_EXPANDED_RO ? "expanded" :
+ va_tag == VARTAG_EXPANDED_RW ? "expanded" :
+ "unexpected",
+ ctx->tuphdr->t_hoff + ctx->offset));

I suggest "unexpected TOAST tag %d", without trying to convert to a
string. Such a conversion will likely fail in the case of genuine
corruption, and isn't meaningful even if it works.

Again, let's try to standardize terminology here: most of the messages
in this function are now of the form "tuple attribute %d has some
problem" or "attribute %d has some problem", but some have neither.
Since we're separately returning attnum I don't see why it should be
in the message, and if we weren't separately returning attnum then it
ought to be in the message the same way all the time, rather than
sometimes writing "attribute" and other times "tuple attribute".

+ /* Check relminmxid against mxid, if any */
+ xmax = HeapTupleHeaderGetRawXmax(ctx->tuphdr);
+ if (infomask & HEAP_XMAX_IS_MULTI &&
+ MultiXactIdPrecedes(xmax, ctx->relminmxid))
+ {
+ confess(ctx,
+ psprintf("tuple xmax %u precedes relminmxid %u",
+ xmax, ctx->relminmxid));
+ fatal = true;
+ }

There are checks that an XID is neither too old nor too new, and
presumably something similar could be done for MultiXactIds, but here
you only check one end of the range. Seems like you should check both.

+ /* Check xmin against relfrozenxid */
+ xmin = HeapTupleHeaderGetXmin(ctx->tuphdr);
+ if (TransactionIdIsNormal(ctx->relfrozenxid) &&
+ TransactionIdIsNormal(xmin))
+ {
+ if (TransactionIdPrecedes(xmin, ctx->relfrozenxid))
+ {
+ confess(ctx,
+ psprintf("tuple xmin %u precedes relfrozenxid %u",
+ xmin, ctx->relfrozenxid));
+ fatal = true;
+ }
+ else if (!xid_valid_in_rel(xmin, ctx))
+ {
+ confess(ctx,
+ psprintf("tuple xmin %u follows last assigned xid %u",
+ xmin, ctx->next_valid_xid));
+ fatal = true;
+ }
+ }

Here you do check both ends of the range, but the comment claims
otherwise. Again, please proof-read for this kind of stuff.

+ /* Check xmax against relfrozenxid */

Ditto here.

+ psprintf("tuple's header size is %u bytes which is less than the %u
byte minimum valid header size",

I suggest: tuple data begins at byte %u, but the tuple header must be
at least %u bytes

+ psprintf("tuple's %u byte header size exceeds the %u byte length of
the entire tuple",

I suggest: tuple data begins at byte %u, but the entire tuple length
is only %u bytes

+ psprintf("tuple's user data offset %u not maximally aligned to %u",

I suggest: tuple data begins at byte %u, but that is not maximally aligned
Or: tuple data begins at byte %u, which is not a multiple of %u

That makes the messages look much more similar to each other
grammatically and is more consistent about calling things by the same
names.

+ psprintf("tuple with null values has user data offset %u rather than
the expected offset %u",
+ psprintf("tuple without null values has user data offset %u rather
than the expected offset %u",

I suggest merging these: tuple data offset %u, but expected offset %u
(%u attributes, %s)
where %s is either "has nulls" or "no nulls"

In fact, aren't several of the above checks redundant with this one?
Like, why check for a value less than SizeofHeapTupleHeader or that's
not properly aligned first? Just check this straightaway and call it
good.

+ * If we get this far, the tuple is visible to us, so it must not be
+ * incompatible with our relDesc.  The natts field could be legitimately
+ * shorter than rel's natts, but it cannot be longer than rel's natts.

This is yet another case where you didn't update the comments.
tuple_is_visible() now checks whether the tuple is visible to anyone,
not whether it's visible to us, but the comment doesn't agree. In some
sense I think this comment is redundant with the previous one anyway,
because that one already talks about the tuple being visible. Maybe
just write: The tuple is visible, so it must be compatible with the
current version of the relation descriptor. It might have fewer
columns than are present in the relation descriptor, but it cannot
have more.

+ psprintf("tuple has %u attributes in relation with only %u attributes",
+ ctx->natts,
+ RelationGetDescr(ctx->rel)->natts));

I suggest: tuple has %u attributes, but relation has only %u attributes

+ /*
+ * Iterate over the attributes looking for broken toast values. This
+ * roughly follows the logic of heap_deform_tuple, except that it doesn't
+ * bother building up isnull[] and values[] arrays, since nobody wants
+ * them, and it unrolls anything that might trip over an Assert when
+ * processing corrupt data.
+ */
+ ctx->offset = 0;
+ for (ctx->attnum = 0; ctx->attnum < ctx->natts; ctx->attnum++)
+ {
+ if (!check_tuple_attribute(ctx))
+ break;
+ }

I think this comment is too wordy. This text belongs in the header
comment of check_tuple_attribute(), not at the place where it gets
called. Otherwise, as you update what check_tuple_attribute() does,
you have to remember to come find this comment and fix it to match,
and you might forget to do that. In fact... looks like that already
happened, because check_tuple_attribute() now checks more than broken
TOAST attributes. Seems like you could just simplify this down to
something like "Now check each attribute." Also, you could lose the
extra braces.

- bt_index_check |             relname             | relpages
+ bt_index_check |             relname             | relpages

Don't include unrelated changes in the patch.

I'm not really sure that the list of fields you're displaying for each
reported problem really makes sense. I think the theory here should be
that we want to report the information that the user needs to localize
the problem but not everything that they could find out from
inspecting the page, and not things that are too specific to
particular classes of errors. So I would vote for keeping blkno,
offnum, and attnum, but I would lose lp_flags, lp_len, and chunk.
lp_off feels like it's a more arguable case: technically, it's a
locator for the problem, because it gives you the byte offset within
the page, but normally we reference tuples by TID, i.e. (blkno,
offset), not byte offset. On balance I'd be inclined to omit it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Mark Dilger-5
In reply to this post by Robert Haas


> On Jul 29, 2020, at 12:52 PM, Robert Haas <[hidden email]> wrote:
>
> On Mon, Jul 20, 2020 at 5:02 PM Mark Dilger
> <[hidden email]> wrote:
>> I've made the options 'all-visible', 'all-frozen', and 'none'.  It defaults to 'none'.
>
> That looks nice.
>
>>> I guess that
>>> could still be expensive if there's a lot of them, but needing
>>> ShareUpdateExclusiveLock rather than only AccessShareLock is a little
>>> unfortunate.
>>
>> I welcome strategies that would allow for taking a lesser lock.
>
> I guess I'm not seeing why you need any particular strategy here. Say
> that at the beginning you note the starting relfrozenxid of the table
> -- I think I would lean toward just ignoring datfrozenxid and the
> cluster-wide value completely. You also note the current value of the
> transaction ID counter. Those are the two ends of the acceptable
> range.
>
> Let's first consider the oldest acceptable XID, bounded by
> relfrozenxid. If you see a value that is older than the relfrozenxid
> value that you noted at the start, it is definitely invalid. If you
> see a newer value, it could still be older than the table's current
> relfrozenxid, but that doesn't seem very worrisome. If the user
> vacuumed the table while they were running this tool, they can always
> run the tool again afterward if they wish. Forcing the vacuum to wait
> by taking ShareUpdateExclusiveLock doesn't actually solve anything
> anyway: you STILL won't notice any problems the vacuum introduces, and
> in fact you are now GUARANTEED not to notice them, plus now the vacuum
> happens later.
>
> Now let's consider the newest acceptable XID, bounded by the value of
> the transaction ID counter. Any time you see a newer XID than the last
> value of the transaction ID counter that you observed, you go observe
> it again. If the value from the table still looks invalid, then you
> complain about it. Either way, you remember the new observation and
> check future tuples against that value. I think the patch is already
> doing this anyway; if it weren't, you'd need an even stronger lock,
> one sufficient to prevent any insert/update/delete activity on the
> table altogether.
>
> Maybe I'm just being dense here -- exactly what problem are you worried about?

Per tuple, tuple_is_visible() potentially checks whether the xmin or xmax committed via TransactionIdDidCommit.  I am worried about concurrent truncation of clog entries causing I/O errors on SLRU lookup when performing that check.  The three strategies I had for dealing with that were taking the XactTruncationLock (formerly known as CLogTruncationLock, for those reading this thread from the beginning), locking out vacuum, and the idea upthread from Andres about setting PROC_IN_VACUUM and such.  Maybe I'm being dense and don't need to worry about this.  But I haven't convinced myself of that, yet.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company





Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Andres Freund
Hi,

On 2020-07-30 13:18:01 -0700, Mark Dilger wrote:
> Per tuple, tuple_is_visible() potentially checks whether the xmin or xmax committed via TransactionIdDidCommit.  I am worried about concurrent truncation of clog entries causing I/O errors on SLRU lookup when performing that check.  The three strategies I had for dealing with that were taking the XactTruncationLock (formerly known as CLogTruncationLock, for those reading this thread from the beginning), locking out vacuum, and the idea upthread from Andres about setting PROC_IN_VACUUM and such.  Maybe I'm being dense and don't need to worry about this.  But I haven't convinced myself of that, yet.

I think it's not at all ok to look in the procarray or clog for xids
that are older than what you're announcing you may read. IOW I don't
think it's OK to just ignore the problem, or try to work around it by
holding XactTruncationLock.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Robert Haas
In reply to this post by Mark Dilger-5
On Thu, Jul 30, 2020 at 4:18 PM Mark Dilger
<[hidden email]> wrote:
> > Maybe I'm just being dense here -- exactly what problem are you worried about?
>
> Per tuple, tuple_is_visible() potentially checks whether the xmin or xmax committed via TransactionIdDidCommit.  I am worried about concurrent truncation of clog entries causing I/O errors on SLRU lookup when performing that check.  The three strategies I had for dealing with that were taking the XactTruncationLock (formerly known as CLogTruncationLock, for those reading this thread from the beginning), locking out vacuum, and the idea upthread from Andres about setting PROC_IN_VACUUM and such.  Maybe I'm being dense and don't need to worry about this.  But I haven't convinced myself of that, yet.

I don't get it. If you've already checked that the XIDs are >=
relfrozenxid and <= ReadNewFullTransactionId(), then this shouldn't be
a problem. It could be, if CLOG is hosed, which is possible, because
if the table is corrupted, why shouldn't CLOG also be corrupted? But
I'm not sure that's what your concern is here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Mark Dilger-5


> On Jul 30, 2020, at 2:00 PM, Robert Haas <[hidden email]> wrote:
>
> On Thu, Jul 30, 2020 at 4:18 PM Mark Dilger
> <[hidden email]> wrote:
>>> Maybe I'm just being dense here -- exactly what problem are you worried about?
>>
>> Per tuple, tuple_is_visible() potentially checks whether the xmin or xmax committed via TransactionIdDidCommit.  I am worried about concurrent truncation of clog entries causing I/O errors on SLRU lookup when performing that check.  The three strategies I had for dealing with that were taking the XactTruncationLock (formerly known as CLogTruncationLock, for those reading this thread from the beginning), locking out vacuum, and the idea upthread from Andres about setting PROC_IN_VACUUM and such.  Maybe I'm being dense and don't need to worry about this.  But I haven't convinced myself of that, yet.
>
> I don't get it. If you've already checked that the XIDs are >=
> relfrozenxid and <= ReadNewFullTransactionId(), then this shouldn't be
> a problem. It could be, if CLOG is hosed, which is possible, because
> if the table is corrupted, why shouldn't CLOG also be corrupted? But
> I'm not sure that's what your concern is here.

No, that wasn't my concern.  I was thinking about CLOG entries disappearing during the scan as a consequence of concurrent vacuums, and the effect that would have on the validity of the cached [relfrozenxid..next_valid_xid] range.  In the absence of corruption, I don't immediately see how this would cause any problems.  But for a corrupt table, I'm less certain how it would play out.

The kind of scenario I'm worried about may not be possible in practice.  I think it would depend on how vacuum behaves when scanning a corrupt table that is corrupt in some way that vacuum doesn't notice, and whether vacuum could finish scanning the table with the false belief that it has frozen all tuples with xids less than some cutoff.

I thought it would be safer if that kind of thing were not happening during verify_heapam's scan of the table.  Even if a careful analysis proved it was not an issue with the current coding of vacuum, I don't think there is any coding convention requiring future versions of vacuum to be hardened against corruption, so I don't see how I can rely on vacuum not causing such problems.

I don't think this is necessarily a too-rare-to-care-about type concern, either.  If corruption across multiple tables prevents autovacuum from succeeding, and the DBA doesn't get involved in scanning tables for corruption until the lack of successful vacuums impacts the production system, I imagine you could end up with vacuums repeatedly happening (or trying to happen) around the time the DBA is trying to fix tables, or perhaps drop them, or whatever, using verify_heapam for guidance on which tables are corrupted.

Anyway, that's what I was thinking.  I was imagining that calling TransactionIdDidCommit might keep crashing the backend while the DBA is trying to find and fix corruption, and that could get really annoying.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company





Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Mark Dilger-5
In reply to this post by Andres Freund


> On Jul 30, 2020, at 1:47 PM, Andres Freund <[hidden email]> wrote:
>
> Hi,
>
> On 2020-07-30 13:18:01 -0700, Mark Dilger wrote:
>> Per tuple, tuple_is_visible() potentially checks whether the xmin or xmax committed via TransactionIdDidCommit.  I am worried about concurrent truncation of clog entries causing I/O errors on SLRU lookup when performing that check.  The three strategies I had for dealing with that were taking the XactTruncationLock (formerly known as CLogTruncationLock, for those reading this thread from the beginning), locking out vacuum, and the idea upthread from Andres about setting PROC_IN_VACUUM and such.  Maybe I'm being dense and don't need to worry about this.  But I haven't convinced myself of that, yet.
>
> I think it's not at all ok to look in the procarray or clog for xids
> that are older than what you're announcing you may read. IOW I don't
> think it's OK to just ignore the problem, or try to work around it by
> holding XactTruncationLock.

The current state of the patch is that concurrent vacuums are kept out of the table being checked by means of taking a ShareUpdateExclusive lock on the table being checked.  In response to Robert's review, I was contemplating whether that was necessary, but you raise the interesting question of whether it is even sufficient.  The logic in verify_heapam is currently relying on the ShareUpdateExclusive lock to prevent any of the xids in the range relfrozenxid..nextFullXid from being invalid arguments to TransactionIdDidCommit.  Ignoring whether that is a good choice vis-a-vis performance, is that even a valid strategy?  It sounds like you are saying it is not.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company





Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Robert Haas
In reply to this post by Mark Dilger-5
On Thu, Jul 30, 2020 at 6:10 PM Mark Dilger
<[hidden email]> wrote:
> No, that wasn't my concern.  I was thinking about CLOG entries disappearing during the scan as a consequence of concurrent vacuums, and the effect that would have on the validity of the cached [relfrozenxid..next_valid_xid] range.  In the absence of corruption, I don't immediately see how this would cause any problems.  But for a corrupt table, I'm less certain how it would play out.

Oh, hmm. I wasn't thinking about that problem. I think the only way
this can happen is if we read a page and then, before we try to look
up the CID, vacuum zooms past, finishes the whole table, and truncates
clog. But if that's possible, then it seems like it would be an issue
for SELECT as well, and it apparently isn't, or we would've done
something about it by now. I think the reason it's not possible is
because of the locking rules described in
src/backend/storage/buffer/README, which require that you hold a
buffer lock until you've determined that the tuple is visible. Since
you hold a share lock on the buffer, a VACUUM that hasn't already
processed that freeze the tuples in that buffer; it would need an
exclusive lock on the buffer to do that. Therefore it can't finish and
truncate clog either.

Now, you raise the question of whether this is still true if the table
is corrupt, but I don't really see why that makes any difference.
VACUUM is supposed to freeze each page it encounters, to the extent
that such freezing is necessary, and with Andres's changes, it's
supposed to ERROR out if things are messed up. We can postulate a bug
in that logic, but inserting a VACUUM-blocking lock into this tool to
guard against a hypothetical vacuum bug seems strange to me. Why would
the right solution not be to fix such a bug if and when we find that
there is one?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Mark Dilger-5


> On Jul 30, 2020, at 5:53 PM, Robert Haas <[hidden email]> wrote:
>
> On Thu, Jul 30, 2020 at 6:10 PM Mark Dilger
> <[hidden email]> wrote:
>> No, that wasn't my concern.  I was thinking about CLOG entries disappearing during the scan as a consequence of concurrent vacuums, and the effect that would have on the validity of the cached [relfrozenxid..next_valid_xid] range.  In the absence of corruption, I don't immediately see how this would cause any problems.  But for a corrupt table, I'm less certain how it would play out.
>
> Oh, hmm. I wasn't thinking about that problem. I think the only way
> this can happen is if we read a page and then, before we try to look
> up the CID, vacuum zooms past, finishes the whole table, and truncates
> clog. But if that's possible, then it seems like it would be an issue
> for SELECT as well, and it apparently isn't, or we would've done
> something about it by now. I think the reason it's not possible is
> because of the locking rules described in
> src/backend/storage/buffer/README, which require that you hold a
> buffer lock until you've determined that the tuple is visible. Since
> you hold a share lock on the buffer, a VACUUM that hasn't already
> processed that freeze the tuples in that buffer; it would need an
> exclusive lock on the buffer to do that. Therefore it can't finish and
> truncate clog either.
>
> Now, you raise the question of whether this is still true if the table
> is corrupt, but I don't really see why that makes any difference.
> VACUUM is supposed to freeze each page it encounters, to the extent
> that such freezing is necessary, and with Andres's changes, it's
> supposed to ERROR out if things are messed up. We can postulate a bug
> in that logic, but inserting a VACUUM-blocking lock into this tool to
> guard against a hypothetical vacuum bug seems strange to me. Why would
> the right solution not be to fix such a bug if and when we find that
> there is one?

Since I can't think of a plausible concrete example of corruption which would elicit the problem I was worrying about, I'll withdraw the argument.  But that leaves me wondering about a comment that Andres made upthread:

> On Apr 20, 2020, at 12:42 PM, Andres Freund <[hidden email]> wrote:

> I don't think random interspersed uses of CLogTruncationLock are a good
> idea. If you move to only checking visibility after tuple fits into
> [relfrozenxid, nextXid), then you don't need to take any locks here, as
> long as a lock against vacuum is taken (which I think this should do
> anyway).


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company





Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Robert Haas
On Thu, Jul 30, 2020 at 9:38 PM Mark Dilger
<[hidden email]> wrote:

> > On Jul 30, 2020, at 5:53 PM, Robert Haas <[hidden email]> wrote:
> > On Thu, Jul 30, 2020 at 6:10 PM Mark Dilger
> Since I can't think of a plausible concrete example of corruption which would elicit the problem I was worrying about, I'll withdraw the argument.  But that leaves me wondering about a comment that Andres made upthread:
>
> > On Apr 20, 2020, at 12:42 PM, Andres Freund <[hidden email]> wrote:
>
> > I don't think random interspersed uses of CLogTruncationLock are a good
> > idea. If you move to only checking visibility after tuple fits into
> > [relfrozenxid, nextXid), then you don't need to take any locks here, as
> > long as a lock against vacuum is taken (which I think this should do
> > anyway).

The version of the patch I'm looking at doesn't seem to mention
CLogTruncationLock at all, so I'm confused about the comment. But what
it is that you are wondering about exactly?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Mark Dilger-5


> On Jul 31, 2020, at 5:02 AM, Robert Haas <[hidden email]> wrote:
>
> On Thu, Jul 30, 2020 at 9:38 PM Mark Dilger
> <[hidden email]> wrote:
>>> On Jul 30, 2020, at 5:53 PM, Robert Haas <[hidden email]> wrote:
>>> On Thu, Jul 30, 2020 at 6:10 PM Mark Dilger
>> Since I can't think of a plausible concrete example of corruption which would elicit the problem I was worrying about, I'll withdraw the argument.  But that leaves me wondering about a comment that Andres made upthread:
>>
>>> On Apr 20, 2020, at 12:42 PM, Andres Freund <[hidden email]> wrote:
>>
>>> I don't think random interspersed uses of CLogTruncationLock are a good
>>> idea. If you move to only checking visibility after tuple fits into
>>> [relfrozenxid, nextXid), then you don't need to take any locks here, as
>>> long as a lock against vacuum is taken (which I think this should do
>>> anyway).
>
> The version of the patch I'm looking at doesn't seem to mention
> CLogTruncationLock at all, so I'm confused about the comment. But what
> it is that you are wondering about exactly?

In earlier versions of the patch, I was guarding (perhaps unnecessarily) against clog truncation, (perhaps incorrectly) by taking the CLogTruncationLock (aka XactTruncationLock.) .  I thought Andres was arguing that such locks were not necessary "as long as a lock against vacuum is taken".  That's what motivated me to remove the clog locking business and put in the ShareUpdateExclusive lock.  I don't want to remove the ShareUpdateExclusive lock from the patch without perhaps a clarification from Andres on the subject.  His recent reply upthread seems to still support the idea that some kind of protection is required:

> I think it's not at all ok to look in the procarray or clog for xids
> that are older than what you're announcing you may read. IOW I don't
> think it's OK to just ignore the problem, or try to work around it by
> holding XactTruncationLock.

I don't understand that paragraph fully, in particular the part about "than what you're announcing you may read", since the cached value of relfrozenxid is not announced; we're just assuming that as long as vacuum cannot advance it during our scan, that we should be safe checking whether xids newer than that value (and not in the future) were committed.

Andres?


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company





Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Andres Freund
Hi,

On 2020-07-31 08:51:50 -0700, Mark Dilger wrote:

> In earlier versions of the patch, I was guarding (perhaps
> unnecessarily) against clog truncation, (perhaps incorrectly) by
> taking the CLogTruncationLock (aka XactTruncationLock.) .  I thought
> Andres was arguing that such locks were not necessary "as long as a
> lock against vacuum is taken".  That's what motivated me to remove the
> clog locking business and put in the ShareUpdateExclusive lock.  I
> don't want to remove the ShareUpdateExclusive lock from the patch
> without perhaps a clarification from Andres on the subject.  His
> recent reply upthread seems to still support the idea that some kind
> of protection is required:

I'm not sure what I was thinking "back then", but right now I'd argue
that the best lock against vacuum isn't a SUE, but announcing the
correct ->xmin, so you can be sure that clog entries won't be yanked out
from under you. Potentially with the right flag sets to avoid old enough
tuples eing pruned.


> > I think it's not at all ok to look in the procarray or clog for xids
> > that are older than what you're announcing you may read. IOW I don't
> > think it's OK to just ignore the problem, or try to work around it by
> > holding XactTruncationLock.
>
> I don't understand that paragraph fully, in particular the part about
> "than what you're announcing you may read", since the cached value of
> relfrozenxid is not announced; we're just assuming that as long as
> vacuum cannot advance it during our scan, that we should be safe
> checking whether xids newer than that value (and not in the future)
> were committed.

With 'announcing' I mean using the normal mechanism for avoiding the
clog being truncated for values one might look up. Which is announcing
the oldest xid one may look up in PGXACT->xmin.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Robert Haas
On Fri, Jul 31, 2020 at 12:33 PM Andres Freund <[hidden email]> wrote:
> I'm not sure what I was thinking "back then", but right now I'd argue
> that the best lock against vacuum isn't a SUE, but announcing the
> correct ->xmin, so you can be sure that clog entries won't be yanked out
> from under you. Potentially with the right flag sets to avoid old enough
> tuples eing pruned.

Suppose we don't even do anything special in terms of advertising
xmin. What can go wrong? To have a problem, we've got to be running
concurrently with a vacuum that truncates clog. The clog truncation
must happen before our XID lookups, but vacuum has to remove the XIDs
from the heap before it can truncate. So we have to observe the XIDs
before vacuum removes them, but then vacuum has to truncate before we
look them up. But since we observe them and look them up while holding
a ShareLock on the buffer, this seems impossible. What's the flaw in
this argument?

If we do need to do something special in terms of advertising xmin,
how would you do it? Normally it happens by registering a snapshot,
but here all we would have is an XID; specifically, the value of
relfrozenxid that we observed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Andres Freund
Hi,

On 2020-07-31 12:42:51 -0400, Robert Haas wrote:

> On Fri, Jul 31, 2020 at 12:33 PM Andres Freund <[hidden email]> wrote:
> > I'm not sure what I was thinking "back then", but right now I'd argue
> > that the best lock against vacuum isn't a SUE, but announcing the
> > correct ->xmin, so you can be sure that clog entries won't be yanked out
> > from under you. Potentially with the right flag sets to avoid old enough
> > tuples eing pruned.
>
> Suppose we don't even do anything special in terms of advertising
> xmin. What can go wrong? To have a problem, we've got to be running
> concurrently with a vacuum that truncates clog. The clog truncation
> must happen before our XID lookups, but vacuum has to remove the XIDs
> from the heap before it can truncate. So we have to observe the XIDs
> before vacuum removes them, but then vacuum has to truncate before we
> look them up. But since we observe them and look them up while holding
> a ShareLock on the buffer, this seems impossible. What's the flaw in
> this argument?

The page could have been wrongly marked all-frozen. There could be
interactions between heap and toast table that are checked. Other bugs
could apply, like a broken hot chain or such.


> If we do need to do something special in terms of advertising xmin,
> how would you do it? Normally it happens by registering a snapshot,
> but here all we would have is an XID; specifically, the value of
> relfrozenxid that we observed.

An appropriate procarray or snapmgr function would probably suffice?

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Robert Haas
On Fri, Jul 31, 2020 at 3:05 PM Andres Freund <[hidden email]> wrote:
> The page could have been wrongly marked all-frozen. There could be
> interactions between heap and toast table that are checked. Other bugs
> could apply, like a broken hot chain or such.

OK, at least the first two of these do sound like problems. Not sure
about the third one.

> > If we do need to do something special in terms of advertising xmin,
> > how would you do it? Normally it happens by registering a snapshot,
> > but here all we would have is an XID; specifically, the value of
> > relfrozenxid that we observed.
>
> An appropriate procarray or snapmgr function would probably suffice?

Not sure; I guess that'll need some investigation.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


123456