new heapcheck contrib module

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
164 messages Options
1234 ... 9
Reply | Threaded
Open this post in threaded view
|

new heapcheck contrib module

Mark Dilger-5
Hackers,

I have been talking with Robert about table corruption that occurs from time to time. The page checksum feature seems sufficient to detect most random corruption problems, but it can't detect "logical" corruption, where the page is valid but inconsistent with the rest of the database cluster. This can happen due to faulty or ill-conceived backup and restore tools, or bad storage, or user error, or bugs in the server itself. (Also, not everyone enables checksums.)

The attached module provides the means to scan a relation and sanity check it. Currently, it checks xmin and xmax values against relfrozenxid and relminmxid, and also validates TOAST pointers. If people like this, it could be expanded to perform additional checks.

There was a prior v1 patch, discussed offlist with Robert, not posted.  Here is v2:





Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




v2-0001-Adding-heapcheck-contrib-module.patch (85K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Peter Geoghegan-4
On Mon, Apr 20, 2020 at 10:59 AM Mark Dilger
<[hidden email]> wrote:
> The attached module provides the means to scan a relation and sanity check it. Currently, it checks xmin and xmax values against relfrozenxid and relminmxid, and also validates TOAST pointers. If people like this, it could be expanded to perform additional checks.

Cool. Why not make it part of contrib/amcheck?

We talked about the kinds of checks that we'd like to have for a tool
like this before:

https://postgr.es/m/20161017014605.GA1220186@...

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Robert Haas
On Mon, Apr 20, 2020 at 2:09 PM Peter Geoghegan <[hidden email]> wrote:
> Cool. Why not make it part of contrib/amcheck?

I wondered if people would suggest that. Didn't take long.

The documentation would need some updating, but that's doable.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Peter Geoghegan-4
On Mon, Apr 20, 2020 at 11:19 AM Robert Haas <[hidden email]> wrote:
> I wondered if people would suggest that. Didn't take long.

You were the one that pointed out that my first version of
contrib/amcheck, which was called "contrib/btreecheck", should have
a more general name. And rightly so!

The basic interface used for the heap checker functions seem very
similar to what amcheck already offers for B-Tree indexes, so it seems
very natural to distribute them together.

IMV, the problem that we have with amcheck is that it's too hard to
use in a top down kind of way. Perhaps there is an opportunity to
provide a more top-down interface to an expanded version of amcheck
that does heap checking. Something with a high level practical focus,
in addition to the low level functions. I'm not saying that Mark
should be required to solve that problem, but it certainly seems worth
considering now.

> The documentation would need some updating, but that's doable.

It would also probably need a bit of renaming, so that analogous
function names are used.


--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Mark Dilger-5


> On Apr 20, 2020, at 11:31 AM, Peter Geoghegan <[hidden email]> wrote:
>
> IMV, the problem that we have with amcheck is that it's too hard to
> use in a top down kind of way. Perhaps there is an opportunity to
> provide a more top-down interface to an expanded version of amcheck
> that does heap checking. Something with a high level practical focus,
> in addition to the low level functions. I'm not saying that Mark
> should be required to solve that problem, but it certainly seems worth
> considering now.

Thanks for your quick response and interest in this submission!

Can you elaborate on "top-down"?  I'm not sure what that means in this context.

I don't mind going further with this project if I understand what you are suggesting.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company





Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Peter Geoghegan-4
I mean an interface that's friendly to DBAs, that verifies an entire database. No custom sql query required. Something that provides a reasonable mix of verification options based on high level directives. All verification methods can be combined in a granular, possibly randomized fashion. Maybe we can make this run in parallel. 

For example, maybe your heap checker code sometimes does index probes for a subset of indexes and heap tuples. It's not hard to combine it with the rootdescend stuff from amcheck. It should be composable. 

The interface you've chosen is a good starting point. But let's not miss an opportunity to make everything work together. 

Peter Geoghegan
(Sent from my phone)
Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Mark Dilger-5


> On Apr 20, 2020, at 12:37 PM, Peter Geoghegan <[hidden email]> wrote:
>
> I mean an interface that's friendly to DBAs, that verifies an entire database. No custom sql query required. Something that provides a reasonable mix of verification options based on high level directives. All verification methods can be combined in a granular, possibly randomized fashion. Maybe we can make this run in parallel.
>
> For example, maybe your heap checker code sometimes does index probes for a subset of indexes and heap tuples. It's not hard to combine it with the rootdescend stuff from amcheck. It should be composable.
>
> The interface you've chosen is a good starting point. But let's not miss an opportunity to make everything work together.

Ok, I'll work in that direction and repost when I have something along those lines.

Thanks again for your input.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company





Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Andres Freund
In reply to this post by Mark Dilger-5
Hi,

On 2020-04-20 10:59:28 -0700, Mark Dilger wrote:
> I have been talking with Robert about table corruption that occurs
> from time to time. The page checksum feature seems sufficient to
> detect most random corruption problems, but it can't detect "logical"
> corruption, where the page is valid but inconsistent with the rest of
> the database cluster. This can happen due to faulty or ill-conceived
> backup and restore tools, or bad storage, or user error, or bugs in
> the server itself. (Also, not everyone enables checksums.)

This is something we really really really need. I'm very excited to see
progress!


> From 2a1bc0bb9fa94bd929adc1a408900cb925ebcdd5 Mon Sep 17 00:00:00 2001
> From: Mark Dilger <[hidden email]>
> Date: Mon, 20 Apr 2020 08:05:58 -0700
> Subject: [PATCH v2] Adding heapcheck contrib module.
>
> The heapcheck module introduces a new function for checking a heap
> relation and associated toast relation, if any, for corruption.

Why not add it to amcheck?


I wonder if a mode where heapcheck optionally would only checks
non-frozen (perhaps also non-all-visible) regions of a table would be a
good idea? Would make it a lot more viable to run this regularly on
bigger databases. Even if there's a window to not check some data
(because it's frozen before the next heapcheck run).


> The attached module provides the means to scan a relation and sanity
> check it. Currently, it checks xmin and xmax values against
> relfrozenxid and relminmxid, and also validates TOAST pointers. If
> people like this, it could be expanded to perform additional checks.


> The postgres backend already defends against certain forms of
> corruption, by checking the page header of each page before allowing
> it into the page cache, and by checking the page checksum, if enabled.
> Experience shows that broken or ill-conceived backup and restore
> mechanisms can result in a page, or an entire file, being overwritten
> with an earlier version of itself, restored from backup.  Pages thus
> overwritten will appear to have valid page headers and checksums,
> while potentially containing xmin, xmax, and toast pointers that are
> invalid.

We also had a *lot* of bugs that we'd have found a lot earlier, possibly
even during development, if we had a way to easily perform these checks.


> contrib/heapcheck introduces a function, heapcheck_relation, that
> takes a regclass argument, scans the given heap relation, and returns
> rows containing information about corruption found within the table.
> The main focus of the scan is to find invalid xmin, xmax, and toast
> pointer values.  It also checks for structural corruption within the
> page (such as invalid t_hoff values) that could lead to the backend
> aborting should the function blindly trust the data as it finds it.


> +typedef struct CorruptionInfo
> +{
> + BlockNumber blkno;
> + OffsetNumber offnum;
> + int16 lp_off;
> + int16 lp_flags;
> + int16 lp_len;
> + int32 attnum;
> + int32 chunk;
> + char   *msg;
> +} CorruptionInfo;

Adding a short comment explaining what this is for would be good.


> +/* Internal implementation */
> +void record_corruption(HeapCheckContext * ctx, char *msg);
> +TupleDesc heapcheck_relation_tupdesc(void);
> +
> +void beginRelBlockIteration(HeapCheckContext * ctx);
> +bool relBlockIteration_next(HeapCheckContext * ctx);
> +void endRelBlockIteration(HeapCheckContext * ctx);
> +
> +void beginPageTupleIteration(HeapCheckContext * ctx);
> +bool pageTupleIteration_next(HeapCheckContext * ctx);
> +void endPageTupleIteration(HeapCheckContext * ctx);
> +
> +void beginTupleAttributeIteration(HeapCheckContext * ctx);
> +bool tupleAttributeIteration_next(HeapCheckContext * ctx);
> +void endTupleAttributeIteration(HeapCheckContext * ctx);
> +
> +void beginToastTupleIteration(HeapCheckContext * ctx,
> + struct varatt_external *toast_pointer);
> +void endToastTupleIteration(HeapCheckContext * ctx);
> +bool toastTupleIteration_next(HeapCheckContext * ctx);
> +
> +bool TransactionIdStillValid(TransactionId xid, FullTransactionId *fxid);
> +bool HeapTupleIsVisible(HeapTupleHeader tuphdr, HeapCheckContext * ctx);
> +void check_toast_tuple(HeapCheckContext * ctx);
> +bool check_tuple_attribute(HeapCheckContext * ctx);
> +void check_tuple(HeapCheckContext * ctx);
> +
> +List   *check_relation(Oid relid);
> +void check_relation_relkind(Relation rel);

Why aren't these static?


> +/*
> + * record_corruption
> + *
> + *   Record a message about corruption, including information
> + *   about where in the relation the corruption was found.
> + */
> +void
> +record_corruption(HeapCheckContext * ctx, char *msg)
> +{

Given that you went through the trouble of adding prototypes for all of
these, I'd start with the most important functions, not the unimportant
details.


> +/*
> + * Helper function to construct the TupleDesc needed by heapcheck_relation.
> + */
> +TupleDesc
> +heapcheck_relation_tupdesc()

Missing (void) (it's our style, even though you could theoretically not
have it as long as you have a prototype).


> +{
> + TupleDesc tupdesc;
> + AttrNumber maxattr = 8;

This 8 is in multiple places, I'd add a define for it.

> + AttrNumber a = 0;
> +
> + tupdesc = CreateTemplateTupleDesc(maxattr);
> + TupleDescInitEntry(tupdesc, ++a, "blkno", INT8OID, -1, 0);
> + TupleDescInitEntry(tupdesc, ++a, "offnum", INT4OID, -1, 0);
> + TupleDescInitEntry(tupdesc, ++a, "lp_off", INT2OID, -1, 0);
> + TupleDescInitEntry(tupdesc, ++a, "lp_flags", INT2OID, -1, 0);
> + TupleDescInitEntry(tupdesc, ++a, "lp_len", INT2OID, -1, 0);
> + TupleDescInitEntry(tupdesc, ++a, "attnum", INT4OID, -1, 0);
> + TupleDescInitEntry(tupdesc, ++a, "chunk", INT4OID, -1, 0);
> + TupleDescInitEntry(tupdesc, ++a, "msg", TEXTOID, -1, 0);
> + Assert(a == maxattr);
> +
> + return BlessTupleDesc(tupdesc);
> +}


> +/*
> + * heapcheck_relation
> + *
> + *   Scan and report corruption in heap pages or in associated toast relation.
> + */
> +Datum
> +heapcheck_relation(PG_FUNCTION_ARGS)
> +{
> + FuncCallContext *funcctx;
> + CheckRelCtx *ctx;
> +
> + if (SRF_IS_FIRSTCALL())
> + {

I think it'd be good to have a version that just returned a boolean. For
one, in many cases that's all we care about when scripting things. But
also, on a large relation, there could be a lot of errors.


> + Oid relid = PG_GETARG_OID(0);
> + MemoryContext oldcontext;
> +
> + /*
> + * Scan the entire relation, building up a list of corruption found in
> + * ctx->corruption, for returning later.  The scan must be performed
> + * in a memory context that will survive until after all rows are
> + * returned.
> + */
> + funcctx = SRF_FIRSTCALL_INIT();
> + oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
> + funcctx->tuple_desc = heapcheck_relation_tupdesc();
> + ctx = (CheckRelCtx *) palloc0(sizeof(CheckRelCtx));
> + ctx->corruption = check_relation(relid);
> + ctx->idx = 0; /* start the iterator at the beginning */
> + funcctx->user_fctx = (void *) ctx;
> + MemoryContextSwitchTo(oldcontext);

Hm. This builds up all the errors in memory. Is that a good idea? I mean
for a large relation having one returned value for each tuple could be a
heck of a lot of data.

I think it'd be better to use the spilling SRF protocol here.  It's not
like you're benefitting from deferring the tuple construction to the
return currently.


> +/*
> + * beginRelBlockIteration
> + *
> + *   For the given heap relation being checked, as recorded in ctx, sets up
> + *   variables for iterating over the heap's pages.
> + *
> + *   The caller should have already opened the heap relation, ctx->rel
> + */
> +void
> +beginRelBlockIteration(HeapCheckContext * ctx)
> +{
> + ctx->nblocks = RelationGetNumberOfBlocks(ctx->rel);
> + ctx->blkno = InvalidBlockNumber;
> + ctx->bstrategy = GetAccessStrategy(BAS_BULKREAD);
> + ctx->buffer = InvalidBuffer;
> + ctx->page = NULL;
> +}
> +
> +/*
> + * endRelBlockIteration
> + *
> + *   Releases resources that were reserved by either beginRelBlockIteration or
> + *   relBlockIteration_next.
> + */
> +void
> +endRelBlockIteration(HeapCheckContext * ctx)
> +{
> + /*
> + * Clean up.  If the caller iterated to the end, the final call to
> + * relBlockIteration_next will already have released the buffer, but if
> + * the caller is bailing out early, we have to release it ourselves.
> + */
> + if (InvalidBuffer != ctx->buffer)
> + UnlockReleaseBuffer(ctx->buffer);
> +}

These seem mighty granular and generically named to me.


> + * pageTupleIteration_next
> + *
> + *   Advances the state tracked in ctx to the next tuple on the page.
> + *
> + *   Caller should have already set up the iteration via
> + *   beginPageTupleIteration, and should stop calling when this function
> + *   returns false.
> + */
> +bool
> +pageTupleIteration_next(HeapCheckContext * ctx)

I don't think this is a naming scheme we use anywhere in postgres. I
don't think it's a good idea to add yet more of those.


> +{
> + /*
> + * Iterate to the next interesting line pointer, if any. Unused, dead and
> + * redirect line pointers are of no interest.
> + */
> + do
> + {
> + ctx->offnum = OffsetNumberNext(ctx->offnum);
> + if (ctx->offnum > ctx->maxoff)
> + return false;
> + ctx->itemid = PageGetItemId(ctx->page, ctx->offnum);
> + } while (!ItemIdIsUsed(ctx->itemid) ||
> + ItemIdIsDead(ctx->itemid) ||
> + ItemIdIsRedirected(ctx->itemid));

This is an odd loop. Part of the test is in the body, part of in the
loop header.


> +/*
> + * Given a TransactionId, attempt to interpret it as a valid
> + * FullTransactionId, neither in the future nor overlong in
> + * the past.  Stores the inferred FullTransactionId in *fxid.
> + *
> + * Returns whether the xid is newer than the oldest clog xid.
> + */
> +bool
> +TransactionIdStillValid(TransactionId xid, FullTransactionId *fxid)

I don't at all like the naming of this function. This isn't a reliable
check. As before, it obviously also shouldn't be static.


> +{
> + FullTransactionId fnow;
> + uint32 epoch;
> +
> + /* Initialize fxid; we'll overwrite this later if needed */
> + *fxid = FullTransactionIdFromEpochAndXid(0, xid);

> + /* Special xids can quickly be turned into invalid fxids */
> + if (!TransactionIdIsValid(xid))
> + return false;
> + if (!TransactionIdIsNormal(xid))
> + return true;
> +
> + /*
> + * Charitably infer the full transaction id as being within one epoch ago
> + */
> + fnow = ReadNextFullTransactionId();
> + epoch = EpochFromFullTransactionId(fnow);
> + *fxid = FullTransactionIdFromEpochAndXid(epoch, xid);

So now you're overwriting the fxid value from above unconditionally?


> + if (!FullTransactionIdPrecedes(*fxid, fnow))
> + *fxid = FullTransactionIdFromEpochAndXid(epoch - 1, xid);


I think it'd be better to do the conversion the following way:

    *fxid = FullTransactionIdFromU64(U64FromFullTransactionId(fnow)
                                    + (int32) (XidFromFullTransactionId(fnow) - xid));


> + if (!FullTransactionIdPrecedes(*fxid, fnow))
> + return false;
> + /* The oldestClogXid is protected by CLogTruncationLock */
> + Assert(LWLockHeldByMe(CLogTruncationLock));
> + if (TransactionIdPrecedes(xid, ShmemVariableCache->oldestClogXid))
> + return false;
> + return true;
> +}

Why is this testing oldestClogXid instead of oldestXid?


> +/*
> + * HeapTupleIsVisible
> + *
> + * Determine whether tuples are visible for heapcheck.  Similar to
> + *  HeapTupleSatisfiesVacuum, but with critical differences.
> + *
> + *  1) Does not touch hint bits.  It seems imprudent to write hint bits
> + *     to a table during a corruption check.
> + *  2) Gracefully handles xids that are too old by calling
> + *     TransactionIdStillValid before TransactionLogFetch, thus avoiding
> + *     a backend abort.

I think it'd be better to protect against this by avoiding checks for
xids that are older than relfrozenxid. And ones that are newer than
ReadNextTransactionId().  But all of those cases should be errors
anyway, so it doesn't seem like that should be handled within the
visibility routine.


> + *  3) Only makes a boolean determination of whether heapcheck should
> + *     see the tuple, rather than doing extra work for vacuum-related
> + *     categorization.
> + */
> +bool
> +HeapTupleIsVisible(HeapTupleHeader tuphdr, HeapCheckContext * ctx)
> +{

> + FullTransactionId fxmin,
> + fxmax;
> + uint16 infomask = tuphdr->t_infomask;
> + TransactionId xmin = HeapTupleHeaderGetXmin(tuphdr);
> +
> + if (!HeapTupleHeaderXminCommitted(tuphdr))
> + {

Hm. I wonder if it'd be good to crosscheck the xid committed hint bits
with clog?


> + else if (!TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuphdr)))
> + {
> + LWLockRelease(CLogTruncationLock);
> + return false; /* HEAPTUPLE_DEAD */
> + }

Note that this actually can error out, if xmin is a subtransaction xid,
because pg_subtrans is truncated a lot more aggressively than anything
else. I think you'd need to filter against subtransactions older than
RecentXmin before here, and treat that as an error.


> + if (!(infomask & HEAP_XMAX_INVALID) && !HEAP_XMAX_IS_LOCKED_ONLY(infomask))
> + {
> + if (infomask & HEAP_XMAX_IS_MULTI)
> + {
> + TransactionId xmax = HeapTupleGetUpdateXid(tuphdr);
> +
> + /* not LOCKED_ONLY, so it has to have an xmax */
> + if (!TransactionIdIsValid(xmax))
> + {
> + record_corruption(ctx, _("heap tuple with XMAX_IS_MULTI is "
> + "neither LOCKED_ONLY nor has a "
> + "valid xmax"));
> + return false;
> + }

I think it's bad to have code like this in a routine that's named like a
generic visibility check routine.


> + if (TransactionIdIsInProgress(xmax))
> + return false; /* HEAPTUPLE_DELETE_IN_PROGRESS */
> +
> + LWLockAcquire(CLogTruncationLock, LW_SHARED);
> + if (!TransactionIdStillValid(xmax, &fxmax))
> + {
> + LWLockRelease(CLogTruncationLock);
> + record_corruption(ctx, psprintf("tuple xmax = %u (interpreted "
> + "as " UINT64_FORMAT
> + ") not or no longer valid",
> + xmax, fxmax.value));
> + return false;
> + }
> + else if (TransactionIdDidCommit(xmax))
> + {
> + LWLockRelease(CLogTruncationLock);
> + return false; /* HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD */
> + }
> + LWLockRelease(CLogTruncationLock);
> + /* Ok, the tuple is live */

I don't think random interspersed uses of CLogTruncationLock are a good
idea. If you move to only checking visibility after tuple fits into
[relfrozenxid, nextXid), then you don't need to take any locks here, as
long as a lock against vacuum is taken (which I think this should do
anyway).


> +/*
> + * check_tuple
> + *
> + *   Checks the current tuple as tracked in ctx for corruption.  Records any
> + *   corruption found in ctx->corruption.
> + *
> + *   The caller should have iterated to a tuple via pageTupleIteration_next.
> + */
> +void
> +check_tuple(HeapCheckContext * ctx)
> +{
> + bool fatal = false;

Wait, aren't some checks here duplicate with ones in
HeapTupleIsVisible()?


> + /* Check relminmxid against mxid, if any */
> + if (ctx->infomask & HEAP_XMAX_IS_MULTI &&
> + MultiXactIdPrecedes(ctx->xmax, ctx->relminmxid))
> + {
> + record_corruption(ctx, psprintf("tuple xmax = %u precedes relation "
> + "relminmxid = %u",
> + ctx->xmax, ctx->relminmxid));
> + }

It's pretty weird that the routines here access xmin/xmax/... via
HeapCheckContext, but HeapTupleIsVisible() doesn't.


> + /* Check xmin against relfrozenxid */
> + if (TransactionIdIsNormal(ctx->relfrozenxid) &&
> + TransactionIdIsNormal(ctx->xmin) &&
> + TransactionIdPrecedes(ctx->xmin, ctx->relfrozenxid))
> + {
> + record_corruption(ctx, psprintf("tuple xmin = %u precedes relation "
> + "relfrozenxid = %u",
> + ctx->xmin, ctx->relfrozenxid));
> + }
> +
> + /* Check xmax against relfrozenxid */
> + if (TransactionIdIsNormal(ctx->relfrozenxid) &&
> + TransactionIdIsNormal(ctx->xmax) &&
> + TransactionIdPrecedes(ctx->xmax, ctx->relfrozenxid))
> + {
> + record_corruption(ctx, psprintf("tuple xmax = %u precedes relation "
> + "relfrozenxid = %u",
> + ctx->xmax, ctx->relfrozenxid));
> + }

these all should be fatal. You definitely cannot just continue
afterwards given the justification below:


> + /*
> + * Iterate over the attributes looking for broken toast values. This
> + * roughly follows the logic of heap_deform_tuple, except that it doesn't
> + * bother building up isnull[] and values[] arrays, since nobody wants
> + * them, and it unrolls anything that might trip over an Assert when
> + * processing corrupt data.
> + */
> + beginTupleAttributeIteration(ctx);
> + while (tupleAttributeIteration_next(ctx) &&
> +   check_tuple_attribute(ctx))
> + ;
> + endTupleAttributeIteration(ctx);
> +}

I really don't find these helpers helpful.


> +/*
> + * check_relation
> + *
> + *   Checks the relation given by relid for corruption, returning a list of all
> + *   it finds.
> + *
> + *   The caller should set up the memory context as desired before calling.
> + *   The returned list belongs to the caller.
> + */
> +List *
> +check_relation(Oid relid)
> +{
> + HeapCheckContext ctx;
> +
> + memset(&ctx, 0, sizeof(HeapCheckContext));
> +
> + /* Open the relation */
> + ctx.relid = relid;
> + ctx.corruption = NIL;
> + ctx.rel = relation_open(relid, AccessShareLock);

I think you need to protect at least against concurrent schema changes
given some of your checks. But I think it'd be better to also conflict
with vacuum here.


> + check_relation_relkind(ctx.rel);

I think you also need to ensure that the table is actually using heap
AM, not another tableam. Oh - you're doing that inside the check. But
that's confusing, because that's not 'relkind'.


> + ctx.relDesc = RelationGetDescr(ctx.rel);
> + ctx.rel_natts = RelationGetDescr(ctx.rel)->natts;
> + ctx.relfrozenxid = ctx.rel->rd_rel->relfrozenxid;
> + ctx.relminmxid = ctx.rel->rd_rel->relminmxid;

three naming schemes in three lines...



> + /* check all blocks of the relation */
> + beginRelBlockIteration(&ctx);
> + while (relBlockIteration_next(&ctx))
> + {
> + /* Perform tuple checks */
> + beginPageTupleIteration(&ctx);
> + while (pageTupleIteration_next(&ctx))
> + check_tuple(&ctx);
> + endPageTupleIteration(&ctx);
> + }
> + endRelBlockIteration(&ctx);

I again do not find this helper stuff helpful.


> + /* Close the associated toast table and indexes, if any. */
> + if (ctx.has_toastrel)
> + {
> + toast_close_indexes(ctx.toast_indexes, ctx.num_toast_indexes,
> + AccessShareLock);
> + table_close(ctx.toastrel, AccessShareLock);
> + }
> +
> + /* Close the main relation */
> + relation_close(ctx.rel, AccessShareLock);

Why the closing here?



> +# This regression test demonstrates that the heapcheck_relation() function
> +# supplied with this contrib module correctly identifies specific kinds of
> +# corruption within pages.  To test this, we need a mechanism to create corrupt
> +# pages with predictable, repeatable corruption.  The postgres backend cannot be
> +# expected to help us with this, as its design is not consistent with the goal
> +# of intentionally corrupting pages.
> +#
> +# Instead, we create a table to corrupt, and with careful consideration of how
> +# postgresql lays out heap pages, we seek to offsets within the page and
> +# overwrite deliberately chosen bytes with specific values calculated to
> +# corrupt the page in expected ways.  We then verify that heapcheck_relation
> +# reports the corruption, and that it runs without crashing.  Note that the
> +# backend cannot simply be started to run queries against the corrupt table, as
> +# the backend will crash, at least for some of the corruption types we
> +# generate.
> +#
> +# Autovacuum potentially touching the table in the background makes the exact
> +# behavior of this test harder to reason about.  We turn it off to keep things
> +# simpler.  We use a "belt and suspenders" approach, turning it off for the
> +# system generally in postgresql.conf, and turning it off specifically for the
> +# test table.
> +#
> +# This test depends on the table being written to the heap file exactly as we
> +# expect it to be, so we take care to arrange the columns of the table, and
> +# insert rows of the table, that give predictable sizes and locations within
> +# the table page.

I have a hard time believing this is going to be really
reliable. E.g. the alignment requirements will vary between platforms,
leading to different layouts. In particular, MAXALIGN differs between
platforms.

Also, it's supported to compile postgres with a different pagesize.


Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Robert Haas
[ retrying from the email address I intended to use ]

On Mon, Apr 20, 2020 at 3:42 PM Andres Freund <[hidden email]> wrote:
> I don't think random interspersed uses of CLogTruncationLock are a good
> idea. If you move to only checking visibility after tuple fits into
> [relfrozenxid, nextXid), then you don't need to take any locks here, as
> long as a lock against vacuum is taken (which I think this should do
> anyway).

I think it would be *really* good to avoid ShareUpdateExclusiveLock
here. Running with only AccessShareLock would be a big advantage. I
agree that any use of CLogTruncationLock should not be "random", but I
don't see why the same method we use to make txid_status() safe to
expose to SQL shouldn't also be used here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Andres Freund
In reply to this post by Andres Freund
Hi,

On 2020-04-20 15:59:49 -0400, Robert Haas wrote:

> On Mon, Apr 20, 2020 at 3:42 PM Andres Freund <[hidden email]> wrote:
> > I don't think random interspersed uses of CLogTruncationLock are a good
> > idea. If you move to only checking visibility after tuple fits into
> > [relfrozenxid, nextXid), then you don't need to take any locks here, as
> > long as a lock against vacuum is taken (which I think this should do
> > anyway).
>
> I think it would be *really* good to avoid ShareUpdateExclusiveLock
> here. Running with only AccessShareLock would be a big advantage. I
> agree that any use of CLogTruncationLock should not be "random", but I
> don't see why the same method we use to make txid_status() safe to
> expose to SQL shouldn't also be used here.

A few billion CLogTruncationLock acquisitions in short order will likely
have at least as big an impact as ShareUpdateExclusiveLock held for the
duration of the check. That's not really a relevant concern or
txid_status().  Per-tuple lock acquisitions aren't great.

I think it might be doable to not need either. E.g. we could set the
checking backend's xmin to relfrozenxid, and set somethign like
PROC_IN_VACUUM. That should, I think, prevent clog from being truncated
in a problematic way (clog truncations look at PROC_IN_VACUUM backends),
while not blocking vacuum.

The similar concern for ReadNewTransactionId() can probably more easily
be addressed, by only calling ReadNewTransactionId() when encountering
an xid that's newer than the last value read.


I think it'd be good to set PROC_IN_VACUUM (or maybe a separate version
of it) while checking anyway. Reading the full relation can take quite a
while, and we shouldn't prevent hot pruning while doing so.


There's some things we'd need to figure out to be able to use
PROC_IN_VACUUM, as that's really only safe in some
circumstances. Possibly it'd be easiest to address that if we'd make the
check a procedure...

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Peter Geoghegan-4
In reply to this post by Andres Freund
On Mon, Apr 20, 2020 at 12:42 PM Andres Freund <[hidden email]> wrote:
> This is something we really really really need. I'm very excited to see
> progress!

+1

My experience with amcheck was that the requirement that we document
and verify pretty much every invariant (the details of which differ
slightly based on the B-Tree version in use) has had intangible
benefits. It helped me come up with a simpler, better design in the
first place. Also, many of the benchmarks that I perform get to be a
stress-test of the feature itself. It saves quite a lot of testing
work in the long run.

> I wonder if a mode where heapcheck optionally would only checks
> non-frozen (perhaps also non-all-visible) regions of a table would be a
> good idea? Would make it a lot more viable to run this regularly on
> bigger databases. Even if there's a window to not check some data
> (because it's frozen before the next heapcheck run).

That's a great idea. It could also make it practical to use the
rootdescend verification option to verify indexes selectively -- if
you don't have too many blocks to check on average, the overhead is
tolerable. This is the kind of thing that naturally belongs in the
higher level interface that I sketched already.

> We also had a *lot* of bugs that we'd have found a lot earlier, possibly
> even during development, if we had a way to easily perform these checks.

I can think of a case where it was quite unclear what the invariants
for the heap even were, at least temporarily. And this was in the
context of fixing a bug that was really quite nasty. Formally defining
the invariants in one place, and taking a position on exactly what
correct looks like seems like a very valuable exercise. Even without
the tool catching a single bug.

> I have a hard time believing this is going to be really
> reliable. E.g. the alignment requirements will vary between platforms,
> leading to different layouts. In particular, MAXALIGN differs between
> platforms.

Over on another thread, I suggested that Mark might want to have a
corruption test framework that exposes some of the bufpage.c routines.
The idea is that you can destructively manipulate a page using the
logical page interface. Something that works one level below the
access method, but one level above the raw page image. It probably
wouldn't test everything that Mark wants to test, but it would test
some things in a way that seems maintainable to me.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Robert Haas
In reply to this post by Andres Freund
On Mon, Apr 20, 2020 at 4:30 PM Andres Freund <[hidden email]> wrote:
> A few billion CLogTruncationLock acquisitions in short order will likely
> have at least as big an impact as ShareUpdateExclusiveLock held for the
> duration of the check. That's not really a relevant concern or
> txid_status().  Per-tuple lock acquisitions aren't great.

Yeah, that's true. Doing it for every tuple is going to be too much, I
think. I was hoping we could avoid that.

> I think it might be doable to not need either. E.g. we could set the
> checking backend's xmin to relfrozenxid, and set somethign like
> PROC_IN_VACUUM. That should, I think, prevent clog from being truncated
> in a problematic way (clog truncations look at PROC_IN_VACUUM backends),
> while not blocking vacuum.

Hmm, OK, I don't know if that would be OK or not.

> The similar concern for ReadNewTransactionId() can probably more easily
> be addressed, by only calling ReadNewTransactionId() when encountering
> an xid that's newer than the last value read.

Yeah, if we can cache some things to avoid repetitive calls, that would be good.

> I think it'd be good to set PROC_IN_VACUUM (or maybe a separate version
> of it) while checking anyway. Reading the full relation can take quite a
> while, and we shouldn't prevent hot pruning while doing so.
>
> There's some things we'd need to figure out to be able to use
> PROC_IN_VACUUM, as that's really only safe in some
> circumstances. Possibly it'd be easiest to address that if we'd make the
> check a procedure...

I think we sure want to set things up so that we do this check without
holding a snapshot, if we can. Not sure exactly how to get there.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Peter Geoghegan-4
In reply to this post by Mark Dilger-5
On Mon, Apr 20, 2020 at 12:40 PM Mark Dilger
<[hidden email]> wrote:
> Ok, I'll work in that direction and repost when I have something along those lines.

Great, thanks!

It also occurs to me that the B-Tree checks that amcheck already has
have one remaining blindspot: While the heapallindexed verification
option has the ability to detect the absence of an index tuple that
the dummy CREATE INDEX that we perform under the hood says should be
in the index, it cannot do the opposite: It cannot detect the presence
of a malformed tuple that shouldn't be there at all, unless the index
tuple itself is corrupt. That could miss an inconsistent page image
when a few tuples have been VACUUMed away, but still appear in the
index.

In order to do that, we'd have to have something a bit like the
validate_index() heap scan that CREATE INDEX CONCURRENTLY uses. We'd
have to get a list of heap TIDs that any index tuple might be pointing
to, and then make sure that there were no TIDs in the index that were
not in that list -- tuples that were pointing to nothing in the heap
at all. This could use the index_bulk_delete() interface. This is the
kind of verification option that I might work on for debugging
purposes, but not the kind of thing I could really recommend to
ordinary users outside of exceptional cases. This is the kind of thing
that argues for more or less providing all of the verification
functionality we have through both high level and low level
interfaces. This isn't likely to be all that valuable most of the
time, and users shouldn't have to figure that out for themselves the
hard way. (BTW, I think that this could be implemented in an
index-AM-agnostic way, I think, so perhaps you can consider adding it
too, if you have time.)

One last thing for now: take a look at amcheck's
bt_tuple_present_callback() function. It has comments about HOT chain
corruption that you may find interesting. Note that this check played
a role in the "freeze the dead" corruption bug [1] -- it detected that
our initial fix for that was broken. It seems like it would be a good
idea to go back through the reproducers we've seen for some of the
more memorable corruption bugs, and actually make sure that your tool
detects them where that isn't clear. History doesn't repeat itself,
but it often rhymes.

[1] https://postgr.es/m/CAH2-Wznm4rCrhFAiwKPWTpEw2bXDtgROZK7jWWGucXeH3D1fmA@...
--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Peter Geoghegan-4
In reply to this post by Robert Haas
On Mon, Apr 20, 2020 at 1:40 PM Robert Haas <[hidden email]> wrote:
> On Mon, Apr 20, 2020 at 4:30 PM Andres Freund <[hidden email]> wrote:
> > A few billion CLogTruncationLock acquisitions in short order will likely
> > have at least as big an impact as ShareUpdateExclusiveLock held for the
> > duration of the check. That's not really a relevant concern or
> > txid_status().  Per-tuple lock acquisitions aren't great.
>
> Yeah, that's true. Doing it for every tuple is going to be too much, I
> think. I was hoping we could avoid that.

What about the visibility map? It would be nice if pg_visibility was
merged into amcheck, since it mostly provides integrity checking for
the visibility map. Maybe we could just merge the functions that
perform verification, and leave other functions (like
pg_truncate_visibility_map()) where they are. We could keep the
current interface for functions like pg_check_visible(), but also
allow the same verification to occur in passing, as part of a higher
level check.

It wouldn't be so bad if pg_visibility was an expert-only tool. But
ISTM that the verification performed by code like
collect_corrupt_items() could easily take place at the same time as
the new checks that Mark proposes. Possibly only some of the time. It
can work in a totally additive way. (Though like Andres I don't really
like the current "helper" functions used to iterate through a heap
relation; they seem like they'd make this harder.)

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Mark Dilger-5
In reply to this post by Andres Freund


> On Apr 20, 2020, at 12:42 PM, Andres Freund <[hidden email]> wrote:
>
> Hi,
>
> On 2020-04-20 10:59:28 -0700, Mark Dilger wrote:
>> I have been talking with Robert about table corruption that occurs
>> from time to time. The page checksum feature seems sufficient to
>> detect most random corruption problems, but it can't detect "logical"
>> corruption, where the page is valid but inconsistent with the rest of
>> the database cluster. This can happen due to faulty or ill-conceived
>> backup and restore tools, or bad storage, or user error, or bugs in
>> the server itself. (Also, not everyone enables checksums.)
>
> This is something we really really really need. I'm very excited to see
> progress!
Thanks for the review!

>> From 2a1bc0bb9fa94bd929adc1a408900cb925ebcdd5 Mon Sep 17 00:00:00 2001
>> From: Mark Dilger <[hidden email]>
>> Date: Mon, 20 Apr 2020 08:05:58 -0700
>> Subject: [PATCH v2] Adding heapcheck contrib module.
>>
>> The heapcheck module introduces a new function for checking a heap
>> relation and associated toast relation, if any, for corruption.
>
> Why not add it to amcheck?

That seems to be the general consensus.  The functionality has been moved there, renamed as "verify_heapam", as that seems more in line with the "verify_nbtree" name already present in that module.  The docs have also been moved there, although not very gracefully.  It seems premature to polish the documentation given that the interface will likely change at least one more time, to incorporate more of Peter's suggestions.  There are still design differences between the two implementations that need to be harmonized.  The verify_heapam function returns rows detailing the corruption found, which is inconsistent with how verify_heapam does things.

> I wonder if a mode where heapcheck optionally would only checks
> non-frozen (perhaps also non-all-visible) regions of a table would be a
> good idea? Would make it a lot more viable to run this regularly on
> bigger databases. Even if there's a window to not check some data
> (because it's frozen before the next heapcheck run).

Perhaps we should come back to that.  Version 3 of this patch addresses concerns about the v2 patch without adding too many new features.

>> The attached module provides the means to scan a relation and sanity
>> check it. Currently, it checks xmin and xmax values against
>> relfrozenxid and relminmxid, and also validates TOAST pointers. If
>> people like this, it could be expanded to perform additional checks.
>
>
>> The postgres backend already defends against certain forms of
>> corruption, by checking the page header of each page before allowing
>> it into the page cache, and by checking the page checksum, if enabled.
>> Experience shows that broken or ill-conceived backup and restore
>> mechanisms can result in a page, or an entire file, being overwritten
>> with an earlier version of itself, restored from backup.  Pages thus
>> overwritten will appear to have valid page headers and checksums,
>> while potentially containing xmin, xmax, and toast pointers that are
>> invalid.
>
> We also had a *lot* of bugs that we'd have found a lot earlier, possibly
> even during development, if we had a way to easily perform these checks.
I certainly hope this is useful for testing.

>> contrib/heapcheck introduces a function, heapcheck_relation, that
>> takes a regclass argument, scans the given heap relation, and returns
>> rows containing information about corruption found within the table.
>> The main focus of the scan is to find invalid xmin, xmax, and toast
>> pointer values.  It also checks for structural corruption within the
>> page (such as invalid t_hoff values) that could lead to the backend
>> aborting should the function blindly trust the data as it finds it.
>
>
>> +typedef struct CorruptionInfo
>> +{
>> + BlockNumber blkno;
>> + OffsetNumber offnum;
>> + int16 lp_off;
>> + int16 lp_flags;
>> + int16 lp_len;
>> + int32 attnum;
>> + int32 chunk;
>> + char   *msg;
>> +} CorruptionInfo;
>
> Adding a short comment explaining what this is for would be good.
This struct has been removed.

>> +/* Internal implementation */
>> +void record_corruption(HeapCheckContext * ctx, char *msg);
>> +TupleDesc heapcheck_relation_tupdesc(void);
>> +
>> +void beginRelBlockIteration(HeapCheckContext * ctx);
>> +bool relBlockIteration_next(HeapCheckContext * ctx);
>> +void endRelBlockIteration(HeapCheckContext * ctx);
>> +
>> +void beginPageTupleIteration(HeapCheckContext * ctx);
>> +bool pageTupleIteration_next(HeapCheckContext * ctx);
>> +void endPageTupleIteration(HeapCheckContext * ctx);
>> +
>> +void beginTupleAttributeIteration(HeapCheckContext * ctx);
>> +bool tupleAttributeIteration_next(HeapCheckContext * ctx);
>> +void endTupleAttributeIteration(HeapCheckContext * ctx);
>> +
>> +void beginToastTupleIteration(HeapCheckContext * ctx,
>> + struct varatt_external *toast_pointer);
>> +void endToastTupleIteration(HeapCheckContext * ctx);
>> +bool toastTupleIteration_next(HeapCheckContext * ctx);
>> +
>> +bool TransactionIdStillValid(TransactionId xid, FullTransactionId *fxid);
>> +bool HeapTupleIsVisible(HeapTupleHeader tuphdr, HeapCheckContext * ctx);
>> +void check_toast_tuple(HeapCheckContext * ctx);
>> +bool check_tuple_attribute(HeapCheckContext * ctx);
>> +void check_tuple(HeapCheckContext * ctx);
>> +
>> +List   *check_relation(Oid relid);
>> +void check_relation_relkind(Relation rel);
>
> Why aren't these static?
They are now, except for the iterator style functions, which are gone.

>> +/*
>> + * record_corruption
>> + *
>> + *   Record a message about corruption, including information
>> + *   about where in the relation the corruption was found.
>> + */
>> +void
>> +record_corruption(HeapCheckContext * ctx, char *msg)
>> +{
>
> Given that you went through the trouble of adding prototypes for all of
> these, I'd start with the most important functions, not the unimportant
> details.
Yeah, good idea.  The most important functions are now at the top.

>> +/*
>> + * Helper function to construct the TupleDesc needed by heapcheck_relation.
>> + */
>> +TupleDesc
>> +heapcheck_relation_tupdesc()
>
> Missing (void) (it's our style, even though you could theoretically not
> have it as long as you have a prototype).

That was unintentional, and is now fixed.

>> +{
>> + TupleDesc tupdesc;
>> + AttrNumber maxattr = 8;
>
> This 8 is in multiple places, I'd add a define for it.

Done.

>> + AttrNumber a = 0;
>> +
>> + tupdesc = CreateTemplateTupleDesc(maxattr);
>> + TupleDescInitEntry(tupdesc, ++a, "blkno", INT8OID, -1, 0);
>> + TupleDescInitEntry(tupdesc, ++a, "offnum", INT4OID, -1, 0);
>> + TupleDescInitEntry(tupdesc, ++a, "lp_off", INT2OID, -1, 0);
>> + TupleDescInitEntry(tupdesc, ++a, "lp_flags", INT2OID, -1, 0);
>> + TupleDescInitEntry(tupdesc, ++a, "lp_len", INT2OID, -1, 0);
>> + TupleDescInitEntry(tupdesc, ++a, "attnum", INT4OID, -1, 0);
>> + TupleDescInitEntry(tupdesc, ++a, "chunk", INT4OID, -1, 0);
>> + TupleDescInitEntry(tupdesc, ++a, "msg", TEXTOID, -1, 0);
>> + Assert(a == maxattr);
>> +
>> + return BlessTupleDesc(tupdesc);
>> +}
>
>
>> +/*
>> + * heapcheck_relation
>> + *
>> + *   Scan and report corruption in heap pages or in associated toast relation.
>> + */
>> +Datum
>> +heapcheck_relation(PG_FUNCTION_ARGS)
>> +{
>> + FuncCallContext *funcctx;
>> + CheckRelCtx *ctx;
>> +
>> + if (SRF_IS_FIRSTCALL())
>> + {
>
> I think it'd be good to have a version that just returned a boolean. For
> one, in many cases that's all we care about when scripting things. But
> also, on a large relation, there could be a lot of errors.
There is now a second parameter to the function, "stop_on_error".  The function performs exactly the same checks, but returns after the first page that contains corruption.

>> + Oid relid = PG_GETARG_OID(0);
>> + MemoryContext oldcontext;
>> +
>> + /*
>> + * Scan the entire relation, building up a list of corruption found in
>> + * ctx->corruption, for returning later.  The scan must be performed
>> + * in a memory context that will survive until after all rows are
>> + * returned.
>> + */
>> + funcctx = SRF_FIRSTCALL_INIT();
>> + oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
>> + funcctx->tuple_desc = heapcheck_relation_tupdesc();
>> + ctx = (CheckRelCtx *) palloc0(sizeof(CheckRelCtx));
>> + ctx->corruption = check_relation(relid);
>> + ctx->idx = 0; /* start the iterator at the beginning */
>> + funcctx->user_fctx = (void *) ctx;
>> + MemoryContextSwitchTo(oldcontext);
>
> Hm. This builds up all the errors in memory. Is that a good idea? I mean
> for a large relation having one returned value for each tuple could be a
> heck of a lot of data.
>
> I think it'd be better to use the spilling SRF protocol here.  It's not
> like you're benefitting from deferring the tuple construction to the
> return currently.
Done.

>> +/*
>> + * beginRelBlockIteration
>> + *
>> + *   For the given heap relation being checked, as recorded in ctx, sets up
>> + *   variables for iterating over the heap's pages.
>> + *
>> + *   The caller should have already opened the heap relation, ctx->rel
>> + */
>> +void
>> +beginRelBlockIteration(HeapCheckContext * ctx)
>> +{
>> + ctx->nblocks = RelationGetNumberOfBlocks(ctx->rel);
>> + ctx->blkno = InvalidBlockNumber;
>> + ctx->bstrategy = GetAccessStrategy(BAS_BULKREAD);
>> + ctx->buffer = InvalidBuffer;
>> + ctx->page = NULL;
>> +}
>> +
>> +/*
>> + * endRelBlockIteration
>> + *
>> + *   Releases resources that were reserved by either beginRelBlockIteration or
>> + *   relBlockIteration_next.
>> + */
>> +void
>> +endRelBlockIteration(HeapCheckContext * ctx)
>> +{
>> + /*
>> + * Clean up.  If the caller iterated to the end, the final call to
>> + * relBlockIteration_next will already have released the buffer, but if
>> + * the caller is bailing out early, we have to release it ourselves.
>> + */
>> + if (InvalidBuffer != ctx->buffer)
>> + UnlockReleaseBuffer(ctx->buffer);
>> +}
>
> These seem mighty granular and generically named to me.
Removed.

>> + * pageTupleIteration_next
>> + *
>> + *   Advances the state tracked in ctx to the next tuple on the page.
>> + *
>> + *   Caller should have already set up the iteration via
>> + *   beginPageTupleIteration, and should stop calling when this function
>> + *   returns false.
>> + */
>> +bool
>> +pageTupleIteration_next(HeapCheckContext * ctx)
>
> I don't think this is a naming scheme we use anywhere in postgres. I
> don't think it's a good idea to add yet more of those.
Removed.

>> +{
>> + /*
>> + * Iterate to the next interesting line pointer, if any. Unused, dead and
>> + * redirect line pointers are of no interest.
>> + */
>> + do
>> + {
>> + ctx->offnum = OffsetNumberNext(ctx->offnum);
>> + if (ctx->offnum > ctx->maxoff)
>> + return false;
>> + ctx->itemid = PageGetItemId(ctx->page, ctx->offnum);
>> + } while (!ItemIdIsUsed(ctx->itemid) ||
>> + ItemIdIsDead(ctx->itemid) ||
>> + ItemIdIsRedirected(ctx->itemid));
>
> This is an odd loop. Part of the test is in the body, part of in the
> loop header.
Refactored.

>> +/*
>> + * Given a TransactionId, attempt to interpret it as a valid
>> + * FullTransactionId, neither in the future nor overlong in
>> + * the past.  Stores the inferred FullTransactionId in *fxid.
>> + *
>> + * Returns whether the xid is newer than the oldest clog xid.
>> + */
>> +bool
>> +TransactionIdStillValid(TransactionId xid, FullTransactionId *fxid)
>
> I don't at all like the naming of this function. This isn't a reliable
> check. As before, it obviously also shouldn't be static.
Renamed and refactored.

>> +{
>> + FullTransactionId fnow;
>> + uint32 epoch;
>> +
>> + /* Initialize fxid; we'll overwrite this later if needed */
>> + *fxid = FullTransactionIdFromEpochAndXid(0, xid);
>
>> + /* Special xids can quickly be turned into invalid fxids */
>> + if (!TransactionIdIsValid(xid))
>> + return false;
>> + if (!TransactionIdIsNormal(xid))
>> + return true;
>> +
>> + /*
>> + * Charitably infer the full transaction id as being within one epoch ago
>> + */
>> + fnow = ReadNextFullTransactionId();
>> + epoch = EpochFromFullTransactionId(fnow);
>> + *fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
>
> So now you're overwriting the fxid value from above unconditionally?
>
>
>> + if (!FullTransactionIdPrecedes(*fxid, fnow))
>> + *fxid = FullTransactionIdFromEpochAndXid(epoch - 1, xid);
>
>
> I think it'd be better to do the conversion the following way:
>
>    *fxid = FullTransactionIdFromU64(U64FromFullTransactionId(fnow)
>                                    + (int32) (XidFromFullTransactionId(fnow) - xid));
This has been refactored to the point that these review comments cannot be directly replied to.

>> + if (!FullTransactionIdPrecedes(*fxid, fnow))
>> + return false;
>> + /* The oldestClogXid is protected by CLogTruncationLock */
>> + Assert(LWLockHeldByMe(CLogTruncationLock));
>> + if (TransactionIdPrecedes(xid, ShmemVariableCache->oldestClogXid))
>> + return false;
>> + return true;
>> +}
>
> Why is this testing oldestClogXid instead of oldestXid?
References to clog have been refactored out of this module.

>> +/*
>> + * HeapTupleIsVisible
>> + *
>> + * Determine whether tuples are visible for heapcheck.  Similar to
>> + *  HeapTupleSatisfiesVacuum, but with critical differences.
>> + *
>> + *  1) Does not touch hint bits.  It seems imprudent to write hint bits
>> + *     to a table during a corruption check.
>> + *  2) Gracefully handles xids that are too old by calling
>> + *     TransactionIdStillValid before TransactionLogFetch, thus avoiding
>> + *     a backend abort.
>
> I think it'd be better to protect against this by avoiding checks for
> xids that are older than relfrozenxid. And ones that are newer than
> ReadNextTransactionId().  But all of those cases should be errors
> anyway, so it doesn't seem like that should be handled within the
> visibility routine.
The new implementation caches a range of expected xids.  With the relation locked against concurrent vacuum runs, it can trust that the old end of the range won't move during the course of the scan.  The newest end may move, but it only has to check for that when it encounters a newer than expected xid, and it updates the cache with the new maximum.

>
>> + *  3) Only makes a boolean determination of whether heapcheck should
>> + *     see the tuple, rather than doing extra work for vacuum-related
>> + *     categorization.
>> + */
>> +bool
>> +HeapTupleIsVisible(HeapTupleHeader tuphdr, HeapCheckContext * ctx)
>> +{
>
>> + FullTransactionId fxmin,
>> + fxmax;
>> + uint16 infomask = tuphdr->t_infomask;
>> + TransactionId xmin = HeapTupleHeaderGetXmin(tuphdr);
>> +
>> + if (!HeapTupleHeaderXminCommitted(tuphdr))
>> + {
>
> Hm. I wonder if it'd be good to crosscheck the xid committed hint bits
> with clog?
This is not done in v3, as it no longer checks clog.

>> + else if (!TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuphdr)))
>> + {
>> + LWLockRelease(CLogTruncationLock);
>> + return false; /* HEAPTUPLE_DEAD */
>> + }
>
> Note that this actually can error out, if xmin is a subtransaction xid,
> because pg_subtrans is truncated a lot more aggressively than anything
> else. I think you'd need to filter against subtransactions older than
> RecentXmin before here, and treat that as an error.
Calls to TransactionIdDidCommit are now preceded by checks that the xid argument is not too old.

>> + if (!(infomask & HEAP_XMAX_INVALID) && !HEAP_XMAX_IS_LOCKED_ONLY(infomask))
>> + {
>> + if (infomask & HEAP_XMAX_IS_MULTI)
>> + {
>> + TransactionId xmax = HeapTupleGetUpdateXid(tuphdr);
>> +
>> + /* not LOCKED_ONLY, so it has to have an xmax */
>> + if (!TransactionIdIsValid(xmax))
>> + {
>> + record_corruption(ctx, _("heap tuple with XMAX_IS_MULTI is "
>> + "neither LOCKED_ONLY nor has a "
>> + "valid xmax"));
>> + return false;
>> + }
>
> I think it's bad to have code like this in a routine that's named like a
> generic visibility check routine.
Renamed.

>> + if (TransactionIdIsInProgress(xmax))
>> + return false; /* HEAPTUPLE_DELETE_IN_PROGRESS */
>> +
>> + LWLockAcquire(CLogTruncationLock, LW_SHARED);
>> + if (!TransactionIdStillValid(xmax, &fxmax))
>> + {
>> + LWLockRelease(CLogTruncationLock);
>> + record_corruption(ctx, psprintf("tuple xmax = %u (interpreted "
>> + "as " UINT64_FORMAT
>> + ") not or no longer valid",
>> + xmax, fxmax.value));
>> + return false;
>> + }
>> + else if (TransactionIdDidCommit(xmax))
>> + {
>> + LWLockRelease(CLogTruncationLock);
>> + return false; /* HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD */
>> + }
>> + LWLockRelease(CLogTruncationLock);
>> + /* Ok, the tuple is live */
>
> I don't think random interspersed uses of CLogTruncationLock are a good
> idea. If you move to only checking visibility after tuple fits into
> [relfrozenxid, nextXid), then you don't need to take any locks here, as
> long as a lock against vacuum is taken (which I think this should do
> anyway).
Done.

>> +/*
>> + * check_tuple
>> + *
>> + *   Checks the current tuple as tracked in ctx for corruption.  Records any
>> + *   corruption found in ctx->corruption.
>> + *
>> + *   The caller should have iterated to a tuple via pageTupleIteration_next.
>> + */
>> +void
>> +check_tuple(HeapCheckContext * ctx)
>> +{
>> + bool fatal = false;
>
> Wait, aren't some checks here duplicate with ones in
> HeapTupleIsVisible()?
Yeah, there was some overlap.  That should be better now.

>> + /* Check relminmxid against mxid, if any */
>> + if (ctx->infomask & HEAP_XMAX_IS_MULTI &&
>> + MultiXactIdPrecedes(ctx->xmax, ctx->relminmxid))
>> + {
>> + record_corruption(ctx, psprintf("tuple xmax = %u precedes relation "
>> + "relminmxid = %u",
>> + ctx->xmax, ctx->relminmxid));
>> + }
>
> It's pretty weird that the routines here access xmin/xmax/... via
> HeapCheckContext, but HeapTupleIsVisible() doesn't.
Fair point.  HeapCheckContext no longer has fields for xmin/xmax after the refactoring.

>> + /* Check xmin against relfrozenxid */
>> + if (TransactionIdIsNormal(ctx->relfrozenxid) &&
>> + TransactionIdIsNormal(ctx->xmin) &&
>> + TransactionIdPrecedes(ctx->xmin, ctx->relfrozenxid))
>> + {
>> + record_corruption(ctx, psprintf("tuple xmin = %u precedes relation "
>> + "relfrozenxid = %u",
>> + ctx->xmin, ctx->relfrozenxid));
>> + }
>> +
>> + /* Check xmax against relfrozenxid */
>> + if (TransactionIdIsNormal(ctx->relfrozenxid) &&
>> + TransactionIdIsNormal(ctx->xmax) &&
>> + TransactionIdPrecedes(ctx->xmax, ctx->relfrozenxid))
>> + {
>> + record_corruption(ctx, psprintf("tuple xmax = %u precedes relation "
>> + "relfrozenxid = %u",
>> + ctx->xmax, ctx->relfrozenxid));
>> + }
>
> these all should be fatal. You definitely cannot just continue
> afterwards given the justification below:
They are now fatal.

>> + /*
>> + * Iterate over the attributes looking for broken toast values. This
>> + * roughly follows the logic of heap_deform_tuple, except that it doesn't
>> + * bother building up isnull[] and values[] arrays, since nobody wants
>> + * them, and it unrolls anything that might trip over an Assert when
>> + * processing corrupt data.
>> + */
>> + beginTupleAttributeIteration(ctx);
>> + while (tupleAttributeIteration_next(ctx) &&
>> +   check_tuple_attribute(ctx))
>> + ;
>> + endTupleAttributeIteration(ctx);
>> +}
>
> I really don't find these helpers helpful.
Removed.

>> +/*
>> + * check_relation
>> + *
>> + *   Checks the relation given by relid for corruption, returning a list of all
>> + *   it finds.
>> + *
>> + *   The caller should set up the memory context as desired before calling.
>> + *   The returned list belongs to the caller.
>> + */
>> +List *
>> +check_relation(Oid relid)
>> +{
>> + HeapCheckContext ctx;
>> +
>> + memset(&ctx, 0, sizeof(HeapCheckContext));
>> +
>> + /* Open the relation */
>> + ctx.relid = relid;
>> + ctx.corruption = NIL;
>> + ctx.rel = relation_open(relid, AccessShareLock);
>
> I think you need to protect at least against concurrent schema changes
> given some of your checks. But I think it'd be better to also conflict
> with vacuum here.
The relation is now opened with ShareUpdateExclusiveLock.

>
>> + check_relation_relkind(ctx.rel);
>
> I think you also need to ensure that the table is actually using heap
> AM, not another tableam. Oh - you're doing that inside the check. But
> that's confusing, because that's not 'relkind'.

It is checking both relkind and relam.  The function has been renamed to reflect that.

>> + ctx.relDesc = RelationGetDescr(ctx.rel);
>> + ctx.rel_natts = RelationGetDescr(ctx.rel)->natts;
>> + ctx.relfrozenxid = ctx.rel->rd_rel->relfrozenxid;
>> + ctx.relminmxid = ctx.rel->rd_rel->relminmxid;
>
> three naming schemes in three lines...

Fixed.

>> + /* check all blocks of the relation */
>> + beginRelBlockIteration(&ctx);
>> + while (relBlockIteration_next(&ctx))
>> + {
>> + /* Perform tuple checks */
>> + beginPageTupleIteration(&ctx);
>> + while (pageTupleIteration_next(&ctx))
>> + check_tuple(&ctx);
>> + endPageTupleIteration(&ctx);
>> + }
>> + endRelBlockIteration(&ctx);
>
> I again do not find this helper stuff helpful.
Removed.

>> + /* Close the associated toast table and indexes, if any. */
>> + if (ctx.has_toastrel)
>> + {
>> + toast_close_indexes(ctx.toast_indexes, ctx.num_toast_indexes,
>> + AccessShareLock);
>> + table_close(ctx.toastrel, AccessShareLock);
>> + }
>> +
>> + /* Close the main relation */
>> + relation_close(ctx.rel, AccessShareLock);
>
> Why the closing here?
As opposed to where...?  It seems fairly standard to close the relation in the function where it was opened.  Do you prefer that the relation not be closed?  Or that it be closed but the lock retained?

>> +# This regression test demonstrates that the heapcheck_relation() function
>> +# supplied with this contrib module correctly identifies specific kinds of
>> +# corruption within pages.  To test this, we need a mechanism to create corrupt
>> +# pages with predictable, repeatable corruption.  The postgres backend cannot be
>> +# expected to help us with this, as its design is not consistent with the goal
>> +# of intentionally corrupting pages.
>> +#
>> +# Instead, we create a table to corrupt, and with careful consideration of how
>> +# postgresql lays out heap pages, we seek to offsets within the page and
>> +# overwrite deliberately chosen bytes with specific values calculated to
>> +# corrupt the page in expected ways.  We then verify that heapcheck_relation
>> +# reports the corruption, and that it runs without crashing.  Note that the
>> +# backend cannot simply be started to run queries against the corrupt table, as
>> +# the backend will crash, at least for some of the corruption types we
>> +# generate.
>> +#
>> +# Autovacuum potentially touching the table in the background makes the exact
>> +# behavior of this test harder to reason about.  We turn it off to keep things
>> +# simpler.  We use a "belt and suspenders" approach, turning it off for the
>> +# system generally in postgresql.conf, and turning it off specifically for the
>> +# test table.
>> +#
>> +# This test depends on the table being written to the heap file exactly as we
>> +# expect it to be, so we take care to arrange the columns of the table, and
>> +# insert rows of the table, that give predictable sizes and locations within
>> +# the table page.
>
> I have a hard time believing this is going to be really
> reliable. E.g. the alignment requirements will vary between platforms,
> leading to different layouts. In particular, MAXALIGN differs between
> platforms.
>
> Also, it's supported to compile postgres with a different pagesize.
It's simple enough to extend the tap test a little to check for those things.  In v3, the tap test skips tests if the page size is not 8k, and also if the tuples do not fall on the page where expected (which would happen due to alignment issues, gremlins, or whatever.).  There are other approaches, though.  The HeapFile/HeapPage/HeapTuple perl modules recently submitted on another thread *could* be used here, but only if those modules are likely to be committed.  This test *could* be extended to autodetect the page size and alignment issues and calculate at runtime where tuples will be on the page, but only if folks don't mind the test having that extra complexity in it.  (There is a school of thought that regression tests should avoid excess complexity.). Do you have a recommendation about which way to go with this?

Here is the work thus far:




Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




v3-0001-Adding-verify_heapam-to-amcheck-contrib-module.patch (80K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Mark Dilger-5

>> I wonder if a mode where heapcheck optionally would only checks
>> non-frozen (perhaps also non-all-visible) regions of a table would be a
>> good idea?

Version 4 of this patch now includes boolean options skip_all_frozen and skip_all_visible.

>> Would make it a lot more viable to run this regularly on
>> bigger databases. Even if there's a window to not check some data
>> (because it's frozen before the next heapcheck run).

Do you think it would make sense to have the amcheck contrib module have, in addition to the SQL queriable functions, a bgworker based mode that periodically checks your database?  The work along those lines is not included in v4, but if it were part of v5, would you have specific design preferences?




Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




v4-0001-Adding-verify_heapam-to-amcheck-contrib-module.patch (85K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Robert Haas
On Wed, Apr 29, 2020 at 12:30 PM Mark Dilger
<[hidden email]> wrote:
> Do you think it would make sense to have the amcheck contrib module have, in addition to the SQL queriable functions, a bgworker based mode that periodically checks your database?  The work along those lines is not included in v4, but if it were part of v5, would you have specific design preferences?

-1 on that idea from me. That sounds like it's basically building
"cron" into PostgreSQL, but in a way that can only be used by amcheck.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Robert Haas
In reply to this post by Mark Dilger-5
On Wed, Apr 22, 2020 at 10:43 PM Mark Dilger
<[hidden email]> wrote:
> It's simple enough to extend the tap test a little to check for those things.  In v3, the tap test skips tests if the page size is not 8k, and also if the tuples do not fall on the page where expected (which would happen due to alignment issues, gremlins, or whatever.).

Skipping the test if the tuple isn't in the expected location sounds
really bad. That will just lead to the tests passing without actually
doing anything. If the tuple isn't in the expected location, the tests
should fail.

> There are other approaches, though.  The HeapFile/HeapPage/HeapTuple perl modules recently submitted on another thread *could* be used here, but only if those modules are likely to be committed.

Yeah, I don't know if we want that stuff or not.

> This test *could* be extended to autodetect the page size and alignment issues and calculate at runtime where tuples will be on the page, but only if folks don't mind the test having that extra complexity in it.  (There is a school of thought that regression tests should avoid excess complexity.). Do you have a recommendation about which way to go with this?

How much extra complexity are we talking about? It feels to me like
for a heap page, the only things that are going to affect the position
of the tuples on the page -- supposing we know the tuple size -- are
the page size and, I think, MAXALIGN, and that doesn't sound too bad.
Another possibility is to use pageinspect's heap_page_items() to
determine the position within the page (lp_off), which seems like it
might simplify things considerably. Then, we're entirely relying on
the backend to tell us where the tuples are, and we only need to worry
about the offsets relative to the start of the tuple.

I kind of like that approach, because it doesn't involve having Perl
code that knows how heap pages are laid out; we rely entirely on the C
code for that. I'm not sure if it'd be a problem to have a TAP test
for one contrib module that uses another contrib module, but maybe
there's some way to figure that problem out.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Robert Haas
In reply to this post by Mark Dilger-5
On Wed, Apr 29, 2020 at 12:30 PM Mark Dilger
<[hidden email]> wrote:
> Version 4 of this patch now includes boolean options skip_all_frozen and skip_all_visible.

I'm not sure sure, but maybe there should just be one argument with
three possible values, because skip_all_frozen = true and
skip_all_visible = false seems nonsensical. On the other hand, if we
used a text argument with three possible values, I'm not sure what
we'd call the argument or what strings we'd use as the values.

Also, what do people -- either those who have already responded, or
others -- think about the idea of putting a command-line tool around
this? I know that there were some rumblings about this in respect to
pg_verifybackup, but I think a pg_amcheck binary would be
well-received. It could do some interesting things, too. For instance,
it could query pg_class for a list of relations that amcheck would
know how to check, and then issue a separate query for each relation,
which would avoid holding a snapshot or heavyweight locks across the
whole operation. It could do parallelism across relations by opening
multiple connections, or even within a single relation if -- as I
think would be a good idea -- we extended heapcheck to take a range of
block numbers after the style of pg_prewarm.

Apart from allowing for client-driven parallelism, accepting block
number ranges would have the advantage -- IMHO pretty significant --
of making it far easier to use this on a relation where some blocks
are entirely unreadable. You could specify ranges to check out the
remaining blocks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: new heapcheck contrib module

Mark Dilger-5
In reply to this post by Robert Haas


> On Apr 29, 2020, at 11:41 AM, Robert Haas <[hidden email]> wrote:
>
> On Wed, Apr 22, 2020 at 10:43 PM Mark Dilger
> <[hidden email]> wrote:
>> It's simple enough to extend the tap test a little to check for those things.  In v3, the tap test skips tests if the page size is not 8k, and also if the tuples do not fall on the page where expected (which would happen due to alignment issues, gremlins, or whatever.).
>
> Skipping the test if the tuple isn't in the expected location sounds
> really bad. That will just lead to the tests passing without actually
> doing anything. If the tuple isn't in the expected location, the tests
> should fail.
>
>> There are other approaches, though.  The HeapFile/HeapPage/HeapTuple perl modules recently submitted on another thread *could* be used here, but only if those modules are likely to be committed.
>
> Yeah, I don't know if we want that stuff or not.
>
>> This test *could* be extended to autodetect the page size and alignment issues and calculate at runtime where tuples will be on the page, but only if folks don't mind the test having that extra complexity in it.  (There is a school of thought that regression tests should avoid excess complexity.). Do you have a recommendation about which way to go with this?
>
> How much extra complexity are we talking about?

The page size is easy to query, and the test already does so, skipping if the answer isn't 8k.  The test could recalculate offsets based on the pagesize rather than skipping the test easily enough, but the MAXALIGN stuff is a little harder.  I don't know (perhaps someone would share?) how to easily query that from within a perl test.  So the test could guess all possible alignments that occur in the real world, read from the page at the offset that alignment would create, and check if the expected datum is there.  The test would have to be careful to avoid false positives, by placing data before and after the datum being checked with bit patterns that cannot be misinterpreted as a match.  That level of complexity seems unappealing, at least to me.  It's not hard to write, but maintaining stuff like that is an unwelcome burden.

> It feels to me like
> for a heap page, the only things that are going to affect the position
> of the tuples on the page -- supposing we know the tuple size -- are
> the page size and, I think, MAXALIGN, and that doesn't sound too bad.
> Another possibility is to use pageinspect's heap_page_items() to
> determine the position within the page (lp_off), which seems like it
> might simplify things considerably. Then, we're entirely relying on
> the backend to tell us where the tuples are, and we only need to worry
> about the offsets relative to the start of the tuple.
>
> I kind of like that approach, because it doesn't involve having Perl
> code that knows how heap pages are laid out; we rely entirely on the C
> code for that. I'm not sure if it'd be a problem to have a TAP test
> for one contrib module that uses another contrib module, but maybe
> there's some way to figure that problem out.

Yeah, I'll give this a try.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company





1234 ... 9