recovering from "found xmin ... from before relfrozenxid ..."

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
142 messages Options
123456 ... 8
Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Ashutosh Sharma
Hi Beena,

Thanks for the review.

1. We would be marking buffer dirty and writing wal even if we have
not done any changes( ex if we pass invalid/dead tids). Maybe we can
handle this better?

Yeah, we can skip this when nothing has changed. Will take care of it in the next version of patch.
 
cosmetic changes
1. Maybe "HTupleSurgicalOption" instead of "HTupleForceOption" and
also the variable names could use surgery instead.

I think that looks fine. I would rather prefer the word "Force" just because all the enum options contain the word "Force" in it.
 
2. extension comment pg_surgery.control "extension to perform surgery
the damaged heap table" -> "extension to perform surgery on the
damaged heap table"

Okay, will fix that typo.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Thomas Munro-5
In reply to this post by Robert Haas
On Tue, Jul 14, 2020 at 9:12 AM Robert Haas <[hidden email]> wrote:
> A number of EDB customers have had this error crop on their tables for
> reasons that we have usually not been able to determine. In many

<long-shot>Do you happen to know if they ever used the
snapshot-too-old feature?</long-shot>


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Robert Haas
On Wed, Jul 29, 2020 at 3:23 AM Thomas Munro <[hidden email]> wrote:
> On Tue, Jul 14, 2020 at 9:12 AM Robert Haas <[hidden email]> wrote:
> > A number of EDB customers have had this error crop on their tables for
> > reasons that we have usually not been able to determine. In many
>
> <long-shot>Do you happen to know if they ever used the
> snapshot-too-old feature?</long-shot>

I don't have any reason to believe that they did. Why?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Thomas Munro-5
On Thu, Jul 30, 2020 at 1:36 AM Robert Haas <[hidden email]> wrote:
> On Wed, Jul 29, 2020 at 3:23 AM Thomas Munro <[hidden email]> wrote:
> > On Tue, Jul 14, 2020 at 9:12 AM Robert Haas <[hidden email]> wrote:
> > > A number of EDB customers have had this error crop on their tables for
> > > reasons that we have usually not been able to determine. In many
> >
> > <long-shot>Do you happen to know if they ever used the
> > snapshot-too-old feature?</long-shot>
>
> I don't have any reason to believe that they did. Why?

Nothing specific, I was just contemplating the problems with that
feature and the patches[1] proposed so far to fix some of them, and
what types of corruption might be possible due to that stuff, and it
occurred to me to ask if you'd thought about that in connection to
these reports.

[1] https://www.postgresql.org/message-id/flat/CA%2BTgmoY%3Daqf0zjTD%2B3dUWYkgMiNDegDLFjo%2B6ze%3DWtpik%2B3XqA%40mail.gmail.com


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Ashutosh Sharma
In reply to this post by Ashutosh Sharma

On Wed, Jul 29, 2020 at 9:58 AM Ashutosh Sharma <[hidden email]> wrote:

> I think we should let VACUUM do that.
Sometimes VACUUM will not get to these pages, because they are marked All Frozen.
An possibly some tuples will get stale on this page again

Hmm, okay, will have a look into this. Thanks.

I had a look over this and found that one can use the DISABLE_PAGE_SKIPPING option with VACUUM to disable all its page-skipping behavior.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Ashutosh Sharma
In reply to this post by Ashutosh Sharma
Attached is the new version of patch that addresses the comments from Andrey and Beena.

On Wed, Jul 29, 2020 at 10:07 AM Ashutosh Sharma <[hidden email]> wrote:
Hi Beena,

Thanks for the review.

1. We would be marking buffer dirty and writing wal even if we have
not done any changes( ex if we pass invalid/dead tids). Maybe we can
handle this better?

Yeah, we can skip this when nothing has changed. Will take care of it in the next version of patch.
 
cosmetic changes
1. Maybe "HTupleSurgicalOption" instead of "HTupleForceOption" and
also the variable names could use surgery instead.

I think that looks fine. I would rather prefer the word "Force" just because all the enum options contain the word "Force" in it.
 
2. extension comment pg_surgery.control "extension to perform surgery
the damaged heap table" -> "extension to perform surgery on the
damaged heap table"

Okay, will fix that typo.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

v2-0001-Add-contrib-pg_surgery-to-perform-surgery-on-the-dam.patch (26K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Andrey M. Borodin
In reply to this post by Ashutosh Sharma


> 31 июля 2020 г., в 17:32, Ashutosh Sharma <[hidden email]> написал(а):
>
>
> On Wed, Jul 29, 2020 at 9:58 AM Ashutosh Sharma <[hidden email]> wrote:
>
> > I think we should let VACUUM do that.
> Sometimes VACUUM will not get to these pages, because they are marked All Frozen.
> An possibly some tuples will get stale on this page again
>
> Hmm, okay, will have a look into this. Thanks.
>
> I had a look over this and found that one can use the DISABLE_PAGE_SKIPPING option with VACUUM to disable all its page-skipping behavior.

Oh, wow, I didn't know that. Thanks! This actually will do the trick.
I'll try to review your patch again next week.

Thanks!

Best regards, Andrey Borodin.

Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Robert Haas
In reply to this post by Ashutosh Sharma
On Fri, Jul 31, 2020 at 8:52 AM Ashutosh Sharma <[hidden email]> wrote:
> Attached is the new version of patch that addresses the comments from Andrey and Beena.

+PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table"

the -> a

I also suggest: heap table -> relation, because we might want to
extend this to other cases later.

+-- toast table.
+begin;
+create table ttab(a text);
+insert into ttab select string_agg(chr(floor(random() * 26)::int +
65), '') from generate_series(1,10000);
+select * from ttab where xmin = 2;
+ a
+---
+(0 rows)
+
+select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]);
+ heap_force_freeze
+-------------------
+
+(1 row)
+

I don't understand the point of this. You're not testing the function
on the TOAST table; you're testing it on the main table when there
happens to be a TOAST table that is probably getting used for
something. But that's not really relevant to what is being tested
here, so as written this seems redundant with the previous cases.

+-- test pg_surgery functions with the unsupported relations. Should fail.

Please name the specific functions being tested here in case we add
more in the future that are tested separately.

+++ b/contrib/pg_surgery/heap_surgery_funcs.c

I think we could drop _funcs from the file name.

+#ifdef PG_MODULE_MAGIC
+PG_MODULE_MAGIC;
+#endif

The #ifdef here is not required, and if you look at other contrib
modules you'll see that they don't have it.

I don't like all the macros at the top of the file much. CHECKARRVALID
is only used in one place, so it seems to me that you might as well
just inline it and lose the macro. Likewise for SORT and ARRISEMPTY.

Once you do that, heap_force_common() can just compute the number of
array elements once, instead of doing it once inside ARRISEMPTY, then
again inside SORT, and then a third time to initialize ntids. You can
just have a local variable in that function that is set once and then
used as needed. Then you are only doing ARRNELEMS once, so you can get
rid of that macro too. The same technique can be used to get rid of
ARRPTR. So then all the macros go away without introducing any code
duplication.

+/* Options to forcefully change the state of a heap tuple. */
+typedef enum HTupleForceOption
+{
+ FORCE_KILL,
+ FORCE_FREEZE
+} HTupleForceOption;

I suggest un-abbreviating HTuple to HeapTuple and un-abbreviating the
enum members to HEAP_FORCE_KILL and HEAP_FORCE_FREE. Also, how about
option -> operation?

+ return heap_force_common(fcinfo, FORCE_KILL);

I think it might be more idiomatic to use PG_RETURN_DATUM here.  I
know it's the same thing, though, and perhaps I'm even wrong about the
prevailing style.

+ Assert(force_opt == FORCE_KILL || force_opt == FORCE_FREEZE);

I think this is unnecessary. It's an enum with 2 values.

+ if (ARRISEMPTY(ta))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("empty tid array")));

I don't see why this should be an error. Why can't we just continue
normally and if it does nothing, it does nothing? You'd need to change
the do..while loop to a while loop so that the end condition is tested
at the top, but that seems fine.

+ rel = relation_open(relid, AccessShareLock);

Maybe we should take RowExclusiveLock, since we are going to modify
rows. Not sure how much it matters, though.

+ if (!superuser() && GetUserId() != rel->rd_rel->relowner)
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ errmsg("must be superuser or object owner to use %s.",
+ force_opt == FORCE_KILL ? "heap_force_kill()" :
+ "heap_force_freeze()")));

This is the wrong way to do a permissions check, and it's also the
wrong way to write an error message about having failed one. To see
the correct method, grep for pg_class_aclcheck. However, I think that
we shouldn't in general trust the object owner to do this, unless the
super-user gave permission. This is a data-corrupting operation, and
only the boss is allowed to authorize it. So I think you should also
add REVOKE EXECUTE FROM PUBLIC statements to the SQL file, and then
have this check as a backup. Then, the superuser is always allowed,
and if they choose to GRANT EXECUTE on this function to some users,
those users can do it for their own relations, but not others.

+ if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("only heap AM is supported")));
+
+ check_relation_relkind(rel);

Seems like these checks are in the wrong order. Also, maybe you could
call the function something like check_relation_ok() and put the
permissions test, the relkind test, and the relam test all inside of
it, just to tighten up the code in this main function a bit.

+ if (noffs > maxoffset)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("number of offsets specified for block %u exceeds the max
offset number %u",
+ blkno, maxoffset)));

Hmm, this doesn't seem quite right. The actual problem is if an
individual item pointer's offset number is greater than maxoffset,
which can be true even if the total number of offsets is less than
maxoffset. So I think you need to remove this check and add a check
inside the loop which follows that offnos[i] is in range.

The way you've structured that loop is actually problematic -- I don't
think we want to be calling elog() or ereport() inside a critical
section. You could fix the case that checks for an invalid force_opt
by just doing if (op == HEAP_FORCE_KILL) { ... } else { Assert(op ==
HEAP_FORCE_FREEZE); ... }, or by using a switch with no default. The
NOTICE case you have here is a bigger problem. You really cannot
modify the buffer like this and then decide, oops, never mind, I think
I won't mark it dirty or write WAL for the changes. If you do that,
the buffer is still in memory, but it's now been modified. A
subsequent operation that modifies it will start with the altered
state you created here, quite possibly leading to WAL that cannot be
correctly replayed on the standby. In other words, you've got to
decide for certain whether you want to proceed with the operation
*before* you enter the critical section. You also need to emit any
messages before or after the critical section.  So you could:

1. If you encounter a TID that's unused or dead, skip it silently.
-or-
2. Loop over offsets twice. The first time, ERROR if you find any one
that is unused or dead. Then start a critical section. Loop again and
do the real work.
-or-
3. Like #2, but emit a NOTICE about a unused or dead item rather than
an ERROR, and skip the critical section and the second loop if you did
that >0 times.
-or-
4. Like #3, but don't skip anything just because you emitted a NOTICE
about the page.

#3 is closest to the behavior you have now, but I'm not sure what else
it has going for it. It doesn't seem like particularly intuitive
behavior that finding a dead or unused TID should cause other item
TIDs on the same page not to get processed while still permitting TIDs
on other pages to get processed. I don't think that's the behavior
users will be expecting. I think my vote is for #4, which will emit a
NOTICE about any TID that is dead or unused -- and I guess also about
any TID whose offset number is out of range -- but won't actually skip
any operations that can be performed. But there are decent arguments
for #1 or #2 too.

+ (errmsg("skipping tid (%u, %u) because it is already marked %s",
+ blkno, offnos[i],
+ ItemIdIsDead(itemid) ? "dead" : "unused")));

I believe this violates our guidelines on message construction. Have
two completely separate messages -- and maybe lose the word "already":

"skipping tid (%u, %u) because it is dead"
"skipping tid (%u, %u) because it is unused"

The point of this is that it makes it easier for translators.

I see very little point in what verify_tid() is doing. Before using
each block number, we should check that it's less than or equal to a
cached value of RelationGetNumberOfBlocks(rel). That's necessary in
any case to avoid funny errors; and then the check here against
specifically InvalidBlockNumber is redundant. For the offset number,
same thing: we need to check each offset against the page's
PageGetMaxOffsetNumber(page); and if we do that then we don't need
these checks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Ashutosh Sharma
Hi Robert,

Thanks for the review.

I've gone through all your review comments and understood all of them except this one:

You really cannot
modify the buffer like this and then decide, oops, never mind, I think
I won't mark it dirty or write WAL for the changes. If you do that,
the buffer is still in memory, but it's now been modified. A
subsequent operation that modifies it will start with the altered
state you created here, quite possibly leading to WAL that cannot be
correctly replayed on the standby. In other words, you've got to
decide for certain whether you want to proceed with the operation
*before* you enter the critical section. 

Could you please explain this point once more in detail? I am not quite able to understand under what circumstances a buffer would be modified, but won't be marked as dirty or a WAL won't be written for it.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Robert Haas
On Mon, Aug 3, 2020 at 5:05 AM Ashutosh Sharma <[hidden email]> wrote:
> Could you please explain this point once more in detail? I am not quite able to understand under what circumstances a buffer would be modified, but won't be marked as dirty or a WAL won't be written for it.

Whenever this branch is taken:

+               if (nskippedItems == noffs)
+                       goto skip_wal;

At this point you have already modified the page, using ItemIdSetDead,
HeapTupleHeaderSet*, and/or directly adjusting htup->infomask. If this
branch is taken, then MarkBufferDirty() and log_newpage_buffer() are
skipped.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Ashutosh Sharma
On Mon, Aug 3, 2020 at 7:06 PM Robert Haas <[hidden email]> wrote:
>
> On Mon, Aug 3, 2020 at 5:05 AM Ashutosh Sharma <[hidden email]> wrote:
> > Could you please explain this point once more in detail? I am not quite able to understand under what circumstances a buffer would be modified, but won't be marked as dirty or a WAL won't be written for it.
>
> Whenever this branch is taken:
>
> +               if (nskippedItems == noffs)
> +                       goto skip_wal;
>

If the above path is taken that means none of the items in the page
got changed. As per the following if-check whenever an item in the
offnos[] array is found dead or unused, it is skipped (due to continue
statement) which means the item is neither marked dead nor it is
marked frozen. Now, if this happens for all the items in a page, then
the above condition (nskippedItems == noffs) would be true and hence
the buffer would remain unchanged, so, we don't mark such a buffer as
dirty and neither do any WAL logging for it. This is my understanding,
please let me know if I am missing something here. Thank you.

if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
{
        nskippedItems++;
        ereport(NOTICE,
                     (errmsg("skipping tid (%u, %u) because it is
already marked %s",
                                   blkno, offnos[i],
                                   ItemIdIsDead(itemid) ? "dead" : "unused")));
        continue;
}

> At this point you have already modified the page, using ItemIdSetDead,
> HeapTupleHeaderSet*, and/or directly adjusting htup->infomask. If this
> branch is taken, then MarkBufferDirty() and log_newpage_buffer() are
> skipped.
>

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Ashutosh Sharma
In reply to this post by Robert Haas
Hi Robert,

Thanks for the review. Please find my comments inline:

On Sat, Aug 1, 2020 at 12:18 AM Robert Haas <[hidden email]> wrote:

>
> On Fri, Jul 31, 2020 at 8:52 AM Ashutosh Sharma <[hidden email]> wrote:
> > Attached is the new version of patch that addresses the comments from Andrey and Beena.
>
> +PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table"
>
> the -> a
>
> I also suggest: heap table -> relation, because we might want to
> extend this to other cases later.
>
Corrected.

> +-- toast table.
> +begin;
> +create table ttab(a text);
> +insert into ttab select string_agg(chr(floor(random() * 26)::int +
> 65), '') from generate_series(1,10000);
> +select * from ttab where xmin = 2;
> + a
> +---
> +(0 rows)
> +
> +select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]);
> + heap_force_freeze
> +-------------------
> +
> +(1 row)
> +
>
> I don't understand the point of this. You're not testing the function
> on the TOAST table; you're testing it on the main table when there
> happens to be a TOAST table that is probably getting used for
> something. But that's not really relevant to what is being tested
> here, so as written this seems redundant with the previous cases.
>
Yeah, it's being tested on the main table, not on a toast table. I've
removed this test-case and also restricted direct access to the toast
table using heap_force_kill/freeze functions. I think we shouldn't be
using these functions to do any changes in the toast table. We will
only use these functions with the main table and let VACUUM remove the
corresponding data chunks (pointed by the tuple that got removed from
the main table).

Another option would be to identify all the data chunks corresponding
to the tuple (ctid) being killed from the main table and remove them
one by one. We will only do this if the tuple from the main table that
has been marked as killed has an external storage. We will have to add
a bunch of code for this otherwise we can let VACUUM do this for us.
Let me know your thoughts on this.

> +-- test pg_surgery functions with the unsupported relations. Should fail.
>
> Please name the specific functions being tested here in case we add
> more in the future that are tested separately.
>

Done.

> +++ b/contrib/pg_surgery/heap_surgery_funcs.c
>
> I think we could drop _funcs from the file name.
>

Done.

> +#ifdef PG_MODULE_MAGIC
> +PG_MODULE_MAGIC;
> +#endif
>
> The #ifdef here is not required, and if you look at other contrib
> modules you'll see that they don't have it.
>

Okay, done.

> I don't like all the macros at the top of the file much. CHECKARRVALID
> is only used in one place, so it seems to me that you might as well
> just inline it and lose the macro. Likewise for SORT and ARRISEMPTY.
>

Done.

> Once you do that, heap_force_common() can just compute the number of
> array elements once, instead of doing it once inside ARRISEMPTY, then
> again inside SORT, and then a third time to initialize ntids. You can
> just have a local variable in that function that is set once and then
> used as needed. Then you are only doing ARRNELEMS once, so you can get
> rid of that macro too. The same technique can be used to get rid of
> ARRPTR. So then all the macros go away without introducing any code
> duplication.
>

Done.

> +/* Options to forcefully change the state of a heap tuple. */
> +typedef enum HTupleForceOption
> +{
> + FORCE_KILL,
> + FORCE_FREEZE
> +} HTupleForceOption;
>
> I suggest un-abbreviating HTuple to HeapTuple and un-abbreviating the
> enum members to HEAP_FORCE_KILL and HEAP_FORCE_FREE.

Done.

Also, how about
> option -> operation?
>

I think both look okay to me.

> + return heap_force_common(fcinfo, FORCE_KILL);
>
> I think it might be more idiomatic to use PG_RETURN_DATUM here.  I
> know it's the same thing, though, and perhaps I'm even wrong about the
> prevailing style.
>

Done.

> + Assert(force_opt == FORCE_KILL || force_opt == FORCE_FREEZE);
>
> I think this is unnecessary. It's an enum with 2 values.
>

Removed.

> + if (ARRISEMPTY(ta))
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("empty tid array")));
>
> I don't see why this should be an error. Why can't we just continue
> normally and if it does nothing, it does nothing? You'd need to change
> the do..while loop to a while loop so that the end condition is tested
> at the top, but that seems fine.
>
I think it's okay to have this check. So, just left it as-is. We do
have such checks in other contrib modules as well wherever the array
is being passed as an input to the function.

> + rel = relation_open(relid, AccessShareLock);
>
> Maybe we should take RowExclusiveLock, since we are going to modify
> rows. Not sure how much it matters, though.
>

I don't know how it would make a difference, but still as you said
replaced AccessShare with RowExclusive.

> + if (!superuser() && GetUserId() != rel->rd_rel->relowner)
> + ereport(ERROR,
> + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
> + errmsg("must be superuser or object owner to use %s.",
> + force_opt == FORCE_KILL ? "heap_force_kill()" :
> + "heap_force_freeze()")));
>
> This is the wrong way to do a permissions check, and it's also the
> wrong way to write an error message about having failed one. To see
> the correct method, grep for pg_class_aclcheck. However, I think that
> we shouldn't in general trust the object owner to do this, unless the
> super-user gave permission. This is a data-corrupting operation, and
> only the boss is allowed to authorize it. So I think you should also
> add REVOKE EXECUTE FROM PUBLIC statements to the SQL file, and then
> have this check as a backup. Then, the superuser is always allowed,
> and if they choose to GRANT EXECUTE on this function to some users,
> those users can do it for their own relations, but not others.
>
Done.

> + if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
> + ereport(ERROR,
> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> + errmsg("only heap AM is supported")));
> +
> + check_relation_relkind(rel);
>
> Seems like these checks are in the wrong order.

I don't think there is anything wrong with the order. I can see the
same order in other contrib modules as well.

Also, maybe you could
> call the function something like check_relation_ok() and put the
> permissions test, the relkind test, and the relam test all inside of
> it, just to tighten up the code in this main function a bit.
>

Yeah, I've added a couple of functions named sanity_check_relation and
sanity_check_tid_array and shifted all the sanity checks there.

> + if (noffs > maxoffset)
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("number of offsets specified for block %u exceeds the max
> offset number %u",
> + blkno, maxoffset)));
>
> Hmm, this doesn't seem quite right. The actual problem is if an
> individual item pointer's offset number is greater than maxoffset,
> which can be true even if the total number of offsets is less than
> maxoffset. So I think you need to remove this check and add a check
> inside the loop which follows that offnos[i] is in range.
>
Agreed and done.

> The way you've structured that loop is actually problematic -- I don't
> think we want to be calling elog() or ereport() inside a critical
> section. You could fix the case that checks for an invalid force_opt
> by just doing if (op == HEAP_FORCE_KILL) { ... } else { Assert(op ==
> HEAP_FORCE_FREEZE); ... }, or by using a switch with no default. The
> NOTICE case you have here is a bigger problem.

Done.

You really cannot

> modify the buffer like this and then decide, oops, never mind, I think
> I won't mark it dirty or write WAL for the changes. If you do that,
> the buffer is still in memory, but it's now been modified. A
> subsequent operation that modifies it will start with the altered
> state you created here, quite possibly leading to WAL that cannot be
> correctly replayed on the standby. In other words, you've got to
> decide for certain whether you want to proceed with the operation
> *before* you enter the critical section. You also need to emit any
> messages before or after the critical section.  So you could:
>
This is still not clear. I think Robert needs to respond to my earlier comment.

> I believe this violates our guidelines on message construction. Have
> two completely separate messages -- and maybe lose the word "already":
>
> "skipping tid (%u, %u) because it is dead"
> "skipping tid (%u, %u) because it is unused"
>
> The point of this is that it makes it easier for translators.
>

Done.

> I see very little point in what verify_tid() is doing. Before using
> each block number, we should check that it's less than or equal to a
> cached value of RelationGetNumberOfBlocks(rel). That's necessary in
> any case to avoid funny errors; and then the check here against
> specifically InvalidBlockNumber is redundant. For the offset number,
> same thing: we need to check each offset against the page's
> PageGetMaxOffsetNumber(page); and if we do that then we don't need
> these checks.
>

Done.

Please check the attached patch for the changes.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

v3-0001-Add-contrib-pg_surgery-to-perform-surgery-on-a-damag.patch (24K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Robert Haas
In reply to this post by Ashutosh Sharma
On Mon, Aug 3, 2020 at 12:13 PM Ashutosh Sharma <[hidden email]> wrote:
> If the above path is taken that means none of the items in the page
> got changed.

Oops. I didn't realize that, sorry. Maybe it would be a little more
clear if instead of "int nSkippedItems" you had "bool
did_modify_page"? Then you could initialize it to false and set it to
true just before doing the page modifications.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Robert Haas
In reply to this post by Ashutosh Sharma
On Wed, Aug 5, 2020 at 9:42 AM Ashutosh Sharma <[hidden email]> wrote:
> Yeah, it's being tested on the main table, not on a toast table. I've
> removed this test-case and also restricted direct access to the toast
> table using heap_force_kill/freeze functions. I think we shouldn't be
> using these functions to do any changes in the toast table. We will
> only use these functions with the main table and let VACUUM remove the
> corresponding data chunks (pointed by the tuple that got removed from
> the main table).

I agree with removing the test case, but I disagree with restricting
this from being used on the TOAST table. These are tools for experts,
who may use them as they see fit. It's unlikely that it would be
useful to use this on a TOAST table, I think, but not impossible.

> Another option would be to identify all the data chunks corresponding
> to the tuple (ctid) being killed from the main table and remove them
> one by one. We will only do this if the tuple from the main table that
> has been marked as killed has an external storage. We will have to add
> a bunch of code for this otherwise we can let VACUUM do this for us.
> Let me know your thoughts on this.

I don't think VACUUM will do anything for us automatically -- it isn't
going to know that we force-killed the tuple in the main table.
Normally, a tuple delete would have to set xmax on the TOAST tuples
and then VACUUM would do its thing, but in this case that won't
happen. So if you force-kill a tuple in the main table you would end
up with a space leak in the TOAST table.

The problem here is that one reason you might force-killing a tuple in
the main table is because it's full of garbage. If so, trying to
decode the tuple so that you can find the TOAST pointers might crash
or error out, or maybe that part will work but then you'll error out
trying to look up the corresponding TOAST tuples, either because the
values are not valid or because the TOAST table itself is generally
hosed in some way. So I think it is probably best if we keep this tool
as simple as possible, with as few dependencies as we can, and
document the possible negative outcomes of using it. It's not
impossible to recover from a space-leak like this; you can always move
the data into a new table with CTAS and then drop the old one. Not
sure whether CLUSTER or VACUUM FULL would also be sufficient.

Separately, we might want to add a TOAST-checker to amcheck, or
enhance the heap-checker Mark is working on, and one of the things it
could do is check for TOAST entries to which nothing points. Then if
you force-kill tuples in the main table you could also use that tool
to look for things in the TOAST table that ought to also be
force-killed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Ashutosh Sharma
In reply to this post by Robert Haas
On Thu, Aug 6, 2020 at 1:04 AM Robert Haas <[hidden email]> wrote:

>
> On Mon, Aug 3, 2020 at 12:13 PM Ashutosh Sharma <[hidden email]> wrote:
> > If the above path is taken that means none of the items in the page
> > got changed.
>
> Oops. I didn't realize that, sorry. Maybe it would be a little more
> clear if instead of "int nSkippedItems" you had "bool
> did_modify_page"? Then you could initialize it to false and set it to
> true just before doing the page modifications.
>

Okay, np, in that case, as you suggested, I will replace "int
nSkippedItems" with "did_modify_page" to increase the clarity. I will
do this change in the next version of patch. Thanks.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Ashutosh Sharma
In reply to this post by Robert Haas
On Thu, Aug 6, 2020 at 1:29 AM Robert Haas <[hidden email]> wrote:

>
> On Wed, Aug 5, 2020 at 9:42 AM Ashutosh Sharma <[hidden email]> wrote:
> > Yeah, it's being tested on the main table, not on a toast table. I've
> > removed this test-case and also restricted direct access to the toast
> > table using heap_force_kill/freeze functions. I think we shouldn't be
> > using these functions to do any changes in the toast table. We will
> > only use these functions with the main table and let VACUUM remove the
> > corresponding data chunks (pointed by the tuple that got removed from
> > the main table).
>
> I agree with removing the test case, but I disagree with restricting
> this from being used on the TOAST table. These are tools for experts,
> who may use them as they see fit. It's unlikely that it would be
> useful to use this on a TOAST table, I think, but not impossible.
>

Okay, If you want I can remove the restriction on a toast table, but,
then that means a user can directly remove the data chunks from the
toast table without changing anything in the main table. This means we
won't be able to query the main table as it will fail with an error
like "ERROR:  unexpected chunk number ...". So, we will have to find
some way to identify the pointer to the chunks that got deleted from
the toast table and remove that pointer from the main table. We also
need to make sure that before we remove a tuple (pointer) from the
main table, we identify all the remaining data chunks pointed by this
tuple and remove them completely only then that table would be
considered to be in a good state. Now, I am not sure if we can
currently do all these things.

> > Another option would be to identify all the data chunks corresponding
> > to the tuple (ctid) being killed from the main table and remove them
> > one by one. We will only do this if the tuple from the main table that
> > has been marked as killed has an external storage. We will have to add
> > a bunch of code for this otherwise we can let VACUUM do this for us.
> > Let me know your thoughts on this.
>
> I don't think VACUUM will do anything for us automatically -- it isn't
> going to know that we force-killed the tuple in the main table.
> Normally, a tuple delete would have to set xmax on the TOAST tuples
> and then VACUUM would do its thing, but in this case that won't
> happen. So if you force-kill a tuple in the main table you would end
> up with a space leak in the TOAST table.
>
> The problem here is that one reason you might force-killing a tuple in
> the main table is because it's full of garbage. If so, trying to
> decode the tuple so that you can find the TOAST pointers might crash
> or error out, or maybe that part will work but then you'll error out
> trying to look up the corresponding TOAST tuples, either because the
> values are not valid or because the TOAST table itself is generally
> hosed in some way. So I think it is probably best if we keep this tool
> as simple as possible, with as few dependencies as we can, and
> document the possible negative outcomes of using it.

I completely agree with you.

It's not
> impossible to recover from a space-leak like this; you can always move
> the data into a new table with CTAS and then drop the old one. Not
> sure whether CLUSTER or VACUUM FULL would also be sufficient.
>

Yeah, I think, we can either use CTAS or VACUUM FULL, both look fine.

> Separately, we might want to add a TOAST-checker to amcheck, or
> enhance the heap-checker Mark is working on, and one of the things it
> could do is check for TOAST entries to which nothing points. Then if
> you force-kill tuples in the main table you could also use that tool
> to look for things in the TOAST table that ought to also be
> force-killed.
>

Okay, good to know that. Thanks for sharing this info.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Masahiko Sawada-2
In reply to this post by Ashutosh Sharma
On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma <[hidden email]> wrote:

>
> Hi Robert,
>
> Thanks for the review. Please find my comments inline:
>
> On Sat, Aug 1, 2020 at 12:18 AM Robert Haas <[hidden email]> wrote:
> >
> > On Fri, Jul 31, 2020 at 8:52 AM Ashutosh Sharma <[hidden email]> wrote:
> > > Attached is the new version of patch that addresses the comments from Andrey and Beena.
> >
> > +PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table"
> >
> > the -> a
> >
> > I also suggest: heap table -> relation, because we might want to
> > extend this to other cases later.
> >
>
> Corrected.
>
> > +-- toast table.
> > +begin;
> > +create table ttab(a text);
> > +insert into ttab select string_agg(chr(floor(random() * 26)::int +
> > 65), '') from generate_series(1,10000);
> > +select * from ttab where xmin = 2;
> > + a
> > +---
> > +(0 rows)
> > +
> > +select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]);
> > + heap_force_freeze
> > +-------------------
> > +
> > +(1 row)
> > +
> >
> > I don't understand the point of this. You're not testing the function
> > on the TOAST table; you're testing it on the main table when there
> > happens to be a TOAST table that is probably getting used for
> > something. But that's not really relevant to what is being tested
> > here, so as written this seems redundant with the previous cases.
> >
>
> Yeah, it's being tested on the main table, not on a toast table. I've
> removed this test-case and also restricted direct access to the toast
> table using heap_force_kill/freeze functions. I think we shouldn't be
> using these functions to do any changes in the toast table. We will
> only use these functions with the main table and let VACUUM remove the
> corresponding data chunks (pointed by the tuple that got removed from
> the main table).
>
> Another option would be to identify all the data chunks corresponding
> to the tuple (ctid) being killed from the main table and remove them
> one by one. We will only do this if the tuple from the main table that
> has been marked as killed has an external storage. We will have to add
> a bunch of code for this otherwise we can let VACUUM do this for us.
> Let me know your thoughts on this.
>
> > +-- test pg_surgery functions with the unsupported relations. Should fail.
> >
> > Please name the specific functions being tested here in case we add
> > more in the future that are tested separately.
> >
>
> Done.
>
> > +++ b/contrib/pg_surgery/heap_surgery_funcs.c
> >
> > I think we could drop _funcs from the file name.
> >
>
> Done.
>
> > +#ifdef PG_MODULE_MAGIC
> > +PG_MODULE_MAGIC;
> > +#endif
> >
> > The #ifdef here is not required, and if you look at other contrib
> > modules you'll see that they don't have it.
> >
>
> Okay, done.
>
> > I don't like all the macros at the top of the file much. CHECKARRVALID
> > is only used in one place, so it seems to me that you might as well
> > just inline it and lose the macro. Likewise for SORT and ARRISEMPTY.
> >
>
> Done.
>
> > Once you do that, heap_force_common() can just compute the number of
> > array elements once, instead of doing it once inside ARRISEMPTY, then
> > again inside SORT, and then a third time to initialize ntids. You can
> > just have a local variable in that function that is set once and then
> > used as needed. Then you are only doing ARRNELEMS once, so you can get
> > rid of that macro too. The same technique can be used to get rid of
> > ARRPTR. So then all the macros go away without introducing any code
> > duplication.
> >
>
> Done.
>
> > +/* Options to forcefully change the state of a heap tuple. */
> > +typedef enum HTupleForceOption
> > +{
> > + FORCE_KILL,
> > + FORCE_FREEZE
> > +} HTupleForceOption;
> >
> > I suggest un-abbreviating HTuple to HeapTuple and un-abbreviating the
> > enum members to HEAP_FORCE_KILL and HEAP_FORCE_FREE.
>
> Done.
>
> Also, how about
> > option -> operation?
> >
>
> I think both look okay to me.
>
> > + return heap_force_common(fcinfo, FORCE_KILL);
> >
> > I think it might be more idiomatic to use PG_RETURN_DATUM here.  I
> > know it's the same thing, though, and perhaps I'm even wrong about the
> > prevailing style.
> >
>
> Done.
>
> > + Assert(force_opt == FORCE_KILL || force_opt == FORCE_FREEZE);
> >
> > I think this is unnecessary. It's an enum with 2 values.
> >
>
> Removed.
>
> > + if (ARRISEMPTY(ta))
> > + ereport(ERROR,
> > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > + errmsg("empty tid array")));
> >
> > I don't see why this should be an error. Why can't we just continue
> > normally and if it does nothing, it does nothing? You'd need to change
> > the do..while loop to a while loop so that the end condition is tested
> > at the top, but that seems fine.
> >
>
> I think it's okay to have this check. So, just left it as-is. We do
> have such checks in other contrib modules as well wherever the array
> is being passed as an input to the function.
>
> > + rel = relation_open(relid, AccessShareLock);
> >
> > Maybe we should take RowExclusiveLock, since we are going to modify
> > rows. Not sure how much it matters, though.
> >
>
> I don't know how it would make a difference, but still as you said
> replaced AccessShare with RowExclusive.
>
> > + if (!superuser() && GetUserId() != rel->rd_rel->relowner)
> > + ereport(ERROR,
> > + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
> > + errmsg("must be superuser or object owner to use %s.",
> > + force_opt == FORCE_KILL ? "heap_force_kill()" :
> > + "heap_force_freeze()")));
> >
> > This is the wrong way to do a permissions check, and it's also the
> > wrong way to write an error message about having failed one. To see
> > the correct method, grep for pg_class_aclcheck. However, I think that
> > we shouldn't in general trust the object owner to do this, unless the
> > super-user gave permission. This is a data-corrupting operation, and
> > only the boss is allowed to authorize it. So I think you should also
> > add REVOKE EXECUTE FROM PUBLIC statements to the SQL file, and then
> > have this check as a backup. Then, the superuser is always allowed,
> > and if they choose to GRANT EXECUTE on this function to some users,
> > those users can do it for their own relations, but not others.
> >
>
> Done.
>
> > + if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
> > + ereport(ERROR,
> > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> > + errmsg("only heap AM is supported")));
> > +
> > + check_relation_relkind(rel);
> >
> > Seems like these checks are in the wrong order.
>
> I don't think there is anything wrong with the order. I can see the
> same order in other contrib modules as well.
>
> Also, maybe you could
> > call the function something like check_relation_ok() and put the
> > permissions test, the relkind test, and the relam test all inside of
> > it, just to tighten up the code in this main function a bit.
> >
>
> Yeah, I've added a couple of functions named sanity_check_relation and
> sanity_check_tid_array and shifted all the sanity checks there.
>
> > + if (noffs > maxoffset)
> > + ereport(ERROR,
> > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > + errmsg("number of offsets specified for block %u exceeds the max
> > offset number %u",
> > + blkno, maxoffset)));
> >
> > Hmm, this doesn't seem quite right. The actual problem is if an
> > individual item pointer's offset number is greater than maxoffset,
> > which can be true even if the total number of offsets is less than
> > maxoffset. So I think you need to remove this check and add a check
> > inside the loop which follows that offnos[i] is in range.
> >
>
> Agreed and done.
>
> > The way you've structured that loop is actually problematic -- I don't
> > think we want to be calling elog() or ereport() inside a critical
> > section. You could fix the case that checks for an invalid force_opt
> > by just doing if (op == HEAP_FORCE_KILL) { ... } else { Assert(op ==
> > HEAP_FORCE_FREEZE); ... }, or by using a switch with no default. The
> > NOTICE case you have here is a bigger problem.
>
> Done.
>
> You really cannot
> > modify the buffer like this and then decide, oops, never mind, I think
> > I won't mark it dirty or write WAL for the changes. If you do that,
> > the buffer is still in memory, but it's now been modified. A
> > subsequent operation that modifies it will start with the altered
> > state you created here, quite possibly leading to WAL that cannot be
> > correctly replayed on the standby. In other words, you've got to
> > decide for certain whether you want to proceed with the operation
> > *before* you enter the critical section. You also need to emit any
> > messages before or after the critical section.  So you could:
> >
>
> This is still not clear. I think Robert needs to respond to my earlier comment.
>
> > I believe this violates our guidelines on message construction. Have
> > two completely separate messages -- and maybe lose the word "already":
> >
> > "skipping tid (%u, %u) because it is dead"
> > "skipping tid (%u, %u) because it is unused"
> >
> > The point of this is that it makes it easier for translators.
> >
>
> Done.
>
> > I see very little point in what verify_tid() is doing. Before using
> > each block number, we should check that it's less than or equal to a
> > cached value of RelationGetNumberOfBlocks(rel). That's necessary in
> > any case to avoid funny errors; and then the check here against
> > specifically InvalidBlockNumber is redundant. For the offset number,
> > same thing: we need to check each offset against the page's
> > PageGetMaxOffsetNumber(page); and if we do that then we don't need
> > these checks.
> >
>
> Done.
>
> Please check the attached patch for the changes.

I also looked at this version patch and have some small comments:

+   Oid             relid = PG_GETARG_OID(0);
+   ArrayType      *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
+   ItemPointer     tids;
+   int             ntids;
+   Relation        rel;
+   Buffer          buf;
+   Page            page;
+   ItemId          itemid;
+   BlockNumber     blkno;
+   OffsetNumber   *offnos;
+   OffsetNumber    offno,
+                   noffs,
+                   curr_start_ptr,
+                   next_start_ptr,
+                   maxoffset;
+   int             i,
+                   nskippedItems,
+                   nblocks;

You declare all variables at the top of heap_force_common() function
but I think we can declare some variables such as buf, page inside of
the do loop.

---
+           if (offnos[i] > maxoffset)
+           {
+               ereport(NOTICE,
+                        errmsg("skipping tid (%u, %u) because it
contains an invalid offset",
+                               blkno, offnos[i]));
+               continue;
+           }

If all tids on a page take the above path, we will end up logging FPI
in spite of modifying nothing on the page.

---
+       /* XLOG stuff */
+       if (RelationNeedsWAL(rel))
+           log_newpage_buffer(buf, true);

I think we need to set the returned LSN by log_newpage_buffer() to the page lsn.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

rajkumar.raghuwanshi

I have been doing some user-level testing of this feature, apart from sanity test for extension and it's functions

I have tried to corrupt tuples and then able to fix it using heap_force_freeze/kill functions.


--corrupt relfrozenxid, cause vacuum failed.

update pg_class set relfrozenxid = (relfrozenxid::text::integer + 10)::text::xid where relname = 'test_tbl';

UPDATE 1

insert into test_tbl values (2, 'BBB');


postgres=# vacuum test_tbl;

ERROR:  found xmin 507 from before relfrozenxid 516

CONTEXT:  while scanning block 0 of relation "public.test_tbl"


postgres=# select *, ctid, xmin, xmax from test_tbl;

 a |  b  | ctid  | xmin | xmax 

---+-----+-------+------+------

 1 | AAA | (0,1) |  505 |    0

 2 | BBB | (0,2) |  507 |    0

(2 rows)


--fixed using heap_force_freeze

postgres=# select heap_force_freeze('test_tbl'::regclass, ARRAY['(0,2)']::tid[]);

 heap_force_freeze 

-------------------


postgres=# vacuum test_tbl;

VACUUM

postgres=# select *, ctid, xmin, xmax from test_tbl;

 a |  b  | ctid  | xmin | xmax 

---+-----+-------+------+------

 1 | AAA | (0,1) |  505 |    0

 2 | BBB | (0,2) |    2 |    0

(2 rows)


--corrupt table headers in base/oid. file, cause table access failed.

postgres=# select ctid, * from test_tbl;

ERROR:  could not access status of transaction 4294967295

DETAIL:  Could not open file "pg_xact/0FFF": No such file or directory.


--removed corrupted tuple using heap_force_kill

postgres=# select heap_force_kill('test_tbl'::regclass, ARRAY['(0,2)']::tid[]);

 heap_force_kill 

-----------------

 

(1 row)


postgres=# select ctid, * from test_tbl;

 ctid  | a |  b  

-------+---+-----

 (0,1) | 1 | AAA

(1 row)


I will be continuing with my testing with the latest patch and update here if found anything.


Thanks & Regards,
Rajkumar Raghuwanshi


On Thu, Aug 6, 2020 at 1:42 PM Masahiko Sawada <[hidden email]> wrote:
On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma <[hidden email]> wrote:
>
> Hi Robert,
>
> Thanks for the review. Please find my comments inline:
>
> On Sat, Aug 1, 2020 at 12:18 AM Robert Haas <[hidden email]> wrote:
> >
> > On Fri, Jul 31, 2020 at 8:52 AM Ashutosh Sharma <[hidden email]> wrote:
> > > Attached is the new version of patch that addresses the comments from Andrey and Beena.
> >
> > +PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table"
> >
> > the -> a
> >
> > I also suggest: heap table -> relation, because we might want to
> > extend this to other cases later.
> >
>
> Corrected.
>
> > +-- toast table.
> > +begin;
> > +create table ttab(a text);
> > +insert into ttab select string_agg(chr(floor(random() * 26)::int +
> > 65), '') from generate_series(1,10000);
> > +select * from ttab where xmin = 2;
> > + a
> > +---
> > +(0 rows)
> > +
> > +select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]);
> > + heap_force_freeze
> > +-------------------
> > +
> > +(1 row)
> > +
> >
> > I don't understand the point of this. You're not testing the function
> > on the TOAST table; you're testing it on the main table when there
> > happens to be a TOAST table that is probably getting used for
> > something. But that's not really relevant to what is being tested
> > here, so as written this seems redundant with the previous cases.
> >
>
> Yeah, it's being tested on the main table, not on a toast table. I've
> removed this test-case and also restricted direct access to the toast
> table using heap_force_kill/freeze functions. I think we shouldn't be
> using these functions to do any changes in the toast table. We will
> only use these functions with the main table and let VACUUM remove the
> corresponding data chunks (pointed by the tuple that got removed from
> the main table).
>
> Another option would be to identify all the data chunks corresponding
> to the tuple (ctid) being killed from the main table and remove them
> one by one. We will only do this if the tuple from the main table that
> has been marked as killed has an external storage. We will have to add
> a bunch of code for this otherwise we can let VACUUM do this for us.
> Let me know your thoughts on this.
>
> > +-- test pg_surgery functions with the unsupported relations. Should fail.
> >
> > Please name the specific functions being tested here in case we add
> > more in the future that are tested separately.
> >
>
> Done.
>
> > +++ b/contrib/pg_surgery/heap_surgery_funcs.c
> >
> > I think we could drop _funcs from the file name.
> >
>
> Done.
>
> > +#ifdef PG_MODULE_MAGIC
> > +PG_MODULE_MAGIC;
> > +#endif
> >
> > The #ifdef here is not required, and if you look at other contrib
> > modules you'll see that they don't have it.
> >
>
> Okay, done.
>
> > I don't like all the macros at the top of the file much. CHECKARRVALID
> > is only used in one place, so it seems to me that you might as well
> > just inline it and lose the macro. Likewise for SORT and ARRISEMPTY.
> >
>
> Done.
>
> > Once you do that, heap_force_common() can just compute the number of
> > array elements once, instead of doing it once inside ARRISEMPTY, then
> > again inside SORT, and then a third time to initialize ntids. You can
> > just have a local variable in that function that is set once and then
> > used as needed. Then you are only doing ARRNELEMS once, so you can get
> > rid of that macro too. The same technique can be used to get rid of
> > ARRPTR. So then all the macros go away without introducing any code
> > duplication.
> >
>
> Done.
>
> > +/* Options to forcefully change the state of a heap tuple. */
> > +typedef enum HTupleForceOption
> > +{
> > + FORCE_KILL,
> > + FORCE_FREEZE
> > +} HTupleForceOption;
> >
> > I suggest un-abbreviating HTuple to HeapTuple and un-abbreviating the
> > enum members to HEAP_FORCE_KILL and HEAP_FORCE_FREE.
>
> Done.
>
> Also, how about
> > option -> operation?
> >
>
> I think both look okay to me.
>
> > + return heap_force_common(fcinfo, FORCE_KILL);
> >
> > I think it might be more idiomatic to use PG_RETURN_DATUM here.  I
> > know it's the same thing, though, and perhaps I'm even wrong about the
> > prevailing style.
> >
>
> Done.
>
> > + Assert(force_opt == FORCE_KILL || force_opt == FORCE_FREEZE);
> >
> > I think this is unnecessary. It's an enum with 2 values.
> >
>
> Removed.
>
> > + if (ARRISEMPTY(ta))
> > + ereport(ERROR,
> > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > + errmsg("empty tid array")));
> >
> > I don't see why this should be an error. Why can't we just continue
> > normally and if it does nothing, it does nothing? You'd need to change
> > the do..while loop to a while loop so that the end condition is tested
> > at the top, but that seems fine.
> >
>
> I think it's okay to have this check. So, just left it as-is. We do
> have such checks in other contrib modules as well wherever the array
> is being passed as an input to the function.
>
> > + rel = relation_open(relid, AccessShareLock);
> >
> > Maybe we should take RowExclusiveLock, since we are going to modify
> > rows. Not sure how much it matters, though.
> >
>
> I don't know how it would make a difference, but still as you said
> replaced AccessShare with RowExclusive.
>
> > + if (!superuser() && GetUserId() != rel->rd_rel->relowner)
> > + ereport(ERROR,
> > + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
> > + errmsg("must be superuser or object owner to use %s.",
> > + force_opt == FORCE_KILL ? "heap_force_kill()" :
> > + "heap_force_freeze()")));
> >
> > This is the wrong way to do a permissions check, and it's also the
> > wrong way to write an error message about having failed one. To see
> > the correct method, grep for pg_class_aclcheck. However, I think that
> > we shouldn't in general trust the object owner to do this, unless the
> > super-user gave permission. This is a data-corrupting operation, and
> > only the boss is allowed to authorize it. So I think you should also
> > add REVOKE EXECUTE FROM PUBLIC statements to the SQL file, and then
> > have this check as a backup. Then, the superuser is always allowed,
> > and if they choose to GRANT EXECUTE on this function to some users,
> > those users can do it for their own relations, but not others.
> >
>
> Done.
>
> > + if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
> > + ereport(ERROR,
> > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> > + errmsg("only heap AM is supported")));
> > +
> > + check_relation_relkind(rel);
> >
> > Seems like these checks are in the wrong order.
>
> I don't think there is anything wrong with the order. I can see the
> same order in other contrib modules as well.
>
> Also, maybe you could
> > call the function something like check_relation_ok() and put the
> > permissions test, the relkind test, and the relam test all inside of
> > it, just to tighten up the code in this main function a bit.
> >
>
> Yeah, I've added a couple of functions named sanity_check_relation and
> sanity_check_tid_array and shifted all the sanity checks there.
>
> > + if (noffs > maxoffset)
> > + ereport(ERROR,
> > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > + errmsg("number of offsets specified for block %u exceeds the max
> > offset number %u",
> > + blkno, maxoffset)));
> >
> > Hmm, this doesn't seem quite right. The actual problem is if an
> > individual item pointer's offset number is greater than maxoffset,
> > which can be true even if the total number of offsets is less than
> > maxoffset. So I think you need to remove this check and add a check
> > inside the loop which follows that offnos[i] is in range.
> >
>
> Agreed and done.
>
> > The way you've structured that loop is actually problematic -- I don't
> > think we want to be calling elog() or ereport() inside a critical
> > section. You could fix the case that checks for an invalid force_opt
> > by just doing if (op == HEAP_FORCE_KILL) { ... } else { Assert(op ==
> > HEAP_FORCE_FREEZE); ... }, or by using a switch with no default. The
> > NOTICE case you have here is a bigger problem.
>
> Done.
>
> You really cannot
> > modify the buffer like this and then decide, oops, never mind, I think
> > I won't mark it dirty or write WAL for the changes. If you do that,
> > the buffer is still in memory, but it's now been modified. A
> > subsequent operation that modifies it will start with the altered
> > state you created here, quite possibly leading to WAL that cannot be
> > correctly replayed on the standby. In other words, you've got to
> > decide for certain whether you want to proceed with the operation
> > *before* you enter the critical section. You also need to emit any
> > messages before or after the critical section.  So you could:
> >
>
> This is still not clear. I think Robert needs to respond to my earlier comment.
>
> > I believe this violates our guidelines on message construction. Have
> > two completely separate messages -- and maybe lose the word "already":
> >
> > "skipping tid (%u, %u) because it is dead"
> > "skipping tid (%u, %u) because it is unused"
> >
> > The point of this is that it makes it easier for translators.
> >
>
> Done.
>
> > I see very little point in what verify_tid() is doing. Before using
> > each block number, we should check that it's less than or equal to a
> > cached value of RelationGetNumberOfBlocks(rel). That's necessary in
> > any case to avoid funny errors; and then the check here against
> > specifically InvalidBlockNumber is redundant. For the offset number,
> > same thing: we need to check each offset against the page's
> > PageGetMaxOffsetNumber(page); and if we do that then we don't need
> > these checks.
> >
>
> Done.
>
> Please check the attached patch for the changes.

I also looked at this version patch and have some small comments:

+   Oid             relid = PG_GETARG_OID(0);
+   ArrayType      *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
+   ItemPointer     tids;
+   int             ntids;
+   Relation        rel;
+   Buffer          buf;
+   Page            page;
+   ItemId          itemid;
+   BlockNumber     blkno;
+   OffsetNumber   *offnos;
+   OffsetNumber    offno,
+                   noffs,
+                   curr_start_ptr,
+                   next_start_ptr,
+                   maxoffset;
+   int             i,
+                   nskippedItems,
+                   nblocks;

You declare all variables at the top of heap_force_common() function
but I think we can declare some variables such as buf, page inside of
the do loop.

---
+           if (offnos[i] > maxoffset)
+           {
+               ereport(NOTICE,
+                        errmsg("skipping tid (%u, %u) because it
contains an invalid offset",
+                               blkno, offnos[i]));
+               continue;
+           }

If all tids on a page take the above path, we will end up logging FPI
in spite of modifying nothing on the page.

---
+       /* XLOG stuff */
+       if (RelationNeedsWAL(rel))
+           log_newpage_buffer(buf, true);

I think we need to set the returned LSN by log_newpage_buffer() to the page lsn.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Ashutosh Sharma
In reply to this post by Masahiko Sawada-2
Hello Masahiko-san,

Thanks for looking into the patch. Please find my comments inline below:

On Thu, Aug 6, 2020 at 1:42 PM Masahiko Sawada
<[hidden email]> wrote:

>
> On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma <[hidden email]> wrote:
> > Please check the attached patch for the changes.
>
> I also looked at this version patch and have some small comments:
>
> +   Oid             relid = PG_GETARG_OID(0);
> +   ArrayType      *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
> +   ItemPointer     tids;
> +   int             ntids;
> +   Relation        rel;
> +   Buffer          buf;
> +   Page            page;
> +   ItemId          itemid;
> +   BlockNumber     blkno;
> +   OffsetNumber   *offnos;
> +   OffsetNumber    offno,
> +                   noffs,
> +                   curr_start_ptr,
> +                   next_start_ptr,
> +                   maxoffset;
> +   int             i,
> +                   nskippedItems,
> +                   nblocks;
>
> You declare all variables at the top of heap_force_common() function
> but I think we can declare some variables such as buf, page inside of
> the do loop.
>

Sure, I will do this in the next version of patch.

> ---
> +           if (offnos[i] > maxoffset)
> +           {
> +               ereport(NOTICE,
> +                        errmsg("skipping tid (%u, %u) because it
> contains an invalid offset",
> +                               blkno, offnos[i]));
> +               continue;
> +           }
>
> If all tids on a page take the above path, we will end up logging FPI
> in spite of modifying nothing on the page.
>

Yeah, that's right. I've taken care of this in the new version of
patch which I am yet to share.

> ---
> +       /* XLOG stuff */
> +       if (RelationNeedsWAL(rel))
> +           log_newpage_buffer(buf, true);
>
> I think we need to set the returned LSN by log_newpage_buffer() to the page lsn.
>

I think we are already setting the page lsn in the log_newpage() which
is being called from log_newpage_buffer(). So, AFAIU, no change would
be required here. Please let me know if I am missing something.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Masahiko Sawada-2
On Thu, 6 Aug 2020 at 18:05, Ashutosh Sharma <[hidden email]> wrote:

>
> Hello Masahiko-san,
>
> Thanks for looking into the patch. Please find my comments inline below:
>
> On Thu, Aug 6, 2020 at 1:42 PM Masahiko Sawada
> <[hidden email]> wrote:
> >
> > On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma <[hidden email]> wrote:
> > > Please check the attached patch for the changes.
> >
> > I also looked at this version patch and have some small comments:
> >
> > +   Oid             relid = PG_GETARG_OID(0);
> > +   ArrayType      *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
> > +   ItemPointer     tids;
> > +   int             ntids;
> > +   Relation        rel;
> > +   Buffer          buf;
> > +   Page            page;
> > +   ItemId          itemid;
> > +   BlockNumber     blkno;
> > +   OffsetNumber   *offnos;
> > +   OffsetNumber    offno,
> > +                   noffs,
> > +                   curr_start_ptr,
> > +                   next_start_ptr,
> > +                   maxoffset;
> > +   int             i,
> > +                   nskippedItems,
> > +                   nblocks;
> >
> > You declare all variables at the top of heap_force_common() function
> > but I think we can declare some variables such as buf, page inside of
> > the do loop.
> >
>
> Sure, I will do this in the next version of patch.
>
> > ---
> > +           if (offnos[i] > maxoffset)
> > +           {
> > +               ereport(NOTICE,
> > +                        errmsg("skipping tid (%u, %u) because it
> > contains an invalid offset",
> > +                               blkno, offnos[i]));
> > +               continue;
> > +           }
> >
> > If all tids on a page take the above path, we will end up logging FPI
> > in spite of modifying nothing on the page.
> >
>
> Yeah, that's right. I've taken care of this in the new version of
> patch which I am yet to share.
>
> > ---
> > +       /* XLOG stuff */
> > +       if (RelationNeedsWAL(rel))
> > +           log_newpage_buffer(buf, true);
> >
> > I think we need to set the returned LSN by log_newpage_buffer() to the page lsn.
> >
>
> I think we are already setting the page lsn in the log_newpage() which
> is being called from log_newpage_buffer(). So, AFAIU, no change would
> be required here. Please let me know if I am missing something.

You're right. I'd missed the comment of log_newpage_buffer(). Thanks!

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


123456 ... 8