recovering from "found xmin ... from before relfrozenxid ..."

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
65 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

recovering from "found xmin ... from before relfrozenxid ..."

Robert Haas
Hi,

A number of EDB customers have had this error crop on their tables for
reasons that we have usually not been able to determine. In many
cases, it's probably down to things like running buggy old releases
for a long time before upgrading, or bad backup and recovery
procedures. It's more than possible that there are still-unfixed
server bugs, but I do not have any compelling evidence of such bugs at
this time. Unfortunately, once you're in this situation, it's kind of
hard to find your way out of it. There are a few problems:

1. There's nothing to identify the tuple that has the problem, and no
way to know how many more of them there might be. Back-patching
b61d161c146328ae6ba9ed937862d66e5c8b035a would help with the first
part of this.

2. In some other, similar situations, e.g. where the tuple data is
garbled, it's often possible to get out from under the problem by
deleting the tuple at issue. But I think that doesn't necessarily fix
anything in this case.

3. We've had some success with using a PL/plgsql loop with an
EXCEPTION block to extract all the accessible tuples from the table.
Then you can truncate the original table and reinsert the data. But
this is slow, so it stinks if the table is big, and it's not a viable
approach if the table in question is a system catalog table -- at
least if it's not if it's something critical like pg_class.

I realize somebody's probably going to say "well, you shouldn't try to
repair a database that's in this state, you shouldn't let it happen in
the first place, and if it does happen, you should track the root
cause to the ends of the earth." But I think that's a completely
impractical approach. I at least have no idea how I'm supposed to
figure out when and how a bad relfrozenxid ended up in the table, and
by the time the problem is discovered after an upgrade the problem
that caused it may be quite old. Moreover, not everyone is as
interested in an extended debugging exercise as they are in getting
the system working again, and VACUUM failing repeatedly is a pretty
serious problem.

Therefore, one of my colleagues has - at my request - created a couple
of functions called heap_force_kill() and heap_force_freeze() which
take an array of TIDs. The former truncates them all to dead line
pointers. The latter resets the infomask and xmin to make the xmin
frozen. (It should probably handle the xmax too; not sure that the
current version does that, but it's easily fixed if not.) The
intention is that you can use these to get either get rid of, or get
access to, tuples whose visibility information is corrupted for
whatever reason. These are pretty sharp tools; you could corrupt a
perfectly-good table by incautious use of them, or destroy a large
amount of data. You could, for example, force-freeze a tuple created
by a transaction which added a column, inserted data, and rolled back;
that would likely be disastrous. However, in the cases that I'm
thinking about, disaster has already struck, and something that you
can use to get things back to a saner state is better than just
leaving the table perpetually broken. Without something like this, the
backup plan is probably to shut down the server and try to edit the
pages using a perl script or something, but that seems clearly worse.

So I have these questions:

- Do people think it would me smart/good/useful to include something
like this in PostgreSQL?

- If so, how? I would propose a new contrib module that we back-patch
all the way, because the VACUUM errors were back-patched all the way,
and there seems to be no advantage in making people wait 5 years for a
new version that has some kind of tooling in this area.

- Any ideas for additional things we should include, or improvements
on the sketch above?

Thanks,

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Stephen Frost
Greetings,

* Robert Haas ([hidden email]) wrote:
> - Do people think it would me smart/good/useful to include something
> like this in PostgreSQL?

Absolutely, yes.

> - If so, how? I would propose a new contrib module that we back-patch
> all the way, because the VACUUM errors were back-patched all the way,
> and there seems to be no advantage in making people wait 5 years for a
> new version that has some kind of tooling in this area.

While I agree that this would be a good and useful new contrib module to
have, I don't think it would be appropriate to back-patch it into PG
formally.

Unfortunately, that gets into the discussion that's cropped up on a few
other threads of late- that we don't have a good place to put extensions
which are well maintained/recommended by core PG hackers, and which are
able to work with lots of different versions of PG, and are versioned
and released independently of PG (and, ideally, built for all the
versions of PG that we distribute through our packages).

Given the lack of such a place today, I'd at least suggest starting with
proposing it as a new contrib module for v14.

> - Any ideas for additional things we should include, or improvements
> on the sketch above?

Not right off-hand, but will think about it, there could certainly be a
lot of very interesting tools in such a toolbox.

Thanks!

Stephen

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Tom Lane-2
Stephen Frost <[hidden email]> writes:
> * Robert Haas ([hidden email]) wrote:
>> - If so, how? I would propose a new contrib module that we back-patch
>> all the way, because the VACUUM errors were back-patched all the way,
>> and there seems to be no advantage in making people wait 5 years for a
>> new version that has some kind of tooling in this area.

> While I agree that this would be a good and useful new contrib module to
> have, I don't think it would be appropriate to back-patch it into PG
> formally.

Yeah, I don't care for that either.  That's a pretty huge violation of our
normal back-patching rules, and I'm not convinced that it's justified.

No objection to adding it as a new contrib module.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Peter Geoghegan-4
In reply to this post by Robert Haas
On Mon, Jul 13, 2020 at 2:12 PM Robert Haas <[hidden email]> wrote:
> 1. There's nothing to identify the tuple that has the problem, and no
> way to know how many more of them there might be. Back-patching
> b61d161c146328ae6ba9ed937862d66e5c8b035a would help with the first
> part of this.

I am in favor of backpatching such changes in cases where senior
community members feel that it could help with hypothetical
undiscovered data corruption issues -- if they're willing to take
responsibility for the change. It certainly wouldn't be the first
time. A "defense in depth" mindset seems like the right one when it
comes to data corruption bugs. Early detection is really important.

> Moreover, not everyone is as
> interested in an extended debugging exercise as they are in getting
> the system working again, and VACUUM failing repeatedly is a pretty
> serious problem.

That's absolutely consistent with my experience. Most users want to
get back to business as usual now, while letting somebody else do the
hard work of debugging.

> Therefore, one of my colleagues has - at my request - created a couple
> of functions called heap_force_kill() and heap_force_freeze() which
> take an array of TIDs.

> So I have these questions:
>
> - Do people think it would me smart/good/useful to include something
> like this in PostgreSQL?

I'm in favor of it.

> - If so, how? I would propose a new contrib module that we back-patch
> all the way, because the VACUUM errors were back-patched all the way,
> and there seems to be no advantage in making people wait 5 years for a
> new version that has some kind of tooling in this area.

I'm in favor of it being *possible* to backpatch tooling that is
clearly related to correctness in a fundamental way. Obviously this
would mean that we'd be revising our general position on backpatching
to allow some limited exceptions around corruption. I'm not sure that
this meets that standard, though. It's hardly something that we can
expect all that many users to be able to use effectively.

I may be biased, but I'd be inclined to permit it in the case of
something like amcheck, or pg_visibility, on the grounds that they're
more or less the same as the new VACUUM errcontext instrumentation you
mentioned. The same cannot be said of something like this new
heap_force_kill() stuff.

> - Any ideas for additional things we should include, or improvements
> on the sketch above?

Clearly you should work out a way of making it very hard to
accidentally (mis)use. For example, maybe you make the functions check
for the presence of a sentinel file in the data directory.


--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Andres Freund
In reply to this post by Robert Haas
Hi,

On 2020-07-13 17:12:18 -0400, Robert Haas wrote:
> 1. There's nothing to identify the tuple that has the problem, and no
> way to know how many more of them there might be. Back-patching
> b61d161c146328ae6ba9ed937862d66e5c8b035a would help with the first
> part of this.

Not fully, I'm afraid. Afaict it doesn't currently tell you the item
pointer offset, just the block numer, right? We probably should extend
it to also include the offset...


> 2. In some other, similar situations, e.g. where the tuple data is
> garbled, it's often possible to get out from under the problem by
> deleting the tuple at issue. But I think that doesn't necessarily fix
> anything in this case.

Huh, why not? That worked in the cases I saw.


> Therefore, one of my colleagues has - at my request - created a couple
> of functions called heap_force_kill() and heap_force_freeze() which
> take an array of TIDs. The former truncates them all to dead line
> pointers. The latter resets the infomask and xmin to make the xmin
> frozen. (It should probably handle the xmax too; not sure that the
> current version does that, but it's easily fixed if not.)

xmax is among the problematic cases IIRC, so yes, it'd be good to fix
that.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Robert Haas
In reply to this post by Tom Lane-2
On Mon, Jul 13, 2020 at 6:15 PM Tom Lane <[hidden email]> wrote:
> Yeah, I don't care for that either.  That's a pretty huge violation of our
> normal back-patching rules, and I'm not convinced that it's justified.

I think that our normal back-patching rules are based primarily on the
risk of breaking things, and a new contrib module carries a pretty
negligible risk of breaking anything that works today. I wouldn't
propose to back-patch something on those grounds just as a way of
delivering a new feature more quickly, but that's not the intention
here. At least in my experience, un-VACUUM-able tables have gotten
several orders of magnitude more common since Andres put those changes
in. As far as I can recall, EDB has not had this many instances of
different customers reporting the same problem since the 9.3-era
multixact issues. So far, this does not rise to that level, but it is
by no means a negligible issue, either. I believe it deserves to be
taken quite seriously, especially because the existing options for
helping customers with this kind of problem are so limited.

Now, if this goes into v14, we can certainly stick it up on github, or
put it out there in some other way for users to download,
self-compile, and install, but that seems noticeably less convenient
for people who need it, and I'm not clear what the benefit to the
project is.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Robert Haas
In reply to this post by Andres Freund
On Mon, Jul 13, 2020 at 6:38 PM Andres Freund <[hidden email]> wrote:
> Not fully, I'm afraid. Afaict it doesn't currently tell you the item
> pointer offset, just the block numer, right? We probably should extend
> it to also include the offset...

Oh, I hadn't realized that limitation. That would be good to fix. It
would be even better, I think, if we could have VACUUM proceed with
the rest of vacuuming the table, emitting warnings about each
instance, instead of blowing up when it hits the first bad tuple, but
I think you may have told me sometime that doing so would be, uh, less
than straightforward. We probably should refuse to update
relfrozenxid/relminmxid when this is happening, but I *think* it would
be better to still proceed with dead tuple cleanup as far as we can,
or at least have an option to enable that behavior. I'm not positive
about that, but not being able to complete VACUUM at all is a FAR more
urgent problem than not being able to freeze, even though in the long
run the latter is more severe.

> > 2. In some other, similar situations, e.g. where the tuple data is
> > garbled, it's often possible to get out from under the problem by
> > deleting the tuple at issue. But I think that doesn't necessarily fix
> > anything in this case.
>
> Huh, why not? That worked in the cases I saw.

I'm not sure I've seen a case where that didn't work, but I don't see
a reason why it couldn't happen. Do you think the code is structured
in such a way that a deleted tuple is guaranteed to be pruned even if
the XID is old? What if clog has been truncated so that the xmin can't
be looked up?

> xmax is among the problematic cases IIRC, so yes, it'd be good to fix
> that.

Thanks for the input.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Tom Lane-2
Robert Haas <[hidden email]> writes:

> Oh, I hadn't realized that limitation. That would be good to fix. It
> would be even better, I think, if we could have VACUUM proceed with
> the rest of vacuuming the table, emitting warnings about each
> instance, instead of blowing up when it hits the first bad tuple, but
> I think you may have told me sometime that doing so would be, uh, less
> than straightforward. We probably should refuse to update
> relfrozenxid/relminmxid when this is happening, but I *think* it would
> be better to still proceed with dead tuple cleanup as far as we can,
> or at least have an option to enable that behavior. I'm not positive
> about that, but not being able to complete VACUUM at all is a FAR more
> urgent problem than not being able to freeze, even though in the long
> run the latter is more severe.

+1 for proceeding in this direction, rather than handing users tools
that they *will* hurt themselves with.

The more that I think about it, the more I think that the proposed
functions are tools for wizards only, and so I'm getting hesitant
about having them in contrib at all.  We lack a better place to
put them, but that doesn't mean they should be there.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Robert Haas
On Mon, Jul 13, 2020 at 8:58 PM Tom Lane <[hidden email]> wrote:

> Robert Haas <[hidden email]> writes:
> > Oh, I hadn't realized that limitation. That would be good to fix. It
> > would be even better, I think, if we could have VACUUM proceed with
> > the rest of vacuuming the table, emitting warnings about each
> > instance, instead of blowing up when it hits the first bad tuple, but
> > I think you may have told me sometime that doing so would be, uh, less
> > than straightforward. We probably should refuse to update
> > relfrozenxid/relminmxid when this is happening, but I *think* it would
> > be better to still proceed with dead tuple cleanup as far as we can,
> > or at least have an option to enable that behavior. I'm not positive
> > about that, but not being able to complete VACUUM at all is a FAR more
> > urgent problem than not being able to freeze, even though in the long
> > run the latter is more severe.
>
> +1 for proceeding in this direction, rather than handing users tools
> that they *will* hurt themselves with.
>
> The more that I think about it, the more I think that the proposed
> functions are tools for wizards only, and so I'm getting hesitant
> about having them in contrib at all.  We lack a better place to
> put them, but that doesn't mean they should be there.

It's not an either/or; it's a both/and. To recover from this problem,
you need to:

1. Be able to tell which tuples are affected.
2. Do something about it.

I think there are a number of strategies that we could pursue around
either of those things, and there are better and worse ways of
accomplishing them, but having one without the other isn't too great.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Andres Freund
In reply to this post by Robert Haas
Hi,

On 2020-07-13 20:47:10 -0400, Robert Haas wrote:
> On Mon, Jul 13, 2020 at 6:38 PM Andres Freund <[hidden email]> wrote:
> > Not fully, I'm afraid. Afaict it doesn't currently tell you the item
> > pointer offset, just the block numer, right? We probably should extend
> > it to also include the offset...
>
> Oh, I hadn't realized that limitation. That would be good to fix.

Yea. And it'd even be good if we were to to end up implementing your
suggestion below about continuing vacuuming other tuples.


> It would be even better, I think, if we could have VACUUM proceed with
> the rest of vacuuming the table, emitting warnings about each
> instance, instead of blowing up when it hits the first bad tuple, but
> I think you may have told me sometime that doing so would be, uh, less
> than straightforward.

Yea, it's not that simple to implement. Not impossible either.


> We probably should refuse to update relfrozenxid/relminmxid when this
> is happening, but I *think* it would be better to still proceed with
> dead tuple cleanup as far as we can, or at least have an option to
> enable that behavior. I'm not positive about that, but not being able
> to complete VACUUM at all is a FAR more urgent problem than not being
> able to freeze, even though in the long run the latter is more severe.

I'm hesitant to default to removing tuples once we've figured out that
something is seriously wrong. Could easy enough make us plow ahead and
delete valuable data on other tuples, even if we'd already detected
there's a problem. But I also see the problem you raise. That's not
academic, a number of multixact corruption issues the checks detected
IIRC weren't guaranteed to be caught.


> > > 2. In some other, similar situations, e.g. where the tuple data is
> > > garbled, it's often possible to get out from under the problem by
> > > deleting the tuple at issue. But I think that doesn't necessarily fix
> > > anything in this case.
> >
> > Huh, why not? That worked in the cases I saw.
>
> I'm not sure I've seen a case where that didn't work, but I don't see
> a reason why it couldn't happen. Do you think the code is structured
> in such a way that a deleted tuple is guaranteed to be pruned even if
> the XID is old?

I think so, leaving aside some temporary situations perhaps.


> What if clog has been truncated so that the xmin can't be looked up?

That's possible, but probably only in cases where xmin actually
committed.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Robert Haas
In reply to this post by Tom Lane-2
On Mon, Jul 13, 2020 at 8:58 PM Tom Lane <[hidden email]> wrote:
> The more that I think about it, the more I think that the proposed
> functions are tools for wizards only, and so I'm getting hesitant
> about having them in contrib at all.  We lack a better place to
> put them, but that doesn't mean they should be there.

Also, I want to clarify that in a typical situation in which a
customer is facing this problem, I don't have any access to their
system. I basically never touch customer systems directly. Typically,
the customer sends us log files and a description of the problem and
their goals, and we send them back advice or instructions. So it's
impractical to imagine that this can be something where you have to
know the secret magic wizard password to get access to it. We'd just
have to give the customers who need to use this tool said password,
and then the jig is up - they can redistribute that password to all
the non-wizards on the Internet, if they so choose.

I understand that it's not too great when we give people access to
sharp tools and they hurt themselves with said tools. But this is open
source. That's how it goes.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Robert Haas
In reply to this post by Andres Freund
On Mon, Jul 13, 2020 at 9:10 PM Andres Freund <[hidden email]> wrote:
> > What if clog has been truncated so that the xmin can't be looked up?
>
> That's possible, but probably only in cases where xmin actually
> committed.

Isn't that the normal case? I'm imagining something like:

- Tuple gets inserted. Transaction commits.
- VACUUM processes table.
- Mischievous fairies mark page all-visible in the visibility map.
- VACUUM runs lots more times, relfrozenxid advances, but without ever
looking at the page in question, because it's all-visible.
- clog is truncated, rendering xmin no longer accessible.
- User runs VACUUM disabling page skipping, gets ERROR.
- User deletes offending tuple.
- At this point, I think the tuple is both invisible and unprunable?
- Fairies happy, user sad.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Tom Lane-2
In reply to this post by Robert Haas
Robert Haas <[hidden email]> writes:
> On Mon, Jul 13, 2020 at 8:58 PM Tom Lane <[hidden email]> wrote:
>> The more that I think about it, the more I think that the proposed
>> functions are tools for wizards only, and so I'm getting hesitant
>> about having them in contrib at all.  We lack a better place to
>> put them, but that doesn't mean they should be there.

> I understand that it's not too great when we give people access to
> sharp tools and they hurt themselves with said tools. But this is open
> source. That's how it goes.

I think you're attacking a straw man.  I'm well aware of how open source
works, thanks.  What I'm saying is that contrib is mostly seen to be
reasonably harmless stuff.  Sure, you can overwrite data you didn't want
to with adminpack's pg_file_write.  But that's the price of having such a
capability at all, and in general it's not hard for users to understand
both the uses and risks of that function.  That statement does not apply
to the functions being proposed here.  It doesn't seem like they could
possibly be safe to use without very specific expert advice --- and even
then, we're talking rather small values of "safe".  So I wish we had some
other way to distribute them than via contrib.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Fujii Masao-4
In reply to this post by Robert Haas


On 2020/07/14 9:41, Robert Haas wrote:

> On Mon, Jul 13, 2020 at 6:15 PM Tom Lane <[hidden email]> wrote:
>> Yeah, I don't care for that either.  That's a pretty huge violation of our
>> normal back-patching rules, and I'm not convinced that it's justified.
>
> I think that our normal back-patching rules are based primarily on the
> risk of breaking things, and a new contrib module carries a pretty
> negligible risk of breaking anything that works today. I wouldn't
> propose to back-patch something on those grounds just as a way of
> delivering a new feature more quickly, but that's not the intention
> here. At least in my experience, un-VACUUM-able tables have gotten
> several orders of magnitude more common since Andres put those changes
> in. As far as I can recall, EDB has not had this many instances of
> different customers reporting the same problem since the 9.3-era
> multixact issues. So far, this does not rise to that level, but it is
> by no means a negligible issue, either. I believe it deserves to be
> taken quite seriously, especially because the existing options for
> helping customers with this kind of problem are so limited.
>
> Now, if this goes into v14, we can certainly stick it up on github, or
> put it out there in some other way for users to download,
> self-compile, and install, but that seems noticeably less convenient
> for people who need it, and I'm not clear what the benefit to the
> project is.

But updating this tool can fit to the release schedule and
policy of PostgreSQL?

While investigating the problem by using this tool, we may want to
add new feature into the tool because it's necessary for the investigation.
But users would need to wait for next minor version release, to use this
new feature.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Andrey M. Borodin
In reply to this post by Robert Haas
Hi!

> 14 июля 2020 г., в 02:12, Robert Haas <[hidden email]> написал(а):
>
> So I have these questions:
>
> - Do people think it would me smart/good/useful to include something
> like this in PostgreSQL?
>
> - If so, how? I would propose a new contrib module that we back-patch
> all the way


My 0.05₽.

At Yandex we used to fix similar corruption things with our pg_dirty_hands extension [0].
But then we developed our internal pg_heapcheck module (unfortunately we did not publish it) and incorporated aggressive recovery into heapcheck.

Now when community has official heapcheck I think it worth to keep detection and fixing tools together.

Best regards, Andrey Borodin.

[0] https://github.com/dsarafan/pg_dirty_hands/blob/master/src/pg_dirty_hands.c



Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Peter Eisentraut-6
In reply to this post by Robert Haas
On 2020-07-14 02:41, Robert Haas wrote:
> I think that our normal back-patching rules are based primarily on the
> risk of breaking things, and a new contrib module carries a pretty
> negligible risk of breaking anything that works today.

I think that all feature code ought to go through a beta cycle.  So if
this code makes it to 14.0 or 14.1, then I'd consider backpatching it.

> Now, if this goes into v14, we can certainly stick it up on github, or
> put it out there in some other way for users to download,
> self-compile, and install, but that seems noticeably less convenient
> for people who need it, and I'm not clear what the benefit to the
> project is.

In the meantime, if you're wizard enough to deal with this kind of
thing, you could also clone the module from the PG14 tree and build it
against older versions manually.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Magnus Hagander-2
In reply to this post by Tom Lane-2


On Tue, Jul 14, 2020 at 3:26 AM Tom Lane <[hidden email]> wrote:
Robert Haas <[hidden email]> writes:
> On Mon, Jul 13, 2020 at 8:58 PM Tom Lane <[hidden email]> wrote:
>> The more that I think about it, the more I think that the proposed
>> functions are tools for wizards only, and so I'm getting hesitant
>> about having them in contrib at all.  We lack a better place to
>> put them, but that doesn't mean they should be there.

> I understand that it's not too great when we give people access to
> sharp tools and they hurt themselves with said tools. But this is open
> source. That's how it goes.

I think you're attacking a straw man.  I'm well aware of how open source
works, thanks.  What I'm saying is that contrib is mostly seen to be
reasonably harmless stuff.  Sure, you can overwrite data you didn't want
to with adminpack's pg_file_write.  But that's the price of having such a
capability at all, and in general it's not hard for users to understand
both the uses and risks of that function.  That statement does not apply
to the functions being proposed here.  It doesn't seem like they could
possibly be safe to use without very specific expert advice --- and even
then, we're talking rather small values of "safe".  So I wish we had some
other way to distribute them than via contrib.

The countersable of this is pg_resetwal. The number of people who have broken their database with that tool is not small.

That said, we could have a separate "class" of extensions/tools in the distribution, and encourage packagers to pack them up as separate packages for example. Technically they don't have to be in the same source repository at all of course, but I have a feeling some of them might be a lot easier to maintain if they are. And then the user would just have to install something like "postgresql-14-wizardtools". They'd still be available to everybody, of course, but at least the knives would be in a closed drawer until intentionally picked up.
 
--
Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Jochem van Dieten
In reply to this post by Tom Lane-2
On Tue, Jul 14, 2020 at 3:26 AM Tom Lane <[hidden email]> wrote:
> I think you're attacking a straw man.  I'm well aware of how open source
> works, thanks.  What I'm saying is that contrib is mostly seen to be
> reasonably harmless stuff.  Sure, you can overwrite data you didn't want
> to with adminpack's pg_file_write.  But that's the price of having such a
> capability at all, and in general it's not hard for users to understand
> both the uses and risks of that function.  That statement does not apply
> to the functions being proposed here.  It doesn't seem like they could
> possibly be safe to use without very specific expert advice --- and even
> then, we're talking rather small values of "safe".

Would it be possible to make them safe(r)? For instance, truncate
only, don't freeze; only tuples whose visibility information is
corrupted; and only in non-catalog tables. What exactly is the risk in
that case? Foreign keys might not be satisfied, which might make it
impossible to restore a dump, but is that worse than what a DBA can do
anyway? I would think that it is not and would leave the database in a
state DBAs are much better equipped to deal with.
Or would it be possible to create a table like the original table
(minus any constraints) and copy all tuples with corrupted visibility
there before truncating to a dead line pointer?

Jochem


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Robert Haas
In reply to this post by Peter Eisentraut-6
On Tue, Jul 14, 2020 at 3:08 AM Peter Eisentraut
<[hidden email]> wrote:
> In the meantime, if you're wizard enough to deal with this kind of
> thing, you could also clone the module from the PG14 tree and build it
> against older versions manually.

But what if you are NOT a wizard, and a wizard is giving you
directions? Then having to build from source is a real pain. And
that's normally the situation I'm in when a customer has this issue.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Robert Haas
In reply to this post by Magnus Hagander-2
On Tue, Jul 14, 2020 at 4:59 AM Magnus Hagander <[hidden email]> wrote:
> The countersable of this is pg_resetwal. The number of people who have broken their database with that tool is not small.

Very true.

> That said, we could have a separate "class" of extensions/tools in the distribution, and encourage packagers to pack them up as separate packages for example. Technically they don't have to be in the same source repository at all of course, but I have a feeling some of them might be a lot easier to maintain if they are. And then the user would just have to install something like "postgresql-14-wizardtools". They'd still be available to everybody, of course, but at least the knives would be in a closed drawer until intentionally picked up.

I don't think that does much to help with the immediate problem here,
because people are being bitten by this problem *now* and a packaging
change like this will take a long time to happen and become standard
out there, but I think it's a good idea.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


1234