recovering from "found xmin ... from before relfrozenxid ..."

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
142 messages Options
12345678
Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Ashutosh Sharma
Attached v4 patch fixes the latest comments from Robert and Masahiko-san.

Changes:
1) Let heap_force_kill and freeze functions to be used with toast tables.
2) Replace "int nskippedItems" with "bool did_modify_page" flag to
know if any modification happened in the page or not.
3) Declare some of the variables such as buf, page inside of the do
loop instead of declaring them at the top of heap_force_common
function.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Thu, Aug 6, 2020 at 2:49 PM Masahiko Sawada
<[hidden email]> wrote:

>
> On Thu, 6 Aug 2020 at 18:05, Ashutosh Sharma <[hidden email]> wrote:
> >
> > Hello Masahiko-san,
> >
> > Thanks for looking into the patch. Please find my comments inline below:
> >
> > On Thu, Aug 6, 2020 at 1:42 PM Masahiko Sawada
> > <[hidden email]> wrote:
> > >
> > > On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma <[hidden email]> wrote:
> > > > Please check the attached patch for the changes.
> > >
> > > I also looked at this version patch and have some small comments:
> > >
> > > +   Oid             relid = PG_GETARG_OID(0);
> > > +   ArrayType      *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
> > > +   ItemPointer     tids;
> > > +   int             ntids;
> > > +   Relation        rel;
> > > +   Buffer          buf;
> > > +   Page            page;
> > > +   ItemId          itemid;
> > > +   BlockNumber     blkno;
> > > +   OffsetNumber   *offnos;
> > > +   OffsetNumber    offno,
> > > +                   noffs,
> > > +                   curr_start_ptr,
> > > +                   next_start_ptr,
> > > +                   maxoffset;
> > > +   int             i,
> > > +                   nskippedItems,
> > > +                   nblocks;
> > >
> > > You declare all variables at the top of heap_force_common() function
> > > but I think we can declare some variables such as buf, page inside of
> > > the do loop.
> > >
> >
> > Sure, I will do this in the next version of patch.
> >
> > > ---
> > > +           if (offnos[i] > maxoffset)
> > > +           {
> > > +               ereport(NOTICE,
> > > +                        errmsg("skipping tid (%u, %u) because it
> > > contains an invalid offset",
> > > +                               blkno, offnos[i]));
> > > +               continue;
> > > +           }
> > >
> > > If all tids on a page take the above path, we will end up logging FPI
> > > in spite of modifying nothing on the page.
> > >
> >
> > Yeah, that's right. I've taken care of this in the new version of
> > patch which I am yet to share.
> >
> > > ---
> > > +       /* XLOG stuff */
> > > +       if (RelationNeedsWAL(rel))
> > > +           log_newpage_buffer(buf, true);
> > >
> > > I think we need to set the returned LSN by log_newpage_buffer() to the page lsn.
> > >
> >
> > I think we are already setting the page lsn in the log_newpage() which
> > is being called from log_newpage_buffer(). So, AFAIU, no change would
> > be required here. Please let me know if I am missing something.
>
> You're right. I'd missed the comment of log_newpage_buffer(). Thanks!
>
> Regards,
>
> --
> Masahiko Sawada            http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

v4-0001-Add-contrib-pg_surgery-to-perform-surgery-on-a-damag.patch (25K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Robert Haas
In reply to this post by Ashutosh Sharma
On Thu, Aug 6, 2020 at 2:11 AM Ashutosh Sharma <[hidden email]> wrote:

> Okay, If you want I can remove the restriction on a toast table, but,
> then that means a user can directly remove the data chunks from the
> toast table without changing anything in the main table. This means we
> won't be able to query the main table as it will fail with an error
> like "ERROR:  unexpected chunk number ...". So, we will have to find
> some way to identify the pointer to the chunks that got deleted from
> the toast table and remove that pointer from the main table. We also
> need to make sure that before we remove a tuple (pointer) from the
> main table, we identify all the remaining data chunks pointed by this
> tuple and remove them completely only then that table would be
> considered to be in a good state. Now, I am not sure if we can
> currently do all these things.

That's the user's problem. If they don't have a plan for that, they
shouldn't use this tool. I don't think we can or should try to handle
it in this code.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Ashutosh Sharma
On Thu, Aug 6, 2020 at 9:19 PM Robert Haas <[hidden email]> wrote:

>
> On Thu, Aug 6, 2020 at 2:11 AM Ashutosh Sharma <[hidden email]> wrote:
> > Okay, If you want I can remove the restriction on a toast table, but,
> > then that means a user can directly remove the data chunks from the
> > toast table without changing anything in the main table. This means we
> > won't be able to query the main table as it will fail with an error
> > like "ERROR:  unexpected chunk number ...". So, we will have to find
> > some way to identify the pointer to the chunks that got deleted from
> > the toast table and remove that pointer from the main table. We also
> > need to make sure that before we remove a tuple (pointer) from the
> > main table, we identify all the remaining data chunks pointed by this
> > tuple and remove them completely only then that table would be
> > considered to be in a good state. Now, I am not sure if we can
> > currently do all these things.
>
> That's the user's problem. If they don't have a plan for that, they
> shouldn't use this tool. I don't think we can or should try to handle
> it in this code.
>

Okay, thanks.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Ashutosh Sharma
In reply to this post by rajkumar.raghuwanshi
Thanks Rajkumar for testing the patch.

Here are some of the additional test-cases that I would suggest you to
execute, if possible:

1) You may try running the test-cases that you have executed so far
with SR setup and see if the changes are getting reflected on the
standby.

2) You may also try running some concurrent test-cases for e.g. try
running these functions with VACUUM or some other sql commands
(preferable DML commands) in parallel.

3) See what happens when you pass some invalid tids (containing
invalid block or offset number) to these functions. You may also try
running these functions on the same tuple repeatedly and see the
behaviour.

...

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Thu, Aug 6, 2020 at 2:25 PM Rajkumar Raghuwanshi
<[hidden email]> wrote:

>
> I have been doing some user-level testing of this feature, apart from sanity test for extension and it's functions
>
> I have tried to corrupt tuples and then able to fix it using heap_force_freeze/kill functions.
>
>
> --corrupt relfrozenxid, cause vacuum failed.
>
> update pg_class set relfrozenxid = (relfrozenxid::text::integer + 10)::text::xid where relname = 'test_tbl';
>
> UPDATE 1
>
> insert into test_tbl values (2, 'BBB');
>
>
> postgres=# vacuum test_tbl;
>
> ERROR:  found xmin 507 from before relfrozenxid 516
>
> CONTEXT:  while scanning block 0 of relation "public.test_tbl"
>
>
> postgres=# select *, ctid, xmin, xmax from test_tbl;
>
>  a |  b  | ctid  | xmin | xmax
>
> ---+-----+-------+------+------
>
>  1 | AAA | (0,1) |  505 |    0
>
>  2 | BBB | (0,2) |  507 |    0
>
> (2 rows)
>
>
> --fixed using heap_force_freeze
>
> postgres=# select heap_force_freeze('test_tbl'::regclass, ARRAY['(0,2)']::tid[]);
>
>  heap_force_freeze
>
> -------------------
>
>
> postgres=# vacuum test_tbl;
>
> VACUUM
>
> postgres=# select *, ctid, xmin, xmax from test_tbl;
>
>  a |  b  | ctid  | xmin | xmax
>
> ---+-----+-------+------+------
>
>  1 | AAA | (0,1) |  505 |    0
>
>  2 | BBB | (0,2) |    2 |    0
>
> (2 rows)
>
>
> --corrupt table headers in base/oid. file, cause table access failed.
>
> postgres=# select ctid, * from test_tbl;
>
> ERROR:  could not access status of transaction 4294967295
>
> DETAIL:  Could not open file "pg_xact/0FFF": No such file or directory.
>
>
> --removed corrupted tuple using heap_force_kill
>
> postgres=# select heap_force_kill('test_tbl'::regclass, ARRAY['(0,2)']::tid[]);
>
>  heap_force_kill
>
> -----------------
>
>
>
> (1 row)
>
>
> postgres=# select ctid, * from test_tbl;
>
>  ctid  | a |  b
>
> -------+---+-----
>
>  (0,1) | 1 | AAA
>
> (1 row)
>
>
> I will be continuing with my testing with the latest patch and update here if found anything.
>
>
> Thanks & Regards,
> Rajkumar Raghuwanshi
>
>
> On Thu, Aug 6, 2020 at 1:42 PM Masahiko Sawada <[hidden email]> wrote:
>>
>> On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma <[hidden email]> wrote:
>> >
>> > Hi Robert,
>> >
>> > Thanks for the review. Please find my comments inline:
>> >
>> > On Sat, Aug 1, 2020 at 12:18 AM Robert Haas <[hidden email]> wrote:
>> > >
>> > > On Fri, Jul 31, 2020 at 8:52 AM Ashutosh Sharma <[hidden email]> wrote:
>> > > > Attached is the new version of patch that addresses the comments from Andrey and Beena.
>> > >
>> > > +PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table"
>> > >
>> > > the -> a
>> > >
>> > > I also suggest: heap table -> relation, because we might want to
>> > > extend this to other cases later.
>> > >
>> >
>> > Corrected.
>> >
>> > > +-- toast table.
>> > > +begin;
>> > > +create table ttab(a text);
>> > > +insert into ttab select string_agg(chr(floor(random() * 26)::int +
>> > > 65), '') from generate_series(1,10000);
>> > > +select * from ttab where xmin = 2;
>> > > + a
>> > > +---
>> > > +(0 rows)
>> > > +
>> > > +select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]);
>> > > + heap_force_freeze
>> > > +-------------------
>> > > +
>> > > +(1 row)
>> > > +
>> > >
>> > > I don't understand the point of this. You're not testing the function
>> > > on the TOAST table; you're testing it on the main table when there
>> > > happens to be a TOAST table that is probably getting used for
>> > > something. But that's not really relevant to what is being tested
>> > > here, so as written this seems redundant with the previous cases.
>> > >
>> >
>> > Yeah, it's being tested on the main table, not on a toast table. I've
>> > removed this test-case and also restricted direct access to the toast
>> > table using heap_force_kill/freeze functions. I think we shouldn't be
>> > using these functions to do any changes in the toast table. We will
>> > only use these functions with the main table and let VACUUM remove the
>> > corresponding data chunks (pointed by the tuple that got removed from
>> > the main table).
>> >
>> > Another option would be to identify all the data chunks corresponding
>> > to the tuple (ctid) being killed from the main table and remove them
>> > one by one. We will only do this if the tuple from the main table that
>> > has been marked as killed has an external storage. We will have to add
>> > a bunch of code for this otherwise we can let VACUUM do this for us.
>> > Let me know your thoughts on this.
>> >
>> > > +-- test pg_surgery functions with the unsupported relations. Should fail.
>> > >
>> > > Please name the specific functions being tested here in case we add
>> > > more in the future that are tested separately.
>> > >
>> >
>> > Done.
>> >
>> > > +++ b/contrib/pg_surgery/heap_surgery_funcs.c
>> > >
>> > > I think we could drop _funcs from the file name.
>> > >
>> >
>> > Done.
>> >
>> > > +#ifdef PG_MODULE_MAGIC
>> > > +PG_MODULE_MAGIC;
>> > > +#endif
>> > >
>> > > The #ifdef here is not required, and if you look at other contrib
>> > > modules you'll see that they don't have it.
>> > >
>> >
>> > Okay, done.
>> >
>> > > I don't like all the macros at the top of the file much. CHECKARRVALID
>> > > is only used in one place, so it seems to me that you might as well
>> > > just inline it and lose the macro. Likewise for SORT and ARRISEMPTY.
>> > >
>> >
>> > Done.
>> >
>> > > Once you do that, heap_force_common() can just compute the number of
>> > > array elements once, instead of doing it once inside ARRISEMPTY, then
>> > > again inside SORT, and then a third time to initialize ntids. You can
>> > > just have a local variable in that function that is set once and then
>> > > used as needed. Then you are only doing ARRNELEMS once, so you can get
>> > > rid of that macro too. The same technique can be used to get rid of
>> > > ARRPTR. So then all the macros go away without introducing any code
>> > > duplication.
>> > >
>> >
>> > Done.
>> >
>> > > +/* Options to forcefully change the state of a heap tuple. */
>> > > +typedef enum HTupleForceOption
>> > > +{
>> > > + FORCE_KILL,
>> > > + FORCE_FREEZE
>> > > +} HTupleForceOption;
>> > >
>> > > I suggest un-abbreviating HTuple to HeapTuple and un-abbreviating the
>> > > enum members to HEAP_FORCE_KILL and HEAP_FORCE_FREE.
>> >
>> > Done.
>> >
>> > Also, how about
>> > > option -> operation?
>> > >
>> >
>> > I think both look okay to me.
>> >
>> > > + return heap_force_common(fcinfo, FORCE_KILL);
>> > >
>> > > I think it might be more idiomatic to use PG_RETURN_DATUM here.  I
>> > > know it's the same thing, though, and perhaps I'm even wrong about the
>> > > prevailing style.
>> > >
>> >
>> > Done.
>> >
>> > > + Assert(force_opt == FORCE_KILL || force_opt == FORCE_FREEZE);
>> > >
>> > > I think this is unnecessary. It's an enum with 2 values.
>> > >
>> >
>> > Removed.
>> >
>> > > + if (ARRISEMPTY(ta))
>> > > + ereport(ERROR,
>> > > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>> > > + errmsg("empty tid array")));
>> > >
>> > > I don't see why this should be an error. Why can't we just continue
>> > > normally and if it does nothing, it does nothing? You'd need to change
>> > > the do..while loop to a while loop so that the end condition is tested
>> > > at the top, but that seems fine.
>> > >
>> >
>> > I think it's okay to have this check. So, just left it as-is. We do
>> > have such checks in other contrib modules as well wherever the array
>> > is being passed as an input to the function.
>> >
>> > > + rel = relation_open(relid, AccessShareLock);
>> > >
>> > > Maybe we should take RowExclusiveLock, since we are going to modify
>> > > rows. Not sure how much it matters, though.
>> > >
>> >
>> > I don't know how it would make a difference, but still as you said
>> > replaced AccessShare with RowExclusive.
>> >
>> > > + if (!superuser() && GetUserId() != rel->rd_rel->relowner)
>> > > + ereport(ERROR,
>> > > + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
>> > > + errmsg("must be superuser or object owner to use %s.",
>> > > + force_opt == FORCE_KILL ? "heap_force_kill()" :
>> > > + "heap_force_freeze()")));
>> > >
>> > > This is the wrong way to do a permissions check, and it's also the
>> > > wrong way to write an error message about having failed one. To see
>> > > the correct method, grep for pg_class_aclcheck. However, I think that
>> > > we shouldn't in general trust the object owner to do this, unless the
>> > > super-user gave permission. This is a data-corrupting operation, and
>> > > only the boss is allowed to authorize it. So I think you should also
>> > > add REVOKE EXECUTE FROM PUBLIC statements to the SQL file, and then
>> > > have this check as a backup. Then, the superuser is always allowed,
>> > > and if they choose to GRANT EXECUTE on this function to some users,
>> > > those users can do it for their own relations, but not others.
>> > >
>> >
>> > Done.
>> >
>> > > + if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
>> > > + ereport(ERROR,
>> > > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>> > > + errmsg("only heap AM is supported")));
>> > > +
>> > > + check_relation_relkind(rel);
>> > >
>> > > Seems like these checks are in the wrong order.
>> >
>> > I don't think there is anything wrong with the order. I can see the
>> > same order in other contrib modules as well.
>> >
>> > Also, maybe you could
>> > > call the function something like check_relation_ok() and put the
>> > > permissions test, the relkind test, and the relam test all inside of
>> > > it, just to tighten up the code in this main function a bit.
>> > >
>> >
>> > Yeah, I've added a couple of functions named sanity_check_relation and
>> > sanity_check_tid_array and shifted all the sanity checks there.
>> >
>> > > + if (noffs > maxoffset)
>> > > + ereport(ERROR,
>> > > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>> > > + errmsg("number of offsets specified for block %u exceeds the max
>> > > offset number %u",
>> > > + blkno, maxoffset)));
>> > >
>> > > Hmm, this doesn't seem quite right. The actual problem is if an
>> > > individual item pointer's offset number is greater than maxoffset,
>> > > which can be true even if the total number of offsets is less than
>> > > maxoffset. So I think you need to remove this check and add a check
>> > > inside the loop which follows that offnos[i] is in range.
>> > >
>> >
>> > Agreed and done.
>> >
>> > > The way you've structured that loop is actually problematic -- I don't
>> > > think we want to be calling elog() or ereport() inside a critical
>> > > section. You could fix the case that checks for an invalid force_opt
>> > > by just doing if (op == HEAP_FORCE_KILL) { ... } else { Assert(op ==
>> > > HEAP_FORCE_FREEZE); ... }, or by using a switch with no default. The
>> > > NOTICE case you have here is a bigger problem.
>> >
>> > Done.
>> >
>> > You really cannot
>> > > modify the buffer like this and then decide, oops, never mind, I think
>> > > I won't mark it dirty or write WAL for the changes. If you do that,
>> > > the buffer is still in memory, but it's now been modified. A
>> > > subsequent operation that modifies it will start with the altered
>> > > state you created here, quite possibly leading to WAL that cannot be
>> > > correctly replayed on the standby. In other words, you've got to
>> > > decide for certain whether you want to proceed with the operation
>> > > *before* you enter the critical section. You also need to emit any
>> > > messages before or after the critical section.  So you could:
>> > >
>> >
>> > This is still not clear. I think Robert needs to respond to my earlier comment.
>> >
>> > > I believe this violates our guidelines on message construction. Have
>> > > two completely separate messages -- and maybe lose the word "already":
>> > >
>> > > "skipping tid (%u, %u) because it is dead"
>> > > "skipping tid (%u, %u) because it is unused"
>> > >
>> > > The point of this is that it makes it easier for translators.
>> > >
>> >
>> > Done.
>> >
>> > > I see very little point in what verify_tid() is doing. Before using
>> > > each block number, we should check that it's less than or equal to a
>> > > cached value of RelationGetNumberOfBlocks(rel). That's necessary in
>> > > any case to avoid funny errors; and then the check here against
>> > > specifically InvalidBlockNumber is redundant. For the offset number,
>> > > same thing: we need to check each offset against the page's
>> > > PageGetMaxOffsetNumber(page); and if we do that then we don't need
>> > > these checks.
>> > >
>> >
>> > Done.
>> >
>> > Please check the attached patch for the changes.
>>
>> I also looked at this version patch and have some small comments:
>>
>> +   Oid             relid = PG_GETARG_OID(0);
>> +   ArrayType      *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
>> +   ItemPointer     tids;
>> +   int             ntids;
>> +   Relation        rel;
>> +   Buffer          buf;
>> +   Page            page;
>> +   ItemId          itemid;
>> +   BlockNumber     blkno;
>> +   OffsetNumber   *offnos;
>> +   OffsetNumber    offno,
>> +                   noffs,
>> +                   curr_start_ptr,
>> +                   next_start_ptr,
>> +                   maxoffset;
>> +   int             i,
>> +                   nskippedItems,
>> +                   nblocks;
>>
>> You declare all variables at the top of heap_force_common() function
>> but I think we can declare some variables such as buf, page inside of
>> the do loop.
>>
>> ---
>> +           if (offnos[i] > maxoffset)
>> +           {
>> +               ereport(NOTICE,
>> +                        errmsg("skipping tid (%u, %u) because it
>> contains an invalid offset",
>> +                               blkno, offnos[i]));
>> +               continue;
>> +           }
>>
>> If all tids on a page take the above path, we will end up logging FPI
>> in spite of modifying nothing on the page.
>>
>> ---
>> +       /* XLOG stuff */
>> +       if (RelationNeedsWAL(rel))
>> +           log_newpage_buffer(buf, true);
>>
>> I think we need to set the returned LSN by log_newpage_buffer() to the page lsn.
>>
>> Regards,
>>
>> --
>> Masahiko Sawada            http://www.2ndQuadrant.com/
>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>>
>>


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Robert Haas
In reply to this post by Ashutosh Sharma
On Thu, Aug 6, 2020 at 9:23 AM Ashutosh Sharma <[hidden email]> wrote:
> Attached v4 patch fixes the latest comments from Robert and Masahiko-san.

Compiler warning:

heap_surgery.c:136:13: error: comparison of unsigned expression < 0 is
always false [-Werror,-Wtautological-compare]
                if (blkno < 0 || blkno >= nblocks)
                    ~~~~~ ^ ~

There's a certain inconsistency to these messages:

rhaas=# create table foo (a int);
CREATE TABLE
rhaas=# insert into foo values (1);
INSERT 0 1
rhaas=# select heap_force_kill('foo'::regclass, array['(0,2)'::tid]);
NOTICE:  skipping tid (0, 2) because it contains an invalid offset
 heap_force_kill
-----------------

(1 row)

rhaas=# select heap_force_kill('foo'::regclass, array['(1,0)'::tid]);
ERROR:  invalid item pointer
LOCATION:  tids_same_page_fetch_offnums, heap_surgery.c:347
rhaas=# select heap_force_kill('foo'::regclass, array['(1,1)'::tid]);
ERROR:  block number 1 is out of range for relation "foo"

From a user perspective it seems like I've made three very similar
mistakes: in the first case, the offset is too high, in the second
case it's too low, and in the third case the block number is out of
range. But in one case I get a NOTICE and in the other two cases I get
an ERROR. In one case I get the relation name and in the other two
cases I don't. The two complaints about an invalid offset are phrased
completely differently from each other. For example, suppose you do
this:

ERROR: tid (%u, %u) is invalid for relation "%s" because the block
number is out of range (%u..%u)
ERROR: tid (%u, %u) is invalid for relation "%s" because the item
number is out of range for this block (%u..%u)
ERROR: tid (%u, %u) is invalid for relation "%s" because the item is unused
ERROR: tid (%u, %u) is invalid for relation "%s" because the item is dead

I think I misled you when I said to use pg_class_aclcheck. I think it
should actually be pg_class_ownercheck.

I think the relkind sanity check should permit RELKIND_MATVIEW also.

It's unclear to me why the freeze logic here shouldn't do this part
what heap_prepare_freeze_tuple() does when freezing xmax:

        frz->t_infomask2 &= ~HEAP_HOT_UPDATED;
        frz->t_infomask2 &= ~HEAP_KEYS_UPDATED;

Likewise, why should we not freeze or invalidate xvac in the case
where tuple->t_infomask & HEAP_MOVED, as heap_prepare_freeze_tuple()
would do?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Ashutosh Sharma
On Fri, Aug 7, 2020 at 9:21 PM Robert Haas <[hidden email]> wrote:

>
> On Thu, Aug 6, 2020 at 9:23 AM Ashutosh Sharma <[hidden email]> wrote:
> There's a certain inconsistency to these messages:
>
> rhaas=# create table foo (a int);
> CREATE TABLE
> rhaas=# insert into foo values (1);
> INSERT 0 1
> rhaas=# select heap_force_kill('foo'::regclass, array['(0,2)'::tid]);
> NOTICE:  skipping tid (0, 2) because it contains an invalid offset
>  heap_force_kill
> -----------------
>
> (1 row)
>
> rhaas=# select heap_force_kill('foo'::regclass, array['(1,0)'::tid]);
> ERROR:  invalid item pointer
> LOCATION:  tids_same_page_fetch_offnums, heap_surgery.c:347
> rhaas=# select heap_force_kill('foo'::regclass, array['(1,1)'::tid]);
> ERROR:  block number 1 is out of range for relation "foo"
>
> From a user perspective it seems like I've made three very similar
> mistakes: in the first case, the offset is too high, in the second
> case it's too low, and in the third case the block number is out of
> range. But in one case I get a NOTICE and in the other two cases I get
> an ERROR. In one case I get the relation name and in the other two
> cases I don't. The two complaints about an invalid offset are phrased
> completely differently from each other. For example, suppose you do
> this:
>
> ERROR: tid (%u, %u) is invalid for relation "%s" because the block
> number is out of range (%u..%u)
> ERROR: tid (%u, %u) is invalid for relation "%s" because the item
> number is out of range for this block (%u..%u)
> ERROR: tid (%u, %u) is invalid for relation "%s" because the item is unused
> ERROR: tid (%u, %u) is invalid for relation "%s" because the item is dead
>

Thank you for your suggestions. To make this consistent, I am planning
to do the following changes:

Remove the error message to report "invalid item pointer" from
tids_same_page_fetch_offnums() and expand the if-check to detect any
invalid offset number in the CRITICAL section such that it not just
checks if the offset number is > maxoffset, but also checks if the
offset number is equal to 0. That way it would also do the job that
"if (!ItemPointerIsValid)" was doing for us.

Further, if any invalid block number is detected, then I am planning
to skip all the tids associated with this block and move to the next
block. Hence, instead of reporting the error I would report the NOTICE
message to the user.

The other two messages for reporting unused items and dead items
remain the same. Hence, with above change, we would be reporting the
following 4 messages:

NOTICE:  skipping all the tids in block %u for relation "%s" because
the block number is out of range

NOTICE:  skipping tid (%u, %u) for relation "%s" because the item
number is out of range for this block

NOTICE:  skipping tid (%u, %u) for relation "%s" because it is marked dead

NOTICE:  skipping tid (%u, %u) for relation "%s" because it is marked unused

Please let me know if you are okay with the above changes or not?

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Robert Haas
On Tue, Aug 11, 2020 at 3:39 AM Ashutosh Sharma <[hidden email]> wrote:

> The other two messages for reporting unused items and dead items
> remain the same. Hence, with above change, we would be reporting the
> following 4 messages:
>
> NOTICE:  skipping all the tids in block %u for relation "%s" because
> the block number is out of range
>
> NOTICE:  skipping tid (%u, %u) for relation "%s" because the item
> number is out of range for this block
>
> NOTICE:  skipping tid (%u, %u) for relation "%s" because it is marked dead
>
> NOTICE:  skipping tid (%u, %u) for relation "%s" because it is marked unused
>
> Please let me know if you are okay with the above changes or not?

That seems broadly reasonable, but I would suggest phrasing the first
message like this:

skipping block %u for relation "%s" because the block number is out of range

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Ashutosh Sharma
On Tue, Aug 11, 2020 at 7:33 PM Robert Haas <[hidden email]> wrote:

>
> On Tue, Aug 11, 2020 at 3:39 AM Ashutosh Sharma <[hidden email]> wrote:
> > The other two messages for reporting unused items and dead items
> > remain the same. Hence, with above change, we would be reporting the
> > following 4 messages:
> >
> > NOTICE:  skipping all the tids in block %u for relation "%s" because
> > the block number is out of range
> >
> > NOTICE:  skipping tid (%u, %u) for relation "%s" because the item
> > number is out of range for this block
> >
> > NOTICE:  skipping tid (%u, %u) for relation "%s" because it is marked dead
> >
> > NOTICE:  skipping tid (%u, %u) for relation "%s" because it is marked unused
> >
> > Please let me know if you are okay with the above changes or not?
>
> That seems broadly reasonable, but I would suggest phrasing the first
> message like this:
>
> skipping block %u for relation "%s" because the block number is out of range
>

Okay, thanks for the confirmation. I'll do that.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Ashutosh Sharma
In reply to this post by Robert Haas
Thanks Robert for the review. Please find my comments inline below:

On Fri, Aug 7, 2020 at 9:21 PM Robert Haas <[hidden email]> wrote:

>
> On Thu, Aug 6, 2020 at 9:23 AM Ashutosh Sharma <[hidden email]> wrote:
> > Attached v4 patch fixes the latest comments from Robert and Masahiko-san.
>
> Compiler warning:
>
> heap_surgery.c:136:13: error: comparison of unsigned expression < 0 is
> always false [-Werror,-Wtautological-compare]
>                 if (blkno < 0 || blkno >= nblocks)
>                     ~~~~~ ^ ~
>
Fixed.

> There's a certain inconsistency to these messages:
>
> rhaas=# create table foo (a int);
> CREATE TABLE
> rhaas=# insert into foo values (1);
> INSERT 0 1
> rhaas=# select heap_force_kill('foo'::regclass, array['(0,2)'::tid]);
> NOTICE:  skipping tid (0, 2) because it contains an invalid offset
>  heap_force_kill
> -----------------
>
> (1 row)
>
> rhaas=# select heap_force_kill('foo'::regclass, array['(1,0)'::tid]);
> ERROR:  invalid item pointer
> LOCATION:  tids_same_page_fetch_offnums, heap_surgery.c:347
> rhaas=# select heap_force_kill('foo'::regclass, array['(1,1)'::tid]);
> ERROR:  block number 1 is out of range for relation "foo"
>
> From a user perspective it seems like I've made three very similar
> mistakes: in the first case, the offset is too high, in the second
> case it's too low, and in the third case the block number is out of
> range. But in one case I get a NOTICE and in the other two cases I get
> an ERROR. In one case I get the relation name and in the other two
> cases I don't. The two complaints about an invalid offset are phrased
> completely differently from each other. For example, suppose you do
> this:
>
> ERROR: tid (%u, %u) is invalid for relation "%s" because the block
> number is out of range (%u..%u)
> ERROR: tid (%u, %u) is invalid for relation "%s" because the item
> number is out of range for this block (%u..%u)
> ERROR: tid (%u, %u) is invalid for relation "%s" because the item is unused
> ERROR: tid (%u, %u) is invalid for relation "%s" because the item is dead
>
Corrected.

> I think I misled you when I said to use pg_class_aclcheck. I think it
> should actually be pg_class_ownercheck.
>

okay, I've changed it to pg_class_ownercheck.

> I think the relkind sanity check should permit RELKIND_MATVIEW also.
>

Yeah, actually we should allow MATVIEW, don't know why I thought of
blocking it earlier.

> It's unclear to me why the freeze logic here shouldn't do this part
> what heap_prepare_freeze_tuple() does when freezing xmax:
>
>         frz->t_infomask2 &= ~HEAP_HOT_UPDATED;
>         frz->t_infomask2 &= ~HEAP_KEYS_UPDATED;
>

Yeah, we should have these changes when freezing the xmax.

> Likewise, why should we not freeze or invalidate xvac in the case
> where tuple->t_infomask & HEAP_MOVED, as heap_prepare_freeze_tuple()
> would do?
>

Again, we should have this as well.

Apart from above, this time I've also added the documentation on
pg_surgery module and added a few more test-cases.

Attached patch with above changes.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

v5-0001-Add-contrib-pg_surgery-to-perform-surgery-on-a-damag.patch (34K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Asim Praveen
Hi Ashutosh

I stumbled upon this thread today, went through your patch and it looks good.  A minor suggestion in sanity_check_relation():

        if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
                ereport(ERROR,
                                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
                                 errmsg("only heap AM is supported")));

Instead of checking the access method OID, it seems better to check the handler OID like so:

        if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID)

The reason is current version of sanity_check_relation() would emit error for the following case even when the table structure is actually heap.

        create access method myam type table handler heap_tableam_handler;
        create table mytable (…) using myam;

Asim
Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Ashutosh Sharma
Hi Asim,

Thanks for having a look into the patch and for sharing your feedback.
Please find my comments inline below:

On Thu, Aug 13, 2020 at 12:36 PM Asim Praveen <[hidden email]> wrote:

>
> Hi Ashutosh
>
> I stumbled upon this thread today, went through your patch and it looks good.  A minor suggestion in sanity_check_relation():
>
>         if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
>                 ereport(ERROR,
>                                 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>                                  errmsg("only heap AM is supported")));
>
> Instead of checking the access method OID, it seems better to check the handler OID like so:
>
>         if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID)
>
> The reason is current version of sanity_check_relation() would emit error for the following case even when the table structure is actually heap.
>
>         create access method myam type table handler heap_tableam_handler;
>         create table mytable (…) using myam;
>

This looks like a very good suggestion to me. I will do this change in
the next version. Just wondering if we should be doing similar changes
in other contrib modules (like pgrowlocks, pageinspect and
pgstattuple) as well?

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Robert Haas
On Thu, Aug 13, 2020 at 3:52 AM Ashutosh Sharma <[hidden email]> wrote:
> This looks like a very good suggestion to me. I will do this change in
> the next version. Just wondering if we should be doing similar changes
> in other contrib modules (like pgrowlocks, pageinspect and
> pgstattuple) as well?

It seems like it should be consistent, but I'm not sure the proposed
change is really an improvement.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Masahiko Sawada-2
In reply to this post by Ashutosh Sharma
On Wed, 12 Aug 2020 at 22:27, Ashutosh Sharma <[hidden email]> wrote:

>
> Thanks Robert for the review. Please find my comments inline below:
>
> On Fri, Aug 7, 2020 at 9:21 PM Robert Haas <[hidden email]> wrote:
> >
> > On Thu, Aug 6, 2020 at 9:23 AM Ashutosh Sharma <[hidden email]> wrote:
> > > Attached v4 patch fixes the latest comments from Robert and Masahiko-san.
> >
> > Compiler warning:
> >
> > heap_surgery.c:136:13: error: comparison of unsigned expression < 0 is
> > always false [-Werror,-Wtautological-compare]
> >                 if (blkno < 0 || blkno >= nblocks)
> >                     ~~~~~ ^ ~
> >
>
> Fixed.
>
> > There's a certain inconsistency to these messages:
> >
> > rhaas=# create table foo (a int);
> > CREATE TABLE
> > rhaas=# insert into foo values (1);
> > INSERT 0 1
> > rhaas=# select heap_force_kill('foo'::regclass, array['(0,2)'::tid]);
> > NOTICE:  skipping tid (0, 2) because it contains an invalid offset
> >  heap_force_kill
> > -----------------
> >
> > (1 row)
> >
> > rhaas=# select heap_force_kill('foo'::regclass, array['(1,0)'::tid]);
> > ERROR:  invalid item pointer
> > LOCATION:  tids_same_page_fetch_offnums, heap_surgery.c:347
> > rhaas=# select heap_force_kill('foo'::regclass, array['(1,1)'::tid]);
> > ERROR:  block number 1 is out of range for relation "foo"
> >
> > From a user perspective it seems like I've made three very similar
> > mistakes: in the first case, the offset is too high, in the second
> > case it's too low, and in the third case the block number is out of
> > range. But in one case I get a NOTICE and in the other two cases I get
> > an ERROR. In one case I get the relation name and in the other two
> > cases I don't. The two complaints about an invalid offset are phrased
> > completely differently from each other. For example, suppose you do
> > this:
> >
> > ERROR: tid (%u, %u) is invalid for relation "%s" because the block
> > number is out of range (%u..%u)
> > ERROR: tid (%u, %u) is invalid for relation "%s" because the item
> > number is out of range for this block (%u..%u)
> > ERROR: tid (%u, %u) is invalid for relation "%s" because the item is unused
> > ERROR: tid (%u, %u) is invalid for relation "%s" because the item is dead
> >
>
> Corrected.
>
> > I think I misled you when I said to use pg_class_aclcheck. I think it
> > should actually be pg_class_ownercheck.
> >
>
> okay, I've changed it to pg_class_ownercheck.
>
> > I think the relkind sanity check should permit RELKIND_MATVIEW also.
> >
>
> Yeah, actually we should allow MATVIEW, don't know why I thought of
> blocking it earlier.
>
> > It's unclear to me why the freeze logic here shouldn't do this part
> > what heap_prepare_freeze_tuple() does when freezing xmax:
> >
> >         frz->t_infomask2 &= ~HEAP_HOT_UPDATED;
> >         frz->t_infomask2 &= ~HEAP_KEYS_UPDATED;
> >
>
> Yeah, we should have these changes when freezing the xmax.
>
> > Likewise, why should we not freeze or invalidate xvac in the case
> > where tuple->t_infomask & HEAP_MOVED, as heap_prepare_freeze_tuple()
> > would do?
> >
>
> Again, we should have this as well.
>
> Apart from above, this time I've also added the documentation on
> pg_surgery module and added a few more test-cases.
>
> Attached patch with above changes.
>

Thank you for updating the patch! Here are my comments on v5 patch:

--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -35,6 +35,7 @@ SUBDIRS = \
        pg_standby  \
        pg_stat_statements \
        pg_trgm     \
+       pg_surgery  \
        pgcrypto    \

I guess we use alphabetical order here. So pg_surgery should be placed
before pg_trgm.

---
+           if (heap_force_opt == HEAP_FORCE_KILL)
+               ItemIdSetDead(itemid);

I think that if the page is an all-visible page, we should clear an
all-visible bit on the visibility map corresponding to the page and
PD_ALL_VISIBLE on the page header. Otherwise, index only scan would
return the wrong results.

---
+       /*
+        * We do not mark the buffer dirty or do WAL logging for unmodifed
+        * pages.
+        */
+       if (!did_modify_page)
+           goto skip_wal;
+
+       /* Mark buffer dirty before we write WAL. */
+       MarkBufferDirty(buf);
+
+       /* XLOG stuff */
+       if (RelationNeedsWAL(rel))
+           log_newpage_buffer(buf, true);
+
+skip_wal:
+       END_CRIT_SECTION();
+

s/unmodifed/unmodified/

Do we really need to use goto? I think we can modify it like follows:

    if (did_modity_page)
    {
       /* Mark buffer dirty before we write WAL. */
       MarkBufferDirty(buf);

       /* XLOG stuff */
       if (RelationNeedsWAL(rel))
           log_newpage_buffer(buf, true);
    }

    END_CRIT_SECTION();

---
pg_force_freeze() can revival a tuple that is already deleted but not
vacuumed yet. Therefore, the user might need to reindex indexes after
using that function. For instance, with the following script, the last
two queries: index scan and seq scan, will return different results.

set enable_seqscan to off;
set enable_bitmapscan to off;
set enable_indexonlyscan to off;
create table tbl (a int primary key);
insert into tbl values (1);

update tbl set a = a + 100 where a = 1;

explain analyze select * from tbl where a < 200;

-- revive deleted tuple on heap
select heap_force_freeze('tbl', array['(0,1)'::tid]);

-- index scan returns 2 tuples
explain analyze select * from tbl where a < 200;

-- seq scan returns 1 tuple
set enable_seqscan to on;
explain analyze select * from tbl;

Also, if a tuple updated and moved to another partition is revived by
heap_force_freeze(), its ctid still has special values:
MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
see a problem yet caused by a visible tuple having the special ctid
value, but it might be worth considering either to reset ctid value as
well or to not freezing already-deleted tuple.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Ashutosh Sharma
Hello Masahiko-san,

Thanks for the review. Please check the comments inline below:

On Fri, Aug 14, 2020 at 10:07 AM Masahiko Sawada
<[hidden email]> wrote:

> Thank you for updating the patch! Here are my comments on v5 patch:
>
> --- a/contrib/Makefile
> +++ b/contrib/Makefile
> @@ -35,6 +35,7 @@ SUBDIRS = \
>         pg_standby  \
>         pg_stat_statements \
>         pg_trgm     \
> +       pg_surgery  \
>         pgcrypto    \
>
> I guess we use alphabetical order here. So pg_surgery should be placed
> before pg_trgm.
>

Okay, will take care of this in the next version of patch.

> ---
> +           if (heap_force_opt == HEAP_FORCE_KILL)
> +               ItemIdSetDead(itemid);
>
> I think that if the page is an all-visible page, we should clear an
> all-visible bit on the visibility map corresponding to the page and
> PD_ALL_VISIBLE on the page header. Otherwise, index only scan would
> return the wrong results.
>

I think we should let VACUUM do that. Please note that this module is
intended to be used only on a damaged relation and should only be
operated on damaged tuples of such relations. And the execution of any
of the functions provided by this module on a damaged relation must be
followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation.
This is necessary to bring back a damaged relation to the sane state
once a surgery is performed on it. I will try to add this note in the
documentation for this module.

> ---
> +       /*
> +        * We do not mark the buffer dirty or do WAL logging for unmodifed
> +        * pages.
> +        */
> +       if (!did_modify_page)
> +           goto skip_wal;
> +
> +       /* Mark buffer dirty before we write WAL. */
> +       MarkBufferDirty(buf);
> +
> +       /* XLOG stuff */
> +       if (RelationNeedsWAL(rel))
> +           log_newpage_buffer(buf, true);
> +
> +skip_wal:
> +       END_CRIT_SECTION();
> +
>
> s/unmodifed/unmodified/
>

okay, will fix this typo.

> Do we really need to use goto? I think we can modify it like follows:
>
>     if (did_modity_page)
>     {
>        /* Mark buffer dirty before we write WAL. */
>        MarkBufferDirty(buf);
>
>        /* XLOG stuff */
>        if (RelationNeedsWAL(rel))
>            log_newpage_buffer(buf, true);
>     }
>
>     END_CRIT_SECTION();
>

No, we don't need it. We can achieve the same by checking the status
of did_modify_page flag as you suggested. I will do this change in the
next version.

> ---
> pg_force_freeze() can revival a tuple that is already deleted but not
> vacuumed yet. Therefore, the user might need to reindex indexes after
> using that function. For instance, with the following script, the last
> two queries: index scan and seq scan, will return different results.
>
> set enable_seqscan to off;
> set enable_bitmapscan to off;
> set enable_indexonlyscan to off;
> create table tbl (a int primary key);
> insert into tbl values (1);
>
> update tbl set a = a + 100 where a = 1;
>
> explain analyze select * from tbl where a < 200;
>
> -- revive deleted tuple on heap
> select heap_force_freeze('tbl', array['(0,1)'::tid]);
>
> -- index scan returns 2 tuples
> explain analyze select * from tbl where a < 200;
>
> -- seq scan returns 1 tuple
> set enable_seqscan to on;
> explain analyze select * from tbl;
>

I am not sure if this is the right use-case of pg_force_freeze
function. I think we should only be running pg_force_freeze function
on a tuple for which VACUUM reports "found xmin ABC from before
relfrozenxid PQR" sort of error otherwise it might worsen the things
instead of making it better. Now, the question is - can VACUUM report
this type of error for a deleted tuple or it would only report it for
a live tuple? AFAIU this won't be reported for the deleted tuples
because VACUUM wouldn't consider freezing a tuple that has been
deleted.

> Also, if a tuple updated and moved to another partition is revived by
> heap_force_freeze(), its ctid still has special values:
> MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
> see a problem yet caused by a visible tuple having the special ctid
> value, but it might be worth considering either to reset ctid value as
> well or to not freezing already-deleted tuple.
>

For this as well, the answer remains the same as above.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Ashutosh Sharma
Hello Masahiko-san,

I've spent some more time trying to understand the code in
lazy_scan_heap function to know under what all circumstances a VACUUM
can fail with "found xmin ... before relfrozenxid ..." error for a
tuple whose xmin is behind relfrozenxid. Here are my observations:

1) It can fail with this error for a live tuple

OR,

2) It can also fail with this error if a tuple (that went through
update) is marked as HEAP_HOT_UPDATED or HEAP_ONLY_TUPLE.

OR,

3) If there are any concurrent transactions, then the tuple might be
marked as HEAPTUPLE_INSERT_IN_PROGRESS or HEAPTUPLE_DELETE_IN_PROGRESS
or HEAPTUPLE_RECENTLY_DEAD in which case also VACUUM can fail with
this error.

Now, AFAIU, as we will be dealing with a damaged table, the chances of
point #3 being the cause of this error looks impossible in our case
because I don't think we will be doing anything in parallel when
performing surgery on a damaged table, in fact we shouldn't be doing
any such things. However, it is quite possible that reason #2 could
cause VACUUM to fail with this sort of error, but, as we are already
skipping redirected item pointers in heap_force_common(), I think, we
would never be marking HEAP_HOT_UPDATED tuple as frozen and I don't
see any problem in marking HEAP_ONLY_TUPLE as frozen. So, probably, we
may not need to handle point #2 as well.

Further, I also don't see VACUUM reporting this error for a tuple that
has been moved from one partition to another. So, I think we might not
need to do any special handling for a tuple that got updated and its
new version was moved to another partition.

If you feel I am missing something here, please correct me. Thank you.

Moreover, while I was exploring on above, I noticed that in
lazy_scan_heap(), before we call HeapTupleSatisfiesVacuum() we check
for a redirected item pointers and if any redirected item pointer is
detected we do not call HeapTupleSatisfiesVacuum(). So, not sure how
HeapTupleSatisfiesVacuum would ever return a dead tuple that is marked
with HEAP_HOT_UPDATED. I am referring to the following code in
lazy_scan_heap().

        for (offnum = FirstOffsetNumber;
             offnum <= maxoff;
             offnum = OffsetNumberNext(offnum))
        {
            ItemId      itemid;

            itemid = PageGetItemId(page, offnum);

.............
.............


            /* Redirect items mustn't be touched */ <-- this check
would bypass the redirected item pointers from being checked for
HeapTupleSatisfiesVacuum.
            if (ItemIdIsRedirected(itemid))
            {
                hastup = true;  /* this page won't be truncatable */
                continue;
            }

..............
..............

            switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
            {
                case HEAPTUPLE_DEAD:

                    if (HeapTupleIsHotUpdated(&tuple) ||
                        HeapTupleIsHeapOnly(&tuple) ||
                        params->index_cleanup == VACOPT_TERNARY_DISABLED)
                        nkeep += 1;
                    else
                        tupgone = true; /* we can delete the tuple */
..............
..............
             }


So, the point is, would HeapTupleIsHotUpdated(&tuple) ever be true?

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Mon, Aug 17, 2020 at 11:35 AM Ashutosh Sharma <[hidden email]> wrote:

>
> Hello Masahiko-san,
>
> Thanks for the review. Please check the comments inline below:
>
> On Fri, Aug 14, 2020 at 10:07 AM Masahiko Sawada
> <[hidden email]> wrote:
>
> > Thank you for updating the patch! Here are my comments on v5 patch:
> >
> > --- a/contrib/Makefile
> > +++ b/contrib/Makefile
> > @@ -35,6 +35,7 @@ SUBDIRS = \
> >         pg_standby  \
> >         pg_stat_statements \
> >         pg_trgm     \
> > +       pg_surgery  \
> >         pgcrypto    \
> >
> > I guess we use alphabetical order here. So pg_surgery should be placed
> > before pg_trgm.
> >
>
> Okay, will take care of this in the next version of patch.
>
> > ---
> > +           if (heap_force_opt == HEAP_FORCE_KILL)
> > +               ItemIdSetDead(itemid);
> >
> > I think that if the page is an all-visible page, we should clear an
> > all-visible bit on the visibility map corresponding to the page and
> > PD_ALL_VISIBLE on the page header. Otherwise, index only scan would
> > return the wrong results.
> >
>
> I think we should let VACUUM do that. Please note that this module is
> intended to be used only on a damaged relation and should only be
> operated on damaged tuples of such relations. And the execution of any
> of the functions provided by this module on a damaged relation must be
> followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation.
> This is necessary to bring back a damaged relation to the sane state
> once a surgery is performed on it. I will try to add this note in the
> documentation for this module.
>
> > ---
> > +       /*
> > +        * We do not mark the buffer dirty or do WAL logging for unmodifed
> > +        * pages.
> > +        */
> > +       if (!did_modify_page)
> > +           goto skip_wal;
> > +
> > +       /* Mark buffer dirty before we write WAL. */
> > +       MarkBufferDirty(buf);
> > +
> > +       /* XLOG stuff */
> > +       if (RelationNeedsWAL(rel))
> > +           log_newpage_buffer(buf, true);
> > +
> > +skip_wal:
> > +       END_CRIT_SECTION();
> > +
> >
> > s/unmodifed/unmodified/
> >
>
> okay, will fix this typo.
>
> > Do we really need to use goto? I think we can modify it like follows:
> >
> >     if (did_modity_page)
> >     {
> >        /* Mark buffer dirty before we write WAL. */
> >        MarkBufferDirty(buf);
> >
> >        /* XLOG stuff */
> >        if (RelationNeedsWAL(rel))
> >            log_newpage_buffer(buf, true);
> >     }
> >
> >     END_CRIT_SECTION();
> >
>
> No, we don't need it. We can achieve the same by checking the status
> of did_modify_page flag as you suggested. I will do this change in the
> next version.
>
> > ---
> > pg_force_freeze() can revival a tuple that is already deleted but not
> > vacuumed yet. Therefore, the user might need to reindex indexes after
> > using that function. For instance, with the following script, the last
> > two queries: index scan and seq scan, will return different results.
> >
> > set enable_seqscan to off;
> > set enable_bitmapscan to off;
> > set enable_indexonlyscan to off;
> > create table tbl (a int primary key);
> > insert into tbl values (1);
> >
> > update tbl set a = a + 100 where a = 1;
> >
> > explain analyze select * from tbl where a < 200;
> >
> > -- revive deleted tuple on heap
> > select heap_force_freeze('tbl', array['(0,1)'::tid]);
> >
> > -- index scan returns 2 tuples
> > explain analyze select * from tbl where a < 200;
> >
> > -- seq scan returns 1 tuple
> > set enable_seqscan to on;
> > explain analyze select * from tbl;
> >
>
> I am not sure if this is the right use-case of pg_force_freeze
> function. I think we should only be running pg_force_freeze function
> on a tuple for which VACUUM reports "found xmin ABC from before
> relfrozenxid PQR" sort of error otherwise it might worsen the things
> instead of making it better. Now, the question is - can VACUUM report
> this type of error for a deleted tuple or it would only report it for
> a live tuple? AFAIU this won't be reported for the deleted tuples
> because VACUUM wouldn't consider freezing a tuple that has been
> deleted.
>
> > Also, if a tuple updated and moved to another partition is revived by
> > heap_force_freeze(), its ctid still has special values:
> > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
> > see a problem yet caused by a visible tuple having the special ctid
> > value, but it might be worth considering either to reset ctid value as
> > well or to not freezing already-deleted tuple.
> >
>
> For this as well, the answer remains the same as above.
>
> --
> With Regards,
> Ashutosh Sharma
> EnterpriseDB:http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Ashutosh Sharma
Attached is the new version of patch that addresses the comments from
Asim Praveen and Masahiko-san. It also improves the documentation to
some extent.


On Tue, Aug 18, 2020 at 1:46 PM Ashutosh Sharma <[hidden email]> wrote:

>
> Hello Masahiko-san,
>
> I've spent some more time trying to understand the code in
> lazy_scan_heap function to know under what all circumstances a VACUUM
> can fail with "found xmin ... before relfrozenxid ..." error for a
> tuple whose xmin is behind relfrozenxid. Here are my observations:
>
> 1) It can fail with this error for a live tuple
>
> OR,
>
> 2) It can also fail with this error if a tuple (that went through
> update) is marked as HEAP_HOT_UPDATED or HEAP_ONLY_TUPLE.
>
> OR,
>
> 3) If there are any concurrent transactions, then the tuple might be
> marked as HEAPTUPLE_INSERT_IN_PROGRESS or HEAPTUPLE_DELETE_IN_PROGRESS
> or HEAPTUPLE_RECENTLY_DEAD in which case also VACUUM can fail with
> this error.
>
> Now, AFAIU, as we will be dealing with a damaged table, the chances of
> point #3 being the cause of this error looks impossible in our case
> because I don't think we will be doing anything in parallel when
> performing surgery on a damaged table, in fact we shouldn't be doing
> any such things. However, it is quite possible that reason #2 could
> cause VACUUM to fail with this sort of error, but, as we are already
> skipping redirected item pointers in heap_force_common(), I think, we
> would never be marking HEAP_HOT_UPDATED tuple as frozen and I don't
> see any problem in marking HEAP_ONLY_TUPLE as frozen. So, probably, we
> may not need to handle point #2 as well.
>
> Further, I also don't see VACUUM reporting this error for a tuple that
> has been moved from one partition to another. So, I think we might not
> need to do any special handling for a tuple that got updated and its
> new version was moved to another partition.
>
> If you feel I am missing something here, please correct me. Thank you.
>
> Moreover, while I was exploring on above, I noticed that in
> lazy_scan_heap(), before we call HeapTupleSatisfiesVacuum() we check
> for a redirected item pointers and if any redirected item pointer is
> detected we do not call HeapTupleSatisfiesVacuum(). So, not sure how
> HeapTupleSatisfiesVacuum would ever return a dead tuple that is marked
> with HEAP_HOT_UPDATED. I am referring to the following code in
> lazy_scan_heap().
>
>         for (offnum = FirstOffsetNumber;
>              offnum <= maxoff;
>              offnum = OffsetNumberNext(offnum))
>         {
>             ItemId      itemid;
>
>             itemid = PageGetItemId(page, offnum);
>
> .............
> .............
>
>
>             /* Redirect items mustn't be touched */ <-- this check
> would bypass the redirected item pointers from being checked for
> HeapTupleSatisfiesVacuum.
>             if (ItemIdIsRedirected(itemid))
>             {
>                 hastup = true;  /* this page won't be truncatable */
>                 continue;
>             }
>
> ..............
> ..............
>
>             switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
>             {
>                 case HEAPTUPLE_DEAD:
>
>                     if (HeapTupleIsHotUpdated(&tuple) ||
>                         HeapTupleIsHeapOnly(&tuple) ||
>                         params->index_cleanup == VACOPT_TERNARY_DISABLED)
>                         nkeep += 1;
>                     else
>                         tupgone = true; /* we can delete the tuple */
> ..............
> ..............
>              }
>
>
> So, the point is, would HeapTupleIsHotUpdated(&tuple) ever be true?
>
> --
> With Regards,
> Ashutosh Sharma
> EnterpriseDB:http://www.enterprisedb.com
>
> On Mon, Aug 17, 2020 at 11:35 AM Ashutosh Sharma <[hidden email]> wrote:
> >
> > Hello Masahiko-san,
> >
> > Thanks for the review. Please check the comments inline below:
> >
> > On Fri, Aug 14, 2020 at 10:07 AM Masahiko Sawada
> > <[hidden email]> wrote:
> >
> > > Thank you for updating the patch! Here are my comments on v5 patch:
> > >
> > > --- a/contrib/Makefile
> > > +++ b/contrib/Makefile
> > > @@ -35,6 +35,7 @@ SUBDIRS = \
> > >         pg_standby  \
> > >         pg_stat_statements \
> > >         pg_trgm     \
> > > +       pg_surgery  \
> > >         pgcrypto    \
> > >
> > > I guess we use alphabetical order here. So pg_surgery should be placed
> > > before pg_trgm.
> > >
> >
> > Okay, will take care of this in the next version of patch.
> >
> > > ---
> > > +           if (heap_force_opt == HEAP_FORCE_KILL)
> > > +               ItemIdSetDead(itemid);
> > >
> > > I think that if the page is an all-visible page, we should clear an
> > > all-visible bit on the visibility map corresponding to the page and
> > > PD_ALL_VISIBLE on the page header. Otherwise, index only scan would
> > > return the wrong results.
> > >
> >
> > I think we should let VACUUM do that. Please note that this module is
> > intended to be used only on a damaged relation and should only be
> > operated on damaged tuples of such relations. And the execution of any
> > of the functions provided by this module on a damaged relation must be
> > followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation.
> > This is necessary to bring back a damaged relation to the sane state
> > once a surgery is performed on it. I will try to add this note in the
> > documentation for this module.
> >
> > > ---
> > > +       /*
> > > +        * We do not mark the buffer dirty or do WAL logging for unmodifed
> > > +        * pages.
> > > +        */
> > > +       if (!did_modify_page)
> > > +           goto skip_wal;
> > > +
> > > +       /* Mark buffer dirty before we write WAL. */
> > > +       MarkBufferDirty(buf);
> > > +
> > > +       /* XLOG stuff */
> > > +       if (RelationNeedsWAL(rel))
> > > +           log_newpage_buffer(buf, true);
> > > +
> > > +skip_wal:
> > > +       END_CRIT_SECTION();
> > > +
> > >
> > > s/unmodifed/unmodified/
> > >
> >
> > okay, will fix this typo.
> >
> > > Do we really need to use goto? I think we can modify it like follows:
> > >
> > >     if (did_modity_page)
> > >     {
> > >        /* Mark buffer dirty before we write WAL. */
> > >        MarkBufferDirty(buf);
> > >
> > >        /* XLOG stuff */
> > >        if (RelationNeedsWAL(rel))
> > >            log_newpage_buffer(buf, true);
> > >     }
> > >
> > >     END_CRIT_SECTION();
> > >
> >
> > No, we don't need it. We can achieve the same by checking the status
> > of did_modify_page flag as you suggested. I will do this change in the
> > next version.
> >
> > > ---
> > > pg_force_freeze() can revival a tuple that is already deleted but not
> > > vacuumed yet. Therefore, the user might need to reindex indexes after
> > > using that function. For instance, with the following script, the last
> > > two queries: index scan and seq scan, will return different results.
> > >
> > > set enable_seqscan to off;
> > > set enable_bitmapscan to off;
> > > set enable_indexonlyscan to off;
> > > create table tbl (a int primary key);
> > > insert into tbl values (1);
> > >
> > > update tbl set a = a + 100 where a = 1;
> > >
> > > explain analyze select * from tbl where a < 200;
> > >
> > > -- revive deleted tuple on heap
> > > select heap_force_freeze('tbl', array['(0,1)'::tid]);
> > >
> > > -- index scan returns 2 tuples
> > > explain analyze select * from tbl where a < 200;
> > >
> > > -- seq scan returns 1 tuple
> > > set enable_seqscan to on;
> > > explain analyze select * from tbl;
> > >
> >
> > I am not sure if this is the right use-case of pg_force_freeze
> > function. I think we should only be running pg_force_freeze function
> > on a tuple for which VACUUM reports "found xmin ABC from before
> > relfrozenxid PQR" sort of error otherwise it might worsen the things
> > instead of making it better. Now, the question is - can VACUUM report
> > this type of error for a deleted tuple or it would only report it for
> > a live tuple? AFAIU this won't be reported for the deleted tuples
> > because VACUUM wouldn't consider freezing a tuple that has been
> > deleted.
> >
> > > Also, if a tuple updated and moved to another partition is revived by
> > > heap_force_freeze(), its ctid still has special values:
> > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
> > > see a problem yet caused by a visible tuple having the special ctid
> > > value, but it might be worth considering either to reset ctid value as
> > > well or to not freezing already-deleted tuple.
> > >
> >
> > For this as well, the answer remains the same as above.
> >
> > --
> > With Regards,
> > Ashutosh Sharma
> > EnterpriseDB:http://www.enterprisedb.com

v6-0001-Add-contrib-pg_surgery-to-perform-surgery-on-a-damag.patch (36K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Chris Travers-7
In reply to this post by Peter Geoghegan-4


On Tue, Jul 14, 2020 at 12:28 AM Peter Geoghegan <[hidden email]> wrote:
On Mon, Jul 13, 2020 at 2:12 PM Robert Haas <[hidden email]> wrote:
> 1. There's nothing to identify the tuple that has the problem, and no
> way to know how many more of them there might be. Back-patching
> b61d161c146328ae6ba9ed937862d66e5c8b035a would help with the first
> part of this.

I am in favor of backpatching such changes in cases where senior
community members feel that it could help with hypothetical
undiscovered data corruption issues -- if they're willing to take
responsibility for the change. It certainly wouldn't be the first
time. A "defense in depth" mindset seems like the right one when it
comes to data corruption bugs. Early detection is really important.

> Moreover, not everyone is as
> interested in an extended debugging exercise as they are in getting
> the system working again, and VACUUM failing repeatedly is a pretty
> serious problem.

That's absolutely consistent with my experience. Most users want to
get back to business as usual now, while letting somebody else do the
hard work of debugging.

Also even if you do trace the problem you still have to recover.

And sometimes I have found latent corruption from times when dbs were running on older versions and older servers, making debugging largely a futile exercise.

> Therefore, one of my colleagues has - at my request - created a couple
> of functions called heap_force_kill() and heap_force_freeze() which
> take an array of TIDs.

> So I have these questions:
>
> - Do people think it would me smart/good/useful to include something
> like this in PostgreSQL?

I'm in favor of it.

+1

Would be worth extending it with some functions to grab rows that have various TOAST oids too.

> - If so, how? I would propose a new contrib module that we back-patch
> all the way, because the VACUUM errors were back-patched all the way,
> and there seems to be no advantage in making people wait 5 years for a
> new version that has some kind of tooling in this area.

I'm in favor of it being *possible* to backpatch tooling that is
clearly related to correctness in a fundamental way. Obviously this
would mean that we'd be revising our general position on backpatching
to allow some limited exceptions around corruption. I'm not sure that
this meets that standard, though. It's hardly something that we can
expect all that many users to be able to use effectively.

I may be biased, but I'd be inclined to permit it in the case of
something like amcheck, or pg_visibility, on the grounds that they're
more or less the same as the new VACUUM errcontext instrumentation you
mentioned. The same cannot be said of something like this new
heap_force_kill() stuff.

> - Any ideas for additional things we should include, or improvements
> on the sketch above?

Clearly you should work out a way of making it very hard to
accidentally (mis)use. For example, maybe you make the functions check
for the presence of a sentinel file in the data directory.

Agreed.


--
Peter Geoghegan




--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

rajkumar.raghuwanshi
Thanks for suggestion Ashutosh, I have done testing around these suggestion
and found no issues. I will continue testing same with updated patch posted
on this thread.

On Fri, Aug 7, 2020 at 12:45 PM Ashutosh Sharma <[hidden email]> wrote:
Thanks Rajkumar for testing the patch.

Here are some of the additional test-cases that I would suggest you to
execute, if possible:

1) You may try running the test-cases that you have executed so far
with SR setup and see if the changes are getting reflected on the
standby.

2) You may also try running some concurrent test-cases for e.g. try
running these functions with VACUUM or some other sql commands
(preferable DML commands) in parallel.

3) See what happens when you pass some invalid tids (containing
invalid block or offset number) to these functions. You may also try
running these functions on the same tuple repeatedly and see the
behaviour.

...

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
 
Thanks & Regards,
Rajkumar Raghuwanshi



Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Alvaro Herrera-9
In reply to this post by Ashutosh Sharma
On 2020-Aug-17, Ashutosh Sharma wrote:

> > +           if (heap_force_opt == HEAP_FORCE_KILL)
> > +               ItemIdSetDead(itemid);
> >
> > I think that if the page is an all-visible page, we should clear an
> > all-visible bit on the visibility map corresponding to the page and
> > PD_ALL_VISIBLE on the page header. Otherwise, index only scan would
> > return the wrong results.
>
> I think we should let VACUUM do that. Please note that this module is
> intended to be used only on a damaged relation and should only be
> operated on damaged tuples of such relations. And the execution of any
> of the functions provided by this module on a damaged relation must be
> followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation.
> This is necessary to bring back a damaged relation to the sane state
> once a surgery is performed on it. I will try to add this note in the
> documentation for this module.

It makes sense to recommend VACUUM after fixing the page, but I agree
with Sawada-san that it would be sensible to reset the VM bit while
doing surgery, since that's the state that the page would be in.  We
should certainly *strongly recommend* to do VACUUM DISABLE_PAGE_SKIPPING,
but if users fail to do so, then leaving the VM bit set just means that
we know *for certain* that there will be further corruption as soon as
the XID counter advances sufficiently.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: recovering from "found xmin ... from before relfrozenxid ..."

Masahiko Sawada-2
In reply to this post by Ashutosh Sharma
On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma <[hidden email]> wrote:

>
> Hello Masahiko-san,
>
> Thanks for the review. Please check the comments inline below:
>
> On Fri, Aug 14, 2020 at 10:07 AM Masahiko Sawada
> <[hidden email]> wrote:
>
> > Thank you for updating the patch! Here are my comments on v5 patch:
> >
> > --- a/contrib/Makefile
> > +++ b/contrib/Makefile
> > @@ -35,6 +35,7 @@ SUBDIRS = \
> >         pg_standby  \
> >         pg_stat_statements \
> >         pg_trgm     \
> > +       pg_surgery  \
> >         pgcrypto    \
> >
> > I guess we use alphabetical order here. So pg_surgery should be placed
> > before pg_trgm.
> >
>
> Okay, will take care of this in the next version of patch.
>
> > ---
> > +           if (heap_force_opt == HEAP_FORCE_KILL)
> > +               ItemIdSetDead(itemid);
> >
> > I think that if the page is an all-visible page, we should clear an
> > all-visible bit on the visibility map corresponding to the page and
> > PD_ALL_VISIBLE on the page header. Otherwise, index only scan would
> > return the wrong results.
> >
>
> I think we should let VACUUM do that. Please note that this module is
> intended to be used only on a damaged relation and should only be
> operated on damaged tuples of such relations. And the execution of any
> of the functions provided by this module on a damaged relation must be
> followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation.
> This is necessary to bring back a damaged relation to the sane state
> once a surgery is performed on it. I will try to add this note in the
> documentation for this module.
>
> > ---
> > +       /*
> > +        * We do not mark the buffer dirty or do WAL logging for unmodifed
> > +        * pages.
> > +        */
> > +       if (!did_modify_page)
> > +           goto skip_wal;
> > +
> > +       /* Mark buffer dirty before we write WAL. */
> > +       MarkBufferDirty(buf);
> > +
> > +       /* XLOG stuff */
> > +       if (RelationNeedsWAL(rel))
> > +           log_newpage_buffer(buf, true);
> > +
> > +skip_wal:
> > +       END_CRIT_SECTION();
> > +
> >
> > s/unmodifed/unmodified/
> >
>
> okay, will fix this typo.
>
> > Do we really need to use goto? I think we can modify it like follows:
> >
> >     if (did_modity_page)
> >     {
> >        /* Mark buffer dirty before we write WAL. */
> >        MarkBufferDirty(buf);
> >
> >        /* XLOG stuff */
> >        if (RelationNeedsWAL(rel))
> >            log_newpage_buffer(buf, true);
> >     }
> >
> >     END_CRIT_SECTION();
> >
>
> No, we don't need it. We can achieve the same by checking the status
> of did_modify_page flag as you suggested. I will do this change in the
> next version.
>
> > ---
> > pg_force_freeze() can revival a tuple that is already deleted but not
> > vacuumed yet. Therefore, the user might need to reindex indexes after
> > using that function. For instance, with the following script, the last
> > two queries: index scan and seq scan, will return different results.
> >
> > set enable_seqscan to off;
> > set enable_bitmapscan to off;
> > set enable_indexonlyscan to off;
> > create table tbl (a int primary key);
> > insert into tbl values (1);
> >
> > update tbl set a = a + 100 where a = 1;
> >
> > explain analyze select * from tbl where a < 200;
> >
> > -- revive deleted tuple on heap
> > select heap_force_freeze('tbl', array['(0,1)'::tid]);
> >
> > -- index scan returns 2 tuples
> > explain analyze select * from tbl where a < 200;
> >
> > -- seq scan returns 1 tuple
> > set enable_seqscan to on;
> > explain analyze select * from tbl;
> >
>
> I am not sure if this is the right use-case of pg_force_freeze
> function. I think we should only be running pg_force_freeze function
> on a tuple for which VACUUM reports "found xmin ABC from before
> relfrozenxid PQR" sort of error otherwise it might worsen the things
> instead of making it better.

Should this also be documented? I think that it's hard to force the
user to always use this module in the right situation but we need to
show at least when to use.

> > Also, if a tuple updated and moved to another partition is revived by
> > heap_force_freeze(), its ctid still has special values:
> > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
> > see a problem yet caused by a visible tuple having the special ctid
> > value, but it might be worth considering either to reset ctid value as
> > well or to not freezing already-deleted tuple.
> >
>
> For this as well, the answer remains the same as above.

Perhaps the same is true when a tuple header is corrupted including
xmin and ctid for some reason and the user wants to fix it? I'm
concerned that a live tuple having the wrong ctid will cause SEGV or
PANIC error in the future.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


12345678