RFC: Making TRUNCATE more "MVCC-safe"

classic Classic list List threaded Threaded
37 messages Options
12
Reply | Threaded
Open this post in threaded view
|

RFC: Making TRUNCATE more "MVCC-safe"

Marti Raudsepp
Hi!

I've always been a little wary of using the TRUNCATE command due to
the warning in the documentation about it not being "MVCC-safe":
queries may silently give wrong results and it's hard to tell when
they are affected.

That got me thinking, why can't we handle this like a standby server
does -- if some query has data removed from underneath it, it aborts
with a serialization failure.

Does this solution sound like a good idea?

The attached patch is a lame attempt at implementing this. I added a
new pg_class.relvalidxmin attribute which tracks the Xid of the last
TRUNCATE (maybe it should be called reltruncatexid?). Whenever
starting a relation scan with a snapshot older than relvalidxmin, an
error is thrown. This seems to work out well since TRUNCATE updates
pg_class anyway, and anyone who sees the new relfilenode automatically
knows when it was truncated.

Am I on the right track? Are there any better ways to attach this
information to a relation?
Should I also add another counter to pg_stat_database_conflicts?
Currently this table is only used on standby servers.

Since I wrote it just this afternoon, there are a few things still
wrong with the patch (it doesn't handle xid wraparound for one), so
don't be too picky about the code yet. :)

Example:
  CREATE TABLE foo (i int);
Session A:
  BEGIN ISOLATION LEVEL REPEATABLE READ;
  SELECT txid_current(); -- Force snapshot open
Session B:
  TRUNCATE TABLE foo;
Session A:
  SELECT * FROM foo;
ERROR:  canceling statement due to conflict with TRUNCATE TABLE on foo
DETAIL:  Rows visible to this transaction have been removed.


Patch also available in my github 'truncate' branch:
https://github.com/intgr/postgres/commits/truncate

Regards,
Marti


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

safe-truncate.patch (19K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Making TRUNCATE more "MVCC-safe"

Noah Misch-2
On Thu, Feb 09, 2012 at 11:11:16PM +0200, Marti Raudsepp wrote:

> I've always been a little wary of using the TRUNCATE command due to
> the warning in the documentation about it not being "MVCC-safe":
> queries may silently give wrong results and it's hard to tell when
> they are affected.
>
> That got me thinking, why can't we handle this like a standby server
> does -- if some query has data removed from underneath it, it aborts
> with a serialization failure.
>
> Does this solution sound like a good idea?
>
> The attached patch is a lame attempt at implementing this. I added a
> new pg_class.relvalidxmin attribute which tracks the Xid of the last
> TRUNCATE (maybe it should be called reltruncatexid?). Whenever
> starting a relation scan with a snapshot older than relvalidxmin, an
> error is thrown. This seems to work out well since TRUNCATE updates
> pg_class anyway, and anyone who sees the new relfilenode automatically
> knows when it was truncated.

I like the design you have chosen.  It would find applications beyond
TRUNCATE, so your use of non-specific naming is sound.  For example, older
snapshots will see an empty table "t" after "CREATE TABLE t AS SELECT 1"
commits; that's a comparable MVCC anomaly.  Some of our non-MVCC-safe commands
should perhaps just become MVCC-safe, but there will always be use cases for
operations that shortcut MVCC.  When one truly does want that, your proposal
for keeping behavior consistent makes plenty of sense.

> Should I also add another counter to pg_stat_database_conflicts?
> Currently this table is only used on standby servers.

> ERROR:  canceling statement due to conflict with TRUNCATE TABLE on foo
> DETAIL:  Rows visible to this transaction have been removed.

My initial reaction is not to portray this like a recovery conflict, since
several aspects distinguish it from all recovery conflict types.

(I have not read your actual patch.)

Thanks,
nm

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Making TRUNCATE more "MVCC-safe"

Robert Haas
On Fri, Feb 10, 2012 at 6:42 AM, Noah Misch <[hidden email]> wrote:
> I like the design you have chosen.  It would find applications beyond
> TRUNCATE, so your use of non-specific naming is sound.  For example, older
> snapshots will see an empty table "t" after "CREATE TABLE t AS SELECT 1"
> commits; that's a comparable MVCC anomaly.  Some of our non-MVCC-safe commands
> should perhaps just become MVCC-safe, but there will always be use cases for
> operations that shortcut MVCC.  When one truly does want that, your proposal
> for keeping behavior consistent makes plenty of sense.

I guess I'm not particularly excited by the idea of trying to make
TRUNCATE MVCC-safe.  I notice that the example involves the REPEATABLE
READ isolation level, which is already known to be busted in a variety
of ways; that's why we now have SERIALIZABLE, and why most people use
READ COMMITTED.  Are there examples of this behavior at other
isolation levels?

But I have to admit I'm intrigued by the idea of extending this to
other cases, if there's a reasonable way to do that.  For example, if
we could fix things up so that we don't see a table at all if it was
created after we took our snapshot, that would remove one of the
obstacles to marking pages bulk-loaded into that table with FrozenXID
and PD_ALL_VISIBLE from the get-go.  I think a lot of people would be
mighty happy about that.

But the necessary semantics seem somewhat different.  For TRUNCATE,
you want to throw a serialization error; but is that also what you
want for CREATE TABLE?  Or do you just want it to appear empty?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Making TRUNCATE more "MVCC-safe"

Noah Misch-2
On Fri, Feb 10, 2012 at 01:59:18PM -0500, Robert Haas wrote:

> On Fri, Feb 10, 2012 at 6:42 AM, Noah Misch <[hidden email]> wrote:
> > I like the design you have chosen. ?It would find applications beyond
> > TRUNCATE, so your use of non-specific naming is sound. ?For example, older
> > snapshots will see an empty table "t" after "CREATE TABLE t AS SELECT 1"
> > commits; that's a comparable MVCC anomaly. ?Some of our non-MVCC-safe commands
> > should perhaps just become MVCC-safe, but there will always be use cases for
> > operations that shortcut MVCC. ?When one truly does want that, your proposal
> > for keeping behavior consistent makes plenty of sense.
>
> I guess I'm not particularly excited by the idea of trying to make
> TRUNCATE MVCC-safe.  I notice that the example involves the REPEATABLE
> READ isolation level, which is already known to be busted in a variety
> of ways; that's why we now have SERIALIZABLE, and why most people use
> READ COMMITTED.  Are there examples of this behavior at other
> isolation levels?

I've yet to see an MVCC anomaly that one can reproduce at REPEATABLE READ and
not at READ COMMITTED.  They tend to be narrow race conditions at READ
COMMITTED, yet easy to demonstrate at REPEATABLE READ.  Related:
http://archives.postgresql.org/pgsql-performance/2011-02/msg00451.php

Incidentally, people use READ COMMITTED because they don't question the
default, not because they know hazards of REPEATABLE READ.  I don't know the
bustedness you speak of; could we improve the documentation to inform folks?

> But I have to admit I'm intrigued by the idea of extending this to
> other cases, if there's a reasonable way to do that.  For example, if
> we could fix things up so that we don't see a table at all if it was
> created after we took our snapshot, that would remove one of the
> obstacles to marking pages bulk-loaded into that table with FrozenXID
> and PD_ALL_VISIBLE from the get-go.  I think a lot of people would be
> mighty happy about that.

Exactly.

> But the necessary semantics seem somewhat different.  For TRUNCATE,
> you want to throw a serialization error; but is that also what you
> want for CREATE TABLE?  Or do you just want it to appear empty?

I think an error helps just as much there.  If you create a table and populate
it with data in the same transaction, letting some snapshot see an empty table
is an atomicity failure.

Your comment illustrates a helpful point: this is just another kind of
ordinary serialization failure, one that can happen at any isolation level.
No serial transaction sequence can yield one of these errors.

Thanks,
nm

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Making TRUNCATE more "MVCC-safe"

Dan Ports-4
In reply to this post by Robert Haas
On Fri, Feb 10, 2012 at 01:59:18PM -0500, Robert Haas wrote:
> I guess I'm not particularly excited by the idea of trying to make
> TRUNCATE MVCC-safe.  I notice that the example involves the REPEATABLE
> READ isolation level, which is already known to be busted in a variety
> of ways; that's why we now have SERIALIZABLE, and why most people use
> READ COMMITTED.  Are there examples of this behavior at other
> isolation levels?

Marti's example works for SERIALIZABLE isolation too. In general, when
DDL operations weren't previously MVCC-safe under REPEATABLE READ, we
didn't change that in SERIALIZABLE.

There's some SSI code for TRUNCATE TABLE that tries to do something
reasonable, and it catches some (more subtle) anomalies involving
concurrent truncates -- but it can only do so much when TRUNCATE itself
isn't MVCC-safe. I expect that the combination of that code and this
patch would ensure full serializability for TRUNCATE operations.

Dan

--
Dan R. K. Ports              MIT CSAIL                http://drkp.net/

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Making TRUNCATE more "MVCC-safe"

Robert Haas
In reply to this post by Noah Misch-2
On Fri, Feb 10, 2012 at 11:46 PM, Noah Misch <[hidden email]> wrote:

> On Fri, Feb 10, 2012 at 01:59:18PM -0500, Robert Haas wrote:
>> On Fri, Feb 10, 2012 at 6:42 AM, Noah Misch <[hidden email]> wrote:
>> > I like the design you have chosen. ?It would find applications beyond
>> > TRUNCATE, so your use of non-specific naming is sound. ?For example, older
>> > snapshots will see an empty table "t" after "CREATE TABLE t AS SELECT 1"
>> > commits; that's a comparable MVCC anomaly. ?Some of our non-MVCC-safe commands
>> > should perhaps just become MVCC-safe, but there will always be use cases for
>> > operations that shortcut MVCC. ?When one truly does want that, your proposal
>> > for keeping behavior consistent makes plenty of sense.
>>
>> I guess I'm not particularly excited by the idea of trying to make
>> TRUNCATE MVCC-safe.  I notice that the example involves the REPEATABLE
>> READ isolation level, which is already known to be busted in a variety
>> of ways; that's why we now have SERIALIZABLE, and why most people use
>> READ COMMITTED.  Are there examples of this behavior at other
>> isolation levels?
>
> I've yet to see an MVCC anomaly that one can reproduce at REPEATABLE READ and
> not at READ COMMITTED.  They tend to be narrow race conditions at READ
> COMMITTED, yet easy to demonstrate at REPEATABLE READ.  Related:
> http://archives.postgresql.org/pgsql-performance/2011-02/msg00451.php

Yeah.  Well, that's actually an interesting example, because it
illustrates how general this problem is.  We could potentially get
ourselves into a situation where just about every system catalog table
needs an xmin field to store the point at which the object came into
existence - or for that matter, was updated.  But it's not quite the
same as the xmin of the row itself, because some updates might be
judged not to matter.  There could also be intermediate cases where
updates are invalidating for some purposes but not others.  I think
we'd better get our hands around more of the problem space before we
start trying to engineer solutions.

> Incidentally, people use READ COMMITTED because they don't question the
> default, not because they know hazards of REPEATABLE READ.  I don't know the
> bustedness you speak of; could we improve the documentation to inform folks?

The example that I remember was related to SELECT FOR UPDATE/SELECT
FOR SHARE.  The idea of those statements is that you want to prevent
the row from being updated or deleted until some other concurrent
action is complete; for example, in the case of a foreign key, we'd
like to prevent the referenced row from being deleted or updated in
the relevant columns until the inserting transaction is committed.
But it doesn't work, because when the updating or deleting process
gets done with the lock wait, they are still using the same snapshot
as before, and merrily do exactly the the thing that the lock-wait was
supposed to prevent.  If an actual UPDATE is used, it's safe (I
think): anyone who was going to UPDATE or DELETE the row will fail
with some kind of serialization error.  But a SELECT FOR UPDATE that
commits is treated more like an UPDATE that rolls back: it's as if the
lock never existed.  Someone (Florian?) proposed a patch to change
this, but it seemed problematic for reasons I no longer exactly
remember.

When using an actual foreign key, we work around this by taking a new
snapshot to cross-check that things haven't changed under us, but
user-level code can't do that.  At READ COMMITTED, depending on the
situation, either the fact that we take new snapshots pretty
frequently or the EPQ machinery sometimes make things work sensibly
anyway, and at SERIALIZABLE, SSI prevents these kinds of anomalies.
But REPEATABLE READ has no protection.  I wish I could find the thread
where we discussed this before.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Making TRUNCATE more "MVCC-safe"

Kevin Grittner
Robert Haas <[hidden email]> wrote:
 

> The example that I remember was related to SELECT FOR
> UPDATE/SELECT FOR SHARE.  The idea of those statements is that you
> want to prevent the row from being updated or deleted until some
> other concurrent action is complete; for example, in the case of a
> foreign key, we'd like to prevent the referenced row from being
> deleted or updated in the relevant columns until the inserting
> transaction is committed.  But it doesn't work, because when the
> updating or deleting process gets done with the lock wait, they
> are still using the same snapshot as before, and merrily do
> exactly the the thing that the lock-wait was supposed to prevent.
 
This issue is one which appears to be a problem for people trying to
migrate from Oracle, where a write conflict would be generated.
 
> If an actual UPDATE is used, it's safe (I think): anyone who was
> going to UPDATE or DELETE the row will fail with some kind of
> serialization error.
 
Right; a write conflict.
 
> But a SELECT FOR UPDATE that commits is treated more like an
> UPDATE that rolls back: it's as if the lock never existed.
> Someone (Florian?) proposed a patch to change this, but it seemed
> problematic for reasons I no longer exactly remember.
 
It had to do with only having one xmax and how that worked with
subtransactions.
 
Of course, besides the technical obstacles, such a semantic change
could break existing code for PostgreSQL users.  :-(
 
> When using an actual foreign key, we work around this by taking a
> new snapshot to cross-check that things haven't changed under us,
> but user-level code can't do that.  At READ COMMITTED, depending
> on the situation, either the fact that we take new snapshots
> pretty frequently or the EPQ machinery sometimes make things work
> sensibly anyway, and at SERIALIZABLE, SSI prevents these kinds of
> anomalies.  But REPEATABLE READ has no protection.
 
Well, personally I have a hard time calling READ COMMITTED behavior
sensible.  Consider this:
 
-- connection 1
test=# create table t (id int not null primary key);
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
"t_pkey" for table "t"
CREATE TABLE
test=# insert into t select generate_series(1, 10);
INSERT 0 10
 
-- connection 2
test=# begin;
BEGIN
test=# update t set id = id - 1;
UPDATE 10
 
-- connection 1
test=# select * from t where id = (select min(id) from t) for
update;
[blocks]
 
-- connection 2
test=# commit;
COMMIT
 
-- connection 1
[unblocks]
 id
----
(0 rows)
 
-Kevin

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Making TRUNCATE more "MVCC-safe"

Robert Haas
On Mon, Feb 13, 2012 at 10:48 AM, Kevin Grittner
<[hidden email]> wrote:
> Well, personally I have a hard time calling READ COMMITTED behavior
> sensible.  Consider this:
>
> [ gigantic pile of fail ]

Yeah, that's bad all right.  I think it's hard to argue that the
current behavior is sensible; the trick is to figure out something
that's better but doesn't impose too much additional overhead.  I
wonder if it's possible to use SSI (or some stripped-down mechanism
along similar lines) to track these kinds of write conflicts, rather
than cluttering the system catalogs with lots more TransactionId
fields.  Like, when we TRUNCATE a table, we could essentially make a
note in memory someplace recording the write conflict.

Unfortunately, the full-blown SSI machinery seems too expensive to use
for this, especially on all-read workloads where there are no actual
conflicts but lots of conflict checks.  But that could probably be
optimized.  The attraction of using something like an in-memory
conflict manager is that it would probably be quite general.  We could
fix problems of this type with no on-disk format changes whenever we
discover them (as we will certainly continue to do) just by adding the
appropriate flag-a-conflict calls to the right places in the code.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Making TRUNCATE more "MVCC-safe"

Kevin Grittner
Robert Haas <[hidden email]> wrote:

> Kevin Grittner <[hidden email]> wrote:
>> Well, personally I have a hard time calling READ COMMITTED
>> behavior sensible.  Consider this:
>>
>> [ gigantic pile of fail ]
>
> Yeah, that's bad all right.  I think it's hard to argue that the
> current behavior is sensible; the trick is to figure out something
> that's better but doesn't impose too much additional overhead.  I
> wonder if it's possible to use SSI (or some stripped-down
> mechanism along similar lines) to track these kinds of write
> conflicts, rather than cluttering the system catalogs with lots
> more TransactionId fields.  Like, when we TRUNCATE a table, we
> could essentially make a note in memory someplace recording the
> write conflict.
 
Potential additional uses of the predicate locking developed for SSI
keep surfacing.  At some point we should probably pick a couple of
them and try to fashion an API for the non-blocking predicate
locking logic that serves them and SSI.  Since this predicate
locking system is explicitly interested only in read-write
conflicts, it does seem like it could work for SELECT FOR UPDATE
versus writes.
 
As mentioned once or twice before, it was pretty clear that while
predicate locking is required for SSI and is probably 80% of the C
code in the patch, it is really a separate thing -- we just didn't
want to try to create a "generalized" API based on the one initial
usage.  I think that an API based on registering and listening would
be the ticket.
 
> Unfortunately, the full-blown SSI machinery seems too expensive to
> use for this, especially on all-read workloads where there are no
> actual conflicts but lots of conflict checks.
 
In an all-read workload, if you actually declare the transactions to
be read-only SSI should not introduce much overhead.  If there's
much overhead showing up there at the moment, it would probably be
pretty easy to eliminate.  When there are any read-write
transactions active, it's a different story.
 
> But that could probably be optimized.
 
Undoubtedly.  It's disappointing that neither Dan nor I could find
the round tuits to make the kinds of changes in the SSI locking that
you made in some other areas for 9.2.  I'm not really sure how the
performance impact breaks down between predicate locking and SSI
proper, although I would tend to start from the assumption that,
like the lines of code, it's 80% in the predicate locking.
 
> The attraction of using something like an in-memory conflict
> manager is that it would probably be quite general.  We could fix
> problems of this type with no on-disk format changes whenever we
> discover them (as we will certainly continue to do) just by adding
> the appropriate flag-a-conflict calls to the right places in the
> code.
 
Assuming that the problems could be expressed in terms of read-write
conflicts, that's probably largely true.  I'm not sure that holds
for some of the funny READ COMMITTED behaviors, though.  I think the
only real "cure" there would be to make subtransactions cheap enough
that we could wrap execution of each SELECT and DML statement in a
subtransaction and roll back for another try with a new snapshot on
conflict.
 
If you want to track something other than read-write conflicts
and/or you want blocking when a conflict is found, you might be
better off looking to bend the heavyweight locks to your purposes.
 
-Kevin

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Making TRUNCATE more "MVCC-safe"

Simon Riggs
In reply to this post by Noah Misch-2
On Sat, Feb 11, 2012 at 4:46 AM, Noah Misch <[hidden email]> wrote:

>> But I have to admit I'm intrigued by the idea of extending this to
>> other cases, if there's a reasonable way to do that.  For example, if
>> we could fix things up so that we don't see a table at all if it was
>> created after we took our snapshot, that would remove one of the
>> obstacles to marking pages bulk-loaded into that table with FrozenXID
>> and PD_ALL_VISIBLE from the get-go.  I think a lot of people would be
>> mighty happy about that.
>
> Exactly.
>
>> But the necessary semantics seem somewhat different.  For TRUNCATE,
>> you want to throw a serialization error; but is that also what you
>> want for CREATE TABLE?  Or do you just want it to appear empty?
>
> I think an error helps just as much there.  If you create a table and populate
> it with data in the same transaction, letting some snapshot see an empty table
> is an atomicity failure.
>
> Your comment illustrates a helpful point: this is just another kind of
> ordinary serialization failure, one that can happen at any isolation level.
> No serial transaction sequence can yield one of these errors.
Thanks Noah for drawing attention to this thread. I hadn't been
watching. As you say, this work would allow me to freeze rows at load
time and avoid the overhead of hint bit setting, which avoids
performance issues from hint bit setting in checksum patch.

I've reviewed this patch and it seems OK to me. Good work Marti.

v2 patch attached, updated to latest HEAD. Patch adds
* a GUC called strict_mvcc to isolate the new behaviour if required
* relvalidxid is reset by VACUUM/ANALYZE to avoid wraparound failure

At present this lacks docs for strict_mvcc and doesn't attempt to
handle CREATE TABLE case yet, but should be straightforward to do so.

Hint bit setting is in separate patch on other thread.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

safe_truncate.v2.patch (23K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Making TRUNCATE more "MVCC-safe"

Marti Raudsepp
On Sat, Mar 3, 2012 at 14:53, Simon Riggs <[hidden email]> wrote:
> Thanks Noah for drawing attention to this thread. I hadn't been
> watching. As you say, this work would allow me to freeze rows at load
> time and avoid the overhead of hint bit setting, which avoids
> performance issues from hint bit setting in checksum patch.
>
> I've reviewed this patch and it seems OK to me. Good work Marti.

Thanks! This approach wasn't universally liked, but if it gets us
tangible benefits (COPY with frozenxid) then I guess it's a reason to
reconsider.

> v2 patch attached, updated to latest HEAD. Patch adds
> * relvalidxid is reset by VACUUM/ANALYZE to avoid wraparound failure

Personally I'd rather keep this out of ANALYZE -- since its purpose is
to collect stats; VACUUM is responsible for correctness (xid
wraparound etc). But I don't feel strongly about this.

A more important consideration is how this interacts with hot standby.
Currently you compare OldestXmin to relvalidxmin to decide when to
reset it. But the standby's OldestXmin may be older than the master's.
(If VACUUM removes rows then this causes a recovery conflict, but
AFAICT that won't happen if only relvalidxmin changes)

It might be more robust to wait until relfrozenxid exceeds
relvalidxmin -- by then, recovery conflict mechanisms will have taken
care of killing all older snapshots, or am I mistaken?

And a few typos the code...

+ gettext_noop("When enabled viewing a truncated or newly created table "
+ "will throw a serialization error to prevent MVCC "
+ "discrepancies that were possible priot to 9.2.")

"prior" not "priot"

+ * Reset relvalidxmin if its old enough

Should be "it's" in this context.

Regards,
Marti

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Making TRUNCATE more "MVCC-safe"

Simon Riggs
On Sun, Mar 4, 2012 at 2:28 AM, Marti Raudsepp <[hidden email]> wrote:

> On Sat, Mar 3, 2012 at 14:53, Simon Riggs <[hidden email]> wrote:
>> Thanks Noah for drawing attention to this thread. I hadn't been
>> watching. As you say, this work would allow me to freeze rows at load
>> time and avoid the overhead of hint bit setting, which avoids
>> performance issues from hint bit setting in checksum patch.
>>
>> I've reviewed this patch and it seems OK to me. Good work Marti.
>
> Thanks! This approach wasn't universally liked, but if it gets us
> tangible benefits (COPY with frozenxid) then I guess it's a reason to
> reconsider.
Comments I see support this idea. If we did this purely for truncate
correctness I probably wouldn't care either, which is why I didn't
read this thread in the first place.

>> v2 patch attached, updated to latest HEAD. Patch adds
>> * relvalidxid is reset by VACUUM/ANALYZE to avoid wraparound failure
>
> Personally I'd rather keep this out of ANALYZE -- since its purpose is
> to collect stats; VACUUM is responsible for correctness (xid
> wraparound etc). But I don't feel strongly about this.

If there were a reason to do it, then I would agree. Later you point
out a reason, so I will make this change.

> A more important consideration is how this interacts with hot standby.
> Currently you compare OldestXmin to relvalidxmin to decide when to
> reset it. But the standby's OldestXmin may be older than the master's.
> (If VACUUM removes rows then this causes a recovery conflict, but
> AFAICT that won't happen if only relvalidxmin changes)

As of 9.1, the standby's oldestxmin is incorporated into the master's
via hot_standby_feedback, so it wouldn't typically be a problem these
days. It's possible that the standby has this set off by choice, in
which case anomalies could exist in the case that a VACUUM doesn't
clean any rows, as you say.

So we'll use vacrelstats->latestRemovedXid instead of OldestXmin when
we call vac_update_relstats()
which means ANALYZE should always pass InvalidTransactionId.

> It might be more robust to wait until relfrozenxid exceeds
> relvalidxmin -- by then, recovery conflict mechanisms will have taken
> care of killing all older snapshots, or am I mistaken?

It would be better to set it as early as possible to reduce the cost
of the test in heap_begin_scan()

> And a few typos the code...
>
> + gettext_noop("When enabled viewing a truncated or newly created table "
> + "will throw a serialization error to prevent MVCC "
> + "discrepancies that were possible priot to 9.2.")
>
> "prior" not "priot"

Yep

> + * Reset relvalidxmin if its old enough
>
> Should be "it's" in this context.

Cool, thanks for the review.

v3 attached.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

safe_truncate.v3.patch (23K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Making TRUNCATE more "MVCC-safe"

Simon Riggs
On Sun, Mar 4, 2012 at 9:59 AM, Simon Riggs <[hidden email]> wrote:
> On Sun, Mar 4, 2012 at 2:28 AM, Marti Raudsepp <[hidden email]> wrote:
>> On Sat, Mar 3, 2012 at 14:53, Simon Riggs <[hidden email]> wrote:
>>> Thanks Noah for drawing attention to this thread. I hadn't been
>>> watching. As you say, this work would allow me to freeze rows at load
>>> time and avoid the overhead of hint bit setting, which avoids
>>> performance issues from hint bit setting in checksum patch.
>>>
>>> I've reviewed this patch and it seems OK to me. Good work Marti.

...

> v3 attached.

More detailed thoughts show that the test in heap_beginscan_internal()
is not right enough, i.e. wrong.

We need a specific XidInMVCCSnapshot test on the relvalidxid, so it
needs to be a specific xid, not an xmin because otherwise we can get
concurrent transactions failing, not just older transactions.

If we're going freeze tuples on load this needs to be watertight, so
some minor rework needed.

Of course, if we only have a valid xid on the class we might get new
columns added when we do repeated SELECT * statements using the same
snapshot while concurrent DDL occurs. That is impractical, so if we
define this as applying to rows it can work; if we want it to apply to
everything its getting more difficult.

Longer term we might fix this by making all catalog access use MVCC,
but that suffers the same problems with ALTER TABLEs that rewrite rows
to add columns. I don't see a neat future solution that is worth
waiting for.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Making TRUNCATE more "MVCC-safe"

Simon Riggs
On Sun, Mar 4, 2012 at 1:02 PM, Simon Riggs <[hidden email]> wrote:

> On Sun, Mar 4, 2012 at 9:59 AM, Simon Riggs <[hidden email]> wrote:
>> On Sun, Mar 4, 2012 at 2:28 AM, Marti Raudsepp <[hidden email]> wrote:
>>> On Sat, Mar 3, 2012 at 14:53, Simon Riggs <[hidden email]> wrote:
>>>> Thanks Noah for drawing attention to this thread. I hadn't been
>>>> watching. As you say, this work would allow me to freeze rows at load
>>>> time and avoid the overhead of hint bit setting, which avoids
>>>> performance issues from hint bit setting in checksum patch.
>>>>
>>>> I've reviewed this patch and it seems OK to me. Good work Marti.
>
> ...
>
>> v3 attached.
>
> More detailed thoughts show that the test in heap_beginscan_internal()
> is not right enough, i.e. wrong.
>
> We need a specific XidInMVCCSnapshot test on the relvalidxid, so it
> needs to be a specific xid, not an xmin because otherwise we can get
> concurrent transactions failing, not just older transactions.
Marti, please review this latest version which has new isolation tests added.

This does both TRUNCATE and CREATE TABLE.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

strictmvcc.v1.patch (32K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Making TRUNCATE more "MVCC-safe"

Robert Haas
On Sun, Mar 4, 2012 at 11:39 AM, Simon Riggs <[hidden email]> wrote:
> Marti, please review this latest version which has new isolation tests added.
>
> This does both TRUNCATE and CREATE TABLE.

I don't see any need for a GUC to control this behavior.  The current
behavior is wrong, so if we're going to choose this path to fix it,
then we should just fix it, period.  The narrow set of circumstances
in which it might be beneficial to disable this behavior doesn't seem
to me to be sufficient to justify a behavior-changing GUC.

It does not seem right that the logic for detecting the serialization
error is in heap_beginscan_internal().  Surely this is just as much of
a problem for an index-scan or index-only-scan.  We don't want to
patch all those places individually, either: I think the check should
happen right around the time we initially lock the relation and build
its relcache entry.

The actual text of the error message could use some work.  Maybe
something like "could not serialize access due to concurrent DDL",
although I think we try to avoid using acronyms like DDL in
translatable strings.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Making TRUNCATE more "MVCC-safe"

Simon Riggs
On Mon, Mar 5, 2012 at 4:32 PM, Robert Haas <[hidden email]> wrote:

> On Sun, Mar 4, 2012 at 11:39 AM, Simon Riggs <[hidden email]> wrote:
>> Marti, please review this latest version which has new isolation tests added.
>>
>> This does both TRUNCATE and CREATE TABLE.
>
> I don't see any need for a GUC to control this behavior.  The current
> behavior is wrong, so if we're going to choose this path to fix it,
> then we should just fix it, period.  The narrow set of circumstances
> in which it might be beneficial to disable this behavior doesn't seem
> to me to be sufficient to justify a behavior-changing GUC.

I agree behaviour is wrong, the only question is whether our users
rely in some way on that behaviour. Given the long discussion on that
point earlier I thought it best to add a GUC. Easy to remove, now or
later.

> It does not seem right that the logic for detecting the serialization
> error is in heap_beginscan_internal().  Surely this is just as much of
> a problem for an index-scan or index-only-scan.

err, very good point. Doh.

> We don't want to
> patch all those places individually, either: I think the check should
> happen right around the time we initially lock the relation and build
> its relcache entry.

OK, that makes sense and works if we need to rebuild relcache.

> The actual text of the error message could use some work.  Maybe
> something like "could not serialize access due to concurrent DDL",
> although I think we try to avoid using acronyms like DDL in
> translatable strings.

Yeh that was designed-to-be-replaced text. We do use DDL already
elsewhere without really explaining it; its also one of those acronyms
that doesn't actually explain what it really means very well. So I
like the phrase you suggest.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Making TRUNCATE more "MVCC-safe"

Simon Riggs
On Mon, Mar 5, 2012 at 4:46 PM, Simon Riggs <[hidden email]> wrote:

>> It does not seem right that the logic for detecting the serialization
>> error is in heap_beginscan_internal().  Surely this is just as much of
>> a problem for an index-scan or index-only-scan.
>
> err, very good point. Doh.
>
>> We don't want to
>> patch all those places individually, either: I think the check should
>> happen right around the time we initially lock the relation and build
>> its relcache entry.
>
> OK, that makes sense and works if we need to rebuild relcache.

Except the reason to do it at the start of the scan is that is the
first time a specific snapshot has been associated with a relation and
also the last point we can apply the check before the errant behaviour
occurs.

If we reject locks against tables that might be used against an
illegal snapshot then we could easily prevent valid snapshot use cases
when a transaction has multiple snapshots, one illegal, one not.

We can certainly have a looser test when we first get the lock and
then another test later, but I don't think we can avoid making all
scans apply this test. And while I'm there, we have to add tests for
things like index build scans.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Making TRUNCATE more "MVCC-safe"

Robert Haas
In reply to this post by Simon Riggs
On Mon, Mar 5, 2012 at 11:46 AM, Simon Riggs <[hidden email]> wrote:
> I agree behaviour is wrong, the only question is whether our users
> rely in some way on that behaviour. Given the long discussion on that
> point earlier I thought it best to add a GUC. Easy to remove, now or
> later.

AFAICT, all the discussion upthread was about whether we ought to be
trying to implement this in some entirely-different way that doesn't
rely on storing XIDs in the catalog.  I have a feeling that there are
a whole lot more cases like this and that we're in for a lot of
unpleasant nastiness if we go very far down this route; pg_constraint
is another one, as SnapshotNow can see constraints that may not be
valid under the query's MVCC snapshot.  On the other hand, if someone
comes up with a better way, I suppose we can always rip this out.  In
any case, I don't remember anyone saying that this needed to be
configurable.

Speaking of that, though, I have one further thought on this: we need
to be absolutely certain that autovacuum is going to prevent this XID
value from wrapping around.  I suppose this is safe since, even if
autovacuum is turned off, we'll forcibly kick it off every so often to
advance relfrozenxid, and that will reset relvalidxid while it's
there.  But then again on second thought, what if relvalidxid lags
relfrozenxid?  Then the emergency autovacuum might not kick in until
relvalidxid has already wrapped around.  I think that could happen
after a TRUNCATE, perhaps, since I think that would leave relfrozenxid
alone while advancing relvalidxid.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Making TRUNCATE more "MVCC-safe"

Robert Haas
In reply to this post by Simon Riggs
On Mon, Mar 5, 2012 at 12:42 PM, Simon Riggs <[hidden email]> wrote:

> On Mon, Mar 5, 2012 at 4:46 PM, Simon Riggs <[hidden email]> wrote:
>>> It does not seem right that the logic for detecting the serialization
>>> error is in heap_beginscan_internal().  Surely this is just as much of
>>> a problem for an index-scan or index-only-scan.
>>
>> err, very good point. Doh.
>>
>>> We don't want to
>>> patch all those places individually, either: I think the check should
>>> happen right around the time we initially lock the relation and build
>>> its relcache entry.
>>
>> OK, that makes sense and works if we need to rebuild relcache.
>
> Except the reason to do it at the start of the scan is that is the
> first time a specific snapshot has been associated with a relation and
> also the last point we can apply the check before the errant behaviour
> occurs.
>
> If we reject locks against tables that might be used against an
> illegal snapshot then we could easily prevent valid snapshot use cases
> when a transaction has multiple snapshots, one illegal, one not.
>
> We can certainly have a looser test when we first get the lock and
> then another test later, but I don't think we can avoid making all
> scans apply this test. And while I'm there, we have to add tests for
> things like index build scans.

Well, there's no point that I can see in having two checks.  I just
dislike the idea that we have to remember to add this check for every
method of accessing the relation - doesn't seem terribly future-proof.
 It gets even worse if you start adding checks to DDL code paths - if
we're going to do that, we really need to cover them all, and that
doesn't seem very practical if they're going to spread out all over
the place.

I don't understand your comment that a snapshot doesn't get associated
with a relation until scan time.  I believe we associated a snapshot
with each query before we even know what relations are involved; that
query then gets passed down to all the individual scans.  The query
also opens and locks those relations.  We ought to be able to arrange
for the query snapshot to be cross-checked at that point, rather than
waiting until scan-start time.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Making TRUNCATE more "MVCC-safe"

Noah Misch-2
In reply to this post by Robert Haas
On Mon, Feb 13, 2012 at 09:29:56AM -0500, Robert Haas wrote:

> On Fri, Feb 10, 2012 at 11:46 PM, Noah Misch <[hidden email]> wrote:
> > I've yet to see an MVCC anomaly that one can reproduce at REPEATABLE READ and
> > not at READ COMMITTED. ?They tend to be narrow race conditions at READ
> > COMMITTED, yet easy to demonstrate at REPEATABLE READ. ?Related:
> > http://archives.postgresql.org/pgsql-performance/2011-02/msg00451.php
>
> Yeah.  Well, that's actually an interesting example, because it
> illustrates how general this problem is.  We could potentially get
> ourselves into a situation where just about every system catalog table
> needs an xmin field to store the point at which the object came into
> existence - or for that matter, was updated.

I can see this strategy applying to many relation-pertinent system catalogs.
Do you foresee applications to non-relation catalogs?

In any event, I think a pg_class.relvalidxmin is the right starting point.
One might imagine a family of relvalidxmin, convalidxmin, indcheckxmin
(already exists), inhvalidxmin, and attvalidxmin.  relvalidxmin is like the
AccessExclusiveLock of that family; it necessarily blocks everything that
might impugn the others.  The value in extending this to more catalogs is the
ability to narrow the impact of failing the check.  A failed indcheckxmin
comparison merely excludes plans involving the index.  A failed inhvalidxmin
check might just skip recursion to the table in question.  Those are further
refinements, much like using weaker heavyweight lock types.

> But it's not quite the
> same as the xmin of the row itself, because some updates might be
> judged not to matter.  There could also be intermediate cases where
> updates are invalidating for some purposes but not others.  I think
> we'd better get our hands around more of the problem space before we
> start trying to engineer solutions.

I'm not seeing that problem.  Any operation that would update some xmin
horizon should set it to the greater of its current value and the value the
operation needs for its own correctness.  If you have something in mind that
needs more, could you elaborate?

> > Incidentally, people use READ COMMITTED because they don't question the
> > default, not because they know hazards of REPEATABLE READ. ?I don't know the
> > bustedness you speak of; could we improve the documentation to inform folks?
>
> The example that I remember was related to SELECT FOR UPDATE/SELECT
> FOR SHARE.  The idea of those statements is that you want to prevent
> the row from being updated or deleted until some other concurrent
> action is complete; for example, in the case of a foreign key, we'd
> like to prevent the referenced row from being deleted or updated in
> the relevant columns until the inserting transaction is committed.
> But it doesn't work, because when the updating or deleting process
> gets done with the lock wait, they are still using the same snapshot
> as before, and merrily do exactly the the thing that the lock-wait was
> supposed to prevent.  If an actual UPDATE is used, it's safe (I
> think): anyone who was going to UPDATE or DELETE the row will fail
> with some kind of serialization error.  But a SELECT FOR UPDATE that
> commits is treated more like an UPDATE that rolls back: it's as if the
> lock never existed.  Someone (Florian?) proposed a patch to change
> this, but it seemed problematic for reasons I no longer exactly
> remember.

Thanks.  I vaguely remember that thread.

nm

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
12