ATTACH/DETACH PARTITION CONCURRENTLY

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

ATTACH/DETACH PARTITION CONCURRENTLY

David Rowley-3
Hi,

One of the downsides of declarative partitioning vs old school
inheritance partitioning is that a new partition cannot be added to
the partitioned table without taking an AccessExclusiveLock on the
partitioned table.  We've obviously got a bunch of features for
various other things where we work a bit harder to get around that
problem, e.g creating indexes concurrently.

I've started working on allowing partitions to be attached and
detached with just a ShareUpdateExclusiveLock on the table. If I'm
correct, then we can do this in a similar, but more simple way as to
how CREATE INDEX CONCURRENTLY works. We just need to pencil in that
the new partition exists, but not yet valid, then wait for snapshots
older than our own to finish before marking the partition is valid.

One problem I had with doing this is that there was not really a good
place to store that "isvalid" flag for partitions. We have pg_index
for indexes, but partition details are just spread over pg_inherits
and pg_class. So step 1 was to move all that into a new table called
pg_partition. I think this is quite nice as it also gets rid of
relpartbound out of pg_class. It probably just a matter of time before
someone complains that they can't create some partition with some
pretty large Datum due to it not being able to fit on a single heap
page (pg_class has no TOAST table).  I ended up getting rid of
pg_class.relispartition replacing it with relpartitionparernt which is
just InvalidOid when the table or index is not a partition.  This
allows various pieces of code to be more efficient since we can look
at the relcache instead of scanning pg_inherits all the time. It's now
also much faster to get a partitions ancestors.

So, patches 0001 is just one I've already submitted for the July
'fest. Nothing new. It was just required to start this work.

0002 migrates partitions out of pg_inherits into pg_partition. This
patch is at a stage where it appears to work, but is very unpolished
and requires me to stare at it much longer than I've done so far.
There's a bunch of code that gets repeated way too many times in
tablecmds.c, for example.

0003 does the same for partitioned indexes. The patch is in a similar,
maybe slightly worse state than 0002. Various comments will be out of
date.

0004
is the early workings of what I have in mind for the concurrent ATTACH
code. It's vastly incomplete. It does pass make check but really only
because there are no tests doing any concurrent attaches.  There's a
mountain of code missing that ignores invalid partitions. I just have
a very simple case working. Partition-wise joins will be very much
broken by what I have so far, and likely a whole bunch of other stuff.

About the extent of my tests so far are the following:

--setup
create table listp (a int) partition by list(a);
create table listp1 partition of listp for values in(1);
create table listp2 (a int);
insert into listp1 values(1);
insert into listp2 values(2);

-- example 1.
start transaction isolation level repeatable read; -- Session 1
select * from listp; -- Session 1
 a
---
 1
(1 row)

alter table listp attach partition concurrently listp2 for values in
(2); -- Session 2 (waits for release of session 1's snapshot)
select * from listp; -- Session 1
 a
---
 1

commit; -- session 1 (session 2's alter table now finishes waiting)
select * from listp; -- Session 1 (new partition now valid)
 a
---
 1
 2
(2 rows)

-- example 2.
start transaction isolation level read committed; -- session 1;
select * from listp; -- session 1
 a
---
 1
(1 row)

alter table listp attach partition concurrently listp2 for values in
(2); -- Session 2 completes without waiting.

select * from listp; -- Session 1 (new partition visible while in transaction)
 a
---
 1
 2
(2 rows)

This basically works by:

1. Do all the normal partition attach partition validation.
2. Insert a record into pg_partition with partisvalid=false
3. Obtain a session-level ShareUpdateExclusiveLock on the partitioned table.
4. Obtain a session-level AccessExclusiveLock on the partition being attached.
5. Commit.
6. Start a new transaction.
7. Wait for snapshots older than our own to be released.
8. Mark the partition as valid
9. Invalidate relcache for the partitioned table.
10. release session-level locks.

I've disallowed the feature when the partitioned table has a default
partition. I don't see how this can be made to work.

At the moment ALTER TABLE ... ATTACH PARTITION commands cannot contain
any other sub-commands in the ALTER TABLE, so performing the
additional transaction commit and begin inside the single sub-command
might be okay.  It does mean that 'rel' which is passed down to
ATExecAttachPartition() must be closed and reopened again which
results in the calling function having a pointer into a closed
Relation. I worked around this by changing the code so it passes a
pointer to the Relation, and I've got ATExecAttachPartition() updating
that pointer before returning. It's not particularly pretty, but I
didn't really see how else this can be done.

I've not yet done anything about the DETACH CONCURRENTLY case. I think
it should just be the same steps in some roughly reverse order.  We
can skip the waiting part of the partition being detached is still
marked as invalid from some failed concurrent ATTACH.

I've not thought much about pg_dump beyond just have it ignore invalid
partitions. I don't think it's very useful to support some command
that attaches an invalid partition since there will be no command to
revalidate an invalid partition. It's probably best to resolve that
with a DETACH followed by a new ATTACH. So probably pg_dump can just
do nothing for invalid partitions.

So anyway, my intentions of posting this patch now rather than when
it's closer to being finished is for design review. I'm interested in
hearing objections, comments, constructive criticism for patches
0002-0004. Patch 0001 comments can go to [1]

Are there any blockers on this that I've overlooked?

[1] https://www.postgresql.org/message-id/CAKJS1f81TpxZ8twugrWCo%3DVDHEkmagxRx7a%2B1z4aaMeQy%3DnA7w%40mail.gmail.com

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

v1-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch (77K) Download Attachment
v1-0002-Store-partition-details-in-pg_partition-instead-o.patch (139K) Download Attachment
v1-0003-Don-t-store-partition-index-details-in-pg_inherit.patch (67K) Download Attachment
v1-0004-Allow-partitions-to-be-attached-without-blocking-.patch (50K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

David Rowley-3
On 3 August 2018 at 01:25, David Rowley <[hidden email]> wrote:

> 1. Do all the normal partition attach partition validation.
> 2. Insert a record into pg_partition with partisvalid=false
> 3. Obtain a session-level ShareUpdateExclusiveLock on the partitioned table.
> 4. Obtain a session-level AccessExclusiveLock on the partition being attached.
> 5. Commit.
> 6. Start a new transaction.
> 7. Wait for snapshots older than our own to be released.
> 8. Mark the partition as valid
> 9. Invalidate relcache for the partitioned table.
> 10. release session-level locks.

So I was thinking about this again and realised this logic is broken.
All it takes is a snapshot that starts after the ATTACH PARTITION
started and before it completed.  This snapshot will have the new
partition attached while it's possibly still open which could lead to
non-repeatable reads in a repeatable read transaction.  The window for
this to occur is possibly quite large given that the ATTACH
CONCURRENTLY can wait a long time for older snapshots to finish.

Here's my updated thinking for an implementation which seems to get
around the above problem:

ATTACH PARTITION CONCURRENTLY:

1. Obtain a ShareUpdateExclusiveLock on the partitioned table rather
than an AccessExclusiveLock.
2. Do all the normal partition attach partition validation.
3. Insert pg_partition record with partvalid = true.
4. Invalidate relcache entry for the partitioned table
5. Any loops over a partitioned table's PartitionDesc must check
PartitionIsValid(). This will return true if the current snapshot
should see the partition or not. The partition is valid if partisvalid
= true and the xmin precedes or is equal to the current snapshot.

#define PartitionIsValid(pd, i) (((pd)->is_valid[(i)] \
&& TransactionIdPrecedesOrEquals((pd)->xmin[(i)], GetCurrentTransactionId())) \
|| (!(pd)->is_valid[(i)] \
&& TransactionIdPrecedesOrEquals(GetCurrentTransactionId(), (pd)->xmin[(i)])))

DETACH PARTITION CONCURRENTLY:

1. Obtain ShareUpdateExclusiveLock on partition being detached
(instead of the AccessShareLock that non-concurrent detach uses)
2. Update the pg_partition record, set partvalid = false.
3. Commit
4. New transaction.
5. Wait for transactions which hold a snapshot older than the one held
when updating pg_partition to complete.
6. Delete the pg_partition record.
7. Perform other cleanup, relpartitionparent = 0, pg_depend etc.

DETACH PARTITION CONCURRENTLY failure when it fails after step 3 (above)

1. Make vacuum of a partition check for pg_partition.partvalid =
false, if xmin of tuple is old enough, perform a partition cleanup by
doing steps 6+7 above.

A VACUUM FREEZE must run before transaction wraparound, so this means
a partition can never reattach itself when the transaction counter
wraps.

I believe I've got the attach and detach working correctly now and
also isolation tests that appear to prove it works. I've also written
the failed detach cleanup code into vacuum. Unusually, since foreign
tables can also be partitions this required teaching auto-vacuum to
look at foreign tables, only in the sense of checking for failed
detached partitions. It also requires adding vacuum support for
foreign tables too.  It feels a little bit weird to modify auto-vacuum
to look at foreign tables, but I really couldn't see another way to do
this.

I'm now considering if this all holds together in the event the
pg_partition tuple of an invalid partition becomes frozen. The problem
would be that PartitionIsValid() could return the wrong value due to
TransactionIdPrecedesOrEquals(GetCurrentTransactionId(),
(pd)->xmin[(i)]). this code is trying to keep the detached partition
visible to older snapshots, but if pd->xmin[i] becomes frozen, then
the partition would become invisible.  However, I think this won't be
a problem since a VACUUM FREEZE would only freeze tuples that are also
old enough to have failed detaches cleaned up earlier in the vacuum
process.

Also, we must disallow a DEFAULT partition from being attached to a
partition with a failed DETACH CONCURRENTLY as it wouldn't be very
clear what the default partition's partition qual would be, as this is
built based on the quals of all attached partitions.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Andres Freund
Hi,

On 2018-08-08 00:40:12 +1200, David Rowley wrote:
> 1. Obtain a ShareUpdateExclusiveLock on the partitioned table rather
> than an AccessExclusiveLock.
> 2. Do all the normal partition attach partition validation.
> 3. Insert pg_partition record with partvalid = true.
> 4. Invalidate relcache entry for the partitioned table
> 5. Any loops over a partitioned table's PartitionDesc must check
> PartitionIsValid(). This will return true if the current snapshot
> should see the partition or not. The partition is valid if partisvalid
> = true and the xmin precedes or is equal to the current snapshot.

How does this protect against other sessions actively using the relcache
entry? Currently it is *NOT* safe to receive invalidations for
e.g. partitioning contents afaics.

- Andres

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Simon Riggs
On 7 August 2018 at 13:47, Andres Freund <[hidden email]> wrote:

> Hi,
>
> On 2018-08-08 00:40:12 +1200, David Rowley wrote:
>> 1. Obtain a ShareUpdateExclusiveLock on the partitioned table rather
>> than an AccessExclusiveLock.
>> 2. Do all the normal partition attach partition validation.
>> 3. Insert pg_partition record with partvalid = true.
>> 4. Invalidate relcache entry for the partitioned table
>> 5. Any loops over a partitioned table's PartitionDesc must check
>> PartitionIsValid(). This will return true if the current snapshot
>> should see the partition or not. The partition is valid if partisvalid
>> = true and the xmin precedes or is equal to the current snapshot.
>
> How does this protect against other sessions actively using the relcache
> entry? Currently it is *NOT* safe to receive invalidations for
> e.g. partitioning contents afaics.

I think you may be right in the general case, but ISTM possible to
invalidate/refresh just the list of partitions.

If so, that idea would seem to require some new, as-yet not invented mechanism.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

David Rowley-3
In reply to this post by Andres Freund
On 8 August 2018 at 00:47, Andres Freund <[hidden email]> wrote:

> On 2018-08-08 00:40:12 +1200, David Rowley wrote:
>> 1. Obtain a ShareUpdateExclusiveLock on the partitioned table rather
>> than an AccessExclusiveLock.
>> 2. Do all the normal partition attach partition validation.
>> 3. Insert pg_partition record with partvalid = true.
>> 4. Invalidate relcache entry for the partitioned table
>> 5. Any loops over a partitioned table's PartitionDesc must check
>> PartitionIsValid(). This will return true if the current snapshot
>> should see the partition or not. The partition is valid if partisvalid
>> = true and the xmin precedes or is equal to the current snapshot.
>
> How does this protect against other sessions actively using the relcache
> entry? Currently it is *NOT* safe to receive invalidations for
> e.g. partitioning contents afaics.

I'm not proposing that sessions running older snapshots can't see that
there's a new partition. The code I have uses PartitionIsValid() to
test if the partition should be visible to the snapshot. The
PartitionDesc will always contain details for all partitions stored in
pg_partition whether they're valid to the current snapshot or not.  I
did it this way as there's no way to invalidate the relcache based on
a point in transaction, only a point in time.

I'm open to better ideas, of course.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Andres Freund
On 2018-08-08 01:23:51 +1200, David Rowley wrote:

> On 8 August 2018 at 00:47, Andres Freund <[hidden email]> wrote:
> > On 2018-08-08 00:40:12 +1200, David Rowley wrote:
> >> 1. Obtain a ShareUpdateExclusiveLock on the partitioned table rather
> >> than an AccessExclusiveLock.
> >> 2. Do all the normal partition attach partition validation.
> >> 3. Insert pg_partition record with partvalid = true.
> >> 4. Invalidate relcache entry for the partitioned table
> >> 5. Any loops over a partitioned table's PartitionDesc must check
> >> PartitionIsValid(). This will return true if the current snapshot
> >> should see the partition or not. The partition is valid if partisvalid
> >> = true and the xmin precedes or is equal to the current snapshot.
> >
> > How does this protect against other sessions actively using the relcache
> > entry? Currently it is *NOT* safe to receive invalidations for
> > e.g. partitioning contents afaics.
>
> I'm not proposing that sessions running older snapshots can't see that
> there's a new partition. The code I have uses PartitionIsValid() to
> test if the partition should be visible to the snapshot. The
> PartitionDesc will always contain details for all partitions stored in
> pg_partition whether they're valid to the current snapshot or not.  I
> did it this way as there's no way to invalidate the relcache based on
> a point in transaction, only a point in time.

I don't think that solves the problem that an arriving relcache
invalidation would trigger a rebuild of rd_partdesc, while it actually
is referenced by running code.

You'd need to build infrastructure to prevent that.

One approach would be to make sure that everything relying on
rt_partdesc staying the same stores its value in a local variable, and
then *not* free the old version of rt_partdesc (etc) when the refcount >
0, but delay that to the RelationClose() that makes refcount reach
0. That'd be the start of a framework for more such concurrenct
handling.

Regards,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

David Rowley-3
On 8 August 2018 at 01:29, Andres Freund <[hidden email]> wrote:

> On 2018-08-08 01:23:51 +1200, David Rowley wrote:
>> I'm not proposing that sessions running older snapshots can't see that
>> there's a new partition. The code I have uses PartitionIsValid() to
>> test if the partition should be visible to the snapshot. The
>> PartitionDesc will always contain details for all partitions stored in
>> pg_partition whether they're valid to the current snapshot or not.  I
>> did it this way as there's no way to invalidate the relcache based on
>> a point in transaction, only a point in time.
>
> I don't think that solves the problem that an arriving relcache
> invalidation would trigger a rebuild of rd_partdesc, while it actually
> is referenced by running code.
>
> You'd need to build infrastructure to prevent that.
>
> One approach would be to make sure that everything relying on
> rt_partdesc staying the same stores its value in a local variable, and
> then *not* free the old version of rt_partdesc (etc) when the refcount >
> 0, but delay that to the RelationClose() that makes refcount reach
> 0. That'd be the start of a framework for more such concurrenct
> handling.
I'm not so sure not freeing the partdesc until the refcount reaches 0
is safe. As you'd expect, we hold a lock on a partitioned table
between the planner and executor, but the relation has been closed and
the ref count returns to 0, which means when the relation is first
opened in the executor that the updated PartitionDesc is obtained.  A
non-concurrent attach would have been blocked in this case due to the
lock being held by the planner. Instead of using refcount == 0,
perhaps we can release the original partdesc only when there are no
locks held by us on the relation.

It's late here now, so I'll look at that tomorrow.

I've attached what I was playing around with. I think I'll also need
to change RelationGetPartitionDesc() to have it return the original
partdesc, if it's non-NULL.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

dont_destroy_original_partdesc_on_rel_inval.patch (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Peter Eisentraut-6
In reply to this post by Andres Freund
On 07/08/2018 15:29, Andres Freund wrote:
> I don't think that solves the problem that an arriving relcache
> invalidation would trigger a rebuild of rd_partdesc, while it actually
> is referenced by running code.

The problem is more generally that a relcache invalidation changes all
pointers that might be in use.  So it's currently not safe to trigger a
relcache invalidation (on tables) without some kind of exclusive lock.
One possible solution to this is outlined here:
<https://www.postgresql.org/message-id/CA+TgmobtmFT5g-0dA=vEFFtogjRAuDHcYPw+qEdou5dZPnF=pg@...>

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Andres Freund
Hi,

On 2018-08-09 20:57:35 +0200, Peter Eisentraut wrote:
> On 07/08/2018 15:29, Andres Freund wrote:
> > I don't think that solves the problem that an arriving relcache
> > invalidation would trigger a rebuild of rd_partdesc, while it actually
> > is referenced by running code.
>
> The problem is more generally that a relcache invalidation changes all
> pointers that might be in use.

I don't think that's quite right. We already better be OK with
superfluous invals that do not change anything, because there's already
sources of those (just think of vacuum, analyze, relation extension,
whatnot).


> So it's currently not safe to trigger a relcache invalidation (on
> tables) without some kind of exclusive lock.

I don't think that's true in a as general sense as you're stating it.
It's not OK to send relcache invalidations for things that people rely
on, and that cannot be updated in-place. Because of the dangling pointer
issue etc.

The fact that currently it is not safe to *change* partition related
stuff without an AEL and how to make it safe is precisely what I was
talking about in the thread. It won't be a general solution, but the
infrastructure I'm talking about should get us closer.


> One possible solution to this is outlined here:
> <https://www.postgresql.org/message-id/CA+TgmobtmFT5g-0dA=vEFFtogjRAuDHcYPw+qEdou5dZPnF=pg@...>

I don't see anything in here that addresses the issue structurally?

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

David Rowley-3
In reply to this post by Andres Freund
On 8 August 2018 at 01:29, Andres Freund <[hidden email]> wrote:
> One approach would be to make sure that everything relying on
> rt_partdesc staying the same stores its value in a local variable, and
> then *not* free the old version of rt_partdesc (etc) when the refcount >
> 0, but delay that to the RelationClose() that makes refcount reach
> 0. That'd be the start of a framework for more such concurrenct
> handling.

This is not a fully baked idea, but I'm wondering if a better way to
do this, instead of having this PartitionIsValid macro to determine if
the partition should be visible to the current transaction ID, we
could, when we invalidate a relcache entry, send along the transaction
ID that it's invalid from.  Other backends when they process the
invalidation message they could wipe out the cache entry only if their
xid is >= the invalidation's "xmax" value. Otherwise, just tag the
xmax onto the cache somewhere and always check it before using the
cache (perhaps make it part of the RelationIsValid macro).  This would
also require that we move away from SnapshotAny type catalogue scans
in favour of MVCC scans so that backends populating their relcache
build it based on their current xid.  Unless I'm mistaken, it should
not make any difference for all DDL that takes an AEL on the relation,
since there can be no older transactions running when the catalogue is
modified, but for DDL that's not taking an AEL, we could effectively
have an MVCC relcache.

It would need careful thought about how it might affect CREATE INDEX
CONCURRENTLY and all the other DDL that can be performed without an
AEL.

I'm unsure how this would work for the catcache as I've studied that
code in even less detail, but throwing this out there in case there
some major flaw in this idea so that I don't go wasting time looking
into it further.

I think the PartitionIsValid idea was not that great as it really
complicates run-time partition pruning since it's quite critical about
partition indexes being the same between the planner and executor.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Robert Haas
On Sun, Aug 12, 2018 at 9:05 AM, David Rowley
<[hidden email]> wrote:
> This is not a fully baked idea, but I'm wondering if a better way to
> do this, instead of having this PartitionIsValid macro to determine if
> the partition should be visible to the current transaction ID, we
> could, when we invalidate a relcache entry, send along the transaction
> ID that it's invalid from.  Other backends when they process the
> invalidation message they could wipe out the cache entry only if their
> xid is >= the invalidation's "xmax" value. Otherwise, just tag the
> xmax onto the cache somewhere and always check it before using the
> cache (perhaps make it part of the RelationIsValid macro).

Transactions don't necessarily commit in XID order, so this might be
an optimization to keep older transactions from having to do
unnecessary rebuilds -- which I actually doubt is a major problem, but
maybe I'm wrong -- but you can't rely solely on this as a way of
deciding which transactions will see the effects of some change.  If
transactions 100, 101, and 102 begin in that order, and transaction
101 commits, there's no principled justification for 102 seeing its
effects but 100 not seeing it.

> This would
> also require that we move away from SnapshotAny type catalogue scans
> in favour of MVCC scans so that backends populating their relcache
> build it based on their current xid.

I think this is a somewhat confused analysis.  We don't use
SnapshotAny for catalog scans, and we never have.  We used to use
SnapshotNow, and we now use a current MVCC snapshot.  What you're
talking about, I think, is possibly using the transaction snapshot
rather than a current MVCC snapshot for the catalog scans.

I've thought about similar things, but I think there's a pretty deep
can of worms.  For instance, if you built a relcache entry using the
transaction snapshot, you might end up building a seemingly-valid
relcache entry for a relation that has been dropped or rewritten.
When you try to access the relation data, you'll be attempt to access
a relfilenode that's not there any more.  Similarly, if you use an
older snapshot to build a partition descriptor, you might thing that
relation OID 12345 is still a partition of that table when in fact
it's been detached - and, maybe, altered in other ways, such as
changing column types.

It seems to me that overall you're not really focusing on the right
set of issues here.  I think the very first thing we need to worry
about how how we're going to keep the executor from following a bad
pointer and crashing.  Any time the partition descriptor changes, the
next relcache rebuild is going to replace rd_partdesc and free the old
one, but the executor may still have the old pointer cached in a
structure or local variable; the next attempt to dereference it will
be looking at freed memory, and kaboom.  Right now, we prevent this by
not allowing the partition descriptor to be modified while there are
any queries running against the partition, so while there may be a
rebuild, the old pointer will remain valid (cf. keep_partdesc).  I
think that whatever scheme you/we choose here should be tested with a
combination of CLOBBER_CACHE_ALWAYS and multiple concurrent sessions
-- one of them doing DDL on the table while the other runs a long
query.

Once we keep it from blowing up, the second question is what the
semantics are supposed to be.  It seems inconceivable to me that the
set of partitions that an in-progress query will scan can really be
changed on the fly.  I think we're going to have to rule that if you
add or remove partitions while a query is running, we're going to scan
exactly the set we had planned to scan at the beginning of the query;
anything else would require on-the-fly plan surgery to a degree that
seems unrealistic.  That means that when a new partition is attached,
already-running queries aren't going to scan it.  If they did, we'd
have big problems, because the transaction snapshot might see rows in
those tables from an earlier time period during which that table
wasn't attached.  There's no guarantee that the data at that time
conformed to the partition constraint, so it would be pretty
problematic to let users see it.  Conversely, when a partition is
detached, there may still be scans from existing queries hitting it
for a fairly arbitrary length of time afterwards.  That may be
surprising from a locking point of view or something, but it's correct
as far as MVCC goes.  Any changes made after the DETACH operation
can't be visible to the snapshot being used for the scan.

Now, what we could try to change on the fly is the set of partitions
that are used for tuple routing.  For instance, suppose we're
inserting a long stream of COPY data.  At some point, we attach a new
partition from another session.  If we then encounter a row that
doesn't route to any of the partitions that existed at the time the
query started, we could - instead of immediately failing - go and
reload the set of partitions that are available for tuple routing and
see if the new partition which was concurrently added happens to be
appropriate to the tuple we've got.  If so, we could route the tuple
to it.  But all of this looks optional.  If new partitions aren't
available for insert/update tuple routing until the start of the next
query, that's not a catastrophe.  The reverse direction might be more
problematic: if a partition is detached, I'm not sure how sensible it
is to keep routing tuples into it.  On the flip side, what would
break, really?

Given the foregoing, I don't see why you need something like
PartitionIsValid() at all, or why you need an algorithm similar to
CREATE INDEX CONCURRENTLY.  The problem seems mostly different.  In
the case of CREATE INDEX CONCURRENTLY, the issue is that any new
tuples that get inserted while the index creation is in progress need
to end up in the index, so you'd better not start building the index
on the existing tuples until everybody who might insert new tuples
knows about the index.  I don't see that we have the same kind of
problem in this case.  Each partition is basically a separate table
with its own set of indexes; as long as queries don't end up with one
notion of which tables are relevant and a different notion of which
indexes are relevant, we shouldn't end up with any table/index
inconsistencies.  And it's not clear to me what other problems we
actually have here.  To put it another way, if we've got the notion of
"isvalid" for a partition, what's the difference between a partition
that exists but is not yet valid and one that exists and is valid?  I
can't think of anything, and I have a feeling that you may therefore
be inventing a lot of infrastructure that is not necessary.

I'm inclined to think that we could drop the name CONCURRENTLY from
this feature altogether and recast it as work to reduce the lock level
associated with partition attach/detach.  As long as we have a
reasonable definition of what the semantics are for already-running
queries, and clear documentation to go with those semantics, that
seems fine.  If a particular user finds the concurrent behavior too
strange, they can always perform the DDL in a transaction that uses
LOCK TABLE first, removing the concurrency.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

David Rowley-3
On 14 August 2018 at 04:00, Robert Haas <[hidden email]> wrote:

> I've thought about similar things, but I think there's a pretty deep
> can of worms.  For instance, if you built a relcache entry using the
> transaction snapshot, you might end up building a seemingly-valid
> relcache entry for a relation that has been dropped or rewritten.
> When you try to access the relation data, you'll be attempt to access
> a relfilenode that's not there any more.  Similarly, if you use an
> older snapshot to build a partition descriptor, you might thing that
> relation OID 12345 is still a partition of that table when in fact
> it's been detached - and, maybe, altered in other ways, such as
> changing column types.

hmm, I guess for that to work correctly we'd need some way to allow
older snapshots to see the changes if they've not already taken a lock
on the table. If the lock had already been obtained then the ALTER
TABLE to change the type of the column would get blocked by the
existing lock. That kinda blows holes in only applying the change to
only snapshots newer than the ATTACH/DETACH's

> It seems to me that overall you're not really focusing on the right
> set of issues here.  I think the very first thing we need to worry
> about how how we're going to keep the executor from following a bad
> pointer and crashing.  Any time the partition descriptor changes, the
> next relcache rebuild is going to replace rd_partdesc and free the old
> one, but the executor may still have the old pointer cached in a
> structure or local variable; the next attempt to dereference it will
> be looking at freed memory, and kaboom.  Right now, we prevent this by
> not allowing the partition descriptor to be modified while there are
> any queries running against the partition, so while there may be a
> rebuild, the old pointer will remain valid (cf. keep_partdesc).  I
> think that whatever scheme you/we choose here should be tested with a
> combination of CLOBBER_CACHE_ALWAYS and multiple concurrent sessions
> -- one of them doing DDL on the table while the other runs a long
> query.

I did focus on that and did write a patch to solve the issue. After
writing that I discovered another problem where if the PartitionDesc
differed between planning and execution then run-time pruning did the
wrong thing (See find_matching_subplans_recurse). The
PartitionPruneInfo is built assuming the PartitionDesc matches between
planning and execution. I moved on from the dangling pointer issue
onto trying to figure out a way to ensure these are the same between
planning and execution.

> Once we keep it from blowing up, the second question is what the
> semantics are supposed to be.  It seems inconceivable to me that the
> set of partitions that an in-progress query will scan can really be
> changed on the fly.  I think we're going to have to rule that if you
> add or remove partitions while a query is running, we're going to scan
> exactly the set we had planned to scan at the beginning of the query;
> anything else would require on-the-fly plan surgery to a degree that
> seems unrealistic.

Trying to do that for in-progress queries would be pretty insane. I'm
not planning on doing anything there.

> That means that when a new partition is attached,
> already-running queries aren't going to scan it.  If they did, we'd
> have big problems, because the transaction snapshot might see rows in
> those tables from an earlier time period during which that table
> wasn't attached.  There's no guarantee that the data at that time
> conformed to the partition constraint, so it would be pretty
> problematic to let users see it.  Conversely, when a partition is
> detached, there may still be scans from existing queries hitting it
> for a fairly arbitrary length of time afterwards.  That may be
> surprising from a locking point of view or something, but it's correct
> as far as MVCC goes.  Any changes made after the DETACH operation
> can't be visible to the snapshot being used for the scan.
>
> Now, what we could try to change on the fly is the set of partitions
> that are used for tuple routing.  For instance, suppose we're
> inserting a long stream of COPY data.  At some point, we attach a new
> partition from another session.  If we then encounter a row that
> doesn't route to any of the partitions that existed at the time the
> query started, we could - instead of immediately failing - go and
> reload the set of partitions that are available for tuple routing and
> see if the new partition which was concurrently added happens to be
> appropriate to the tuple we've got.  If so, we could route the tuple
> to it.  But all of this looks optional.  If new partitions aren't
> available for insert/update tuple routing until the start of the next
> query, that's not a catastrophe.  The reverse direction might be more
> problematic: if a partition is detached, I'm not sure how sensible it
> is to keep routing tuples into it.  On the flip side, what would
> break, really?

Unsure about that, I don't really see what it would buy us, so
presumably you're just considering that this might not be a
roadblocking side-effect. However, I think the PartitionDesc needs to
not change between planning and execution due to run-time pruning
requirements, so if that's the case then what you're saying here is
probably not an issue we need to think about.

> Given the foregoing, I don't see why you need something like
> PartitionIsValid() at all, or why you need an algorithm similar to
> CREATE INDEX CONCURRENTLY.  The problem seems mostly different.  In
> the case of CREATE INDEX CONCURRENTLY, the issue is that any new
> tuples that get inserted while the index creation is in progress need
> to end up in the index, so you'd better not start building the index
> on the existing tuples until everybody who might insert new tuples
> knows about the index.  I don't see that we have the same kind of
> problem in this case.  Each partition is basically a separate table
> with its own set of indexes; as long as queries don't end up with one
> notion of which tables are relevant and a different notion of which
> indexes are relevant, we shouldn't end up with any table/index
> inconsistencies.  And it's not clear to me what other problems we
> actually have here.  To put it another way, if we've got the notion of
> "isvalid" for a partition, what's the difference between a partition
> that exists but is not yet valid and one that exists and is valid?  I
> can't think of anything, and I have a feeling that you may therefore
> be inventing a lot of infrastructure that is not necessary.

Well, the problem is that you want REPEATABLE READ transactions to be
exactly that. A concurrent attach/detach should not change the output
of a query. I don't know for sure that some isvalid flag is required,
but we do need something to ensure we don't change the results of
queries run inside a repeatable read transaction. I did try to start
moving away from the isvalid flag in favour of having a PartitionDesc
just not change within the same snapshot but you've pointed out a few
problems with what I tried to come up with for that.

> I'm inclined to think that we could drop the name CONCURRENTLY from
> this feature altogether and recast it as work to reduce the lock level
> associated with partition attach/detach.  As long as we have a
> reasonable definition of what the semantics are for already-running
> queries, and clear documentation to go with those semantics, that
> seems fine.  If a particular user finds the concurrent behavior too
> strange, they can always perform the DDL in a transaction that uses
> LOCK TABLE first, removing the concurrency.

I did have similar thoughts but that seems like something to think
about once the semantics are determined, not before.

Thanks for your input on this. I clearly don't have all the answers on
this so your input and thoughts are very valuable.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Previous Thread Next Thread