ATTACH/DETACH PARTITION CONCURRENTLY

classic Classic list List threaded Threaded
54 messages Options
123
Reply | Threaded
Open this post in threaded view
|

ATTACH/DETACH PARTITION CONCURRENTLY

David Rowley-3
Hi,

One of the downsides of declarative partitioning vs old school
inheritance partitioning is that a new partition cannot be added to
the partitioned table without taking an AccessExclusiveLock on the
partitioned table.  We've obviously got a bunch of features for
various other things where we work a bit harder to get around that
problem, e.g creating indexes concurrently.

I've started working on allowing partitions to be attached and
detached with just a ShareUpdateExclusiveLock on the table. If I'm
correct, then we can do this in a similar, but more simple way as to
how CREATE INDEX CONCURRENTLY works. We just need to pencil in that
the new partition exists, but not yet valid, then wait for snapshots
older than our own to finish before marking the partition is valid.

One problem I had with doing this is that there was not really a good
place to store that "isvalid" flag for partitions. We have pg_index
for indexes, but partition details are just spread over pg_inherits
and pg_class. So step 1 was to move all that into a new table called
pg_partition. I think this is quite nice as it also gets rid of
relpartbound out of pg_class. It probably just a matter of time before
someone complains that they can't create some partition with some
pretty large Datum due to it not being able to fit on a single heap
page (pg_class has no TOAST table).  I ended up getting rid of
pg_class.relispartition replacing it with relpartitionparernt which is
just InvalidOid when the table or index is not a partition.  This
allows various pieces of code to be more efficient since we can look
at the relcache instead of scanning pg_inherits all the time. It's now
also much faster to get a partitions ancestors.

So, patches 0001 is just one I've already submitted for the July
'fest. Nothing new. It was just required to start this work.

0002 migrates partitions out of pg_inherits into pg_partition. This
patch is at a stage where it appears to work, but is very unpolished
and requires me to stare at it much longer than I've done so far.
There's a bunch of code that gets repeated way too many times in
tablecmds.c, for example.

0003 does the same for partitioned indexes. The patch is in a similar,
maybe slightly worse state than 0002. Various comments will be out of
date.

0004
is the early workings of what I have in mind for the concurrent ATTACH
code. It's vastly incomplete. It does pass make check but really only
because there are no tests doing any concurrent attaches.  There's a
mountain of code missing that ignores invalid partitions. I just have
a very simple case working. Partition-wise joins will be very much
broken by what I have so far, and likely a whole bunch of other stuff.

About the extent of my tests so far are the following:

--setup
create table listp (a int) partition by list(a);
create table listp1 partition of listp for values in(1);
create table listp2 (a int);
insert into listp1 values(1);
insert into listp2 values(2);

-- example 1.
start transaction isolation level repeatable read; -- Session 1
select * from listp; -- Session 1
 a
---
 1
(1 row)

alter table listp attach partition concurrently listp2 for values in
(2); -- Session 2 (waits for release of session 1's snapshot)
select * from listp; -- Session 1
 a
---
 1

commit; -- session 1 (session 2's alter table now finishes waiting)
select * from listp; -- Session 1 (new partition now valid)
 a
---
 1
 2
(2 rows)

-- example 2.
start transaction isolation level read committed; -- session 1;
select * from listp; -- session 1
 a
---
 1
(1 row)

alter table listp attach partition concurrently listp2 for values in
(2); -- Session 2 completes without waiting.

select * from listp; -- Session 1 (new partition visible while in transaction)
 a
---
 1
 2
(2 rows)

This basically works by:

1. Do all the normal partition attach partition validation.
2. Insert a record into pg_partition with partisvalid=false
3. Obtain a session-level ShareUpdateExclusiveLock on the partitioned table.
4. Obtain a session-level AccessExclusiveLock on the partition being attached.
5. Commit.
6. Start a new transaction.
7. Wait for snapshots older than our own to be released.
8. Mark the partition as valid
9. Invalidate relcache for the partitioned table.
10. release session-level locks.

I've disallowed the feature when the partitioned table has a default
partition. I don't see how this can be made to work.

At the moment ALTER TABLE ... ATTACH PARTITION commands cannot contain
any other sub-commands in the ALTER TABLE, so performing the
additional transaction commit and begin inside the single sub-command
might be okay.  It does mean that 'rel' which is passed down to
ATExecAttachPartition() must be closed and reopened again which
results in the calling function having a pointer into a closed
Relation. I worked around this by changing the code so it passes a
pointer to the Relation, and I've got ATExecAttachPartition() updating
that pointer before returning. It's not particularly pretty, but I
didn't really see how else this can be done.

I've not yet done anything about the DETACH CONCURRENTLY case. I think
it should just be the same steps in some roughly reverse order.  We
can skip the waiting part of the partition being detached is still
marked as invalid from some failed concurrent ATTACH.

I've not thought much about pg_dump beyond just have it ignore invalid
partitions. I don't think it's very useful to support some command
that attaches an invalid partition since there will be no command to
revalidate an invalid partition. It's probably best to resolve that
with a DETACH followed by a new ATTACH. So probably pg_dump can just
do nothing for invalid partitions.

So anyway, my intentions of posting this patch now rather than when
it's closer to being finished is for design review. I'm interested in
hearing objections, comments, constructive criticism for patches
0002-0004. Patch 0001 comments can go to [1]

Are there any blockers on this that I've overlooked?

[1] https://www.postgresql.org/message-id/CAKJS1f81TpxZ8twugrWCo%3DVDHEkmagxRx7a%2B1z4aaMeQy%3DnA7w%40mail.gmail.com

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

v1-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch (77K) Download Attachment
v1-0002-Store-partition-details-in-pg_partition-instead-o.patch (139K) Download Attachment
v1-0003-Don-t-store-partition-index-details-in-pg_inherit.patch (67K) Download Attachment
v1-0004-Allow-partitions-to-be-attached-without-blocking-.patch (50K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

David Rowley-3
On 3 August 2018 at 01:25, David Rowley <[hidden email]> wrote:

> 1. Do all the normal partition attach partition validation.
> 2. Insert a record into pg_partition with partisvalid=false
> 3. Obtain a session-level ShareUpdateExclusiveLock on the partitioned table.
> 4. Obtain a session-level AccessExclusiveLock on the partition being attached.
> 5. Commit.
> 6. Start a new transaction.
> 7. Wait for snapshots older than our own to be released.
> 8. Mark the partition as valid
> 9. Invalidate relcache for the partitioned table.
> 10. release session-level locks.

So I was thinking about this again and realised this logic is broken.
All it takes is a snapshot that starts after the ATTACH PARTITION
started and before it completed.  This snapshot will have the new
partition attached while it's possibly still open which could lead to
non-repeatable reads in a repeatable read transaction.  The window for
this to occur is possibly quite large given that the ATTACH
CONCURRENTLY can wait a long time for older snapshots to finish.

Here's my updated thinking for an implementation which seems to get
around the above problem:

ATTACH PARTITION CONCURRENTLY:

1. Obtain a ShareUpdateExclusiveLock on the partitioned table rather
than an AccessExclusiveLock.
2. Do all the normal partition attach partition validation.
3. Insert pg_partition record with partvalid = true.
4. Invalidate relcache entry for the partitioned table
5. Any loops over a partitioned table's PartitionDesc must check
PartitionIsValid(). This will return true if the current snapshot
should see the partition or not. The partition is valid if partisvalid
= true and the xmin precedes or is equal to the current snapshot.

#define PartitionIsValid(pd, i) (((pd)->is_valid[(i)] \
&& TransactionIdPrecedesOrEquals((pd)->xmin[(i)], GetCurrentTransactionId())) \
|| (!(pd)->is_valid[(i)] \
&& TransactionIdPrecedesOrEquals(GetCurrentTransactionId(), (pd)->xmin[(i)])))

DETACH PARTITION CONCURRENTLY:

1. Obtain ShareUpdateExclusiveLock on partition being detached
(instead of the AccessShareLock that non-concurrent detach uses)
2. Update the pg_partition record, set partvalid = false.
3. Commit
4. New transaction.
5. Wait for transactions which hold a snapshot older than the one held
when updating pg_partition to complete.
6. Delete the pg_partition record.
7. Perform other cleanup, relpartitionparent = 0, pg_depend etc.

DETACH PARTITION CONCURRENTLY failure when it fails after step 3 (above)

1. Make vacuum of a partition check for pg_partition.partvalid =
false, if xmin of tuple is old enough, perform a partition cleanup by
doing steps 6+7 above.

A VACUUM FREEZE must run before transaction wraparound, so this means
a partition can never reattach itself when the transaction counter
wraps.

I believe I've got the attach and detach working correctly now and
also isolation tests that appear to prove it works. I've also written
the failed detach cleanup code into vacuum. Unusually, since foreign
tables can also be partitions this required teaching auto-vacuum to
look at foreign tables, only in the sense of checking for failed
detached partitions. It also requires adding vacuum support for
foreign tables too.  It feels a little bit weird to modify auto-vacuum
to look at foreign tables, but I really couldn't see another way to do
this.

I'm now considering if this all holds together in the event the
pg_partition tuple of an invalid partition becomes frozen. The problem
would be that PartitionIsValid() could return the wrong value due to
TransactionIdPrecedesOrEquals(GetCurrentTransactionId(),
(pd)->xmin[(i)]). this code is trying to keep the detached partition
visible to older snapshots, but if pd->xmin[i] becomes frozen, then
the partition would become invisible.  However, I think this won't be
a problem since a VACUUM FREEZE would only freeze tuples that are also
old enough to have failed detaches cleaned up earlier in the vacuum
process.

Also, we must disallow a DEFAULT partition from being attached to a
partition with a failed DETACH CONCURRENTLY as it wouldn't be very
clear what the default partition's partition qual would be, as this is
built based on the quals of all attached partitions.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Andres Freund
Hi,

On 2018-08-08 00:40:12 +1200, David Rowley wrote:
> 1. Obtain a ShareUpdateExclusiveLock on the partitioned table rather
> than an AccessExclusiveLock.
> 2. Do all the normal partition attach partition validation.
> 3. Insert pg_partition record with partvalid = true.
> 4. Invalidate relcache entry for the partitioned table
> 5. Any loops over a partitioned table's PartitionDesc must check
> PartitionIsValid(). This will return true if the current snapshot
> should see the partition or not. The partition is valid if partisvalid
> = true and the xmin precedes or is equal to the current snapshot.

How does this protect against other sessions actively using the relcache
entry? Currently it is *NOT* safe to receive invalidations for
e.g. partitioning contents afaics.

- Andres

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Simon Riggs
On 7 August 2018 at 13:47, Andres Freund <[hidden email]> wrote:

> Hi,
>
> On 2018-08-08 00:40:12 +1200, David Rowley wrote:
>> 1. Obtain a ShareUpdateExclusiveLock on the partitioned table rather
>> than an AccessExclusiveLock.
>> 2. Do all the normal partition attach partition validation.
>> 3. Insert pg_partition record with partvalid = true.
>> 4. Invalidate relcache entry for the partitioned table
>> 5. Any loops over a partitioned table's PartitionDesc must check
>> PartitionIsValid(). This will return true if the current snapshot
>> should see the partition or not. The partition is valid if partisvalid
>> = true and the xmin precedes or is equal to the current snapshot.
>
> How does this protect against other sessions actively using the relcache
> entry? Currently it is *NOT* safe to receive invalidations for
> e.g. partitioning contents afaics.

I think you may be right in the general case, but ISTM possible to
invalidate/refresh just the list of partitions.

If so, that idea would seem to require some new, as-yet not invented mechanism.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

David Rowley-3
In reply to this post by Andres Freund
On 8 August 2018 at 00:47, Andres Freund <[hidden email]> wrote:

> On 2018-08-08 00:40:12 +1200, David Rowley wrote:
>> 1. Obtain a ShareUpdateExclusiveLock on the partitioned table rather
>> than an AccessExclusiveLock.
>> 2. Do all the normal partition attach partition validation.
>> 3. Insert pg_partition record with partvalid = true.
>> 4. Invalidate relcache entry for the partitioned table
>> 5. Any loops over a partitioned table's PartitionDesc must check
>> PartitionIsValid(). This will return true if the current snapshot
>> should see the partition or not. The partition is valid if partisvalid
>> = true and the xmin precedes or is equal to the current snapshot.
>
> How does this protect against other sessions actively using the relcache
> entry? Currently it is *NOT* safe to receive invalidations for
> e.g. partitioning contents afaics.

I'm not proposing that sessions running older snapshots can't see that
there's a new partition. The code I have uses PartitionIsValid() to
test if the partition should be visible to the snapshot. The
PartitionDesc will always contain details for all partitions stored in
pg_partition whether they're valid to the current snapshot or not.  I
did it this way as there's no way to invalidate the relcache based on
a point in transaction, only a point in time.

I'm open to better ideas, of course.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Andres Freund
On 2018-08-08 01:23:51 +1200, David Rowley wrote:

> On 8 August 2018 at 00:47, Andres Freund <[hidden email]> wrote:
> > On 2018-08-08 00:40:12 +1200, David Rowley wrote:
> >> 1. Obtain a ShareUpdateExclusiveLock on the partitioned table rather
> >> than an AccessExclusiveLock.
> >> 2. Do all the normal partition attach partition validation.
> >> 3. Insert pg_partition record with partvalid = true.
> >> 4. Invalidate relcache entry for the partitioned table
> >> 5. Any loops over a partitioned table's PartitionDesc must check
> >> PartitionIsValid(). This will return true if the current snapshot
> >> should see the partition or not. The partition is valid if partisvalid
> >> = true and the xmin precedes or is equal to the current snapshot.
> >
> > How does this protect against other sessions actively using the relcache
> > entry? Currently it is *NOT* safe to receive invalidations for
> > e.g. partitioning contents afaics.
>
> I'm not proposing that sessions running older snapshots can't see that
> there's a new partition. The code I have uses PartitionIsValid() to
> test if the partition should be visible to the snapshot. The
> PartitionDesc will always contain details for all partitions stored in
> pg_partition whether they're valid to the current snapshot or not.  I
> did it this way as there's no way to invalidate the relcache based on
> a point in transaction, only a point in time.

I don't think that solves the problem that an arriving relcache
invalidation would trigger a rebuild of rd_partdesc, while it actually
is referenced by running code.

You'd need to build infrastructure to prevent that.

One approach would be to make sure that everything relying on
rt_partdesc staying the same stores its value in a local variable, and
then *not* free the old version of rt_partdesc (etc) when the refcount >
0, but delay that to the RelationClose() that makes refcount reach
0. That'd be the start of a framework for more such concurrenct
handling.

Regards,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

David Rowley-3
On 8 August 2018 at 01:29, Andres Freund <[hidden email]> wrote:

> On 2018-08-08 01:23:51 +1200, David Rowley wrote:
>> I'm not proposing that sessions running older snapshots can't see that
>> there's a new partition. The code I have uses PartitionIsValid() to
>> test if the partition should be visible to the snapshot. The
>> PartitionDesc will always contain details for all partitions stored in
>> pg_partition whether they're valid to the current snapshot or not.  I
>> did it this way as there's no way to invalidate the relcache based on
>> a point in transaction, only a point in time.
>
> I don't think that solves the problem that an arriving relcache
> invalidation would trigger a rebuild of rd_partdesc, while it actually
> is referenced by running code.
>
> You'd need to build infrastructure to prevent that.
>
> One approach would be to make sure that everything relying on
> rt_partdesc staying the same stores its value in a local variable, and
> then *not* free the old version of rt_partdesc (etc) when the refcount >
> 0, but delay that to the RelationClose() that makes refcount reach
> 0. That'd be the start of a framework for more such concurrenct
> handling.
I'm not so sure not freeing the partdesc until the refcount reaches 0
is safe. As you'd expect, we hold a lock on a partitioned table
between the planner and executor, but the relation has been closed and
the ref count returns to 0, which means when the relation is first
opened in the executor that the updated PartitionDesc is obtained.  A
non-concurrent attach would have been blocked in this case due to the
lock being held by the planner. Instead of using refcount == 0,
perhaps we can release the original partdesc only when there are no
locks held by us on the relation.

It's late here now, so I'll look at that tomorrow.

I've attached what I was playing around with. I think I'll also need
to change RelationGetPartitionDesc() to have it return the original
partdesc, if it's non-NULL.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

dont_destroy_original_partdesc_on_rel_inval.patch (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Peter Eisentraut-6
In reply to this post by Andres Freund
On 07/08/2018 15:29, Andres Freund wrote:
> I don't think that solves the problem that an arriving relcache
> invalidation would trigger a rebuild of rd_partdesc, while it actually
> is referenced by running code.

The problem is more generally that a relcache invalidation changes all
pointers that might be in use.  So it's currently not safe to trigger a
relcache invalidation (on tables) without some kind of exclusive lock.
One possible solution to this is outlined here:
<https://www.postgresql.org/message-id/CA+TgmobtmFT5g-0dA=vEFFtogjRAuDHcYPw+qEdou5dZPnF=pg@...>

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Andres Freund
Hi,

On 2018-08-09 20:57:35 +0200, Peter Eisentraut wrote:
> On 07/08/2018 15:29, Andres Freund wrote:
> > I don't think that solves the problem that an arriving relcache
> > invalidation would trigger a rebuild of rd_partdesc, while it actually
> > is referenced by running code.
>
> The problem is more generally that a relcache invalidation changes all
> pointers that might be in use.

I don't think that's quite right. We already better be OK with
superfluous invals that do not change anything, because there's already
sources of those (just think of vacuum, analyze, relation extension,
whatnot).


> So it's currently not safe to trigger a relcache invalidation (on
> tables) without some kind of exclusive lock.

I don't think that's true in a as general sense as you're stating it.
It's not OK to send relcache invalidations for things that people rely
on, and that cannot be updated in-place. Because of the dangling pointer
issue etc.

The fact that currently it is not safe to *change* partition related
stuff without an AEL and how to make it safe is precisely what I was
talking about in the thread. It won't be a general solution, but the
infrastructure I'm talking about should get us closer.


> One possible solution to this is outlined here:
> <https://www.postgresql.org/message-id/CA+TgmobtmFT5g-0dA=vEFFtogjRAuDHcYPw+qEdou5dZPnF=pg@...>

I don't see anything in here that addresses the issue structurally?

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

David Rowley-3
In reply to this post by Andres Freund
On 8 August 2018 at 01:29, Andres Freund <[hidden email]> wrote:
> One approach would be to make sure that everything relying on
> rt_partdesc staying the same stores its value in a local variable, and
> then *not* free the old version of rt_partdesc (etc) when the refcount >
> 0, but delay that to the RelationClose() that makes refcount reach
> 0. That'd be the start of a framework for more such concurrenct
> handling.

This is not a fully baked idea, but I'm wondering if a better way to
do this, instead of having this PartitionIsValid macro to determine if
the partition should be visible to the current transaction ID, we
could, when we invalidate a relcache entry, send along the transaction
ID that it's invalid from.  Other backends when they process the
invalidation message they could wipe out the cache entry only if their
xid is >= the invalidation's "xmax" value. Otherwise, just tag the
xmax onto the cache somewhere and always check it before using the
cache (perhaps make it part of the RelationIsValid macro).  This would
also require that we move away from SnapshotAny type catalogue scans
in favour of MVCC scans so that backends populating their relcache
build it based on their current xid.  Unless I'm mistaken, it should
not make any difference for all DDL that takes an AEL on the relation,
since there can be no older transactions running when the catalogue is
modified, but for DDL that's not taking an AEL, we could effectively
have an MVCC relcache.

It would need careful thought about how it might affect CREATE INDEX
CONCURRENTLY and all the other DDL that can be performed without an
AEL.

I'm unsure how this would work for the catcache as I've studied that
code in even less detail, but throwing this out there in case there
some major flaw in this idea so that I don't go wasting time looking
into it further.

I think the PartitionIsValid idea was not that great as it really
complicates run-time partition pruning since it's quite critical about
partition indexes being the same between the planner and executor.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Robert Haas
On Sun, Aug 12, 2018 at 9:05 AM, David Rowley
<[hidden email]> wrote:
> This is not a fully baked idea, but I'm wondering if a better way to
> do this, instead of having this PartitionIsValid macro to determine if
> the partition should be visible to the current transaction ID, we
> could, when we invalidate a relcache entry, send along the transaction
> ID that it's invalid from.  Other backends when they process the
> invalidation message they could wipe out the cache entry only if their
> xid is >= the invalidation's "xmax" value. Otherwise, just tag the
> xmax onto the cache somewhere and always check it before using the
> cache (perhaps make it part of the RelationIsValid macro).

Transactions don't necessarily commit in XID order, so this might be
an optimization to keep older transactions from having to do
unnecessary rebuilds -- which I actually doubt is a major problem, but
maybe I'm wrong -- but you can't rely solely on this as a way of
deciding which transactions will see the effects of some change.  If
transactions 100, 101, and 102 begin in that order, and transaction
101 commits, there's no principled justification for 102 seeing its
effects but 100 not seeing it.

> This would
> also require that we move away from SnapshotAny type catalogue scans
> in favour of MVCC scans so that backends populating their relcache
> build it based on their current xid.

I think this is a somewhat confused analysis.  We don't use
SnapshotAny for catalog scans, and we never have.  We used to use
SnapshotNow, and we now use a current MVCC snapshot.  What you're
talking about, I think, is possibly using the transaction snapshot
rather than a current MVCC snapshot for the catalog scans.

I've thought about similar things, but I think there's a pretty deep
can of worms.  For instance, if you built a relcache entry using the
transaction snapshot, you might end up building a seemingly-valid
relcache entry for a relation that has been dropped or rewritten.
When you try to access the relation data, you'll be attempt to access
a relfilenode that's not there any more.  Similarly, if you use an
older snapshot to build a partition descriptor, you might thing that
relation OID 12345 is still a partition of that table when in fact
it's been detached - and, maybe, altered in other ways, such as
changing column types.

It seems to me that overall you're not really focusing on the right
set of issues here.  I think the very first thing we need to worry
about how how we're going to keep the executor from following a bad
pointer and crashing.  Any time the partition descriptor changes, the
next relcache rebuild is going to replace rd_partdesc and free the old
one, but the executor may still have the old pointer cached in a
structure or local variable; the next attempt to dereference it will
be looking at freed memory, and kaboom.  Right now, we prevent this by
not allowing the partition descriptor to be modified while there are
any queries running against the partition, so while there may be a
rebuild, the old pointer will remain valid (cf. keep_partdesc).  I
think that whatever scheme you/we choose here should be tested with a
combination of CLOBBER_CACHE_ALWAYS and multiple concurrent sessions
-- one of them doing DDL on the table while the other runs a long
query.

Once we keep it from blowing up, the second question is what the
semantics are supposed to be.  It seems inconceivable to me that the
set of partitions that an in-progress query will scan can really be
changed on the fly.  I think we're going to have to rule that if you
add or remove partitions while a query is running, we're going to scan
exactly the set we had planned to scan at the beginning of the query;
anything else would require on-the-fly plan surgery to a degree that
seems unrealistic.  That means that when a new partition is attached,
already-running queries aren't going to scan it.  If they did, we'd
have big problems, because the transaction snapshot might see rows in
those tables from an earlier time period during which that table
wasn't attached.  There's no guarantee that the data at that time
conformed to the partition constraint, so it would be pretty
problematic to let users see it.  Conversely, when a partition is
detached, there may still be scans from existing queries hitting it
for a fairly arbitrary length of time afterwards.  That may be
surprising from a locking point of view or something, but it's correct
as far as MVCC goes.  Any changes made after the DETACH operation
can't be visible to the snapshot being used for the scan.

Now, what we could try to change on the fly is the set of partitions
that are used for tuple routing.  For instance, suppose we're
inserting a long stream of COPY data.  At some point, we attach a new
partition from another session.  If we then encounter a row that
doesn't route to any of the partitions that existed at the time the
query started, we could - instead of immediately failing - go and
reload the set of partitions that are available for tuple routing and
see if the new partition which was concurrently added happens to be
appropriate to the tuple we've got.  If so, we could route the tuple
to it.  But all of this looks optional.  If new partitions aren't
available for insert/update tuple routing until the start of the next
query, that's not a catastrophe.  The reverse direction might be more
problematic: if a partition is detached, I'm not sure how sensible it
is to keep routing tuples into it.  On the flip side, what would
break, really?

Given the foregoing, I don't see why you need something like
PartitionIsValid() at all, or why you need an algorithm similar to
CREATE INDEX CONCURRENTLY.  The problem seems mostly different.  In
the case of CREATE INDEX CONCURRENTLY, the issue is that any new
tuples that get inserted while the index creation is in progress need
to end up in the index, so you'd better not start building the index
on the existing tuples until everybody who might insert new tuples
knows about the index.  I don't see that we have the same kind of
problem in this case.  Each partition is basically a separate table
with its own set of indexes; as long as queries don't end up with one
notion of which tables are relevant and a different notion of which
indexes are relevant, we shouldn't end up with any table/index
inconsistencies.  And it's not clear to me what other problems we
actually have here.  To put it another way, if we've got the notion of
"isvalid" for a partition, what's the difference between a partition
that exists but is not yet valid and one that exists and is valid?  I
can't think of anything, and I have a feeling that you may therefore
be inventing a lot of infrastructure that is not necessary.

I'm inclined to think that we could drop the name CONCURRENTLY from
this feature altogether and recast it as work to reduce the lock level
associated with partition attach/detach.  As long as we have a
reasonable definition of what the semantics are for already-running
queries, and clear documentation to go with those semantics, that
seems fine.  If a particular user finds the concurrent behavior too
strange, they can always perform the DDL in a transaction that uses
LOCK TABLE first, removing the concurrency.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

David Rowley-3
On 14 August 2018 at 04:00, Robert Haas <[hidden email]> wrote:

> I've thought about similar things, but I think there's a pretty deep
> can of worms.  For instance, if you built a relcache entry using the
> transaction snapshot, you might end up building a seemingly-valid
> relcache entry for a relation that has been dropped or rewritten.
> When you try to access the relation data, you'll be attempt to access
> a relfilenode that's not there any more.  Similarly, if you use an
> older snapshot to build a partition descriptor, you might thing that
> relation OID 12345 is still a partition of that table when in fact
> it's been detached - and, maybe, altered in other ways, such as
> changing column types.

hmm, I guess for that to work correctly we'd need some way to allow
older snapshots to see the changes if they've not already taken a lock
on the table. If the lock had already been obtained then the ALTER
TABLE to change the type of the column would get blocked by the
existing lock. That kinda blows holes in only applying the change to
only snapshots newer than the ATTACH/DETACH's

> It seems to me that overall you're not really focusing on the right
> set of issues here.  I think the very first thing we need to worry
> about how how we're going to keep the executor from following a bad
> pointer and crashing.  Any time the partition descriptor changes, the
> next relcache rebuild is going to replace rd_partdesc and free the old
> one, but the executor may still have the old pointer cached in a
> structure or local variable; the next attempt to dereference it will
> be looking at freed memory, and kaboom.  Right now, we prevent this by
> not allowing the partition descriptor to be modified while there are
> any queries running against the partition, so while there may be a
> rebuild, the old pointer will remain valid (cf. keep_partdesc).  I
> think that whatever scheme you/we choose here should be tested with a
> combination of CLOBBER_CACHE_ALWAYS and multiple concurrent sessions
> -- one of them doing DDL on the table while the other runs a long
> query.

I did focus on that and did write a patch to solve the issue. After
writing that I discovered another problem where if the PartitionDesc
differed between planning and execution then run-time pruning did the
wrong thing (See find_matching_subplans_recurse). The
PartitionPruneInfo is built assuming the PartitionDesc matches between
planning and execution. I moved on from the dangling pointer issue
onto trying to figure out a way to ensure these are the same between
planning and execution.

> Once we keep it from blowing up, the second question is what the
> semantics are supposed to be.  It seems inconceivable to me that the
> set of partitions that an in-progress query will scan can really be
> changed on the fly.  I think we're going to have to rule that if you
> add or remove partitions while a query is running, we're going to scan
> exactly the set we had planned to scan at the beginning of the query;
> anything else would require on-the-fly plan surgery to a degree that
> seems unrealistic.

Trying to do that for in-progress queries would be pretty insane. I'm
not planning on doing anything there.

> That means that when a new partition is attached,
> already-running queries aren't going to scan it.  If they did, we'd
> have big problems, because the transaction snapshot might see rows in
> those tables from an earlier time period during which that table
> wasn't attached.  There's no guarantee that the data at that time
> conformed to the partition constraint, so it would be pretty
> problematic to let users see it.  Conversely, when a partition is
> detached, there may still be scans from existing queries hitting it
> for a fairly arbitrary length of time afterwards.  That may be
> surprising from a locking point of view or something, but it's correct
> as far as MVCC goes.  Any changes made after the DETACH operation
> can't be visible to the snapshot being used for the scan.
>
> Now, what we could try to change on the fly is the set of partitions
> that are used for tuple routing.  For instance, suppose we're
> inserting a long stream of COPY data.  At some point, we attach a new
> partition from another session.  If we then encounter a row that
> doesn't route to any of the partitions that existed at the time the
> query started, we could - instead of immediately failing - go and
> reload the set of partitions that are available for tuple routing and
> see if the new partition which was concurrently added happens to be
> appropriate to the tuple we've got.  If so, we could route the tuple
> to it.  But all of this looks optional.  If new partitions aren't
> available for insert/update tuple routing until the start of the next
> query, that's not a catastrophe.  The reverse direction might be more
> problematic: if a partition is detached, I'm not sure how sensible it
> is to keep routing tuples into it.  On the flip side, what would
> break, really?

Unsure about that, I don't really see what it would buy us, so
presumably you're just considering that this might not be a
roadblocking side-effect. However, I think the PartitionDesc needs to
not change between planning and execution due to run-time pruning
requirements, so if that's the case then what you're saying here is
probably not an issue we need to think about.

> Given the foregoing, I don't see why you need something like
> PartitionIsValid() at all, or why you need an algorithm similar to
> CREATE INDEX CONCURRENTLY.  The problem seems mostly different.  In
> the case of CREATE INDEX CONCURRENTLY, the issue is that any new
> tuples that get inserted while the index creation is in progress need
> to end up in the index, so you'd better not start building the index
> on the existing tuples until everybody who might insert new tuples
> knows about the index.  I don't see that we have the same kind of
> problem in this case.  Each partition is basically a separate table
> with its own set of indexes; as long as queries don't end up with one
> notion of which tables are relevant and a different notion of which
> indexes are relevant, we shouldn't end up with any table/index
> inconsistencies.  And it's not clear to me what other problems we
> actually have here.  To put it another way, if we've got the notion of
> "isvalid" for a partition, what's the difference between a partition
> that exists but is not yet valid and one that exists and is valid?  I
> can't think of anything, and I have a feeling that you may therefore
> be inventing a lot of infrastructure that is not necessary.

Well, the problem is that you want REPEATABLE READ transactions to be
exactly that. A concurrent attach/detach should not change the output
of a query. I don't know for sure that some isvalid flag is required,
but we do need something to ensure we don't change the results of
queries run inside a repeatable read transaction. I did try to start
moving away from the isvalid flag in favour of having a PartitionDesc
just not change within the same snapshot but you've pointed out a few
problems with what I tried to come up with for that.

> I'm inclined to think that we could drop the name CONCURRENTLY from
> this feature altogether and recast it as work to reduce the lock level
> associated with partition attach/detach.  As long as we have a
> reasonable definition of what the semantics are for already-running
> queries, and clear documentation to go with those semantics, that
> seems fine.  If a particular user finds the concurrent behavior too
> strange, they can always perform the DDL in a transaction that uses
> LOCK TABLE first, removing the concurrency.

I did have similar thoughts but that seems like something to think
about once the semantics are determined, not before.

Thanks for your input on this. I clearly don't have all the answers on
this so your input and thoughts are very valuable.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Alvaro Herrera-9
In reply to this post by Robert Haas
On 2018-Aug-13, Robert Haas wrote:

> I think this is a somewhat confused analysis.  We don't use
> SnapshotAny for catalog scans, and we never have.  We used to use
> SnapshotNow, and we now use a current MVCC snapshot.  What you're
> talking about, I think, is possibly using the transaction snapshot
> rather than a current MVCC snapshot for the catalog scans.
>
> I've thought about similar things, but I think there's a pretty deep
> can of worms.  For instance, if you built a relcache entry using the
> transaction snapshot, you might end up building a seemingly-valid
> relcache entry for a relation that has been dropped or rewritten.
> When you try to access the relation data, you'll be attempt to access
> a relfilenode that's not there any more.  Similarly, if you use an
> older snapshot to build a partition descriptor, you might thing that
> relation OID 12345 is still a partition of that table when in fact
> it's been detached - and, maybe, altered in other ways, such as
> changing column types.

I wonder if this all stems from a misunderstanding of what I suggested
to David offlist.  My suggestion was that the catalog scans would
continue to use the catalog MVCC snapshot, and that the relcache entries
would contain all the partitions that appear to the catalog; but each
partition's entry would carry the Xid of the creating transaction in a
field (say xpart), and that field is compared to the regular transaction
snapshot: if xpart is visible to the transaction snapshot, then the
partition is visible, otherwise not.  So you never try to access a
partition that doesn't exist, because those just don't appear at all in
the relcache entry.  But if you have an old transaction running with an
old snapshot, and the partitioned table just acquired a new partition,
then whether the partition will be returned as part of the partition
descriptor or not depends on the visibility of its entry.

I think that works fine for ATTACH without any further changes.  I'm not
so sure about DETACH, particularly when snapshots persist for a "long
time" (a repeatable-read transaction).  ISTM that in the above design,
the partition descriptor would lose the entry for the detached partition
ahead of time, which means queries would silently fail to see their data
(though they wouldn't crash).  I first thought this could be fixed by
waiting for those snapshots to finish, but then I realized that there's
no actual place where waiting achieves anything.  Certainly it's not
useful to wait before commit (because other snapshots are going to be
starting all the time), and it's not useful to start after the commit
(because by then the catalog tuple is already gone).  Maybe we need two
transactions: mark partition as removed with an xmax of sorts, commit,
wait for all snapshots, start transaction, remove partition catalog
tuple, commit.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Robert Haas
On Mon, Aug 20, 2018 at 4:21 PM, Alvaro Herrera
<[hidden email]> wrote:

> I wonder if this all stems from a misunderstanding of what I suggested
> to David offlist.  My suggestion was that the catalog scans would
> continue to use the catalog MVCC snapshot, and that the relcache entries
> would contain all the partitions that appear to the catalog; but each
> partition's entry would carry the Xid of the creating transaction in a
> field (say xpart), and that field is compared to the regular transaction
> snapshot: if xpart is visible to the transaction snapshot, then the
> partition is visible, otherwise not.  So you never try to access a
> partition that doesn't exist, because those just don't appear at all in
> the relcache entry.  But if you have an old transaction running with an
> old snapshot, and the partitioned table just acquired a new partition,
> then whether the partition will be returned as part of the partition
> descriptor or not depends on the visibility of its entry.

Hmm.  One question is where you're going to get the XID of the
creating transaction.  If it's taken from the pg_class row or the
pg_inherits row or something of that sort, then you risk getting a
bogus value if something updates that row other than what you expect
-- and the consequences of that are pretty bad here; for this to work
as you intend, you need an exactly-correct value, not newer or older.
An alternative is to add an xid field that stores the value
explicitly, and that might work, but you'll have to arrange for that
value to be frozen at the appropriate time.

A further problem is that there could be multiple changes in quick
succession.  Suppose that a partition is attached, then detached
before the attach operation is all-visible, then reattached, perhaps
with different partition bounds.

> I think that works fine for ATTACH without any further changes.  I'm not
> so sure about DETACH, particularly when snapshots persist for a "long
> time" (a repeatable-read transaction).  ISTM that in the above design,
> the partition descriptor would lose the entry for the detached partition
> ahead of time, which means queries would silently fail to see their data
> (though they wouldn't crash).

I don't see why they wouldn't crash.  If the partition descriptor gets
rebuilt and some partitions disappear out from under you, the old
partition descriptor is going to get freed, and the executor has a
cached pointer to it, so it seems like you are in trouble.

> I first thought this could be fixed by
> waiting for those snapshots to finish, but then I realized that there's
> no actual place where waiting achieves anything.  Certainly it's not
> useful to wait before commit (because other snapshots are going to be
> starting all the time), and it's not useful to start after the commit
> (because by then the catalog tuple is already gone).  Maybe we need two
> transactions: mark partition as removed with an xmax of sorts, commit,
> wait for all snapshots, start transaction, remove partition catalog
> tuple, commit.

And what would that accomplish, exactly?  Waiting for all snapshots
would ensure that all still-running transactions see the fact the xmax
with which the partition has been marked as removed, but what good
does that do?  In order to have a plausible algorithm, you have to
describe both what the ATTACH/DETACH operation does and what the other
concurrent transactions do and how those things interact.  Otherwise,
it's like saying that we're going to solve a problem with X and Y
overlapping by having X take a lock.  If Y doesn't take a conflicting
lock, this does nothing.

Generally, I think I see what you're aiming at: make ATTACH and DETACH
have MVCC-like semantics with respect to concurrent transactions.  I
don't think that's a dumb idea from a theoretical perspective, but in
practice I think it's going to be very difficult to implement.  We
have no other DDL that has such semantics, and there's no reason we
couldn't; for example, TRUNCATE could work with SUEL and transactions
that can't see the TRUNCATE as committed continue to operate on the
old heap.  While we could do such things, we don't.  If you decide to
do them here, you've probably got a lot of work ahead of you.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

David Rowley-3
On 21 August 2018 at 13:59, Robert Haas <[hidden email]> wrote:

> On Mon, Aug 20, 2018 at 4:21 PM, Alvaro Herrera
> <[hidden email]> wrote:
>> I wonder if this all stems from a misunderstanding of what I suggested
>> to David offlist.  My suggestion was that the catalog scans would
>> continue to use the catalog MVCC snapshot, and that the relcache entries
>> would contain all the partitions that appear to the catalog; but each
>> partition's entry would carry the Xid of the creating transaction in a
>> field (say xpart), and that field is compared to the regular transaction
>> snapshot: if xpart is visible to the transaction snapshot, then the
>> partition is visible, otherwise not.  So you never try to access a
>> partition that doesn't exist, because those just don't appear at all in
>> the relcache entry.  But if you have an old transaction running with an
>> old snapshot, and the partitioned table just acquired a new partition,
>> then whether the partition will be returned as part of the partition
>> descriptor or not depends on the visibility of its entry.
>
> Hmm.  One question is where you're going to get the XID of the
> creating transaction.  If it's taken from the pg_class row or the
> pg_inherits row or something of that sort, then you risk getting a
> bogus value if something updates that row other than what you expect
> -- and the consequences of that are pretty bad here; for this to work
> as you intend, you need an exactly-correct value, not newer or older.
> An alternative is to add an xid field that stores the value
> explicitly, and that might work, but you'll have to arrange for that
> value to be frozen at the appropriate time.
>
> A further problem is that there could be multiple changes in quick
> succession.  Suppose that a partition is attached, then detached
> before the attach operation is all-visible, then reattached, perhaps
> with different partition bounds.

I should probably post the WIP I have here.  In those, I do have the
xmin array in the PartitionDesc. This gets taken from the new
pg_partition table, which I don't think suffers from the same issue as
taking it from pg_class, since nothing else will update the
pg_partition record.

However, I don't think the xmin array is going to work if we include
it in the PartitionDesc.  The problem is, as I discovered from writing
the code was that the PartitionDesc must remain exactly the same
between planning an execution. If there are any more or any fewer
partitions found during execution than what we saw in planning then
run-time pruning will access the wrong element in the
PartitionPruneInfo array, or perhaps access of the end of the array.
It might be possible to work around that by identifying partitions by
Oid rather than PartitionDesc array index, but the run-time pruning
code is already pretty complex. I think coding it to work when the
PartitionDesc does not match between planning and execution is just
going to too difficult to get right.  Tom is already unhappy with the
complexity of ExecFindInitialMatchingSubPlans().

I think the solution will require that the PartitionDesc does not:

a) Change between planning and execution.
b) Change during a snapshot after the partitioned table has been locked.

With b, it sounds like we'll need to take the most recent
PartitionDesc even if the transaction is older than the one that did
the ATTACH/DETACH operation as if we use an old version then, as
Robert mentions, there's nothing to stop another transaction making
changes to the table that make it an incompatible partition, e.g DROP
COLUMN.  This wouldn't be possible if we update the PartitionDesc
right after taking the first lock on the partitioned table since any
transactions doing DROP COLUMN would be blocked until the other
snapshot gets released.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Alvaro Herrera-9
In reply to this post by David Rowley-3
Hello

Here's my take on this feature, owing to David Rowley's version.

Firstly, I took Robert's advice and removed the CONCURRENTLY keyword
from the syntax.  We just do it that way always.  When there's a default
partition, only that partition is locked with an AEL; all the rest is
locked with ShareUpdateExclusive only.

I added some isolation tests for it -- they all pass for me.

There are two main ideas supporting this patch:

1. The Partition descriptor cache module (partcache.c) now contains a
long-lived hash table that lists all the current partition descriptors;
when an invalidation message is received for a relation, we unlink the
partdesc from the hash table *but do not free it*.  The hash
table-linked partdesc is rebuilt again in the future, when requested, so
many copies might exist in memory for one partitioned table.

2. Snapshots have their own cache (hash table) of partition descriptors.
If a partdesc is requested and the snapshot has already obtained that
partdesc, the original one is returned -- we don't request a new one
from partcache.

Then there are a few other implementation details worth mentioning:

3. parallel query: when a worker starts on a snapshot that has a
partition descriptor cache, we need to transmit those partdescs from
leader via shmem ... but we cannot send the full struct, so we just send
the OID list of partitions, then rebuild the descriptor in the worker.
Side effect: if a partition is detached right between the leader taking
the partdesc and the worker starting, the partition loses its
relpartbound column, so it's not possible to reconstruct the partdesc.
In this case, we raise an error.  Hopefully this should be rare.

4. If a partitioned table is dropped, but was listed in a snapshot's
partdesc cache, and then parallel query starts, the worker will try to
restore the partdesc for that table, but there are no catalog rows for
it.  The implementation choice here is to ignore the table and move on.
I would like to just remove the partdesc from the snapshot, but that
would require a relcache inval callback, and a) it'd kill us to scan all
snapshots for every relation drop; b) it doesn't work anyway because we
don't have any way to distinguish invals arriving because of DROP from
invals arriving because of anything else, say ANALYZE.

5. snapshots are copied a lot.  Copies share the same hash table as the
"original", because surely all copies should see the same partition
descriptor.  This leads to the pinning/unpinning business you see for
the structs in snapmgr.c.

Some known defects:

6. this still leaks memory.  Not as terribly as my earlier prototypes,
but clearly it's something that I need to address.

7. I've considered the idea of tracking snapshot-partdescs in resowner.c
to prevent future memory leak mistakes.  Not done yet.  Closely related
to item 6.

8. Header changes may need some cleanup yet -- eg. I'm not sure
snapmgr.h compiles standalone.

9. David Rowley recently pointed out that we can modify
CREATE TABLE ..  PARTITION OF to likewise not obtain AEL anymore.
Apparently it just requires removal of three lines in MergeAttributes.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

attach-concurrently.patch (67K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Robert Haas
On Thu, Oct 25, 2018 at 4:26 PM Alvaro Herrera <[hidden email]> wrote:
> Firstly, I took Robert's advice and removed the CONCURRENTLY keyword
> from the syntax.  We just do it that way always.  When there's a default
> partition, only that partition is locked with an AEL; all the rest is
> locked with ShareUpdateExclusive only.

Check.

> Then there are a few other implementation details worth mentioning:
>
> 3. parallel query: when a worker starts on a snapshot that has a
> partition descriptor cache, we need to transmit those partdescs from
> leader via shmem ... but we cannot send the full struct, so we just send
> the OID list of partitions, then rebuild the descriptor in the worker.
> Side effect: if a partition is detached right between the leader taking
> the partdesc and the worker starting, the partition loses its
> relpartbound column, so it's not possible to reconstruct the partdesc.
> In this case, we raise an error.  Hopefully this should be rare.

I don't think it's a good idea to for parallel query to just randomly
fail in cases where a non-parallel query would have worked.  I tried
pretty hard to avoid that while working on the feature, and it would
be a shame to see that work undone.

It strikes me that it would be a good idea to break this work into two
phases.  In phase 1, let's support ATTACH and CREATE TABLE ..
PARTITION OF without requiring AccessExclusiveLock.  In phase 2, think
about concurrency for DETACH (and possibly DROP).

I suspect phase 1 actually isn't that hard.  It seems to me that the
only thing we REALLY need to ensure is that the executor doesn't blow
up if a relcache reload occurs.  There are probably a few different
approaches to that problem, but I think it basically boils down to (1)
making sure that the executor is holding onto pointers to the exact
objects it wants to use and not re-finding them through the relcache
and (2) making sure that the relcache doesn't free and rebuild those
objects but rather holds onto the existing copies.  With this
approach, already-running queries won't take into account the fact
that new partitions have been added, but that seems at least tolerable
and perhaps desirable.

For phase 2, we're not just talking about adding stuff that need not
be used immediately, but about removing stuff which may already be in
use.  Your email doesn't seem to describe what we want the *behavior*
to be in that case.  Leave aside for a moment the issue of not
crashing: what are the desired semantics?  I think it would be pretty
strange if you had a COPY running targeting a partitioned table,
detached a partition, and the COPY continued to route tuples to the
detached partition even though it was now an independent table.  It
also seems pretty strange if the tuples just get thrown away.  If the
COPY isn't trying to send any tuples to the now-detached partition,
then it's fine, but if it is, then I have trouble seeing any behavior
other than an error as sane, unless perhaps a new partition has been
attached or created for that part of the key space.

If you adopt that proposal, then the problem of parallel query
behaving differently from non-parallel query goes away.  You just get
an error in both cases, probably to the effect that there is no
(longer) a partition matching the tuple you are trying to insert (or
update).

If you're not hacking on this patch set too actively right at the
moment, I'd like to spend some time hacking on the CREATE/ATTACH side
of things and see if I can produce something committable for that
portion of the problem.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Alvaro Herrera-9
On 2018-Nov-06, Robert Haas wrote:

> If you're not hacking on this patch set too actively right at the
> moment, I'd like to spend some time hacking on the CREATE/ATTACH side
> of things and see if I can produce something committable for that
> portion of the problem.

I'm not -- feel free to hack away.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Simon Riggs
In reply to this post by Robert Haas
On Tue, 6 Nov 2018 at 10:10, Robert Haas <[hidden email]> wrote:
 
With this
approach, already-running queries won't take into account the fact
that new partitions have been added, but that seems at least tolerable
and perhaps desirable.

Desirable, imho. No data added after a query starts would be visible.
 
If the
COPY isn't trying to send any tuples to the now-detached partition,
then it's fine, but if it is, then I have trouble seeing any behavior
other than an error as sane, unless perhaps a new partition has been
attached or created for that part of the key space.

Error in the COPY or in the DDL? COPY preferred. Somebody with insert rights shouldn't be able to prevent a table-owner level action. People normally drop partitions to save space, so it could be annoying if that was interrupted.


Supporting parallel query shouldn't make other cases more difficult from a behavioral perspective just to avoid the ERROR. The ERROR sounds annoying, but not sure how annoying avoiding it would be.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Robert Haas
On Tue, Nov 6, 2018 at 1:54 PM Simon Riggs <[hidden email]> wrote:
> Error in the COPY or in the DDL? COPY preferred. Somebody with insert rights shouldn't be able to prevent a table-owner level action. People normally drop partitions to save space, so it could be annoying if that was interrupted.

Yeah, the COPY.

> Supporting parallel query shouldn't make other cases more difficult from a behavioral perspective just to avoid the ERROR. The ERROR sounds annoying, but not sure how annoying avoiding it would be.

In my view, it's not just a question of it being annoying, but of
whether anything else is even sensible.  I mean, you can avoid an
error when a user types SELECT 1/0 by returning NULL or 42, but that's
not usually how we roll around here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

123