ATTACH/DETACH PARTITION CONCURRENTLY

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
95 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Robert Haas
On Mon, Dec 17, 2018 at 6:44 PM Michael Paquier <[hidden email]> wrote:
> Agreed.  This patch has value, and somebody else could always take it
> from the point where you were.

OK.  I'll post what I have by the end of the week.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Michael Paquier-2
On Tue, Dec 18, 2018 at 01:41:06PM -0500, Robert Haas wrote:
> OK.  I'll post what I have by the end of the week.

Thanks, Robert.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Robert Haas
On Tue, Dec 18, 2018 at 8:04 PM Michael Paquier <[hidden email]> wrote:
> On Tue, Dec 18, 2018 at 01:41:06PM -0500, Robert Haas wrote:
> > OK.  I'll post what I have by the end of the week.
>
> Thanks, Robert.

OK, so I got slightly delayed here by utterly destroying my laptop,
but I've mostly reconstructed what I did.  I think there are some
remaining problems, but this seems like a good time to share what I've
got so far.  I'm attaching three patches.

0001 is one which I posted before.  It attempts to fix up
RelationBuildPartitionDesc() so that this function will always return
a partition descriptor based on a consistent snapshot of the catalogs.
Without this, I think there's nothing to prevent a commit which
happens while the function is running from causing the function to
fail or produce nonsense answers.

0002 introduces the concept of a partition directory.  The idea is
that the planner will create a partition directory, and so will the
executor, and all calls which occur in those places to
RelationGetPartitionDesc() will instead call
PartitionDirectoryLookup().  Every lookup for the same relation in the
same partition directory is guaranteed to produce the same answer.  I
believe this patch still has a number of weaknesses.  More on that
below.

0003 actually lowers the lock level.  The comment here might need some
more work.

Here is a list of possible or definite problems that are known to me:

- I think we need a way to make partition directory lookups consistent
across backends in the case of parallel query.  I believe this can be
done with a dshash and some serialization and deserialization logic,
but I haven't attempted that yet.

- I refactored expand_inherited_rtentry() to drive partition expansion
entirely off of PartitionDescs. The reason why this is necessary is
that it clearly will not work to have find_all_inheritors() use a
current snapshot to decide what children we have and lock them, and
then consult a different source of truth to decide which relations to
open with NoLock.  There's nothing to keep the lists of partitions
from being different in the two cases, and that demonstrably causes
assertion failures if you SELECT with an ATTACH/DETACH loop running in
the background. However, it also changes the order in which tables get
locked.  Possibly that could be fixed by teaching
expand_partitioned_rtentry() to qsort() the OIDs the way
find_inheritance_children() does.  It also loses the infinite-loop
protection which find_all_inheritors() has.  Not sure what to do about
that.

- In order for the new PartitionDirectory machinery to avoid
use-after-free bugs, we have to either copy the PartitionDesc out of
the relcache into the partition directory or avoid freeing it while it
is still in use.  Copying it seems unappealing for performance
reasons, so I took the latter approach.  However, what I did here in
terms of reclaiming memory is just about the least aggressive strategy
short of leaking it altogether - it just keeps it around until the
next rebuild that occurs while the relcache entry is not in use.  We
might want to do better, e.g. freeing old copies immediately as soon
as the relcache reference count drops to 0. I just did it this way
because it was simple to code.

- I tried this with Alvaro's isolation tests and it fails some tests.
That's because Alvaro's tests expect that the list of accessible
partitions is based on the query snapshot.  For reasons I attempted to
explain in the comments in 0003, I think the idea that we can choose
the set of accessible partitions based on the query snapshot is very
wrong.  For example, suppose transaction 1 begins, reads an unrelated
table to establish a snapshot, and then goes idle.  Then transaction 2
comes along, detaches a partition from an important table, and then
does crazy stuff with that table -- changes the column list, drops it,
reattaches it with different bounds, whatever.  Then it commits.  If
transaction 1 now comes along and uses the query snapshot to decide
that the detached partition ought to still be seen as a partition of
that partitioned table, disaster will ensue.

- I don't have any tests here, but I think it would be good to add
some, perhaps modeled on Alvaro's, and also some that involve multiple
ATTACH and DETACH operations mixed with other interesting kinds of
DDL.  I also didn't make any attempt to update the documentation,
which is another thing that will probably have to be done at some
point. Not sure how much documentation we have of any of this now.

- I am uncertain whether the logic that builds up the partition
constraint is really safe in the face of concurrent DDL.  I kinda
suspect there are some problems there, but maybe not.  Once you hold
ANY lock on a partition, the partition constraint can't concurrently
become stricter, because no ATTACH can be done without
AccessExclusiveLock on the partition being attached; but it could
concurrently become less strict, because you only need a lesser lock
for a detach.  Not sure if/how that could foul with this logic.

- I have not done anything about the fact that index detach takes
AccessExclusiveLock.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

0001-Ensure-that-RelationBuildPartitionDesc-sees-a-consis.patch (8K) Download Attachment
0002-Introduce-the-concept-of-a-partition-directory.patch (31K) Download Attachment
0003-Lower-the-lock-level-for-ALTER-TABLE-.-ATTACH-DETACH.patch (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Alvaro Herrera-9
Thanks for this work!  I like the name "partition directory".

On 2018-Dec-20, Robert Haas wrote:

> 0002 introduces the concept of a partition directory.  The idea is
> that the planner will create a partition directory, and so will the
> executor, and all calls which occur in those places to
> RelationGetPartitionDesc() will instead call
> PartitionDirectoryLookup().  Every lookup for the same relation in the
> same partition directory is guaranteed to produce the same answer.  I
> believe this patch still has a number of weaknesses.  More on that
> below.

The commit message for this one also points out another potential
problem,

> Introduce the concept of a partition directory.
>
> Teach the optimizer and executor to use it, so that a single planning
> cycle or query execution gets the same PartitionDesc for the same table
> every time it looks it up.  This does not prevent changes between
> planning and execution, nor does it guarantee that all tables are
> expanded according to the same snapshot.

Namely: how does this handle the case of partition pruning structure
being passed from planner to executor, if an attach happens in the
middle of it and puts a partition in between existing partitions?  Array
indexes of any partitions that appear later in the partition descriptor
will change.

This is the reason I used the query snapshot rather than EState.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Robert Haas
On Thu, Dec 20, 2018 at 3:58 PM Alvaro Herrera <[hidden email]> wrote:

> > Introduce the concept of a partition directory.
> >
> > Teach the optimizer and executor to use it, so that a single planning
> > cycle or query execution gets the same PartitionDesc for the same table
> > every time it looks it up.  This does not prevent changes between
> > planning and execution, nor does it guarantee that all tables are
> > expanded according to the same snapshot.
>
> Namely: how does this handle the case of partition pruning structure
> being passed from planner to executor, if an attach happens in the
> middle of it and puts a partition in between existing partitions?  Array
> indexes of any partitions that appear later in the partition descriptor
> will change.
>
> This is the reason I used the query snapshot rather than EState.

I didn't handle that.  If partition pruning relies on nothing changing
between planning and execution, isn't that broken regardless of any of
this?  It's true that with the simple query protocol we'll hold locks
continuously from planning into execution, and therefore with the
current locking regime we couldn't really have a problem.  But unless
I'm confused, with the extended query protocol it's quite possible to
generate a plan, release locks, and then reacquire locks at execution
time.  Unless we have some guarantee that a new plan will always be
generated if any DDL has happened in the middle, I think we've got
trouble, and I don't think that is guaranteed in all cases.

Maybe I'm all wet, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Alvaro Herrera-9
On 2018-Dec-20, Robert Haas wrote:

> I didn't handle that.  If partition pruning relies on nothing changing
> between planning and execution, isn't that broken regardless of any of
> this?  It's true that with the simple query protocol we'll hold locks
> continuously from planning into execution, and therefore with the
> current locking regime we couldn't really have a problem.  But unless
> I'm confused, with the extended query protocol it's quite possible to
> generate a plan, release locks, and then reacquire locks at execution
> time.  Unless we have some guarantee that a new plan will always be
> generated if any DDL has happened in the middle, I think we've got
> trouble, and I don't think that is guaranteed in all cases.

Oh, so maybe this case is already handled by plan invalidation -- I
mean, if we run DDL, the stored plan is thrown away and a new one
recomputed.  IOW this was already a solved problem and I didn't need to
spend effort on it. /me slaps own forehead

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Robert Haas
On Thu, Dec 20, 2018 at 4:11 PM Alvaro Herrera <[hidden email]> wrote:
> Oh, so maybe this case is already handled by plan invalidation -- I
> mean, if we run DDL, the stored plan is thrown away and a new one
> recomputed.  IOW this was already a solved problem and I didn't need to
> spend effort on it. /me slaps own forehead

I'm kinda saying the opposite - I'm not sure that it's safe even with
the higher lock levels.  If the plan is relying on the same partition
descriptor being in effect at plan time as at execution time, that
sounds kinda dangerous to me.

Lowering the lock level might also make something that was previously
safe into something unsafe, because now there's no longer a guarantee
that invalidation messages are received soon enough. With
AccessExclusiveLock, we'll send invalidation messages before releasing
the lock, and other processes will acquire the lock and then
AcceptInvalidationMessages().  But with ShareUpdateExclusiveLock the
locks can coexist, so now there might be trouble.  I think this is an
area where we need to do some more investigation.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Robert Haas
On Thu, Dec 20, 2018 at 4:38 PM Robert Haas <[hidden email]> wrote:
> Lowering the lock level might also make something that was previously
> safe into something unsafe, because now there's no longer a guarantee
> that invalidation messages are received soon enough. With
> AccessExclusiveLock, we'll send invalidation messages before releasing
> the lock, and other processes will acquire the lock and then
> AcceptInvalidationMessages().  But with ShareUpdateExclusiveLock the
> locks can coexist, so now there might be trouble.  I think this is an
> area where we need to do some more investigation.

So there are definitely problems here.  With my patch:

create table tab (a int, b text) partition by range (a);
create table tab1 partition of tab for values from (0) to (10);
prepare t as select * from tab;
begin;
explain execute t; -- seq scan on tab1
execute t; -- no rows

Then, in another session:

alter table tab detach partition tab1;
insert into tab1 values (300, 'oops');

Back to the first session:

execute t; -- shows (300, 'oops')
explain execute t; -- still planning to scan tab1
commit;
explain execute t; -- now it got the memo, and plans to scan nothing
execute t; -- no rows

Well, that's not good.  We're showing a value that was never within
the partition bounds of any partition of tab.  The problem is that,
since we already have locks on all relevant objects, nothing triggers
the second 'explain execute' to process invalidation messages, so we
don't update the plan.  Generally, any DDL with less than
AccessExclusiveLock has this issue.  On another thread, I was
discussing with Tom and Peter the possibility of trying to rejigger
things so that we always AcceptInvalidationMessages() at least once
per command, but I think that just turns this into a race: if a
concurrent commit happens after 'explain execute t' decides not to
re-plan but before it begins executing, we have the same problem.

This example doesn't involve partition pruning, and in general I don't
think that the problem is confined to partition pruning.  It's rather
that if there's no conflict between the lock that is needed to change
the set of partitions and the lock that is needed to run a query, then
there's no way to guarantee that the query runs with the same set of
partitions for which it was planned. Unless I'm mistaken, which I
might be, this is also a problem with your approach -- if you repeat
the same prepared query in the same transaction, the transaction
snapshot will be updated, and thus the PartitionDesc will be expanded
differently at execution time, but the plan will not have changed,
because invalidation messages have not been processed.

Anyway, I think the only fix here is likely to be making the executor
resilient against concurrent changes in the PartitionDesc. I don't
think there's going to be any easy  way to compensate for added
partitions without re-planning, but maybe we could find a way to flag
detached partitions so that they return no rows without actually
touching the underlying relation.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

David Rowley-3
In reply to this post by Robert Haas
On Fri, 21 Dec 2018 at 09:43, Robert Haas <[hidden email]> wrote:

> - I refactored expand_inherited_rtentry() to drive partition expansion
> entirely off of PartitionDescs. The reason why this is necessary is
> that it clearly will not work to have find_all_inheritors() use a
> current snapshot to decide what children we have and lock them, and
> then consult a different source of truth to decide which relations to
> open with NoLock.  There's nothing to keep the lists of partitions
> from being different in the two cases, and that demonstrably causes
> assertion failures if you SELECT with an ATTACH/DETACH loop running in
> the background. However, it also changes the order in which tables get
> locked.  Possibly that could be fixed by teaching
> expand_partitioned_rtentry() to qsort() the OIDs the way
> find_inheritance_children() does.  It also loses the infinite-loop
> protection which find_all_inheritors() has.  Not sure what to do about
> that.

I don't think you need to qsort() the Oids before locking. What the
qsort() does today is ensure we get a consistent locking order. Any
other order would surely do, providing we stick to it consistently. I
think PartitionDesc order is fine, as it's consistent.  Having it
locked in PartitionDesc order I think is what's needed for [1] anyway.
[2] proposes to relax the locking order taken during execution.

[1] https://commitfest.postgresql.org/21/1778/
[2] https://commitfest.postgresql.org/21/1887/

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

David Rowley-3
In reply to this post by Robert Haas
On Fri, 21 Dec 2018 at 10:05, Robert Haas <[hidden email]> wrote:

> On Thu, Dec 20, 2018 at 3:58 PM Alvaro Herrera <[hidden email]> wrote:
> > Namely: how does this handle the case of partition pruning structure
> > being passed from planner to executor, if an attach happens in the
> > middle of it and puts a partition in between existing partitions?  Array
> > indexes of any partitions that appear later in the partition descriptor
> > will change.
> >
> > This is the reason I used the query snapshot rather than EState.
>
> I didn't handle that.  If partition pruning relies on nothing changing
> between planning and execution, isn't that broken regardless of any of
> this?  It's true that with the simple query protocol we'll hold locks
> continuously from planning into execution, and therefore with the
> current locking regime we couldn't really have a problem.  But unless
> I'm confused, with the extended query protocol it's quite possible to
> generate a plan, release locks, and then reacquire locks at execution
> time.  Unless we have some guarantee that a new plan will always be
> generated if any DDL has happened in the middle, I think we've got
> trouble, and I don't think that is guaranteed in all cases.

Today the plan would be invalidated if a partition was ATTACHED or
DETACHED. The newly built plan would get the updated list of
partitions.


--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Robert Haas
In reply to this post by David Rowley-3
On Fri, Dec 21, 2018 at 6:04 PM David Rowley
<[hidden email]> wrote:
> I don't think you need to qsort() the Oids before locking. What the
> qsort() does today is ensure we get a consistent locking order. Any
> other order would surely do, providing we stick to it consistently. I
> think PartitionDesc order is fine, as it's consistent.  Having it
> locked in PartitionDesc order I think is what's needed for [1] anyway.
> [2] proposes to relax the locking order taken during execution.

If queries take locks in one order and DDL takes them in some other
order, queries and DDL starting around the same time could deadlock.
Unless we convert the whole system to lock everything in PartitionDesc
order the issue doesn't go away completely. But maybe we just have to
live with that. Surely we're not going to pay the cost of locking
partitions that we don't otherwise need to avoid a deadlock-vs-DDL
risk, and once we've decided to assume that risk, I'm not sure a
qsort() here helps anything much.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Robert Haas
In reply to this post by David Rowley-3
On Fri, Dec 21, 2018 at 6:06 PM David Rowley
<[hidden email]> wrote:

> > I didn't handle that.  If partition pruning relies on nothing changing
> > between planning and execution, isn't that broken regardless of any of
> > this?  It's true that with the simple query protocol we'll hold locks
> > continuously from planning into execution, and therefore with the
> > current locking regime we couldn't really have a problem.  But unless
> > I'm confused, with the extended query protocol it's quite possible to
> > generate a plan, release locks, and then reacquire locks at execution
> > time.  Unless we have some guarantee that a new plan will always be
> > generated if any DDL has happened in the middle, I think we've got
> > trouble, and I don't think that is guaranteed in all cases.
>
> Today the plan would be invalidated if a partition was ATTACHED or
> DETACHED. The newly built plan would get the updated list of
> partitions.

I think you're right, and that won't be true any more once we lower
the lock level, so it has to be handled somehow. The entire plan
invalidation mechanism seems to depend fundamentally on
AccessExclusiveLock being used everywhere, so this is likely to be an
ongoing issue every time we want to reduce a lock level anywhere.  I
wonder if there is any kind of systematic fix or if we are just going
to have to keep inventing ad-hoc solutions.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

David Rowley-3
In reply to this post by Robert Haas
On Tue, 25 Dec 2018 at 08:15, Robert Haas <[hidden email]> wrote:

>
> On Fri, Dec 21, 2018 at 6:04 PM David Rowley
> <[hidden email]> wrote:
> > I don't think you need to qsort() the Oids before locking. What the
> > qsort() does today is ensure we get a consistent locking order. Any
> > other order would surely do, providing we stick to it consistently. I
> > think PartitionDesc order is fine, as it's consistent.  Having it
> > locked in PartitionDesc order I think is what's needed for [1] anyway.
> > [2] proposes to relax the locking order taken during execution.
>
> If queries take locks in one order and DDL takes them in some other
> order, queries and DDL starting around the same time could deadlock.
> Unless we convert the whole system to lock everything in PartitionDesc
> order the issue doesn't go away completely. But maybe we just have to
> live with that. Surely we're not going to pay the cost of locking
> partitions that we don't otherwise need to avoid a deadlock-vs-DDL
> risk, and once we've decided to assume that risk, I'm not sure a
> qsort() here helps anything much.

When I said "consistent" I meant consistent over all places where we
obtain locks on all partitions. My original v1-0002 patch attempted
something like this.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Robert Haas
In reply to this post by Alvaro Herrera-9
On Thu, Dec 20, 2018 at 3:58 PM Alvaro Herrera <[hidden email]> wrote:
> Namely: how does this handle the case of partition pruning structure
> being passed from planner to executor, if an attach happens in the
> middle of it and puts a partition in between existing partitions?  Array
> indexes of any partitions that appear later in the partition descriptor
> will change.

I finally gotten a little more time to work on this.  It took me a
while to understand that a PartitionedRelPruneInfos assumes that the
indexes of partitions in the PartitionDesc don't change between
planning and execution, because subplan_map[] and subplan_map[] are
indexed by PartitionDesc offset.  I suppose the reason for this is so
that we don't have to go to the expense of copying the partition
bounds from the PartitionDesc into the final plan, but it somehow
seems a little scary to me.  Perhaps I am too easily frightened, but
it's certainly a problem from the point of view of this project, which
wants to let the PartitionDesc change concurrently.

I wrote a little patch that stores the relation OIDs of the partitions
into the PartitionedPruneRelInfo and then, at execution time, does an
Assert() that what it gets matches what existed at plan time.  I
figured that a good start would be to find a test case where this
fails with concurrent DDL allowed, but I haven't so far succeeded in
devising one.  To make the Assert() fail, I need to come up with a
case where concurrent DDL has caused the PartitionDesc to be rebuilt
but without causing an update to the plan.  If I use prepared queries
inside of a transaction block, I can continue to run old plans after
concurrent DDL has changed things, but I can't actually make the
Assert() fail, because the queries continue to use the old plans
precisely because they haven't processed invalidation messages, and
therefore they also have the old PartitionDesc and everything works.
Maybe if I try it with CLOBBER_CACHE_ALWAYS...

I also had the idea of trying to use a cursor, because if I could
start execution of a query, then force a relcache rebuild, then
continue executing the query, maybe something would blow up somehow.
But that's not so easy because I don't think we have any way using SQL
to declare a cursor for a prepared query, so how do I need to get a
query plan that involves run-time pruning without using parameters,
which I'm pretty sure is possible but I haven't figured it out yet.
And even there the PartitionDirectory concept might preserve us from
any damage if the change happens after the executor is initialized,
though I'm not sure if there are any cases where we don't do the first
PartitionDesc lookup for a particular table until mid-execution.

Anyway, I think this idea of passing a list of relation OIDs that we
saw at planning time through to the executor and cross-checking them
might have some legs.  If we only allowed concurrently *adding*
partitions and not concurrently *removing* them, then even if we find
the case(s) where the PartitionDesc can change under us, we can
probably just adjust subplan_map and subpart_map to compensate, since
we can iterate through the old and new arrays of relation OIDs and
just figure out which things have shifted to higher indexes in the
PartitionDesc.  This is all kind of hand-waving at the moment; tips
appreciated.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Alvaro Herrera-9
On 2019-Jan-25, Robert Haas wrote:

> I finally gotten a little more time to work on this.  It took me a
> while to understand that a PartitionedRelPruneInfos assumes that the
> indexes of partitions in the PartitionDesc don't change between
> planning and execution, because subplan_map[] and subplan_map[] are
> indexed by PartitionDesc offset.

Right, the planner/executor "disconnect" is one of the challenges, and
why I was trying to keep the old copy of the PartitionDesc around
instead of building updated ones as needed.

> I suppose the reason for this is so
> that we don't have to go to the expense of copying the partition
> bounds from the PartitionDesc into the final plan, but it somehow
> seems a little scary to me.  Perhaps I am too easily frightened, but
> it's certainly a problem from the point of view of this project, which
> wants to let the PartitionDesc change concurrently.

Well, my definition of the problem started with the assumption that we
would keep the partition array indexes unchanged, so "change
concurrently" is what we needed to avoid.  Yes, I realize that you've
opted to change that definition.

I may have forgotten some of your earlier emails on this, but one aspect
(possibly a key one) is that I'm not sure we really need to cope, other
than with an ERROR, with queries that continue to run across an
attach/detach -- moreso in absurd scenarios such as the ones you
described where the detached table is later re-attached, possibly to a
different partitioned table.  I mean, if we can just detect the case and
raise an error, and this let us make it all work reasonably, that might
be better.

> I wrote a little patch that stores the relation OIDs of the partitions
> into the PartitionedPruneRelInfo and then, at execution time, does an
> Assert() that what it gets matches what existed at plan time.  I
> figured that a good start would be to find a test case where this
> fails with concurrent DDL allowed, but I haven't so far succeeded in
> devising one.  To make the Assert() fail, I need to come up with a
> case where concurrent DDL has caused the PartitionDesc to be rebuilt
> but without causing an update to the plan.  If I use prepared queries
> inside of a transaction block, [...]

> I also had the idea of trying to use a cursor, because if I could
> start execution of a query, [...]

Those are the ways I thought of, and the reason for the shape of some of
those .spec tests.  I wasn't able to hit the situation.

> Maybe if I try it with CLOBBER_CACHE_ALWAYS...

I didn't try this one.

> Anyway, I think this idea of passing a list of relation OIDs that we
> saw at planning time through to the executor and cross-checking them
> might have some legs.  If we only allowed concurrently *adding*
> partitions and not concurrently *removing* them, then even if we find
> the case(s) where the PartitionDesc can change under us, we can
> probably just adjust subplan_map and subpart_map to compensate, since
> we can iterate through the old and new arrays of relation OIDs and
> just figure out which things have shifted to higher indexes in the
> PartitionDesc.  This is all kind of hand-waving at the moment; tips
> appreciated.

I think detaching partitions concurrently is a necessary part of this
feature, so I would prefer not to go with a solution that works for
attaching partitions but not for detaching them.  That said, I don't see
why it's impossible to adjust the partition maps in both cases.  But I
don't have anything better than hand-waving ATM.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Robert Haas
On Fri, Jan 25, 2019 at 4:18 PM Alvaro Herrera <[hidden email]> wrote:
> Right, the planner/executor "disconnect" is one of the challenges, and
> why I was trying to keep the old copy of the PartitionDesc around
> instead of building updated ones as needed.

I agree that would be simpler, but I just don't see how to make it
work.  For one thing, keeping the old copy around can't work in
parallel workers, which never had a copy in the first place.  For two
things, we don't have a really good mechanism to keep the
PartitionDesc that was used at plan time around until execution time.
Keeping the relation open would do it, but I'm pretty sure that causes
other problems; the system doesn't expect any residual references.

I know you had a solution to this problem, but I don't see how it can
work.  You said "Snapshots have their own cache (hash table) of
partition descriptors. If a partdesc is requested and the snapshot has
already obtained that partdesc, the original one is returned -- we
don't request a new one from partcache."  But presumably this means
when the last snapshot is unregistered, the cache is flushed
(otherwise, when?) and if that's so then this relies on the snapshot
that was used for planning still being around at execution time, which
I am pretty sure isn't guaranteed.

Also, and I think this point is worthy of some further discussion, the
thing that really seems broken to me about your design is the idea
that it's OK to use the query or transaction snapshot to decide which
partitions exist.  The problem with that is that some query or
transaction with an old snapshot might see as a partition some table
that has been dropped or radically altered - different column types,
attached to some other table now, attached to same table but with
different bounds, or just dropped.  And therefore it might try to
insert data into that table and fail in all kinds of crazy ways, about
the mildest of which is inserting data that doesn't match the current
partition constraint.  I'm willing to be told that I've misunderstood
the way it all works and this isn't really a problem for some reason,
but my current belief is that not only is it a problem with your
design, but that it's such a bad problem that there's really no way to
fix it and we have to abandon your entire approach and go a different
route.  If you don't think that's true, then perhaps we should discuss
it further.

> > I suppose the reason for this is so
> > that we don't have to go to the expense of copying the partition
> > bounds from the PartitionDesc into the final plan, but it somehow
> > seems a little scary to me.  Perhaps I am too easily frightened, but
> > it's certainly a problem from the point of view of this project, which
> > wants to let the PartitionDesc change concurrently.
>
> Well, my definition of the problem started with the assumption that we
> would keep the partition array indexes unchanged, so "change
> concurrently" is what we needed to avoid.  Yes, I realize that you've
> opted to change that definition.

I don't think I made a conscious decision to change this, and I'm kind
of wondering whether I have missed some better approach here.  I feel
like the direction I'm pursuing is an inevitable consequence of having
no good way to keep the PartitionDesc around from plan-time to
execution-time, which in turn feels like an inevitable consequence of
the two points I made above: there's no guarantee that the plan-time
snapshot is still registered anywhere by the time we get to execution
time, and even if there were, the associated PartitionDesc may point
to tables that have been drastically modified or don't exist any more.
But it's possible that my chain-of-inevitable-consequences has a weak
link, in which case I would surely like it if you (or someone else)
would point it out to me.

> I may have forgotten some of your earlier emails on this, but one aspect
> (possibly a key one) is that I'm not sure we really need to cope, other
> than with an ERROR, with queries that continue to run across an
> attach/detach -- moreso in absurd scenarios such as the ones you
> described where the detached table is later re-attached, possibly to a
> different partitioned table.  I mean, if we can just detect the case and
> raise an error, and this let us make it all work reasonably, that might
> be better.

Well, that's an interesting idea.  I assumed that users would hate
that kind of behavior with a fiery passion that could never be
quenched.  If not, then the problem changes from *coping* with
concurrent changes to *detecting* concurrent changes, which may be
easier, but see below.

> I think detaching partitions concurrently is a necessary part of this
> feature, so I would prefer not to go with a solution that works for
> attaching partitions but not for detaching them.  That said, I don't see
> why it's impossible to adjust the partition maps in both cases.  But I
> don't have anything better than hand-waving ATM.

The general problem here goes back to what I wrote in the third
paragraph of this email: a PartitionDesc that was built with a
particular snapshot can't be assumed to be usable after any subsequent
DDL has occurred that might affect the shape of the PartitionDesc.
For example, if somebody detaches a partition and reattaches it with
different partition bounds, and we use the old PartitionDesc for tuple
routing, we'll route tuples into that partition that do not satisfy
its partition constraint.  And we won't even get an ERROR, because the
system assumes that any tuple which arrives at a partition as a result
of tuple routing must necessarily satisfy the partition constraint.

If only concurrent CREATE/ATTACH operations are allowed and
DROP/DETACH is not, then that kind of thing isn't possible.  Any new
partitions which have shown up since the plan was created can just be
ignored, and the old ones must still have the same partition bounds
that they did before, and everything is fine.  Or, if we're OK with a
less-nice solution, we could just ERROR out when the number of
partitions have changed.  Some people will get errors they don't like,
but they won't end up with rows in their partitions that violate the
constraints.

But as soon as you allow concurrent DETACH, then things get really
crazy.  Even if, at execution time, there are the same number of
partitions as I had at plan time, and even if those partitions have
the same OIDs as what I had at plan time, and even if those OIDs are
in the same order in the PartitionDesc, it does not prove that things
are OK.  The partition could have been detached and reattached with a
narrower set of partition bounds.  And if so, then we might route a
tuple to it that doesn't fit that narrower set of bounds, and there
will be no error, just database corruption.

I suppose one idea for handling this is to stick a counter into
pg_class or something that gets incremented every time a partition is
detached.  At plan time, save the counter value; if it has changed at
execution time, ERROR.  If not, then you have only added partitions to
worry about, and that's an easier problem.

But I kind of wonder whether we're really gaining as much as you think
by trying to support concurrent DETACH in the first place.  If there
are queries running against the table, there's probably at least
AccessShareLock on the partition itself, not just the parent.  And
that means that reducing the necessary lock on the parent from
AccessExclusiveLock to something less doesn't really help that much,
because we're still going to block trying to acquire
AccessExclusiveLock on the child.  Now you might say: OK, well, just
reduce that lock level, too.

But that seems to me to be opening a giant can of worms which we are
unlikely to get resolved in time for this release.  The worst of those
problem is that if you also reduce the lock level on the partition
when attaching it, then you are adding a constraint while somebody
might at the exact same time be inserting a tuple that violates that
constraint.  Those two things have to be synchronized somehow.  You
could avoid that by reducing the lock level on the partition when
detaching and not when attaching.  But even then, detaching a
partition can involve performing a whole bunch of operations for which
we currently require AccessExclusiveLock. AlterTableGetLockLevel says:

                /*
                 * Removing constraints can affect SELECTs that have been
                 * optimised assuming the constraint holds true.
                 */
            case AT_DropConstraint: /* as DROP INDEX */
            case AT_DropNotNull:    /* may change some SQL plans */
                cmd_lockmode = AccessExclusiveLock;
                break;

Dropping a partition implicitly involves dropping a constraint.  We
could gamble that the above has no consequences that are really
serious enough to care about, and that none of the other subsidiary
objects that we adjust during detach (indexes, foreign keys, triggers)
really need AccessExclusiveLock now, and that none of the other kinds
of subsidiary objects that we might need to adjust in the future
during a detach will be changes that require AccessExclusiveLock
either, but that sounds awfully risky to me.  We have very little DDL
that runs with less than AccessExclusiveLock, and I've already found
lots of subtle problems that have to be patched up just for the
particular case of allowing attach/detach to take a lesser lock on the
parent table, and I bet that there are a whole bunch more similar
problems when you start talking about weakening the lock on the child
table, and I'm not convinced that there are any reasonable solutions
to some of those problems, let alone that we can come up with good
solutions to all of them in the very near future.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Simon Riggs
On Mon, 28 Jan 2019 at 20:15, Robert Haas <[hidden email]> wrote:
On Fri, Jan 25, 2019 at 4:18 PM Alvaro Herrera <[hidden email]> wrote:
> Right, the planner/executor "disconnect" is one of the challenges, and
> why I was trying to keep the old copy of the PartitionDesc around
> instead of building updated ones as needed.

I agree that would be simpler, but I just don't see how to make it
work.  For one thing, keeping the old copy around can't work in
parallel workers, which never had a copy in the first place.  For two
things, we don't have a really good mechanism to keep the
PartitionDesc that was used at plan time around until execution time.
Keeping the relation open would do it, but I'm pretty sure that causes
other problems; the system doesn't expect any residual references.

I know you had a solution to this problem, but I don't see how it can
work.  You said "Snapshots have their own cache (hash table) of
partition descriptors. If a partdesc is requested and the snapshot has
already obtained that partdesc, the original one is returned -- we
don't request a new one from partcache."  But presumably this means
when the last snapshot is unregistered, the cache is flushed
(otherwise, when?) and if that's so then this relies on the snapshot
that was used for planning still being around at execution time, which
I am pretty sure isn't guaranteed.

Also, and I think this point is worthy of some further discussion, the
thing that really seems broken to me about your design is the idea
that it's OK to use the query or transaction snapshot to decide which
partitions exist.  The problem with that is that some query or
transaction with an old snapshot might see as a partition some table
that has been dropped or radically altered - different column types,
attached to some other table now, attached to same table but with
different bounds, or just dropped.  And therefore it might try to
insert data into that table and fail in all kinds of crazy ways, about
the mildest of which is inserting data that doesn't match the current
partition constraint.  I'm willing to be told that I've misunderstood
the way it all works and this isn't really a problem for some reason,
but my current belief is that not only is it a problem with your
design, but that it's such a bad problem that there's really no way to
fix it and we have to abandon your entire approach and go a different
route.  If you don't think that's true, then perhaps we should discuss
it further.

> > I suppose the reason for this is so
> > that we don't have to go to the expense of copying the partition
> > bounds from the PartitionDesc into the final plan, but it somehow
> > seems a little scary to me.  Perhaps I am too easily frightened, but
> > it's certainly a problem from the point of view of this project, which
> > wants to let the PartitionDesc change concurrently.
>
> Well, my definition of the problem started with the assumption that we
> would keep the partition array indexes unchanged, so "change
> concurrently" is what we needed to avoid.  Yes, I realize that you've
> opted to change that definition.

I don't think I made a conscious decision to change this, and I'm kind
of wondering whether I have missed some better approach here.  I feel
like the direction I'm pursuing is an inevitable consequence of having
no good way to keep the PartitionDesc around from plan-time to
execution-time, which in turn feels like an inevitable consequence of
the two points I made above: there's no guarantee that the plan-time
snapshot is still registered anywhere by the time we get to execution
time, and even if there were, the associated PartitionDesc may point
to tables that have been drastically modified or don't exist any more.
But it's possible that my chain-of-inevitable-consequences has a weak
link, in which case I would surely like it if you (or someone else)
would point it out to me.

> I may have forgotten some of your earlier emails on this, but one aspect
> (possibly a key one) is that I'm not sure we really need to cope, other
> than with an ERROR, with queries that continue to run across an
> attach/detach -- moreso in absurd scenarios such as the ones you
> described where the detached table is later re-attached, possibly to a
> different partitioned table.  I mean, if we can just detect the case and
> raise an error, and this let us make it all work reasonably, that might
> be better.

Well, that's an interesting idea.  I assumed that users would hate
that kind of behavior with a fiery passion that could never be
quenched.  If not, then the problem changes from *coping* with
concurrent changes to *detecting* concurrent changes, which may be
easier, but see below.

> I think detaching partitions concurrently is a necessary part of this
> feature, so I would prefer not to go with a solution that works for
> attaching partitions but not for detaching them.  That said, I don't see
> why it's impossible to adjust the partition maps in both cases.  But I
> don't have anything better than hand-waving ATM.

The general problem here goes back to what I wrote in the third
paragraph of this email: a PartitionDesc that was built with a
particular snapshot can't be assumed to be usable after any subsequent
DDL has occurred that might affect the shape of the PartitionDesc.
For example, if somebody detaches a partition and reattaches it with
different partition bounds, and we use the old PartitionDesc for tuple
routing, we'll route tuples into that partition that do not satisfy
its partition constraint.  And we won't even get an ERROR, because the
system assumes that any tuple which arrives at a partition as a result
of tuple routing must necessarily satisfy the partition constraint.

If only concurrent CREATE/ATTACH operations are allowed and
DROP/DETACH is not, then that kind of thing isn't possible.  Any new
partitions which have shown up since the plan was created can just be
ignored, and the old ones must still have the same partition bounds
that they did before, and everything is fine.  Or, if we're OK with a
less-nice solution, we could just ERROR out when the number of
partitions have changed.  Some people will get errors they don't like,
but they won't end up with rows in their partitions that violate the
constraints.

But as soon as you allow concurrent DETACH, then things get really
crazy.  Even if, at execution time, there are the same number of
partitions as I had at plan time, and even if those partitions have
the same OIDs as what I had at plan time, and even if those OIDs are
in the same order in the PartitionDesc, it does not prove that things
are OK.  The partition could have been detached and reattached with a
narrower set of partition bounds.  And if so, then we might route a
tuple to it that doesn't fit that narrower set of bounds, and there
will be no error, just database corruption.

I suppose one idea for handling this is to stick a counter into
pg_class or something that gets incremented every time a partition is
detached.  At plan time, save the counter value; if it has changed at
execution time, ERROR.  If not, then you have only added partitions to
worry about, and that's an easier problem.

Yes, a version number would solve that issue.
 
But I kind of wonder whether we're really gaining as much as you think
by trying to support concurrent DETACH in the first place.  If there
are queries running against the table, there's probably at least
AccessShareLock on the partition itself, not just the parent.  And
that means that reducing the necessary lock on the parent from
AccessExclusiveLock to something less doesn't really help that much,
because we're still going to block trying to acquire
AccessExclusiveLock on the child.  Now you might say: OK, well, just
reduce that lock level, too.
 
The whole point of CONCURRENT detach is that you're not removing it whilst people are still using it, you're just marking it for later disuse.

But that seems to me to be opening a giant can of worms which we are
unlikely to get resolved in time for this release.  The worst of those
problem is that if you also reduce the lock level on the partition
when attaching it, then you are adding a constraint while somebody
might at the exact same time be inserting a tuple that violates that
constraint. 

Spurious.

This would only be true if we were adding a constraint that affected existing partitions.

The constraint being added affects the newly added partition, not existing ones.
 
Those two things have to be synchronized somehow.  You
could avoid that by reducing the lock level on the partition when
detaching and not when attaching.  But even then, detaching a
partition can involve performing a whole bunch of operations for which
we currently require AccessExclusiveLock. AlterTableGetLockLevel says:

                /*
                 * Removing constraints can affect SELECTs that have been
                 * optimised assuming the constraint holds true.
                 */
            case AT_DropConstraint: /* as DROP INDEX */
            case AT_DropNotNull:    /* may change some SQL plans */
                cmd_lockmode = AccessExclusiveLock;
                break;

Dropping a partition implicitly involves dropping a constraint.  We
could gamble that the above has no consequences that are really

It's not a gamble if you know that the constraints being dropped constrain only the object being dropped.
 
serious enough to care about, and that none of the other subsidiary
objects that we adjust during detach (indexes, foreign keys, triggers)
really need AccessExclusiveLock now, and that none of the other kinds
of subsidiary objects that we might need to adjust in the future
during a detach will be changes that require AccessExclusiveLock
either, but that sounds awfully risky to me.  We have very little DDL
that runs with less than AccessExclusiveLock, and I've already found
lots of subtle problems that have to be patched up just for the
particular case of allowing attach/detach to take a lesser lock on the
parent table, and I bet that there are a whole bunch more similar
problems when you start talking about weakening the lock on the child
table, and I'm not convinced that there are any reasonable solutions
to some of those problems, let alone that we can come up with good
solutions to all of them in the very near future.

 I've not read every argument on this thread, but many of the later points made here are spurious, by which I mean they sound like they could apply but in fact do not.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Robert Haas
On Tue, Jan 29, 2019 at 12:29 AM Simon Riggs <[hidden email]> wrote:

>> But I kind of wonder whether we're really gaining as much as you think
>> by trying to support concurrent DETACH in the first place.  If there
>> are queries running against the table, there's probably at least
>> AccessShareLock on the partition itself, not just the parent.  And
>> that means that reducing the necessary lock on the parent from
>> AccessExclusiveLock to something less doesn't really help that much,
>> because we're still going to block trying to acquire
>> AccessExclusiveLock on the child.  Now you might say: OK, well, just
>> reduce that lock level, too.
>
> The whole point of CONCURRENT detach is that you're not removing it whilst people are still using it, you're just marking it for later disuse.

Well, I don't think that's the way any patch so far proposed actually works.

>> But that seems to me to be opening a giant can of worms which we are
>> unlikely to get resolved in time for this release.  The worst of those
>> problem is that if you also reduce the lock level on the partition
>> when attaching it, then you are adding a constraint while somebody
>> might at the exact same time be inserting a tuple that violates that
>> constraint.
>
> Spurious.
>
> This would only be true if we were adding a constraint that affected existing partitions.
>
> The constraint being added affects the newly added partition, not existing ones.

I agree that it affects the newly added partition, not existing ones.
But if you don't hold an AccessExclusiveLock on that partition while
you are adding that constraint to it, then somebody could be
concurrently inserting a tuple that violates that constraint.  This
would be an INSERT targeting the partition directly, not somebody
operating on the partitioning hierarchy to which it is being attached.

>> Those two things have to be synchronized somehow.  You
>> could avoid that by reducing the lock level on the partition when
>> detaching and not when attaching.  But even then, detaching a
>> partition can involve performing a whole bunch of operations for which
>> we currently require AccessExclusiveLock. AlterTableGetLockLevel says:
>>
>>                 /*
>>                  * Removing constraints can affect SELECTs that have been
>>                  * optimised assuming the constraint holds true.
>>                  */
>>             case AT_DropConstraint: /* as DROP INDEX */
>>             case AT_DropNotNull:    /* may change some SQL plans */
>>                 cmd_lockmode = AccessExclusiveLock;
>>                 break;
>>
>> Dropping a partition implicitly involves dropping a constraint.  We
>> could gamble that the above has no consequences that are really
>
> It's not a gamble if you know that the constraints being dropped constrain only the object being dropped.

That's not true, but I can't refute your argument any more than that
because you haven't made one.

>  I've not read every argument on this thread, but many of the later points made here are spurious, by which I mean they sound like they could apply but in fact do not.

I think they do apply, and until somebody explains convincingly why
they don't, I'm going to keep thinking that they do.   Telling me that
my points are wrong without making any kind of argument about why they
are wrong is not constructive.  I've put a lot of energy into
analyzing this topic, both recently and in previous release cycles,
and I'm not inclined to just say "OK, well, Simon says I'm wrong, so
that's the end of it."

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Robert Haas
In reply to this post by Alvaro Herrera-9
On Fri, Jan 25, 2019 at 4:18 PM Alvaro Herrera <[hidden email]> wrote:

> > I wrote a little patch that stores the relation OIDs of the partitions
> > into the PartitionedPruneRelInfo and then, at execution time, does an
> > Assert() that what it gets matches what existed at plan time.  I
> > figured that a good start would be to find a test case where this
> > fails with concurrent DDL allowed, but I haven't so far succeeded in
> > devising one.  To make the Assert() fail, I need to come up with a
> > case where concurrent DDL has caused the PartitionDesc to be rebuilt
> > but without causing an update to the plan.  If I use prepared queries
> > inside of a transaction block, [...]
>
> > I also had the idea of trying to use a cursor, because if I could
> > start execution of a query, [...]
>
> Those are the ways I thought of, and the reason for the shape of some of
> those .spec tests.  I wasn't able to hit the situation.
I've managed to come up with a test case that seems to hit this case.

Preparation:

create table foo (a int, b text, primary key (a)) partition by range (a);
create table foo1 partition of foo for values from (0) to (1000);
create table foo2 partition of foo for values from (1000) to (2000);
insert into foo1 values (1, 'one');
insert into foo2 values (1001, 'two');
alter system set plan_cache_mode = force_generic_plan;
select pg_reload_conf();

$ cat >x
alter table foo detach partition foo2;
alter table foo attach partition foo2 for values from (1000) to (2000);
^D

Window #1:

prepare foo as select * from foo where a = $1;
explain execute foo(1500);
\watch 0.01

Window #2:

$ pgbench -n -f x -T 60

Boom:

TRAP: FailedAssertion("!(partdesc->nparts == pinfo->nparts)", File:
"execPartition.c", Line: 1631)

I don't know how to reduce this to something reliable enough to
include it in the regression tests, and maybe we don't really need
that, but it's good to know that this is not a purely theoretical
problem.  I think next I'll try to write some code to make
execPartition.c able to cope with the situation when it arises.

(My draft/WIP patches attached, if you're interested.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

0008-Drop-lock-level.patch (1K) Download Attachment
0006-Adapt-the-executor-to-use-a-PartitionDirectory.patch (9K) Download Attachment
0005-Adapt-the-optimizer-to-use-a-PartitionDirectory.patch (8K) Download Attachment
0007-relid_map-crosschecks.patch (7K) Download Attachment
0004-Postpone-old-context-removal.patch (2K) Download Attachment
0003-Initial-cut-at-PartitionDirectory.patch (5K) Download Attachment
0002-Ensure-that-RelationBuildPartitionDesc-sees-a-consis.patch (8K) Download Attachment
0001-Move-code-for-managing-PartitionDescs-into-a-new-fil.patch (31K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Robert Haas
On Tue, Jan 29, 2019 at 1:59 PM Robert Haas <[hidden email]> wrote:
> I don't know how to reduce this to something reliable enough to
> include it in the regression tests, and maybe we don't really need
> that, but it's good to know that this is not a purely theoretical
> problem.  I think next I'll try to write some code to make
> execPartition.c able to cope with the situation when it arises.

OK, that seems to be pretty easy.  New patch series attached.  The
patch with that new logic is 0004.  I've consolidated some of the
things I had as separate patches in my last post and rewritten the
commit messages to explain more clearly the purpose of each patch.

Open issues:

- For now, I haven't tried to handle the DETACH PARTITION case.  I
don't think there's anything preventing someone - possibly even me -
from implementing the counter-based approach that I described in the
previous message, but I think it would be good to have some more
discussion first on whether it's acceptable to make concurrent queries
error out.  I think any queries that were already up and running would
be fine, but any that were planned before the DETACH and tried to
execute afterwards would get an ERROR.  That's fairly low-probability,
because normally the shared invalidation machinery would cause
replanning, but there's a race condition, so we'd have to document
something like: if you use this feature, it'll probably just work, but
you might get some funny errors in other sessions if you're unlucky.
That kinda sucks but maybe we should just suck it up.  Possibly we
should consider making the concurrent behavior optional, so that if
you'd rather take blocking locks than risk errors, you have that
option.  Of course I guess you could also just let people do an
explicit LOCK TABLE if that's what they want.  Or we could try to
actually make it work in that case, I guess by ignoring the detached
partitions, but that seems a lot harder.

- 0003 doesn't have any handling for parallel query at this point, so
even though within a single backend a single query execution will
always get the same PartitionDesc for the same relation, the answers
might not be consistent across the parallel group.  I keep going back
and forth on whether this really matters.  It's too late to modify the
plan, so any relations attached since it was generated are not going
to get scanned.  As for detached relations, we're talking about making
them error out, so we don't have to worry about different backends
come to different conclusions about whether they should be scanned.
But maybe we should try to be smarter instead.  One concern is that
even relations that aren't scanned could still be affected because of
tuple routing, but right now parallel queries can't INSERT or UPDATE
rows anyway.  Then again, maybe we should try not to add more
obstacles in the way of lifting that restriction. Then again again,
AFAICT we wouldn't be able to test that the new code is actually
solving a problem right now today, and how much untested code do we
really want in the tree?  And then on the eleventh hand, maybe there
are other reasons why it's important to use the same PartitionDesc
across all parallel workers that I'm not thinking about at the moment.

- 0003 also changes the order in which locks are acquired.  I am not
sure whether we care about this, especially in view of other pending
changes.

If you know of other problems, have solutions to or opinions about
these, or think the whole approach is wrong, please speak up!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

0002-Ensure-that-RelationBuildPartitionDesc-sees-a-consis.patch (8K) Download Attachment
0001-Move-code-for-managing-PartitionDescs-into-a-new-fil.patch (31K) Download Attachment
0005-Reduce-the-lock-level-required-to-attach-a-partition.patch (1K) Download Attachment
0004-Teach-runtime-partition-pruning-to-cope-with-concurr.patch (11K) Download Attachment
0003-Ensure-that-repeated-PartitionDesc-lookups-return-th.patch (25K) Download Attachment
12345