ATTACH/DETACH PARTITION CONCURRENTLY

classic Classic list List threaded Threaded
95 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Simon Riggs
On Tue, 6 Nov 2018 at 10:56, Robert Haas <[hidden email]> wrote:
On Tue, Nov 6, 2018 at 1:54 PM Simon Riggs <[hidden email]> wrote:
> Error in the COPY or in the DDL? COPY preferred. Somebody with insert rights shouldn't be able to prevent a table-owner level action. People normally drop partitions to save space, so it could be annoying if that was interrupted.

Yeah, the COPY.

> Supporting parallel query shouldn't make other cases more difficult from a behavioral perspective just to avoid the ERROR. The ERROR sounds annoying, but not sure how annoying avoiding it would be.

In my view, it's not just a question of it being annoying, but of
whether anything else is even sensible.  I mean, you can avoid an
error when a user types SELECT 1/0 by returning NULL or 42, but that's
not usually how we roll around here.

If you can remove the ERROR without any other adverse effects, that sounds great.

Please let us know what, if any, adverse effects would be caused so we can discuss. Thanks

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Robert Haas
On Tue, Nov 6, 2018 at 2:01 PM Simon Riggs <[hidden email]> wrote:
> If you can remove the ERROR without any other adverse effects, that sounds great.
>
> Please let us know what, if any, adverse effects would be caused so we can discuss. Thanks

Well, I've already written about this in two previous emails on this
thread, so I'm not sure exactly what you think is missing.  But to
state the problem again:

If you don't throw an error when a partition is concurrently detached
and then someone routes a tuple to that portion of the key space, what
DO you do?  Continue inserting tuples into the table even though it's
no longer a partition?  Throw tuples destined for that partition away?
 You can make an argument for both of those behaviors, but they're
both pretty strange.  The first one means that for an arbitrarily long
period of time after detaching a partition, the partition may continue
to receive inserts that were destined for its former parent.  The
second one means that your data can disappear into the ether.  I don't
like either of those things.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Alvaro Herrera-9
On 2018-Nov-06, Robert Haas wrote:

> If you don't throw an error when a partition is concurrently detached
> and then someone routes a tuple to that portion of the key space, what
> DO you do?  Continue inserting tuples into the table even though it's
> no longer a partition?

Yes -- the table was a partition when the query started, so it's still
a partition from the point of view of that query's snapshot.

> Throw tuples destined for that partition away?

Surely not.  (/me doesn't beat straw men anyway.)

> You can make an argument for both of those behaviors, but they're
> both pretty strange.  The first one means that for an arbitrarily long
> period of time after detaching a partition, the partition may continue
> to receive inserts that were destined for its former parent.

Not arbitrarily long -- only as long as those old snapshots live.  I
don't find this at all surprising.


(I think DETACH is not related to DROP in any way.  My proposal is that
DETACH can work concurrently, and if people want to drop the partition
later they can wait until snapshots/queries that could see that
partition are gone.)

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Robert Haas
On Tue, Nov 6, 2018 at 2:10 PM Alvaro Herrera <[hidden email]> wrote:
> On 2018-Nov-06, Robert Haas wrote:
> > If you don't throw an error when a partition is concurrently detached
> > and then someone routes a tuple to that portion of the key space, what
> > DO you do?  Continue inserting tuples into the table even though it's
> > no longer a partition?
>
> Yes -- the table was a partition when the query started, so it's still
> a partition from the point of view of that query's snapshot.

I think it's important to point out that DDL does not in general
respect the query snapshot.  For example, you can query a table that
was created by a transaction not visible to your query snapshot.  You
cannot query a table that was dropped by a transaction not visible to
your query snapshot.  If someone runs ALTER FUNCTION on a function
your query uses, you get the latest committed version, not the version
that was current at the time your query snapshot was created.  So, if
we go with the semantics you are proposing here, we will be making
this DDL behave differently from pretty much all other DDL.

Possibly that's OK in this case, but it's easy to think of other cases
where it could cause problems.  To take an example that I believe was
discussed on-list a number of years ago, suppose that ADD CONSTRAINT
worked according to the model that you are proposing for ATTACH
PARTITION.  If it did, then one transaction could be concurrently
inserting a tuple while another transaction was adding a constraint
which the tuple fails to satisfy.  Once both transactions commit, you
have a table with a supposedly-valid constraint and a tuple inside of
it that doesn't satisfy that constraint.  Obviously, that's no good.

I'm not entirely sure whether there are any similar dangers in the
case of DETACH PARTITION.  I think it depends a lot on what can be
done with that detached partition while the overlapping transaction is
still active.  For instance, suppose you attached it to the original
table with a different set of partition bounds, or attached it to some
other table with a different set of partition bounds.  If you can do
that, then I think it effectively creates the problem described in the
previous paragraph with respect to the partition constraint.

IOW, we've got to somehow prevent this:

setup: partition is attached with bounds 1 to a million
S1: COPY begins
S2: partition is detached
S2: partition is reattached with bounds 1 to a thousand
S1: still-running copy inserts a tuple with value ten thousand

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Simon Riggs
In reply to this post by Robert Haas
On Tue, 6 Nov 2018 at 11:06, Robert Haas <[hidden email]> wrote:
On Tue, Nov 6, 2018 at 2:01 PM Simon Riggs <[hidden email]> wrote:
> If you can remove the ERROR without any other adverse effects, that sounds great.
>
> Please let us know what, if any, adverse effects would be caused so we can discuss. Thanks

Well, I've already written about this in two previous emails on this
thread, so I'm not sure exactly what you think is missing.  But to
state the problem again:

I was discussing the ERROR in relation to parallel query, not COPY.

I didn't understand how that would be achieved.

Thanks for working on this.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Robert Haas
In reply to this post by Andres Freund
On Tue, Aug 7, 2018 at 9:29 AM Andres Freund <[hidden email]> wrote:
> One approach would be to make sure that everything relying on
> rt_partdesc staying the same stores its value in a local variable, and
> then *not* free the old version of rt_partdesc (etc) when the refcount >
> 0, but delay that to the RelationClose() that makes refcount reach
> 0. That'd be the start of a framework for more such concurrenct
> handling.

Some analysis of possible trouble spots:

- get_partition_dispatch_recurse and ExecCreatePartitionPruneState
both call RelationGetPartitionDesc.  Presumably, this means that if
the partition descriptor gets updated on the fly, the tuple routing
and partition dispatch code could end up with different ideas about
which partitions exist.  I think this should be fixed somehow, so that
we only call RelationGetPartitionDesc once per query and use the
result for everything.

- expand_inherited_rtentry checks
RelationGetPartitionDesc(oldrelation) != NULL.  If so, it calls
expand_partitioned_rtentry which fetches the same PartitionDesc again.
We can probably just do this once in the caller and pass the result
down.

- set_relation_partition_info also calls RelationGetPartitionDesc.
Off-hand, I think this code runs after expand_inherited_rtentry.  Not
sure what to do about this.  I'm not sure what the consequences would
be if this function and that one had different ideas about the
partition descriptor.

- tablecmds.c is pretty free about calling RelationGetPartitionDesc
repeatedly, but it probably doesn't matter.  If we're doing some kind
of DDL that depends on the contents of the partition descriptor, we
*had better* be holding a lock strong enough to prevent the partition
descriptor from being changed by somebody else at the same time.
Allowing a partition to be added concurrently with DML is one thing;
allowing a partition to be added concurrently with adding another
partition is a whole different level of insanity.  I think we'd be
best advised not to go down that rathole - among other concerns, how
would you even guarantee that the partitions being added didn't
overlap?

Generally:

Is it really OK to postpone freeing the old partition descriptor until
the relation reference count goes to 0?  I wonder if there are cases
where this could lead to tons of copies of the partition descriptor
floating around at the same time, gobbling up precious cache memory.
My first thought was that this would be pretty easy: just create a lot
of new partitions one by one while some long-running transaction is
open.  But the actual result in that case depends on the behavior of
the backend running the transaction.  If it just ignores the new
partitions and sticks with the partition descriptor it has got, then
probably nothing else will request the new partition descriptor either
and there will be no accumulation of memory.  However, if it tries to
absorb the updated partition descriptor, but without being certain
that the old one can be freed, then we'd have a query-lifespan memory
leak which is quadratic in the number of new partitions.

Maybe even that would be OK -- we could suppose that the number of new
partitions would probably be all THAT crazy large, and the constant
factor not too bad, so maybe you'd leak a could of MB for the length
of the query, but no more.  However, I wonder if it would better to
give each PartitionDescData its own refcnt, so that it can be freed
immediately when the refcnt goes to zero.  That would oblige every
caller of RelationGetPartitionDesc() to later call something like
ReleasePartitionDesc().  We could catch failures to do that by keeping
all the PartitionDesc objects so far created in a list.  When the main
entry's refcnt goes to 0, cross-check that this list is empty; if not,
then the remaining entries have non-zero refcnts that were leaked.  We
could emit a WARNING as we do in similar cases.

In general, I think something along the lines you are suggesting here
is the right place to start attacking this problem.  Effectively, we
immunize the system against the possibility of new entries showing up
in the partition descriptor while concurrent DML is running; the
semantics are that the new partitions are ignored for the duration of
currently-running queries.  This seems to allow for painless creation
or addition of new partitions in normal cases, but not when a default
partition exists.  In that case, using the old PartitionDesc is
outright wrong, because adding a new toplevel partition changes the
default partition's partition constraint. We can't insert into the
default partition a tuple that under the updated table definition
needs to go someplace else.  It seems like the best way to account for
that is to reduce the lock level on the partitioned table to
ShareUpdateExclusiveLock, but leave the lock level on any default
partition as AccessExclusiveLock (because we are modifying a
constraint on it).  We would also need to leave the lock level on the
new partition as AccessExclusiveLock (because we are adding a
constraint on it).  Not perfect, for sure, but not bad for a first
patch, either; it would improve things for users in a bunch of
practical cases.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Alvaro Herrera-9
In reply to this post by Alvaro Herrera-9
On 2018-Nov-06, Alvaro Herrera wrote:

> On 2018-Nov-06, Robert Haas wrote:

> > Throw tuples destined for that partition away?
>
> Surely not.  (/me doesn't beat straw men anyway.)

Hmm, apparently this can indeed happen with my patch :-(

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Robert Haas
On Tue, Nov 6, 2018 at 10:18 PM Alvaro Herrera <[hidden email]> wrote:
> > > Throw tuples destined for that partition away?
> > Surely not.  (/me doesn't beat straw men anyway.)
>
> Hmm, apparently this can indeed happen with my patch :-(

D'oh.  This is a hard problem, especially the part of it that involves
handling detach, so I wouldn't feel too bad about that.  However, to
beat this possibly-dead horse a little more, I think you made the
error of writing a patch that (1) tried to solve too many problems at
once and (2) didn't seem to really have a clear, well-considered idea
about what the semantics ought to be.

This is not intended as an attack; I want to work with you to solve
the problem, not have a fight about it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Robert Haas
In reply to this post by Robert Haas
On Tue, Nov 6, 2018 at 5:09 PM Robert Haas <[hidden email]> wrote:
> - get_partition_dispatch_recurse and ExecCreatePartitionPruneState
> both call RelationGetPartitionDesc.  Presumably, this means that if
> the partition descriptor gets updated on the fly, the tuple routing
> and partition dispatch code could end up with different ideas about
> which partitions exist.  I think this should be fixed somehow, so that
> we only call RelationGetPartitionDesc once per query and use the
> result for everything.

I think there is deeper trouble here.
ExecSetupPartitionTupleRouting() calls find_all_inheritors() to
acquire RowExclusiveLock on the whole partitioning hierarchy.  It then
calls RelationGetPartitionDispatchInfo (as a non-relcache function,
this seems poorly named) which calls get_partition_dispatch_recurse,
which does this:

            /*
             * We assume all tables in the partition tree were already locked
             * by the caller.
             */
            Relation    partrel = heap_open(partrelid, NoLock);

That seems OK at present, because no new partitions can have appeared
since ExecSetupPartitionTupleRouting() acquired locks.  But if we
allow new partitions to be added with only ShareUpdateExclusiveLock,
then I think there would be a problem.  If a new partition OID creeps
into the partition descriptor after find_all_inheritors() and before
we fetch its partition descriptor, then we wouldn't have previously
taken a lock on it and would still be attempting to open it without a
lock, which is bad (cf. b04aeb0a053e7cf7faad89f7d47844d8ba0dc839).

Admittedly, it might be a bit hard to provoke a failure here because
I'm not exactly sure how you could trigger a relcache reload in the
critical window, but I don't think we should rely on that.

More generally, it seems a bit strange that we take the approach of
locking the entire partitioning hierarchy here regardless of which
relations the query actually knows about.  If some relations have been
pruned, presumably we don't need to lock them; if/when we permit
concurrent partition, we don't need to lock any new ones that have
materialized.  We're just going to end up ignoring them anyway because
there's nothing to do with the information that they are or are not
excluded from the query when they don't appear in the query plan in
the first place.

Furthermore, this whole thing looks suspiciously like more of the sort
of redundant locking that f2343653f5b2aecfc759f36dbb3fd2a61f36853e
attempted to eliminate.  In light of that commit message, I'm
wondering whether the best approach would be to [1] get rid of the
find_all_inheritors call altogether and [2] somehow ensure that
get_partition_dispatch_recurse() doesn't open any tables that aren't
part of the query's range table.

Thoughts?  Comments?  Ideas?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Alvaro Herrera-9
In reply to this post by Robert Haas
On 2018-Nov-06, Robert Haas wrote:

> - get_partition_dispatch_recurse and ExecCreatePartitionPruneState
> both call RelationGetPartitionDesc.

My patch deals with this by caching descriptors in the active snapshot.
So those two things would get the same partition descriptor.  There's no
RelationGetPartitionDesc anymore, and SnapshotGetPartitionDesc takes its
place.

(I tried to use different scoping than the active snapshot; I first
tried the Portal, then I tried the resource owner.  But nothing seems to
fit as precisely as the active snapshot.)

> - expand_inherited_rtentry checks
> RelationGetPartitionDesc(oldrelation) != NULL.  If so, it calls
> expand_partitioned_rtentry which fetches the same PartitionDesc again.

This can be solved by changing the test to a relkind one, as my patch
does.

> - set_relation_partition_info also calls RelationGetPartitionDesc.
> Off-hand, I think this code runs after expand_inherited_rtentry.  Not
> sure what to do about this.  I'm not sure what the consequences would
> be if this function and that one had different ideas about the
> partition descriptor.

Snapshot caching, like in my patch, again solves this problem.

> - tablecmds.c is pretty free about calling RelationGetPartitionDesc
> repeatedly, but it probably doesn't matter.  If we're doing some kind
> of DDL that depends on the contents of the partition descriptor, we
> *had better* be holding a lock strong enough to prevent the partition
> descriptor from being changed by somebody else at the same time.

My patch deals with this by unlinking the partcache entry from the hash
table on relation invalidation, so DDL code would obtain a fresh copy
each time (lookup_partcache_entry).

In other words, I already solved these problems you list.

Maybe you could give my patch a look.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Robert Haas
On Wed, Nov 7, 2018 at 12:58 PM Alvaro Herrera <[hidden email]> wrote:

> On 2018-Nov-06, Robert Haas wrote:
> > - get_partition_dispatch_recurse and ExecCreatePartitionPruneState
> > both call RelationGetPartitionDesc.
>
> My patch deals with this by caching descriptors in the active snapshot.
> So those two things would get the same partition descriptor.  There's no
> RelationGetPartitionDesc anymore, and SnapshotGetPartitionDesc takes its
> place.
>
> (I tried to use different scoping than the active snapshot; I first
> tried the Portal, then I tried the resource owner.  But nothing seems to
> fit as precisely as the active snapshot.)
...
> In other words, I already solved these problems you list.
>
> Maybe you could give my patch a look.

I have, a bit.  One problem I'm having is that while you explained the
design you chose in a fair amount of detail, you didn't give a lot of
explanation (that I have seen) of the reasons why you chose that
design.  If there's a README or a particularly good comment someplace
that I should be reading to understand that better, please point me in
the right direction.

And also, I just don't really understand what all the problems are
yet.  I'm only starting to study this.

I am a bit skeptical of your approach, though.  Tying it to the active
snapshot seems like an awfully big hammer.  Snapshot manipulation can
be a performance bottleneck both in terms of actual performance and
also in terms of code complexity, and I don't really like the idea of
adding more code there.  It's not a sustainable pattern for making DDL
work concurrently, either -- I'm pretty sure we don't want to add new
code to things like GetLatestSnapshot() every time we want to make a
new kind of DDL concurrent.  Also, while hash table lookups are pretty
cheap, they're not free.  In my opinion, to the extent that we can, it
would be better to refactor things to avoid duplicate lookups of the
PartitionDesc rather than to install a new subsystem that tries to
make sure they always return the same answer.

Such an approach seems to have other possible advantages.  For
example, if a COPY is running and a new partition shows up, we might
actually want to allow tuples to be routed to it.  Maybe that's too
pie in the sky, but if we want to preserve the option to do such
things in the future, a hard-and-fast rule that the apparent partition
descriptor doesn't change unless the snapshot changes seems like it
might get in the way.  It seems better to me to have a system where
code that accesses the relcache has a choice, so that at its option it
can either hang on to the PartitionDesc it has or get a new one that
may be different.  If we can do things that way, it gives us the most
flexibility.

After the poking around I've done over the last 24 hours, I do see
that there are some non-trivial problems with making it that way, but
I'm not really ready to give up yet.

Does that make sense to you, or am I all wet here?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

David Rowley-3
In reply to this post by Robert Haas
On 8 November 2018 at 05:05, Robert Haas <[hidden email]> wrote:

> That seems OK at present, because no new partitions can have appeared
> since ExecSetupPartitionTupleRouting() acquired locks.  But if we
> allow new partitions to be added with only ShareUpdateExclusiveLock,
> then I think there would be a problem.  If a new partition OID creeps
> into the partition descriptor after find_all_inheritors() and before
> we fetch its partition descriptor, then we wouldn't have previously
> taken a lock on it and would still be attempting to open it without a
> lock, which is bad (cf. b04aeb0a053e7cf7faad89f7d47844d8ba0dc839).
>
> Admittedly, it might be a bit hard to provoke a failure here because
> I'm not exactly sure how you could trigger a relcache reload in the
> critical window, but I don't think we should rely on that.
>
> More generally, it seems a bit strange that we take the approach of
> locking the entire partitioning hierarchy here regardless of which
> relations the query actually knows about.  If some relations have been
> pruned, presumably we don't need to lock them; if/when we permit
> concurrent partition, we don't need to lock any new ones that have
> materialized.  We're just going to end up ignoring them anyway because
> there's nothing to do with the information that they are or are not
> excluded from the query when they don't appear in the query plan in
> the first place.

While the find_all_inheritors() call is something I'd like to see
gone, I assume it was done that way since an UPDATE might route a
tuple to a partition that there is no subplan for and due to INSERT
with VALUES not having any RangeTblEntry for any of the partitions.
Simply, any partition which is a descendant of the target partition
table could receive the tuple regardless of what might have been
pruned.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Robert Haas
On Wed, Nov 7, 2018 at 7:06 PM David Rowley
<[hidden email]> wrote:
> While the find_all_inheritors() call is something I'd like to see
> gone, I assume it was done that way since an UPDATE might route a
> tuple to a partition that there is no subplan for and due to INSERT
> with VALUES not having any RangeTblEntry for any of the partitions.
> Simply, any partition which is a descendant of the target partition
> table could receive the tuple regardless of what might have been
> pruned.

Thanks.  I had figured out since my email of earlier today that it was
needed in the INSERT case, but I had not thought of/discovered the
case of an UPDATE that routes a tuple to a pruned partition.  I think
that latter case may not be tested in our regression tests, which is
perhaps something we ought to change.

Honestly, I *think* that the reason that find_all_inheritors() call is
there is because I had the idea that it was important to try to lock
partition hierarchies in the same order in all cases so as to avoid
spurious deadlocks.  However, I don't think we're really achieving
that goal despite this code.  If we arrive at this point having
already locked some relations, and then lock some more, based on
whatever got pruned, we're clearly not using a deterministic locking
order.  So I think we could probably rip out the find_all_inheritors()
call here and change the NoLock in get_partition_dispatch_recurse() to
just take a lock.  That's probably a worthwhile simplification and a
slight optimization regardless of anything else.

But I really think it would be better if we could also jigger this to
avoid reopening relations which the executor has already opened and
locked elsewhere.  Unfortunately, I don't see a really simple way to
accomplish that.  We get the OIDs of the descendents and want to know
whether there is range table entry for that OID; but there's no data
structure which answers that question at present, I believe, and
introducing one just for this purpose seems like an awful lot of new
machinery.  Perhaps that new machinery would still have less
far-reaching consequences than the machinery Alvaro is proposing, but,
still, it's not very appealing.

Perhaps one idea is only open and lock partitions on demand - i.e. if
a tuple actually gets routed to them.  There are good reasons to do
that independently of reducing lock levels, and we certainly couldn't
do it without having some efficient way to check whether it had
already been done.  So then the mechanism wouldn't feel like so much
like a special-purpose hack just for concurrent ATTACH/DETACH.  (Was
Amit Langote already working on this, or was that some other kind of
on-demand locking?)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

David Rowley-3
On 8 November 2018 at 15:01, Robert Haas <[hidden email]> wrote:

> Honestly, I *think* that the reason that find_all_inheritors() call is
> there is because I had the idea that it was important to try to lock
> partition hierarchies in the same order in all cases so as to avoid
> spurious deadlocks.  However, I don't think we're really achieving
> that goal despite this code.  If we arrive at this point having
> already locked some relations, and then lock some more, based on
> whatever got pruned, we're clearly not using a deterministic locking
> order.  So I think we could probably rip out the find_all_inheritors()
> call here and change the NoLock in get_partition_dispatch_recurse() to
> just take a lock.  That's probably a worthwhile simplification and a
> slight optimization regardless of anything else.

I'd not thought of the locks taken elsewhere case. I guess it just
perhaps reduces the chances of a deadlock then.

A "slight optimization" is one way to categorise it. There are some
benchmarks you might find interesting in [1] and [2]. Patch 0002 does
just what you mention.

[1] https://www.postgresql.org/message-id/06524959-fda8-cff9-6151-728901897b79%40redhat.com
[2] https://www.postgresql.org/message-id/CAKJS1f_1RJyFquuCKRFHTdcXqoPX-PYqAd7nz%3DGVBwvGh4a6xA%40mail.gmail.com

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Amit Langote-2
In reply to this post by Robert Haas
On 2018/11/08 11:01, Robert Haas wrote:

> On Wed, Nov 7, 2018 at 7:06 PM David Rowley
> <[hidden email]> wrote:
>> While the find_all_inheritors() call is something I'd like to see
>> gone, I assume it was done that way since an UPDATE might route a
>> tuple to a partition that there is no subplan for and due to INSERT
>> with VALUES not having any RangeTblEntry for any of the partitions.
>> Simply, any partition which is a descendant of the target partition
>> table could receive the tuple regardless of what might have been
>> pruned.
>
> Thanks.  I had figured out since my email of earlier today that it was
> needed in the INSERT case, but I had not thought of/discovered the
> case of an UPDATE that routes a tuple to a pruned partition.  I think
> that latter case may not be tested in our regression tests, which is
> perhaps something we ought to change.
>
> Honestly, I *think* that the reason that find_all_inheritors() call is
> there is because I had the idea that it was important to try to lock
> partition hierarchies in the same order in all cases so as to avoid
> spurious deadlocks.  However, I don't think we're really achieving
> that goal despite this code.  If we arrive at this point having
> already locked some relations, and then lock some more, based on
> whatever got pruned, we're clearly not using a deterministic locking
> order.  So I think we could probably rip out the find_all_inheritors()
> call here and change the NoLock in get_partition_dispatch_recurse() to
> just take a lock.  That's probably a worthwhile simplification and a
> slight optimization regardless of anything else.

A patch that David and I have been working on over at:

https://commitfest.postgresql.org/20/1690/

does that.  With that patch, partitions (leaf or not) are locked and
opened only if a tuple is routed to them.  In edd44738bc (Be lazier about
partition tuple routing), we postponed the opening of leaf partitions, but
we still left the RelationGetPartitionDispatchInfo machine which
recursively creates PartitionDispatch structs for all partitioned tables
in a tree.  The patch mentioned above postpones even the partitioned
partition initialization to a point after a tuple is routed to it.

The patch doesn't yet eliminate the find_all_inheritors call from
ExecSetupPartitionTupleRouting.  But that's mostly because of the fear
that if we start becoming lazier about locking individual partitions too,
we'll get non-deterministic locking order on partitions that we might want
to avoid for deadlock fears.  Maybe, we don't need to be fearful though.

> But I really think it would be better if we could also jigger this to
> avoid reopening relations which the executor has already opened and
> locked elsewhere.  Unfortunately, I don't see a really simple way to
> accomplish that.  We get the OIDs of the descendents and want to know
> whether there is range table entry for that OID; but there's no data
> structure which answers that question at present, I believe, and
> introducing one just for this purpose seems like an awful lot of new
> machinery.  Perhaps that new machinery would still have less
> far-reaching consequences than the machinery Alvaro is proposing, but,
> still, it's not very appealing.

The newly added ExecGetRangeTableRelation opens (if not already done) and
returns a Relation pointer for tables that are present in the range table,
so requires to be passed a valid RT index.  That works for tables that the
planner touched.  UPDATE tuple routing benefits from that in cases where
the routing target is already in the range table.

For insert itself, planner adds only the target partitioned table to the
range table.  Partitions that the inserted tuples will route to may be
present in the range table via some other plan node, but the insert's
execution state won't know about them, so it cannot use
EcecGetRangeTableRelation.

> Perhaps one idea is only open and lock partitions on demand - i.e. if
> a tuple actually gets routed to them.  There are good reasons to do
> that independently of reducing lock levels, and we certainly couldn't
> do it without having some efficient way to check whether it had
> already been done.  So then the mechanism wouldn't feel like so much
> like a special-purpose hack just for concurrent ATTACH/DETACH.  (Was
> Amit Langote already working on this, or was that some other kind of
> on-demand locking?)

I think the patch mentioned above gets us closer to that goal.

Thanks,
Amit


Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Robert Haas
In reply to this post by Robert Haas
On Wed, Nov 7, 2018 at 1:37 PM Robert Haas <[hidden email]> wrote:
> > Maybe you could give my patch a look.
>
> I have, a bit.

While thinking about this problem a bit more, I realized that what is
called RelationBuildPartitionDesc in master and BuildPartitionDesc in
Alvaro's patch has a synchronization problem as soon as we start to
reduce lock levels.  At some point, find_inheritance_children() gets
called to get a list of the OIDs of the partitions. Then, later,
SysCacheGetAttr(RELOID, ...) gets called for each one to get its
relpartbound value.  But since catalog lookups use the most current
snapshot, they might not see a compatible view of the catalogs.

That could manifest in a few different ways:

- We might see a newer version of relpartbound, where it's now null
because it's been detached.
- We might see a newer version of relpartbound where it now has an
unrelated value because it has been detached and then reattached to
some other partitioned table.
- We might see newer versions of relpartbound for some tables than
others.  For instance, suppose we had partition A for 1..200 and B for
201..300.  Then we realize that this is not what we actually wanted to
do, so we detach A and reattach it with a bound of 1..100 and detached
B and reattach it with a bound of 101..300.  If we perform the
syscache lookup for A before this happens and the syscache lookup for
B after this happens, we might see the old bound for A and the new
bound for B, and that would be sad, 'cuz they overlap.
- Seeing an older relpartbound for some other table is also a problem
for other reasons -- we will have the wrong idea about the bounds of
that partition and may put the wrong tuples into it.  Without
AccessExclusiveLock, I don't think there is anything that keeps us
from reading stale syscache entries.

Alvaro's patch defends against the first of these cases by throwing an
error, which, as I already said, I don't think is acceptable, but I
don't see any defense at all against the other cases. The root of the
problem is that the way catalog lookups work today - each individual
lookup uses the latest available snapshot, but there is zero guarantee
that consecutive lookups use the same snapshot.  Therefore, as soon as
you start lowering lock levels, you are at risk for inconsistent data.

I suspect the only good way of fixing this problem is using a single
snapshot to perform both the scan of pg_inherits and the subsequent
pg_class lookups.  That way, you know that you are seeing the state of
the whole partitioning hierarchy as it existed at some particular
point in time -- every commit is either fully reflected in the
constructed PartitionDesc or not reflected at all.  Unfortunately,
that would mean that we can't use the syscache to perform the lookups,
which might have unhappy performance consequences.

Note that this problem still exists even if you allow concurrent
attach but not concurrent detach, but it's not as bad, because when
you encounter a concurrently-attached partition, you know it hasn't
also been concurrently-detached from someplace else.  Presumably you
either see the latest value of the partition bound or the NULL value
which preceded it, but not anything else.  If that's so, then maybe
you could get by without using a consistent snapshot for all of your
information gathering: if you see NULL, you know that the partition
was concurrently added and you just ignore it. There's still no
guarantee that all parallel workers would come to the same conclusion,
though, which doesn't feel too good.

Personally, I don't think it's right to blame that problem on parallel
query.  The problem is more general than that: we assume that holding
any kind of a lock on a relation is enough to keep the important
details of the relation static, and therefore it's fine to do
staggered lookups within one backend, and it's also fine to do
staggered lookups across different backends.  When you remove the
basic assumption that any lock is enough to prevent concurrent DDL,
then the whole idea that you can do different lookups at different
times with different snapshots (possibly in different backends) and
get sane answers also ceases to be correct.  But the idea that you can
look up different bits of catalog data at whatever time is convenient
undergirds large amounts of our current machinery -- it's built into
relcache, syscache, sinval, ...

I think that things get even crazier if we postpone locking on
individual partitions until we need to do something with that
partition, as has been proposed elsewhere.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

David Rowley-3
On 9 November 2018 at 05:34, Robert Haas <[hidden email]> wrote:
> I suspect the only good way of fixing this problem is using a single
> snapshot to perform both the scan of pg_inherits and the subsequent
> pg_class lookups.  That way, you know that you are seeing the state of
> the whole partitioning hierarchy as it existed at some particular
> point in time -- every commit is either fully reflected in the
> constructed PartitionDesc or not reflected at all.  Unfortunately,
> that would mean that we can't use the syscache to perform the lookups,
> which might have unhappy performance consequences.

I do have a patch sitting around that moves the relpartbound into a
new catalogue table named pg_partition. This gets rid of the usage of
pg_inherits for partitioned tables. I wonder if that problem is easier
to solve with that.  It also solves the issue with long partition keys
and lack of toast table on pg_class.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Robert Haas
On Thu, Nov 8, 2018 at 3:59 PM David Rowley
<[hidden email]> wrote:

> On 9 November 2018 at 05:34, Robert Haas <[hidden email]> wrote:
> > I suspect the only good way of fixing this problem is using a single
> > snapshot to perform both the scan of pg_inherits and the subsequent
> > pg_class lookups.  That way, you know that you are seeing the state of
> > the whole partitioning hierarchy as it existed at some particular
> > point in time -- every commit is either fully reflected in the
> > constructed PartitionDesc or not reflected at all.  Unfortunately,
> > that would mean that we can't use the syscache to perform the lookups,
> > which might have unhappy performance consequences.
>
> I do have a patch sitting around that moves the relpartbound into a
> new catalogue table named pg_partition. This gets rid of the usage of
> pg_inherits for partitioned tables. I wonder if that problem is easier
> to solve with that.  It also solves the issue with long partition keys
> and lack of toast table on pg_class.

Yeah, I thought about that, and it does make some sense.  Not sure if
it would hurt performance to have to access another table, but maybe
it comes out in the wash if pg_inherits is gone?  Seems like a fair
amount of code rearrangement just to get around the lack of a TOAST
table on pg_class, but maybe it's worth it.

I had another idea, too.  I think we might be able to reuse the
technique Noah invented in 4240e429d0c2d889d0cda23c618f94e12c13ade7.
That is:

- make a note of SharedInvalidMessageCounter before doing any of the
relevant catalog lookups
- do them
- AcceptInvalidationMessages()
- if SharedInvalidMessageCounter has changed, discard all the data we
collected and retry from the top

I believe that is sufficient to guarantee that whatever we construct
will have a consistent view of the catalogs which is the most recent
available view as of the time we do the work.  And with this approach
I believe we can continue to use syscache lookups to get the data
rather than having to use actual index scans, which is nice.

Then again, with your approach I'm guessing that one index scan would
get us the list of children and their partition bound information.
That would be even better -- the syscache lookup per child goes away
altogether; it's just a question of deforming the pg_partition tuples.

Way back at the beginning of the partitioning work, I mulled over the
idea of storing the partition bound information in a new column in
pg_inherits rather than in pg_class.  I wonder why exactly I rejected
that idea, and whether I was wrong to do so.  One possible advantage
of that approach over a pg_partition table is that is that client code
which queries pg_inherits will have to be adjusted if we stop using
it, and some of those queries are going to get more complicated.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Robert Haas
On Fri, Nov 9, 2018 at 9:50 AM Robert Haas <[hidden email]> wrote:

> I had another idea, too.  I think we might be able to reuse the
> technique Noah invented in 4240e429d0c2d889d0cda23c618f94e12c13ade7.
> That is:
>
> - make a note of SharedInvalidMessageCounter before doing any of the
> relevant catalog lookups
> - do them
> - AcceptInvalidationMessages()
> - if SharedInvalidMessageCounter has changed, discard all the data we
> collected and retry from the top
>
> I believe that is sufficient to guarantee that whatever we construct
> will have a consistent view of the catalogs which is the most recent
> available view as of the time we do the work.  And with this approach
> I believe we can continue to use syscache lookups to get the data
> rather than having to use actual index scans, which is nice.
Here are a couple of patches to illustrate this approach to this part
of the overall problem.  0001 is, I think, a good cleanup that may as
well be applied in isolation; it makes the code in
RelationBuildPartitionDesc both cleaner and more efficient.  0002
adjust things so that - I hope - the partition bounds we get for the
individual partitions has to be as of the same point in the commit
sequence as the list of children.  As I noted before, Alvaro's patch
doesn't seem to have tackled this part of the problem.

Another part of the problem is finding a way to make sure that if we
execute a query (or plan one), all parts of the executor (or planner)
see the same PartitionDesc for every relation.  In the case of
parallel query, I think it's important to try to get consistency not
only within a single backend but also across backends.  I'm thinking
about perhaps creating an object with a name like
PartitionDescDirectory which can optionally attach to dynamic shared
memory.  It would store an OID -> PartitionDesc mapping in local
memory, and optionally, an additional OID -> serialized-PartitionDesc
in DSA.  When given an OID, it would check the local hash table first,
and then if that doesn't find anything, check the shared hash table if
there is one.  If an entry is found there, deserialize and add to the
local hash table.  We'd then hang such a directory off of the EState
for the executor and the PlannerInfo for the planner.  As compared
with Alvaro's proposal, this approach has the advantage of not
treating parallel query as a second-class citizen, and also of keeping
partitioning considerations out of the snapshot handling, which as I
said before seems to me to be a good idea.

One thing which was vaguely on my mind in earlier emails but which I
think I can now articulate somewhat more clearly is this: In some
cases, a consistent but outdated view of the catalog state is just as
bad as an inconsistent view of the catalog state.  For example, it's
not OK to decide that a tuple can be placed in a certain partition
based on an outdated list of relation constraints, including the
partitioning constraint - nor is it OK to decide that a partition can
be pruned based on an old view of the partitioning constraint.  I
think this means that whenever we change a partition's partitioning
constraint, we MUST take AccessExclusiveLock on the partition.
Otherwise, a heap_insert() [or a partition pruning decision] can be in
progress on that relation in one relation at the same time that some
other partition is changing the partition constraint, which can't
possibly work.  The best we can reasonably do is to reduce the locking
level on the partitioned table itself.

A corollary is that holding the PartitionDescs that a particular query
sees for a particular relation fixed, whether by the method Alvaro
proposes or by what I am proposing here or by some other method is not
a panacea.  For example, the ExecPartitionCheck call in copy.c
sometimes gets skipped on the theory that if tuple routing has sent us
to partition X, then the tuple being routed must satisfy the partition
constraint for that partition.  But that's not true if we set up tuple
routing using one view of the catalogs, and then things changed
afterwards.  RelationBuildPartitionDesc doesn't lock the children
whose relpartbounds it is fetching (!), so unless we're guaranteed to
have already locked them children earlier for some other reason, we
could grab the partition bound at this point and then it could change
again before we get a lock on them.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

0002-Ensure-that-RelationBuildPartitionDesc-sees-a-consis.patch (8K) Download Attachment
0001-Reduce-unnecessary-list-construction-in-RelationBuil.patch (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Michael Paquier-2
On Wed, Nov 14, 2018 at 02:27:31PM -0500, Robert Haas wrote:
> Here are a couple of patches to illustrate this approach to this part
> of the overall problem.  0001 is, I think, a good cleanup that may as
> well be applied in isolation; it makes the code in
> RelationBuildPartitionDesc both cleaner and more efficient.  0002
> adjust things so that - I hope - the partition bounds we get for the
> individual partitions has to be as of the same point in the commit
> sequence as the list of children.  As I noted before, Alvaro's patch
> doesn't seem to have tackled this part of the problem.

You may want to rebase these patches as of b52b7dc2, and change the
first argument of partition_bounds_create() so as a list is used in
input...
--
Michael

signature.asc (849 bytes) Download Attachment
12345