ALTER TABLE .. DETACH PARTITION CONCURRENTLY

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

ALTER TABLE .. DETACH PARTITION CONCURRENTLY

Alvaro Herrera-9
I've been working on the ability to detach a partition from a
partitioned table, without causing blockages to concurrent activity.
I think this operation is critical for some use cases.

There was a lot of great discussion which ended up in Robert completing
a much sought implementation of non-blocking ATTACH.  DETACH was
discussed too because it was a goal initially, but eventually dropped
from that patch altogether. Nonetheless, that thread provided a lot of
useful input to this implementation.  Important ones:

[1] https://postgr.es/m/CA+TgmoYg4x7AH=_QSptvuBKf+3hUdiCa4frPkt+RvXZyjX1n=w@...
[2] https://postgr.es/m/CA+TgmoaAjkTibkEr=xJg3ndbRsHHSiYi2SJgX69MVosj=LJmug@...
[3] https://postgr.es/m/CA+TgmoY13KQZF-=HNTrt9UYWYx3_oYOQpu9ioNT49jGgiDpUEA@...

Attached is a patch that implements
ALTER TABLE ... DETACH PARTITION .. CONCURRENTLY.

In the previous thread we were able to implement the concurrent model
without the extra keyword.  For this one I think that won't work; my
implementation works in two transactions so there's a restriction that
you can't run it in a transaction block.  Also, there's a wait phase
that makes it slower than the non-concurrent one.  Those two drawbacks
make me think that it's better to keep both modes available, just like
we offer both CREATE INDEX and CREATE INDEX CONCURRENTLY.

Why two transactions?  The reason is that in order for this to work, we
make a catalog change (mark it detached), and commit so that all
concurrent transactions can see the change.  A second transaction waits
for anybody who holds any lock on the partitioned table and grabs Access
Exclusive on the partition (which now no one cares about, if they're
looking at the partitioned table), where the DDL action on the partition
can be completed.

ALTER TABLE is normally unable to run in two transactions.  I hacked it
(0001) so that the relation can be closed and reopened in the Exec phase
(by having the rel as part of AlteredTableInfo: when ATRewriteCatalogs
returns, it uses that pointer to close the rel).  It turns out that this
is sufficient to make that work.  This means that ALTER TABLE DETACH
CONCURRENTLY cannot work as part of a multi-command ALTER TABLE, but
that's alreay enforced by the grammar anyway.

DETACH CONCURRENTLY doesn't work if a default partition exists.  It's
just too problematic a case; you would still need to have AEL on the
default partition.


I haven't yet experimented with queries running in a standby in tandem
with a detach.

--
Álvaro Herrera

v1-0001-Let-ALTER-TABLE-exec-routines-deal-with-the-relat.patch (3K) Download Attachment
v1-0002-ALTER-TABLE-.-DETACH-CONCURRENTLY.patch (75K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

Alvaro Herrera-9
On 2020-Aug-03, Alvaro Herrera wrote:

> There was a lot of great discussion which ended up in Robert completing
> a much sought implementation of non-blocking ATTACH.  DETACH was
> discussed too because it was a goal initially, but eventually dropped
> from that patch altogether. Nonetheless, that thread provided a lot of
> useful input to this implementation.  Important ones:
>
> [1] https://postgr.es/m/CA+TgmoYg4x7AH=_QSptvuBKf+3hUdiCa4frPkt+RvXZyjX1n=w@...
> [2] https://postgr.es/m/CA+TgmoaAjkTibkEr=xJg3ndbRsHHSiYi2SJgX69MVosj=LJmug@...
> [3] https://postgr.es/m/CA+TgmoY13KQZF-=HNTrt9UYWYx3_oYOQpu9ioNT49jGgiDpUEA@...

There was some discussion about having a version number in the partition
descriptor somewhere as a means to implement this.  I couldn't figure
out how that would work, or what the version number would be attached
to.  Surely the idea wasn't to increment the version number to every
partition other than the one being detached?

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

Alvaro Herrera-9
In reply to this post by Alvaro Herrera-9
On 2020-Aug-03, Alvaro Herrera wrote:

> Why two transactions?  The reason is that in order for this to work, we
> make a catalog change (mark it detached), and commit so that all
> concurrent transactions can see the change.  A second transaction waits
> for anybody who holds any lock on the partitioned table and grabs Access
> Exclusive on the partition (which now no one cares about, if they're
> looking at the partitioned table), where the DDL action on the partition
> can be completed.

I forgot to mention.  If for whatever reason the second transaction
fails, (say the user aborts it or there is a crash), then the partition
is still marked as detached, so no queries would see it; but all the
ancillary catalog data remains.  Just like when CREATE INDEX
CONCURRENTLY fails, you keep an invalid index that must be dropped; in
this case, the changes to do are much more extensive, so manual action
is out of the question.  So there's another DDL command to be invoked,

ALTER TABLE parent DETACH PARTITION part FINALIZE;

which will complete the detach action.

If we had UNDO then perhaps it would be possible to register an action
so that the detach is completed automatically.  But for now this seems
sufficient.


Another aspect worth mentioning is constraints.  In the patch, I create
a CHECK constraint to stand for the partition constraint that's going to
logically disappear.  This was mentioned as a potential problem in one
of Robert's emails (I didn't actually verify that this is a problem).
However, a funny thing is that if a constraint already exists, you get a
dupe, so after a few rounds of attach/detach you can see them pile up.
I'll have to fix this at some point.  But also, I need to think about
whether foreign keys have similar problems, since they are also used by
the optimizer.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

Robert Haas
In reply to this post by Alvaro Herrera-9
On Mon, Aug 3, 2020 at 7:49 PM Alvaro Herrera <[hidden email]> wrote:
> Why two transactions?  The reason is that in order for this to work, we
> make a catalog change (mark it detached), and commit so that all
> concurrent transactions can see the change.  A second transaction waits
> for anybody who holds any lock on the partitioned table and grabs Access
> Exclusive on the partition (which now no one cares about, if they're
> looking at the partitioned table), where the DDL action on the partition
> can be completed.

Is there a more detailed theory of operation of this patch somewhere?
What exactly do you mean by marking it detached? Committing the change
makes it possible for all concurrent transactions to see the change,
but does not guarantee that they will. If you can't guarantee that,
then I'm not sure how you can guarantee that they will behave sanely.
Even if you can, it's not clear what the sane behavior is: what
happens when a tuple gets routed to an ex-partition? What happens when
an ex-partition needs to be scanned? What prevents problems if a
partition is detached, possibly modified, and then reattached,
possibly with different partition bounds?

I think the two-transaction approach is interesting and I can imagine
that it possibly solves some problems, but it's not clear to me
exactly which problems it solves or how it does so.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

Alvaro Herrera-9
On 2020-Aug-04, Robert Haas wrote:

> On Mon, Aug 3, 2020 at 7:49 PM Alvaro Herrera <[hidden email]> wrote:
> > Why two transactions?  The reason is that in order for this to work, we
> > make a catalog change (mark it detached), and commit so that all
> > concurrent transactions can see the change.  A second transaction waits
> > for anybody who holds any lock on the partitioned table and grabs Access
> > Exclusive on the partition (which now no one cares about, if they're
> > looking at the partitioned table), where the DDL action on the partition
> > can be completed.
>
> Is there a more detailed theory of operation of this patch somewhere?
> What exactly do you mean by marking it detached? Committing the change
> makes it possible for all concurrent transactions to see the change,
> but does not guarantee that they will. If you can't guarantee that,
> then I'm not sure how you can guarantee that they will behave sanely.

Sorry for the long delay.  I haven't written up the theory of operation.
I suppose it is complicated enough that it should be documented
somewhere.

To mark it detached means to set pg_inherits.inhdetached=true.  That
column name is a bit of a misnomer, since that column really means "is
in the process of being detached"; the pg_inherits row goes away
entirely once the detach process completes.  This mark guarantees that
everyone will see that row because the detaching session waits long
enough after committing the first transaction and before deleting the
pg_inherits row, until everyone else has moved on.

The other point is that the partition directory code can be asked to
include detached partitions, or not to.  The executor does the former,
and the planner does the latter.  This allows a transient period during
which the partition descriptor returned to planner and executor is
different; this makes the situation equivalent to what would have
happened if the partition was attached during the operation: in executor
we would detect that there is an additional partition that was not seen
by the planner, and we already know how to deal with that situation by
your handling of the ATTACH code.

> Even if you can, it's not clear what the sane behavior is: what
> happens when a tuple gets routed to an ex-partition? What happens when
> an ex-partition needs to be scanned?

During the transient period, any transaction that obtained a partition
descriptor before the inhdetached mark is committed should be able to
get the tuple routing done successfully, but transactions using later
snapshots should get their insertions rejected.  Reads should behave in
the same way.

> What prevents problems if a partition is detached, possibly modified,
> and then reattached, possibly with different partition bounds?

This should not be a problem, because the partition needs to be fully
detached before it can be attached again.  And if the partition bounds
are different, that won't be a problem, because the previous partition
bounds won't be present in the pg_class row.  Of course, the new
partition bounds will be checked to the existing contents.

There is one fly in the ointment though, which is that if you cancel the
wait and then immediately do the ALTER TABLE DETACH FINALIZE without
waiting for as long as the original execution would have waited, you
might end up killing the partition ahead of time.  One solution to this
would be to cause the FINALIZE action to wait again at start.  This
would cause it to take even longer, but it would be safer.  (If you
don't want it to take longer, you can just not cancel it in the first
place.)  This is not a problem if the server crashes in between (which
is the scenario I had in mind when doing the FINALIZE thing), because of
course no transaction can continue to run across a crash.


I'm going to see if I can get the new delay_execution module to help
test this, to verify whether my claims are true.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

Robert Haas
On Wed, Aug 26, 2020 at 7:40 PM Alvaro Herrera <[hidden email]> wrote:
> To mark it detached means to set pg_inherits.inhdetached=true.  That
> column name is a bit of a misnomer, since that column really means "is
> in the process of being detached"; the pg_inherits row goes away
> entirely once the detach process completes.  This mark guarantees that
> everyone will see that row because the detaching session waits long
> enough after committing the first transaction and before deleting the
> pg_inherits row, until everyone else has moved on.

OK. Do you just wait until the XID of the transaction that set
inhdetached is all-visible, or how do you do it?

> The other point is that the partition directory code can be asked to
> include detached partitions, or not to.  The executor does the former,
> and the planner does the latter.  This allows a transient period during
> which the partition descriptor returned to planner and executor is
> different; this makes the situation equivalent to what would have
> happened if the partition was attached during the operation: in executor
> we would detect that there is an additional partition that was not seen
> by the planner, and we already know how to deal with that situation by
> your handling of the ATTACH code.

Ah ha! That is quite clever and I don't think that I would have
thought of it. So all the plans that were created before you set
inhdetached=true have to be guaranteed to be invaliated or gone
altogether before you can actually delete the pg_inherits row. It
seems like it ought to be possible to ensure that, though I am not
surely of the details exactly. Is it sufficient to wait for all
transactions that have locked the table to go away? I'm not sure
exactly how this stuff interacts with the plan cache.

> There is one fly in the ointment though, which is that if you cancel the
> wait and then immediately do the ALTER TABLE DETACH FINALIZE without
> waiting for as long as the original execution would have waited, you
> might end up killing the partition ahead of time.  One solution to this
> would be to cause the FINALIZE action to wait again at start.  This
> would cause it to take even longer, but it would be safer.  (If you
> don't want it to take longer, you can just not cancel it in the first
> place.)  This is not a problem if the server crashes in between (which
> is the scenario I had in mind when doing the FINALIZE thing), because of
> course no transaction can continue to run across a crash.

Yeah, it sounds like this will require some solution, but I agree that
just waiting "longer" seems acceptable.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

Hao Wu-2
Hi hacker,

I tested the patch provided by Alvaro. It seems not well in REPEATABLE READ.

gpadmin=*# \d+ tpart
                             Partitioned table "public.tpart"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats target | Description
--------+---------+-----------+----------+---------+---------+--------------+-------------
 i      | integer |           |          |         | plain   |              |
 j      | integer |           |          |         | plain   |              |
Partition key: RANGE (i)
Partitions: tpart_1 FOR VALUES FROM (0) TO (100),
            tpart_2 FOR VALUES FROM (100) TO (200)

begin isolation level repeatable read;
BEGIN
gpadmin=*# select * from tpart;
  i  |  j
-----+-----
  10 |  10
  50 |  50
 110 | 110
 120 | 120
 150 | 150
(5 rows)
-- Here, run `ALTER TABLE tpart DROP PARTITION tpart_2 CONCURRENTLY`
-- but only complete the first transaction.

-- the tuples from tpart_2 are gone.
gpadmin=*# select * from tpart;
 i  | j
----+----
 10 | 10
 50 | 50
(2 rows)

gpadmin=*# \d+ tpart_2
                                  Table "public.tpart_2"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats target | Description
--------+---------+-----------+----------+---------+---------+--------------+-------------
 i      | integer |           |          |         | plain   |              |
 j      | integer |           |          |         | plain   |              |
Partition of: tpart FOR VALUES FROM (100) TO (200)
Partition constraint: ((i IS NOT NULL) AND (i >= 100) AND (i < 200))
Access method: heap

-- the part tpart_2 is not showed as DETACHED
gpadmin=*# \d+ tpart
                             Partitioned table "public.tpart"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats target | Description
--------+---------+-----------+----------+---------+---------+--------------+-------------
 i      | integer |           |          |         | plain   |              |
 j      | integer |           |          |         | plain   |              |
Partition key: RANGE (i)
Partitions: tpart_1 FOR VALUES FROM (0) TO (100),
            tpart_2 FOR VALUES FROM (100) TO (200)

-- next, the insert failed. It's OK.
gpadmin=*# insert into tpart values(130,130);
ERROR:  no partition of relation "tpart" found for row
DETAIL:  Partition key of the failing row contains (i) = (130).


Is this an expected behavior?

Regards,
Hao Wu


From: Robert Haas <[hidden email]>
Sent: Thursday, August 27, 2020 11:46 PM
To: Alvaro Herrera <[hidden email]>
Cc: Pg Hackers <[hidden email]>
Subject: Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY
 
On Wed, Aug 26, 2020 at 7:40 PM Alvaro Herrera <[hidden email]> wrote:
> To mark it detached means to set pg_inherits.inhdetached=true.  That
> column name is a bit of a misnomer, since that column really means "is
> in the process of being detached"; the pg_inherits row goes away
> entirely once the detach process completes.  This mark guarantees that
> everyone will see that row because the detaching session waits long
> enough after committing the first transaction and before deleting the
> pg_inherits row, until everyone else has moved on.

OK. Do you just wait until the XID of the transaction that set
inhdetached is all-visible, or how do you do it?

> The other point is that the partition directory code can be asked to
> include detached partitions, or not to.  The executor does the former,
> and the planner does the latter.  This allows a transient period during
> which the partition descriptor returned to planner and executor is
> different; this makes the situation equivalent to what would have
> happened if the partition was attached during the operation: in executor
> we would detect that there is an additional partition that was not seen
> by the planner, and we already know how to deal with that situation by
> your handling of the ATTACH code.

Ah ha! That is quite clever and I don't think that I would have
thought of it. So all the plans that were created before you set
inhdetached=true have to be guaranteed to be invaliated or gone
altogether before you can actually delete the pg_inherits row. It
seems like it ought to be possible to ensure that, though I am not
surely of the details exactly. Is it sufficient to wait for all
transactions that have locked the table to go away? I'm not sure
exactly how this stuff interacts with the plan cache.

> There is one fly in the ointment though, which is that if you cancel the
> wait and then immediately do the ALTER TABLE DETACH FINALIZE without
> waiting for as long as the original execution would have waited, you
> might end up killing the partition ahead of time.  One solution to this
> would be to cause the FINALIZE action to wait again at start.  This
> would cause it to take even longer, but it would be safer.  (If you
> don't want it to take longer, you can just not cancel it in the first
> place.)  This is not a problem if the server crashes in between (which
> is the scenario I had in mind when doing the FINALIZE thing), because of
> course no transaction can continue to run across a crash.

Yeah, it sounds like this will require some solution, but I agree that
just waiting "longer" seems acceptable.

--
Robert Haas
EnterpriseDB: https://urldefense.proofpoint.com/v2/url?u=http-3A__www.enterprisedb.com&d=DwIBaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=tqYUKh-fXcYPWSaF4E-D6A&m=SEDl-6dEISo7BA0qWuv1-idQUVtO0M6qz7hcfwlrF3I&s=pZ7Dx6xrJOYkKKMlXR4wpJNZv-W10wQkMfXdEjtIXJY&e=
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

Alvaro Herrera-9
In reply to this post by Robert Haas
On 2020-Aug-27, Robert Haas wrote:

> On Wed, Aug 26, 2020 at 7:40 PM Alvaro Herrera <[hidden email]> wrote:
> > To mark it detached means to set pg_inherits.inhdetached=true.  That
> > column name is a bit of a misnomer, since that column really means "is
> > in the process of being detached"; the pg_inherits row goes away
> > entirely once the detach process completes.  This mark guarantees that
> > everyone will see that row because the detaching session waits long
> > enough after committing the first transaction and before deleting the
> > pg_inherits row, until everyone else has moved on.
>
> OK. Do you just wait until the XID of the transaction that set
> inhdetached is all-visible, or how do you do it?

I'm just doing WaitForLockers( ... AccessExclusiveLock ...)  on the
partitioned table at the start of the second transaction.  That will
wait until all lockers that have obtained a partition descriptor with
the old definition are gone.  Note we don't actually lock the
partitioned table with that lock level.

In the second transaction we additionally obtain AccessExclusiveLock on
the partition itself, but that's after nobody sees it as a partition
anymore.  That lock level is needed for some of the internal DDL
changes, and should not cause problems.

I thought about using WaitForOlderSnapshots() instead of waiting for
lockers, but it didn't seem to solve any actual problem.

Note that on closing the first transaction, the locks on both tables are
released.  This avoids the deadlock hazards because of the lock upgrades
that would otherwise occur.  This means that the tables could be dropped
or changed in the meantime.  The case where either relation is dropped
is handled by using try_relation_open() in the second transaction; if
either table is gone, then we can just mark the operation as completed.
This part is a bit fuzzy.  One thing that should probably be done is
have a few operations (such as other ALTER TABLE) raise an error when
run on a table with inhdetached=true, because that might get things out
of step and potentially cause other problems.  I've not done that yet.  

> So all the plans that were created before you set
> inhdetached=true have to be guaranteed to be invaliated or gone
> altogether before you can actually delete the pg_inherits row. It
> seems like it ought to be possible to ensure that, though I am not
> surely of the details exactly. Is it sufficient to wait for all
> transactions that have locked the table to go away? I'm not sure
> exactly how this stuff interacts with the plan cache.

Hmm, any cached plan should be released with relcache inval events, per
PlanCacheRelCallback().  There are some comments in plancache.h about
"unsaved" cached plans that I don't really understand :-(

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

Hao Wu-2
In reply to this post by Hao Wu-2
Not related to DETACH PARTITION, but I found a bug in ATTACH PARTITION.
Here are the steps to reproduce the issue:

create table tpart(i int, j int) partition by range(i);
create table tpart_1(like tpart);
create table tpart_2(like tpart);
create table tpart_default(like tpart);alter table tpart attach partition tpart_1 for values from(0) to (100);
alter table tpart attach partition tpart_default default;insert into tpart_2 values(110,110),(120,120),(150,150);1: begin;
1: alter table tpart attach partition tpart_2 for values from(100) to (200);
2: begin;
-- insert will be blocked by ALTER TABLE ATTACH PARTITION
2: insert into tpart values (110,110),(120,120),(150,150);
1: end;
2: select * from tpart_default;
2: end;

After that the partition tpart_default contains (110,110),(120,120),(150,150)
inserted in session 2, which obviously violates the partition constraint.

Regards,
Hao Wu
Reply | Threaded
Open this post in threaded view
|

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

Amit Langote
Hi Hao,

On Wed, Sep 2, 2020 at 5:25 PM Hao Wu <[hidden email]> wrote:

>
> Not related to DETACH PARTITION, but I found a bug in ATTACH PARTITION.
> Here are the steps to reproduce the issue:
>
> create table tpart(i int, j int) partition by range(i);
> create table tpart_1(like tpart);
> create table tpart_2(like tpart);
> create table tpart_default(like tpart);alter table tpart attach partition tpart_1 for values from(0) to (100);
> alter table tpart attach partition tpart_default default;insert into tpart_2 values(110,110),(120,120),(150,150);1: begin;
> 1: alter table tpart attach partition tpart_2 for values from(100) to (200);
> 2: begin;
> -- insert will be blocked by ALTER TABLE ATTACH PARTITION
> 2: insert into tpart values (110,110),(120,120),(150,150);
> 1: end;
> 2: select * from tpart_default;
> 2: end;
>
>
> After that the partition tpart_default contains (110,110),(120,120),(150,150)
> inserted in session 2, which obviously violates the partition constraint.

Thanks for the report.  That looks like a bug.

I have started another thread to discuss this bug and a patch to fix
it to keep the discussion here focused on the new feature.

Subject: default partition and concurrent attach partition
https://www.postgresql.org/message-id/CA%2BHiwqFqBmcSSap4sFnCBUEL_VfOMmEKaQ3gwUhyfa4c7J_-nA%40mail.gmail.com

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

Alvaro Herrera-9
In reply to this post by Hao Wu-2
On 2020-Aug-31, Hao Wu wrote:

> I tested the patch provided by Alvaro. It seems not well in REPEATABLE READ.

> -- the tuples from tpart_2 are gone.
> gpadmin=*# select * from tpart;
>  i  | j
> ----+----
>  10 | 10
>  50 | 50
> (2 rows)

Interesting example, thanks.  It seems this can be fixed without
breaking anything else by changing the planner so that it includes
detached partitions when we are in a snapshot-isolation transaction.
Indeed, the results from the detach-partition-concurrently-1.spec
isolation test are more satisfying with this change.

The attached v2, on current master, includes that change, as well as
fixes a couple of silly bugs in the previous one.

(Because of experimenting with git workflow I did not keep the 0001
part split in this one, but that part is unchanged from v1.)

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

v2-0001-ALTER-TABLE-.-DETACH-CONCURRENTLY.patch (80K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

Robert Haas
On Thu, Sep 10, 2020 at 4:54 PM Alvaro Herrera <[hidden email]> wrote:
> Interesting example, thanks.  It seems this can be fixed without
> breaking anything else by changing the planner so that it includes
> detached partitions when we are in a snapshot-isolation transaction.
> Indeed, the results from the detach-partition-concurrently-1.spec
> isolation test are more satisfying with this change.

Hmm, so I think the idea here is that since we're out-waiting plans
with the old partition descriptor by waiting for lock release, it's OK
for anyone who has a lock to keep using the old partition descriptor
as long as they continuously hold the lock. Is that right? I can't
think of a hole in that logic, but it's probably worth noting in the
comments, in case someone is tempted to change the way that we
out-wait plans with the old partition descriptor to some other
mechanism.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

Amit Langote
In reply to this post by Alvaro Herrera-9
Hi Alvaro,

Studying the patch to understand how it works.

On Tue, Aug 4, 2020 at 8:49 AM Alvaro Herrera <[hidden email]> wrote:
> Why two transactions?  The reason is that in order for this to work, we
> make a catalog change (mark it detached), and commit so that all
> concurrent transactions can see the change.  A second transaction waits
> for anybody who holds any lock on the partitioned table and grabs Access
> Exclusive on the partition (which now no one cares about, if they're
> looking at the partitioned table), where the DDL action on the partition
> can be completed.

+   /*
+    * Concurrent mode has to work harder; first we add a new constraint to the
+    * partition that matches the partition constraint.  The reason for this is
+    * that the planner may have made optimizations that depend on the
+    * constraint.  XXX Isn't it sufficient to invalidate the partition's
+    * relcache entry?
...
+       /* Add constraint on partition, equivalent to the partition
constraint */
+       n = makeNode(Constraint);
+       n->contype = CONSTR_CHECK;
+       n->conname = NULL;
+       n->location = -1;
+       n->is_no_inherit = false;
+       n->raw_expr = NULL;
+       n->cooked_expr =
nodeToString(make_ands_explicit(RelationGetPartitionQual(partRel)));
+       n->initially_valid = true;
+       n->skip_validation = true;
+       /* It's a re-add, since it nominally already exists */
+       ATAddCheckConstraint(wqueue, tab, partRel, n,
+                            true, false, true, ShareUpdateExclusiveLock);

I suspect that we don't really need this defensive constraint.  I mean
even after committing the 1st transaction, the partition being
detached still has relispartition set to true and
RelationGetPartitionQual() still returns the partition constraint, so
it seems the constraint being added above is duplicative.  Moreover,
the constraint is not removed as part of any cleaning up after the
DETACH process, so repeated attach/detach of the same partition
results in the constraints piling up:

create table foo (a int, b int) partition by range (a);
create table foo1 partition of foo for values from (1) to (2);
create table foo2 partition of foo for values from (2) to (3);
alter table foo detach partition foo2 concurrently;
alter table foo attach partition foo2 for values from (2) to (3);
alter table foo detach partition foo2 concurrently;
alter table foo attach partition foo2 for values from (2) to (3);
\d foo2
                Table "public.foo2"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 a      | integer |           |          |
 b      | integer |           |          |
Partition of: foo FOR VALUES FROM (2) TO (3)
Check constraints:
    "foo2_a_check" CHECK (a IS NOT NULL AND a >= 2 AND a < 3)
    "foo2_a_check1" CHECK (a IS NOT NULL AND a >= 2 AND a < 3)

Noticed a bug/typo in the patched RelationBuildPartitionDesc():

-   inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
+   inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock,
+                                       include_detached);

You're passing NoLock for include_detached which means you never
actually end up including detached partitions from here.  I think it
is due to this bug that partition-concurrent-attach test fails in my
run.  Also, the error seen in the following hunk of
detach-partition-concurrently-1 test is not intentional:

+starting permutation: s1brr s1prep s1s s2d s1s s1exec2 s1c
+step s1brr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1prep: PREPARE f(int) AS INSERT INTO d_listp VALUES ($1);
+step s1s: SELECT * FROM d_listp;
+a
+
+1
+2
+step s2d: ALTER TABLE d_listp DETACH PARTITION d_listp2 CONCURRENTLY;
<waiting ...>
+step s1s: SELECT * FROM d_listp;
+a
+
+1
+step s1exec2: EXECUTE f(2); DEALLOCATE f;
+step s2d: <... completed>
+error in steps s1exec2 s2d: ERROR:  no partition of relation
"d_listp" found for row
+step s1c: COMMIT;

As you're intending to make the executor always *include* detached
partitions, the insert should be able route (2) to d_listp2, the
detached partition.  It's the bug mentioned above that causes the
executor's partition descriptor build to falsely fail to include the
detached partition.

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

Alvaro Herrera-9
On 2020-Sep-23, Amit Langote wrote:

Hi Amit, thanks for reviewing this patch!

> On Tue, Aug 4, 2020 at 8:49 AM Alvaro Herrera <[hidden email]> wrote:

> I suspect that we don't really need this defensive constraint.  I mean
> even after committing the 1st transaction, the partition being
> detached still has relispartition set to true and
> RelationGetPartitionQual() still returns the partition constraint, so
> it seems the constraint being added above is duplicative.

Ah, thanks for thinking through that.  I had vague thoughts about this
being unnecessary in the current mechanics, but hadn't fully
materialized the thought.  (The patch was completely different a few
unposted iterations ago).

> Moreover, the constraint is not removed as part of any cleaning up
> after the DETACH process, so repeated attach/detach of the same
> partition results in the constraints piling up:

Yeah, I knew about this issue (mentioned in my self-reply on Aug 4) and
didn't worry too much about it because I was thinking I'd rather get rid
of the constraint addition in the first place.

> Noticed a bug/typo in the patched RelationBuildPartitionDesc():
>
> -   inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
> +   inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock,
> +                                       include_detached);
>
> You're passing NoLock for include_detached which means you never
> actually end up including detached partitions from here.

I fixed this in the version I posted on Sept 10.  I think you were
reading the version posted at the start of this thread.

Thanks,

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

Amit Langote
Hi Alvaro,

Sorry I totally failed to see the v2 you had posted and a couple of
other emails where you mentioned the issues I brought up.

On Thu, Sep 24, 2020 at 12:23 AM Alvaro Herrera
<[hidden email]> wrote:
> On 2020-Sep-23, Amit Langote wrote:
 > I suspect that we don't really need this defensive constraint.  I mean

> > even after committing the 1st transaction, the partition being
> > detached still has relispartition set to true and
> > RelationGetPartitionQual() still returns the partition constraint, so
> > it seems the constraint being added above is duplicative.
>
> Ah, thanks for thinking through that.  I had vague thoughts about this
> being unnecessary in the current mechanics, but hadn't fully
> materialized the thought.  (The patch was completely different a few
> unposted iterations ago).
>
> > Moreover, the constraint is not removed as part of any cleaning up
> > after the DETACH process, so repeated attach/detach of the same
> > partition results in the constraints piling up:
>
> Yeah, I knew about this issue (mentioned in my self-reply on Aug 4) and
> didn't worry too much about it because I was thinking I'd rather get rid
> of the constraint addition in the first place.

Okay, gotcha.

> > Noticed a bug/typo in the patched RelationBuildPartitionDesc():
> >
> > -   inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
> > +   inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock,
> > +                                       include_detached);
> >
> > You're passing NoLock for include_detached which means you never
> > actually end up including detached partitions from here.
>
> I fixed this in the version I posted on Sept 10.  I think you were
> reading the version posted at the start of this thread.

I am trying the v2 now and I can confirm that those problems are now fixed.

However, I am a bit curious about including detached partitions in
some cases while not in other, which can result in a (to me)
surprising behavior as follows:

Session 1:

create table foo (a int, b int) partition by range (a);
create table foo1 partition of foo for values from (1) to (2);
create table foo2 partition of foo for values from (2) to (3);

...attach GDB and set breakpoint so as to block right after finishing
the 1st transaction of DETACH PARTITION CONCURRENTLY...
alter table foo detach partition foo2 concurrently;
<hits breakpoint, wait...>

Session 2:

begin;
insert into foo values (2);  -- ok
select * from foo;
select * from foo;  -- ?!
 a | b
---+---
(0 rows)

Maybe, it's fine to just always exclude detached partitions, although
perhaps I am missing some corner cases that you have thought of?

Also, I noticed that looking up a parent's partitions via
RelationBuildPartitionDesc or directly will inspect inhdetached to
include or exclude partitions, but checking if a child table is a
partition of a given parent table via get_partition_parent doesn't.
Now if you fix get_partition_parent() to also take into account
inhdetached, for example, to return InvalidOid if true, then the
callers would need to not consider the child table a valid partition.
So, RelationGetPartitionQual() on a detached partition should actually
return NIL, making my earlier claim about not needing the defensive
CHECK constraint invalid.  But maybe that's fine if all places agree
on a detached partition not being a valid partition anymore?

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

Michael Paquier-2
On Thu, Sep 24, 2020 at 12:51:52PM +0900, Amit Langote wrote:

> Also, I noticed that looking up a parent's partitions via
> RelationBuildPartitionDesc or directly will inspect inhdetached to
> include or exclude partitions, but checking if a child table is a
> partition of a given parent table via get_partition_parent doesn't.
> Now if you fix get_partition_parent() to also take into account
> inhdetached, for example, to return InvalidOid if true, then the
> callers would need to not consider the child table a valid partition.
> So, RelationGetPartitionQual() on a detached partition should actually
> return NIL, making my earlier claim about not needing the defensive
> CHECK constraint invalid.  But maybe that's fine if all places agree
> on a detached partition not being a valid partition anymore?
It would be good to get that answered, and while on it please note
that the patch needs a rebase.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

Andy Fan
In reply to this post by Alvaro Herrera-9
Hi Alvaro:

On Tue, Aug 4, 2020 at 7:49 AM Alvaro Herrera <[hidden email]> wrote:
I've been working on the ability to detach a partition from a
partitioned table, without causing blockages to concurrent activity.
I think this operation is critical for some use cases.

I think if it is possible to implement the detech with a NoWait option . 

ALTER TABLE ... DETACH PARTITION ..  [NoWait]. 

if it can't get the lock, raise "Resource is Busy" immediately, without blocking others. 
this should be a default behavior.   If people do want to keep trying, it can set 
a ddl_lock_timeout to 'some-interval',  in this case, it will still block others(so it
can't be as good as what you are doing, but very simple),  however the user
would know what would happen exactly and can coordinate with their
application accordingly.   I'm sorry about this since it is a bit of off-topics
or it has been discussed already. 

--
Best Regards
Andy Fan
Reply | Threaded
Open this post in threaded view
|

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

Álvaro Herrera
On 2020-Oct-15, Andy Fan wrote:

> I think if it is possible to implement the detech with a NoWait option .
>
> ALTER TABLE ... DETACH PARTITION ..  [NoWait].
>
> if it can't get the lock, raise "Resource is Busy" immediately,
> without blocking others.  this should be a default behavior.   If
> people do want to keep trying, it can set a ddl_lock_timeout to
> 'some-interval',  in this case, it will still block others(so it can't
> be as good as what you are doing, but very simple),  however the user
> would know what would happen exactly and can coordinate with their
> application accordingly.   I'm sorry about this since it is a bit of
> off-topics or it has been discussed already.

Hi.  I don't think this has been discussed, but it doesn't really solve
the use case I want to -- in many cases where the tables are
continuously busy, this would lead to starvation.  I think the proposal
to make the algorithm work with reduced lock level is much more useful.

I think you can already do NOWAIT behavior, with LOCK TABLE .. NOWAIT
followed by DETACH PARTITION, perhaps with a nonzero statement timeout.


Reply | Threaded
Open this post in threaded view
|

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

David Rowley
In reply to this post by Andy Fan
On Thu, 15 Oct 2020 at 14:04, Andy Fan <[hidden email]> wrote:

>
> I think if it is possible to implement the detech with a NoWait option .
>
> ALTER TABLE ... DETACH PARTITION ..  [NoWait].
>
> if it can't get the lock, raise "Resource is Busy" immediately, without blocking others.
> this should be a default behavior.   If people do want to keep trying, it can set
> a ddl_lock_timeout to 'some-interval',  in this case, it will still block others(so it
> can't be as good as what you are doing, but very simple),  however the user
> would know what would happen exactly and can coordinate with their
> application accordingly.   I'm sorry about this since it is a bit of off-topics
> or it has been discussed already.

How would that differ from setting a low lock_timeout and running the DDL?

I think what Alvaro wants to avoid is taking the AEL in the first
place. When you have multiple long overlapping queries to the
partitioned table, then there be no point in time where there are zero
locks on the table. It does not sound like your idea would help with
that.

David


Reply | Threaded
Open this post in threaded view
|

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

Andy Fan
Hi David/Alvaro:

On Thu, Oct 15, 2020 at 9:09 AM David Rowley <[hidden email]> wrote:
On Thu, 15 Oct 2020 at 14:04, Andy Fan <[hidden email]> wrote:
>
> I think if it is possible to implement the detech with a NoWait option .
>
> ALTER TABLE ... DETACH PARTITION ..  [NoWait].
>
> if it can't get the lock, raise "Resource is Busy" immediately, without blocking others.
> this should be a default behavior.   If people do want to keep trying, it can set
> a ddl_lock_timeout to 'some-interval',  in this case, it will still block others(so it
> can't be as good as what you are doing, but very simple),  however the user
> would know what would happen exactly and can coordinate with their
> application accordingly.   I'm sorry about this since it is a bit of off-topics
> or it has been discussed already.

How would that differ from setting a low lock_timeout and running the DDL?
 
They are exactly the same (I didn't realize this parameter when I sent the email).  
 
I think what Alvaro wants to avoid is taking the AEL in the first
place.

I'm agreed with this,  that's why I said "so it can't be as good as what you are doing"
 
When you have multiple long overlapping queries to the
partitioned table, then there be no point in time where there are zero
locks on the table. It does not sound like your idea would help with that.
 

Based on my current knowledge,  "detach" will hold an exclusive lock 
and it will have higher priority than other waiters.  so it has to wait for the lock
holder before it (named as sess 1).  and at the same time, block all the other
waiters which are requiring a lock even the lock mode is compatible with session 1. 
So "deteach" can probably get its lock in a short time (unless some long transaction
before it). I'm not sure if I have some misunderstanding here. 

Overall I'd be +1 for this patch. 

-- 
Best Regards
Andy Fan
12