UPDATE of partition key

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
197 messages Options
1234 ... 10
Reply | Threaded
Open this post in threaded view
|

UPDATE of partition key

Amit Khandekar-2
Currently, an update of a partition key of a partition is not allowed,
since it requires to move the row(s) into the applicable partition.

Attached is a WIP patch (update-partition-key.patch) that removes this
restriction. When an UPDATE causes the row of a partition to violate
its partition constraint, then a partition is searched in that subtree
that can accommodate this row, and if found, the row is deleted from
the old partition and inserted in the new partition. If not found, an
error is reported.

There are a few things that can be discussed about :

1. We can run an UPDATE using a child partition at any level in a
nested partition tree. In such case, we should move the row only
within that child subtree.

For e.g. , in a tree such as :
tab ->
   t1 ->
      t1_1
      t1_2
   t2 ->
      t2_1
      t2_2

For "UPDATE t2 set col1 = 'AAA' " , if the modified tuple does not fit
in t2_1 but can fit in t1_1, it should not be moved to t1_1, because
the UPDATE is fired using t2.

2. In the patch, as part of the row movement, ExecDelete() is called
followed by ExecInsert(). This is done that way, because we want to
have the ROW triggers on that (sub)partition executed. If a user has
explicitly created DELETE and INSERT BR triggers for this partition, I
think we should run those. While at the same time, another question
is, what about UPDATE trigger on the same table ? Here again, one can
argue that because this UPDATE has been transformed into a
DELETE-INSERT, we should not run UPDATE trigger for row-movement. But
there can be a counter-argument. For e.g. if a user needs to make sure
about logging updates of particular columns of a row, he will expect
the logging to happen even when that row was transparently moved. In
the patch, I have retained the firing of UPDATE BR trigger.

3. In case of a concurrent update/delete, suppose session A has locked
the row for deleting it. Now a session B has decided to update this
row and that is going to cause row movement, which means it will
delete it first. But when session A is finished deleting it, session B
finds that it is already deleted. In such case, it should not go ahead
with inserting a new row as part of the row movement. For that, I have
added a new parameter 'already_delete' for ExecDelete().

Of course, this still won't completely solve the concurrency anomaly.
In the above case, the UPDATE of Session B gets lost. May be, for a
user that does not tolerate this, we can have a table-level option
that disallows row movement, or will cause an error to be thrown for
one of the concurrent session.

4. The ExecSetupPartitionTupleRouting() is re-used for routing the row
that is to be moved. So in ExecInitModifyTable(), we call
ExecSetupPartitionTupleRouting() even for UPDATE. We can also do this
only during execution time for the very first time we find that we
need to do a row movement. I will think over that, but I am thinking
it might complicate things, as compared to always doing the setup for
UPDATE. WIll check on that.


5. Regarding performance testing, I have compared the results of
row-movement with partition versus row-movement with inheritance tree
using triggers.  Below are the details :

Schema :

CREATE TABLE ptab (a date, b int, c int);

CREATE TABLE ptab (a date, b int, c int) PARTITION BY RANGE (a, b);

CREATE TABLE ptab_1_1 PARTITION OF ptab
for values from ('1900-01-01', 1) to ('1900-01-01', 101)
PARTITION BY range (c);

        CREATE TABLE ptab_1_1_1 PARTITION OF ptab_1_1
        for values from (1) to (51);
        CREATE TABLE ptab_1_1_2 PARTITION OF ptab_1_1
        for values from (51) to (101);
.....
.....
        CREATE TABLE ptab_1_1_n PARTITION OF ptab_1_1
        for values from (n) to (n+m);

......
......

CREATE TABLE ptab_5_n PARTITION OF ptab
for values from ('1905-01-01', 101) to ('1905-01-01', 201)
PARTITION BY range (c);

        CREATE TABLE ptab_1_2_1 PARTITION OF ptab_1_2
        for values from (1) to (51);
        CREATE TABLE ptab_1_2_2 PARTITION OF ptab_1_2
        for values from (51) to (101);
.....
.....
        CREATE TABLE ptab_1_2_n PARTITION OF ptab_1_2
        for values from (n) to (n+m);
.....
.....

Similarly for inheritance :

CREATE TABLE ptab_1_1
(constraint check_ptab_1_1 check (a = '1900-01-01' and b >= 1 and b <
8)) inherits (ptab);
create trigger brutrig_ptab_1_1 before update on ptab_1_1 for each row
execute procedure ptab_upd_trig();
CREATE TABLE ptab_1_1_1
(constraint check_ptab_1_1_1 check (c >= 1 and c < 51))
inherits (ptab_1_1);
create trigger brutrig_ptab_1_1_1 before update on ptab_1_1_1 for each
row execute procedure ptab_upd_trig();
CREATE TABLE ptab_1_1_2
(constraint check_ptab_1_1_2 check (c >= 51 and c < 101))
inherits (ptab_1_1);

create trigger brutrig_ptab_1_1_2 before update on ptab_1_1_2 for each
row execute procedure ptab_upd_trig();

I had to have a BR UPDATE trigger on each of the leaf tables.

Attached is the BR trigger function update_trigger.sql. There it
generates the table name assuming a fixed pattern of distribution of
data over the partitions. It first deletes the row and then inserts a
new one. I also skipped the deletion part, and it did not show any
significant change in results.


parts    partitioned   inheritance   no. of rows   subpartitions
=====    ===========   ===========   ===========   =============

500       10 sec       3 min 02 sec   1,000,000     0
1000      10 sec       3 min 05 sec   1,000,000     0
1000     1 min 38sec   30min 50 sec  10,000,000     0
4000      28 sec       5 min 41 sec   1,000,000     10

part : total number of partitions including subparitions if any.
partitioned : Partitions created using declarative syntax.
inheritence : Partitions created using inheritence , check constraints
and insert,update triggers.
subpartitions : Number of subpartitions for each partition (in a 2-level tree)

Overall the UPDATE in partitions is faster by 10-20 times compared
with inheritance with triggers.

The UPDATE query moved all of the rows into another partition. It was
something like this :
update ptab set a = '1949-01-1' where a <= '1924-01-01'

For a plain table with 1,000,000 rows, the UPDATE took 8 seconds, and
with 10,000,000 rows, it took 1min 32sec.

In general, for both partitioned and inheritence tables, the time
taken linearly rose with the number of rows.


--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

update_trigger.sql (1K) Download Attachment
update-partition-key.patch (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: UPDATE of partition key

Robert Haas
On Mon, Feb 13, 2017 at 7:01 AM, Amit Khandekar <[hidden email]> wrote:
> parts    partitioned   inheritance   no. of rows   subpartitions
> =====    ===========   ===========   ===========   =============
>
> 500       10 sec       3 min 02 sec   1,000,000     0
> 1000      10 sec       3 min 05 sec   1,000,000     0
> 1000     1 min 38sec   30min 50 sec  10,000,000     0
> 4000      28 sec       5 min 41 sec   1,000,000     10

That's a big speedup.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: UPDATE of partition key

David Fetter
In reply to this post by Amit Khandekar-2
On Mon, Feb 13, 2017 at 05:31:56PM +0530, Amit Khandekar wrote:

> Currently, an update of a partition key of a partition is not
> allowed, since it requires to move the row(s) into the applicable
> partition.
>
> Attached is a WIP patch (update-partition-key.patch) that removes
> this restriction. When an UPDATE causes the row of a partition to
> violate its partition constraint, then a partition is searched in
> that subtree that can accommodate this row, and if found, the row is
> deleted from the old partition and inserted in the new partition. If
> not found, an error is reported.

This is great!

Would it be really invasive to HINT something when the subtree is a
proper subtree?

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: UPDATE of partition key

Amit Khandekar-2
On 14 February 2017 at 22:24, David Fetter <[hidden email]> wrote:

> On Mon, Feb 13, 2017 at 05:31:56PM +0530, Amit Khandekar wrote:
>> Currently, an update of a partition key of a partition is not
>> allowed, since it requires to move the row(s) into the applicable
>> partition.
>>
>> Attached is a WIP patch (update-partition-key.patch) that removes
>> this restriction. When an UPDATE causes the row of a partition to
>> violate its partition constraint, then a partition is searched in
>> that subtree that can accommodate this row, and if found, the row is
>> deleted from the old partition and inserted in the new partition. If
>> not found, an error is reported.
>
> This is great!
>
> Would it be really invasive to HINT something when the subtree is a
> proper subtree?

I am not quite sure I understood this question. Can you please explain
it a bit more ...


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: UPDATE of partition key

David Fetter
On Wed, Feb 15, 2017 at 01:06:32PM +0530, Amit Khandekar wrote:

> On 14 February 2017 at 22:24, David Fetter <[hidden email]> wrote:
> > On Mon, Feb 13, 2017 at 05:31:56PM +0530, Amit Khandekar wrote:
> >> Currently, an update of a partition key of a partition is not
> >> allowed, since it requires to move the row(s) into the applicable
> >> partition.
> >>
> >> Attached is a WIP patch (update-partition-key.patch) that removes
> >> this restriction. When an UPDATE causes the row of a partition to
> >> violate its partition constraint, then a partition is searched in
> >> that subtree that can accommodate this row, and if found, the row
> >> is deleted from the old partition and inserted in the new
> >> partition. If not found, an error is reported.
> >
> > This is great!
> >
> > Would it be really invasive to HINT something when the subtree is
> > a proper subtree?
>
> I am not quite sure I understood this question. Can you please
> explain it a bit more ...

Sorry.  When an UPDATE can't happen, there are often ways to hint at
what went wrong and how to correct it.  Violating a uniqueness
constraint would be one example.

When an UPDATE can't happen and the depth of the subtree is a
plausible candidate for what prevents it, there might be a way to say
so.

Let's imagine a table called log with partitions on "stamp" log_YYYY
and subpartitions, also on "stamp", log_YYYYMM.  If you do something
like

    UPDATE log_2017 SET "stamp"='2016-11-08 23:03:00' WHERE ...

it's possible to know that it might have worked had the UPDATE taken
place on log rather than on log_2017.

Does that make sense, and if so, is it super invasive to HINT that?

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: UPDATE of partition key

Amit Khandekar-2
On 15 February 2017 at 20:26, David Fetter <[hidden email]> wrote:

> When an UPDATE can't happen, there are often ways to hint at
> what went wrong and how to correct it.  Violating a uniqueness
> constraint would be one example.
>
> When an UPDATE can't happen and the depth of the subtree is a
> plausible candidate for what prevents it, there might be a way to say
> so.
>
> Let's imagine a table called log with partitions on "stamp" log_YYYY
> and subpartitions, also on "stamp", log_YYYYMM.  If you do something
> like
>
>     UPDATE log_2017 SET "stamp"='2016-11-08 23:03:00' WHERE ...
>
> it's possible to know that it might have worked had the UPDATE taken
> place on log rather than on log_2017.
>
> Does that make sense, and if so, is it super invasive to HINT that?

Yeah, I think it should be possible to find the root partition with
the help of pg_partitioned_table, and then run ExecFindPartition()
again using the root. Will check. I am not sure right now how involved
that would turn out to be, but I think that logic would not change the
existing code, so in that sense it is not invasive.


--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: UPDATE of partition key

Amit Langote-2
On 2017/02/16 15:50, Amit Khandekar wrote:

> On 15 February 2017 at 20:26, David Fetter <[hidden email]> wrote:
>> When an UPDATE can't happen, there are often ways to hint at
>> what went wrong and how to correct it.  Violating a uniqueness
>> constraint would be one example.
>>
>> When an UPDATE can't happen and the depth of the subtree is a
>> plausible candidate for what prevents it, there might be a way to say
>> so.
>>
>> Let's imagine a table called log with partitions on "stamp" log_YYYY
>> and subpartitions, also on "stamp", log_YYYYMM.  If you do something
>> like
>>
>>     UPDATE log_2017 SET "stamp"='2016-11-08 23:03:00' WHERE ...
>>
>> it's possible to know that it might have worked had the UPDATE taken
>> place on log rather than on log_2017.
>>
>> Does that make sense, and if so, is it super invasive to HINT that?
>
> Yeah, I think it should be possible to find the root partition with

I assume you mean root *partitioned* table.

> the help of pg_partitioned_table,

The pg_partitioned_table catalog does not store parent-child
relationships, just information about the partition key of a table.  To
get the root partitioned table, you might want to create a recursive
version of get_partition_parent(), maybe called
get_partition_root_parent().  By the way, get_partition_parent() scans
pg_inherits to find the inheritance parent.

> and then run ExecFindPartition()
> again using the root. Will check. I am not sure right now how involved
> that would turn out to be, but I think that logic would not change the
> existing code, so in that sense it is not invasive.

I couldn't understand why run ExecFindPartition() again on the root
partitioned table, can you clarify?  ISTM, we just want to tell the user
in the HINT that trying the same update query with root partitioned table
might work.  I'm not sure if it would work instead to find some
intermediate partitioned table (that is, between the root and the one that
update query was tried with) to include in the HINT.

Thanks,
Amit




--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: UPDATE of partition key

Amit Khandekar-2
On 16 February 2017 at 12:57, Amit Langote
<[hidden email]> wrote:

> On 2017/02/16 15:50, Amit Khandekar wrote:
>> On 15 February 2017 at 20:26, David Fetter <[hidden email]> wrote:
>>> When an UPDATE can't happen, there are often ways to hint at
>>> what went wrong and how to correct it.  Violating a uniqueness
>>> constraint would be one example.
>>>
>>> When an UPDATE can't happen and the depth of the subtree is a
>>> plausible candidate for what prevents it, there might be a way to say
>>> so.
>>>
>>> Let's imagine a table called log with partitions on "stamp" log_YYYY
>>> and subpartitions, also on "stamp", log_YYYYMM.  If you do something
>>> like
>>>
>>>     UPDATE log_2017 SET "stamp"='2016-11-08 23:03:00' WHERE ...
>>>
>>> it's possible to know that it might have worked had the UPDATE taken
>>> place on log rather than on log_2017.
>>>
>>> Does that make sense, and if so, is it super invasive to HINT that?
>>
>> Yeah, I think it should be possible to find the root partition with
>
> I assume you mean root *partitioned* table.
>
>> the help of pg_partitioned_table,
>
> The pg_partitioned_table catalog does not store parent-child
> relationships, just information about the partition key of a table.  To
> get the root partitioned table, you might want to create a recursive
> version of get_partition_parent(), maybe called
> get_partition_root_parent().  By the way, get_partition_parent() scans
> pg_inherits to find the inheritance parent.

Yeah. But we also want to make sure that it's a part of declarative
partition tree, and not just an inheritance tree ? I am not sure
whether it is currently possible to have a mix of these two. May be it
is easy to prevent that from happening.

>
>> and then run ExecFindPartition()
>> again using the root. Will check. I am not sure right now how involved
>> that would turn out to be, but I think that logic would not change the
>> existing code, so in that sense it is not invasive.
>
> I couldn't understand why run ExecFindPartition() again on the root
> partitioned table, can you clarify?  ISTM, we just want to tell the user
> in the HINT that trying the same update query with root partitioned table
> might work. I'm not sure if it would work instead to find some
> intermediate partitioned table (that is, between the root and the one that
> update query was tried with) to include in the HINT.

What I had in mind was : Give that hint only if there *was* a
subpartition that could accommodate that row. And if found, we can
only include the subpartition name.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: UPDATE of partition key

Amit Langote-2
On 2017/02/16 17:55, Amit Khandekar wrote:

> On 16 February 2017 at 12:57, Amit Langote wrote:
>> On 2017/02/16 15:50, Amit Khandekar wrote:
>>> On 15 February 2017 at 20:26, David Fetter <[hidden email]> wrote:
>>>> Does that make sense, and if so, is it super invasive to HINT that?
>>>
>>> Yeah, I think it should be possible to find the root partition with
>>
>> I assume you mean root *partitioned* table.
>>
>>> the help of pg_partitioned_table,
>>
>> The pg_partitioned_table catalog does not store parent-child
>> relationships, just information about the partition key of a table.  To
>> get the root partitioned table, you might want to create a recursive
>> version of get_partition_parent(), maybe called
>> get_partition_root_parent().  By the way, get_partition_parent() scans
>> pg_inherits to find the inheritance parent.
>
> Yeah. But we also want to make sure that it's a part of declarative
> partition tree, and not just an inheritance tree ? I am not sure
> whether it is currently possible to have a mix of these two. May be it
> is easy to prevent that from happening.

It is not possible to mix declarative partitioning and regular
inheritance.  So, you cannot have a table in a declarative partitioning
tree that is not a (sub-) partition of the root table.

>>> and then run ExecFindPartition()
>>> again using the root. Will check. I am not sure right now how involved
>>> that would turn out to be, but I think that logic would not change the
>>> existing code, so in that sense it is not invasive.
>>
>> I couldn't understand why run ExecFindPartition() again on the root
>> partitioned table, can you clarify?  ISTM, we just want to tell the user
>> in the HINT that trying the same update query with root partitioned table
>> might work. I'm not sure if it would work instead to find some
>> intermediate partitioned table (that is, between the root and the one that
>> update query was tried with) to include in the HINT.
>
> What I had in mind was : Give that hint only if there *was* a
> subpartition that could accommodate that row. And if found, we can
> only include the subpartition name.

Asking to try the update query with the root table sounds like a good
enough hint.  Trying to find the exact sub-partition (I assume you mean to
imply sub-tree here) seems like an overkill, IMHO.

Thanks,
Amit




--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: UPDATE of partition key

Amit Khandekar-2
On 16 February 2017 at 14:42, Amit Langote
<[hidden email]> wrote:

> On 2017/02/16 17:55, Amit Khandekar wrote:
>> On 16 February 2017 at 12:57, Amit Langote wrote:
>>> On 2017/02/16 15:50, Amit Khandekar wrote:
>>>> On 15 February 2017 at 20:26, David Fetter <[hidden email]> wrote:
>>>>> Does that make sense, and if so, is it super invasive to HINT that?
>>>>
>>>> Yeah, I think it should be possible to find the root partition with
>>>
>>> I assume you mean root *partitioned* table.
>>>
>>>> the help of pg_partitioned_table,
>>>
>>> The pg_partitioned_table catalog does not store parent-child
>>> relationships, just information about the partition key of a table.  To
>>> get the root partitioned table, you might want to create a recursive
>>> version of get_partition_parent(), maybe called
>>> get_partition_root_parent().  By the way, get_partition_parent() scans
>>> pg_inherits to find the inheritance parent.
>>
>> Yeah. But we also want to make sure that it's a part of declarative
>> partition tree, and not just an inheritance tree ? I am not sure
>> whether it is currently possible to have a mix of these two. May be it
>> is easy to prevent that from happening.
>
> It is not possible to mix declarative partitioning and regular
> inheritance.  So, you cannot have a table in a declarative partitioning
> tree that is not a (sub-) partition of the root table.

Ok, then that makes things easy.

>
>>>> and then run ExecFindPartition()
>>>> again using the root. Will check. I am not sure right now how involved
>>>> that would turn out to be, but I think that logic would not change the
>>>> existing code, so in that sense it is not invasive.
>>>
>>> I couldn't understand why run ExecFindPartition() again on the root
>>> partitioned table, can you clarify?  ISTM, we just want to tell the user
>>> in the HINT that trying the same update query with root partitioned table
>>> might work. I'm not sure if it would work instead to find some
>>> intermediate partitioned table (that is, between the root and the one that
>>> update query was tried with) to include in the HINT.
>>
>> What I had in mind was : Give that hint only if there *was* a
>> subpartition that could accommodate that row. And if found, we can
>> only include the subpartition name.
>
> Asking to try the update query with the root table sounds like a good
> enough hint.  Trying to find the exact sub-partition (I assume you mean to
> imply sub-tree here) seems like an overkill, IMHO.
Yeah ... I was thinking , anyways it's an error condition, so why not
let the server spend a bit more CPU and get the right sub-partition
for the message. If we decide to write code to find the root
partition, then it's just a matter of another function
ExecFindPartition().

Also, I was thinking : give the hint *only* if we know there is a
right sub-partition. Otherwise, it might distract the user.

>
> Thanks,
> Amit
>
>



--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: UPDATE of partition key

Greg Stark
In reply to this post by Amit Khandekar-2
On 13 February 2017 at 12:01, Amit Khandekar <[hidden email]> wrote:
> There are a few things that can be discussed about :

If you do a normal update the new tuple is linked to the old one using
the ctid forming a chain of tuple versions. This tuple movement breaks
that chain.  So the question I had reading this proposal is what
behaviour depends on ctid and how is it affected by the ctid chain
being broken.

I think the concurrent update case is just a symptom of this. If you
try to update a row that's locked for a concurrent update you normally
wait until the concurrent update finishes, then follow the ctid chain
and recheck the where clause on the target of the link and if it still
matches you perform the update there.

At least you do that if you have isolation_level set to
repeatable_read. If you have isolation level set to serializable then
you just fail with a serialization failure. I think that's what you
should do if you come across a row that's been updated with a broken
ctid chain even in repeatable read mode. Just fail with a
serialization failure and document that in partitioned tables if you
perform updates that move tuples between partitions then you need to
be ensure your updates are prepared for serialization failures.

I think this would require another bit in the tuple info mask
indicating that this is tuple is the last version before a broken ctid
chain -- i.e. that it was updated by moving it to another partition.
Maybe there's some combination of bits you could use though since this
is only needed in a particular situation.

Offhand I don't know what other behaviours are dependent on the ctid
chain. I think you need to go search the docs -- and probably the code
just to be sure -- for any references to ctid to ensure you catch
every impact of breaking the ctid chain.

--
greg


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: UPDATE of partition key

David Fetter
In reply to this post by Amit Khandekar-2
On Thu, Feb 16, 2017 at 03:39:30PM +0530, Amit Khandekar wrote:

> >>>> and then run ExecFindPartition()
> >>>> again using the root. Will check. I am not sure right now how involved
> >>>> that would turn out to be, but I think that logic would not change the
> >>>> existing code, so in that sense it is not invasive.
> >>>
> >>> I couldn't understand why run ExecFindPartition() again on the root
> >>> partitioned table, can you clarify?  ISTM, we just want to tell the user
> >>> in the HINT that trying the same update query with root partitioned table
> >>> might work. I'm not sure if it would work instead to find some
> >>> intermediate partitioned table (that is, between the root and the one that
> >>> update query was tried with) to include in the HINT.
> >>
> >> What I had in mind was : Give that hint only if there *was* a
> >> subpartition that could accommodate that row. And if found, we can
> >> only include the subpartition name.
> >
> > Asking to try the update query with the root table sounds like a good
> > enough hint.  Trying to find the exact sub-partition (I assume you mean to
> > imply sub-tree here) seems like an overkill, IMHO.
> Yeah ... I was thinking , anyways it's an error condition, so why not
> let the server spend a bit more CPU and get the right sub-partition
> for the message. If we decide to write code to find the root
> partition, then it's just a matter of another function
> ExecFindPartition().
>
> Also, I was thinking : give the hint *only* if we know there is a
> right sub-partition. Otherwise, it might distract the user.

If this is relatively straight-forward, it'd be great.  More
actionable knowledge is better.

Thanks for taking this on.

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: UPDATE of partition key

Robert Haas
In reply to this post by Greg Stark
On Thu, Feb 16, 2017 at 5:47 AM, Greg Stark <[hidden email]> wrote:
> On 13 February 2017 at 12:01, Amit Khandekar <[hidden email]> wrote:
>> There are a few things that can be discussed about :
>
> If you do a normal update the new tuple is linked to the old one using
> the ctid forming a chain of tuple versions. This tuple movement breaks
> that chain.  So the question I had reading this proposal is what
> behaviour depends on ctid and how is it affected by the ctid chain
> being broken.

I think this is a good question.

> I think the concurrent update case is just a symptom of this. If you
> try to update a row that's locked for a concurrent update you normally
> wait until the concurrent update finishes, then follow the ctid chain
> and recheck the where clause on the target of the link and if it still
> matches you perform the update there.

Right.  EvalPlanQual behavior, in short.

> At least you do that if you have isolation_level set to
> repeatable_read. If you have isolation level set to serializable then
> you just fail with a serialization failure. I think that's what you
> should do if you come across a row that's been updated with a broken
> ctid chain even in repeatable read mode. Just fail with a
> serialization failure and document that in partitioned tables if you
> perform updates that move tuples between partitions then you need to
> be ensure your updates are prepared for serialization failures.

Now, this part I'm not sure about.  What's pretty clear is that,
barring some redesign of the heap format, we can't keep the CTID chain
intact when the tuple moves to a different relfilenode.  What's less
clear is what to do about that.  We can either (1) give up on
EvalPlanQual behavior in this case and act just as we would if the row
had been deleted; no update happens or (2) throw a serialization
error.  You're advocating for #2, but I'm not sure that's right,
because:

1. It's a lot more work,

2. Your proposed implementation needs an on-disk format change that
uses up a scarce infomask bit, and

3. It's not obvious to me that it's clearly preferable from a user
experience standpoint.  I mean, either way the user doesn't get the
behavior that they want.  Either they're hoping for EPQ semantics and
they instead do a no-op update, or they're hoping for EPQ semantics
and they instead get an ERROR.  Generally speaking, we don't throw
serialization errors today at READ COMMITTED, so if we do so here,
that's going to be a noticeable and perhaps unwelcome change.

More opinions welcome.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: UPDATE of partition key

Amit Khandekar-2
On 16 February 2017 at 20:53, Robert Haas <[hidden email]> wrote:

> On Thu, Feb 16, 2017 at 5:47 AM, Greg Stark <[hidden email]> wrote:
>> On 13 February 2017 at 12:01, Amit Khandekar <[hidden email]> wrote:
>>> There are a few things that can be discussed about :
>>
>> If you do a normal update the new tuple is linked to the old one using
>> the ctid forming a chain of tuple versions. This tuple movement breaks
>> that chain.  So the question I had reading this proposal is what
>> behaviour depends on ctid and how is it affected by the ctid chain
>> being broken.
>
> I think this is a good question.
>
>> I think the concurrent update case is just a symptom of this. If you
>> try to update a row that's locked for a concurrent update you normally
>> wait until the concurrent update finishes, then follow the ctid chain
>> and recheck the where clause on the target of the link and if it still
>> matches you perform the update there.
>
> Right.  EvalPlanQual behavior, in short.
>
>> At least you do that if you have isolation_level set to
>> repeatable_read. If you have isolation level set to serializable then
>> you just fail with a serialization failure. I think that's what you
>> should do if you come across a row that's been updated with a broken
>> ctid chain even in repeatable read mode. Just fail with a
>> serialization failure and document that in partitioned tables if you
>> perform updates that move tuples between partitions then you need to
>> be ensure your updates are prepared for serialization failures.
>
> Now, this part I'm not sure about.  What's pretty clear is that,
> barring some redesign of the heap format, we can't keep the CTID chain
> intact when the tuple moves to a different relfilenode.  What's less
> clear is what to do about that.  We can either (1) give up on
> EvalPlanQual behavior in this case and act just as we would if the row
> had been deleted; no update happens.

This is what the current patch has done.

> or (2) throw a serialization
> error.  You're advocating for #2, but I'm not sure that's right,
> because:
>
> 1. It's a lot more work,
>
> 2. Your proposed implementation needs an on-disk format change that
> uses up a scarce infomask bit, and
>
> 3. It's not obvious to me that it's clearly preferable from a user
> experience standpoint.  I mean, either way the user doesn't get the
> behavior that they want.  Either they're hoping for EPQ semantics and
> they instead do a no-op update, or they're hoping for EPQ semantics
> and they instead get an ERROR.  Generally speaking, we don't throw
> serialization errors today at READ COMMITTED, so if we do so here,
> that's going to be a noticeable and perhaps unwelcome change.
>
> More opinions welcome.

I am inclined to at least have some option for the user to decide the
behaviour. In the future we can even consider support for walking
through the ctid chain across multiple relfilenodes. But till then, we
need to decide what default behaviour to keep. My inclination is more
towards erroring out in an unfortunate even where there is an UPDATE
while the row-movement is happening. One option is to not get into
finding whether the DELETE was part of partition row-movement or it
was indeed a DELETE, and always error out the UPDATE when
heap_update() returns HeapTupleUpdated, but only if the table is a
leaf partition. But this obviously will cause annoyance because of
chances of getting such errors when there are concurrent updates and
deletes in the same partition. But we can keep a table-level option
for determining whether to error out or silently lose the UPDATE.

Another option I was thinking : When the UPDATE is on a partition key,
acquire ExclusiveLock (not AccessExclusiveLock) only on that
partition, so that the selects will continue to execute, but
UPDATE/DELETE will wait before opening the table for scan. The UPDATE
on partition key is not going to be a very routine operation, it
sounds more like a DBA maintenance operation; so it does not look like
it would come in between usual transactions.


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: UPDATE of partition key

Thomas Munro-3
In reply to this post by Robert Haas
On Thu, Feb 16, 2017 at 8:53 PM, Robert Haas <[hidden email]> wrote:
> Generally speaking, we don't throw
> serialization errors today at READ COMMITTED, so if we do so here,
> that's going to be a noticeable and perhaps unwelcome change.

Yes we do:

https://www.postgresql.org/docs/9.6/static/transaction-iso.html#XACT-REPEATABLE-READ

--
Thomas Munro
http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: UPDATE of partition key

Thomas Munro-3
On Mon, Feb 20, 2017 at 3:36 PM, Thomas Munro
<[hidden email]> wrote:
> On Thu, Feb 16, 2017 at 8:53 PM, Robert Haas <[hidden email]> wrote:
>> Generally speaking, we don't throw
>> serialization errors today at READ COMMITTED, so if we do so here,
>> that's going to be a noticeable and perhaps unwelcome change.
>
> Yes we do:
>
> https://www.postgresql.org/docs/9.6/static/transaction-iso.html#XACT-REPEATABLE-READ

Oops -- please ignore, I misread that as repeatable read.

--
Thomas Munro
http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: UPDATE of partition key

Amit Langote-2
In reply to this post by Amit Khandekar-2
Hi Amit,

Thanks for working on this.

On 2017/02/13 21:01, Amit Khandekar wrote:
> Currently, an update of a partition key of a partition is not allowed,
> since it requires to move the row(s) into the applicable partition.
>
> Attached is a WIP patch (update-partition-key.patch) that removes this
> restriction. When an UPDATE causes the row of a partition to violate
> its partition constraint, then a partition is searched in that subtree
> that can accommodate this row, and if found, the row is deleted from
> the old partition and inserted in the new partition. If not found, an
> error is reported.

That's clearly an improvement over what we have now.

> There are a few things that can be discussed about :
>
> 1. We can run an UPDATE using a child partition at any level in a
> nested partition tree. In such case, we should move the row only
> within that child subtree.
>
> For e.g. , in a tree such as :
> tab ->
>    t1 ->
>       t1_1
>       t1_2
>    t2 ->
>       t2_1
>       t2_2
>
> For "UPDATE t2 set col1 = 'AAA' " , if the modified tuple does not fit
> in t2_1 but can fit in t1_1, it should not be moved to t1_1, because
> the UPDATE is fired using t2.

Makes sense.  One should perform the update by specifying tab such that
the row moves from t2 to t1, before we could determine t1_1 as the target
for the new row.  Specifying t2 directly in that case is clearly the
"violates partition constraint" situation.  I wonder if that's enough a
hint for the user to try updating the parent (or better still, root
parent).  Or as we were discussing, should there be an actual HINT message
spelling that out for the user.

> 2. In the patch, as part of the row movement, ExecDelete() is called
> followed by ExecInsert(). This is done that way, because we want to
> have the ROW triggers on that (sub)partition executed. If a user has
> explicitly created DELETE and INSERT BR triggers for this partition, I
> think we should run those. While at the same time, another question
> is, what about UPDATE trigger on the same table ? Here again, one can
> argue that because this UPDATE has been transformed into a
> DELETE-INSERT, we should not run UPDATE trigger for row-movement. But
> there can be a counter-argument. For e.g. if a user needs to make sure
> about logging updates of particular columns of a row, he will expect
> the logging to happen even when that row was transparently moved. In
> the patch, I have retained the firing of UPDATE BR trigger.

What of UPDATE AR triggers?

As a comment on how row-movement is being handled in code, I wonder if it
could be be made to look similar structurally to the code in ExecInsert()
that handles ON CONFLICT DO UPDATE.  That is,

if (partition constraint fails)
{
    /* row movement */
}
else
{
    /* ExecConstraints() */
    /* heap_update(), EvalPlanQual(), and ExecInsertIndexTuples() */
}

I see that ExecConstraint() won't get called on the source partition's
constraints if row movement occurs.  Maybe, that's unnecessary because the
new row won't be inserted into that partition anyway.

ExecWithCheckOptions() for RLS update check does happen *before* row
movement though.

> 3. In case of a concurrent update/delete, suppose session A has locked
> the row for deleting it. Now a session B has decided to update this
> row and that is going to cause row movement, which means it will
> delete it first. But when session A is finished deleting it, session B
> finds that it is already deleted. In such case, it should not go ahead
> with inserting a new row as part of the row movement. For that, I have
> added a new parameter 'already_delete' for ExecDelete().

Makes sense.  Maybe: already_deleted -> concurrently_deleted.

> Of course, this still won't completely solve the concurrency anomaly.
> In the above case, the UPDATE of Session B gets lost. May be, for a
> user that does not tolerate this, we can have a table-level option
> that disallows row movement, or will cause an error to be thrown for
> one of the concurrent session.

Will this table-level option be specified for a partitioned table once or
for individual partitions?

> 4. The ExecSetupPartitionTupleRouting() is re-used for routing the row
> that is to be moved. So in ExecInitModifyTable(), we call
> ExecSetupPartitionTupleRouting() even for UPDATE. We can also do this
> only during execution time for the very first time we find that we
> need to do a row movement. I will think over that, but I am thinking
> it might complicate things, as compared to always doing the setup for
> UPDATE. WIll check on that.

Hmm.  ExecSetupPartitionTupleRouting(), which does significant amount of
setup work, is fine being called in ExecInitModifyTable() in the insert
case because there are often cases where that's a bulk-insert and hence
cost of the setup work is amortized.  Updates, OTOH, are seldom done in a
bulk manner.  So that might be an argument for doing it late only when
needed.  But that starts to sound less attractive when one realizes that
that will occur for every row that wants to move.

I wonder if updates that will require row movement when done will be done
in a bulk manner (as a maintenance op), so one-time tuple routing setup
seems fine.  Again, enable_row_movement option specified for the parent
sounds like it would be a nice to have.  Only do the setup if it's turned
on, which goes without saying.

> 5. Regarding performance testing, I have compared the results of
> row-movement with partition versus row-movement with inheritance tree
> using triggers.  Below are the details :
>
> Schema :

[ ... ]

> parts    partitioned   inheritance   no. of rows   subpartitions
> =====    ===========   ===========   ===========   =============
>
> 500       10 sec       3 min 02 sec   1,000,000     0
> 1000      10 sec       3 min 05 sec   1,000,000     0
> 1000     1 min 38sec   30min 50 sec  10,000,000     0
> 4000      28 sec       5 min 41 sec   1,000,000     10
>
> part : total number of partitions including subparitions if any.
> partitioned : Partitions created using declarative syntax.
> inheritence : Partitions created using inheritence , check constraints
> and insert,update triggers.
> subpartitions : Number of subpartitions for each partition (in a 2-level tree)
>
> Overall the UPDATE in partitions is faster by 10-20 times compared
> with inheritance with triggers.
>
> The UPDATE query moved all of the rows into another partition. It was
> something like this :
> update ptab set a = '1949-01-1' where a <= '1924-01-01'
>
> For a plain table with 1,000,000 rows, the UPDATE took 8 seconds, and
> with 10,000,000 rows, it took 1min 32sec.

Nice!

> In general, for both partitioned and inheritence tables, the time
> taken linearly rose with the number of rows.

Hopefully not also with the number of partitions though.

I will look more closely at the code soon.

Thanks,
Amit




--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: UPDATE of partition key

Robert Haas
In reply to this post by Amit Khandekar-2
On Mon, Feb 20, 2017 at 2:58 PM, Amit Khandekar <[hidden email]> wrote:

> I am inclined to at least have some option for the user to decide the
> behaviour. In the future we can even consider support for walking
> through the ctid chain across multiple relfilenodes. But till then, we
> need to decide what default behaviour to keep. My inclination is more
> towards erroring out in an unfortunate even where there is an UPDATE
> while the row-movement is happening. One option is to not get into
> finding whether the DELETE was part of partition row-movement or it
> was indeed a DELETE, and always error out the UPDATE when
> heap_update() returns HeapTupleUpdated, but only if the table is a
> leaf partition. But this obviously will cause annoyance because of
> chances of getting such errors when there are concurrent updates and
> deletes in the same partition. But we can keep a table-level option
> for determining whether to error out or silently lose the UPDATE.

I'm still a fan of the "do nothing and just document that this is a
weirdness of partitioned tables" approach, because implementing
something will be complicated, will ensure that this misses this
release if not the next one, and may not be any better for users.  But
probably we need to get some more opinions from other people, since I
can imagine people being pretty unhappy if the consensus happens to be
at odds with my own preferences.

> Another option I was thinking : When the UPDATE is on a partition key,
> acquire ExclusiveLock (not AccessExclusiveLock) only on that
> partition, so that the selects will continue to execute, but
> UPDATE/DELETE will wait before opening the table for scan. The UPDATE
> on partition key is not going to be a very routine operation, it
> sounds more like a DBA maintenance operation; so it does not look like
> it would come in between usual transactions.

I think that's going to make users far more unhappy than breaking the
EPQ behavior ever would.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: UPDATE of partition key

David G Johnston
On Friday, February 24, 2017, Robert Haas <[hidden email]> wrote:
On Mon, Feb 20, 2017 at 2:58 PM, Amit Khandekar <<a href="javascript:;" onclick="_e(event, &#39;cvml&#39;, &#39;amitdkhan.pg@gmail.com&#39;)">amitdkhan.pg@...> wrote:
> I am inclined to at least have some option for the user to decide the
> behaviour. In the future we can even consider support for walking
> through the ctid chain across multiple relfilenodes. But till then, we
> need to decide what default behaviour to keep. My inclination is more
> towards erroring out in an unfortunate even where there is an UPDATE
> while the row-movement is happening. One option is to not get into
> finding whether the DELETE was part of partition row-movement or it
> was indeed a DELETE, and always error out the UPDATE when
> heap_update() returns HeapTupleUpdated, but only if the table is a
> leaf partition. But this obviously will cause annoyance because of
> chances of getting such errors when there are concurrent updates and
> deletes in the same partition. But we can keep a table-level option
> for determining whether to error out or silently lose the UPDATE.

I'm still a fan of the "do nothing and just document that this is a
weirdness of partitioned tables" approach, because implementing
something will be complicated, will ensure that this misses this
release if not the next one, and may not be any better for users.  But
probably we need to get some more opinions from other people, since I
can imagine people being pretty unhappy if the consensus happens to be
at odds with my own preferences.


For my own sanity - the move update would complete successfully and break every ctid chain that it touches.  Any update lined up behind it in the lock queue would discover their target record has been deleted and would experience whatever behavior their isolation level dictates for such a situation.  So multi-partition update queries will fail to update some records if they happen to move between partitions even if they would otherwise match the query's predicate.

Is there any difference in behavior between this and a SQL writeable CTE performing the same thing via delete-returning-insert?

David J.


Reply | Threaded
Open this post in threaded view
|

Re: UPDATE of partition key

Robert Haas
On Fri, Feb 24, 2017 at 1:18 PM, David G. Johnston
<[hidden email]> wrote:
> For my own sanity - the move update would complete successfully and break
> every ctid chain that it touches.  Any update lined up behind it in the lock
> queue would discover their target record has been deleted and would
> experience whatever behavior their isolation level dictates for such a
> situation.  So multi-partition update queries will fail to update some
> records if they happen to move between partitions even if they would
> otherwise match the query's predicate.

Right.  That's the behavior for which I am advocating, on the grounds
that it's the simplest to implement and if we all agree on something
else more complicated later, we can do it then.

> Is there any difference in behavior between this and a SQL writeable CTE
> performing the same thing via delete-returning-insert?

Not to my knowledge.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
1234 ... 10
Previous Thread Next Thread