Deleting more efficiently from large partitions

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Deleting more efficiently from large partitions

Wells Oliver-2
Hi all. I have a partitioned table (by month from a date column), where each partition contains something like 400m rows.

Each partition is defined by a PK with a uuid and date field (the parent table is partitioned by range on the date), and two other columns.

In doing a delete for a specific date, e.g. DELETE FROM t WHERE date = '2019-09-01' AND uuid IN (SELECT uuid FROM temptable), it runs very efficiently.

I am trying to write a processing script that deletes for potentially multiple dates & uuid values, and it just takes hours, trying:

DELETE FROM t WHERE date = (SELECT DISTINCT date from temp) AND uuid IN (select uuid from tempuuds) -- no go, hours.

Tried USING, e.g. DELETE FROM t USING temp WHERE t.date = temp.date AND t.uuid = temp.uuid -- no go, hours.

I just can't delete from this table without an explicit date and a set of uuids using a WHERE IN approach, but I need to.I was thinking of making a plpgsql function or something that loops through dates and makes a more explicit DELETE statement, but I'm thinking there must be some better way using indexing or something.

Appreciate any tips.


--
Wells Oliver
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Deleting more efficiently from large partitions

Michaeldba@sqlexec.com
What version of PG?
What does EXPLAIN show?

Sent from my iPad

On Jun 15, 2020, at 9:39 PM, Wells Oliver <[hidden email]> wrote:

Hi all. I have a partitioned table (by month from a date column), where each partition contains something like 400m rows.

Each partition is defined by a PK with a uuid and date field (the parent table is partitioned by range on the date), and two other columns.

In doing a delete for a specific date, e.g. DELETE FROM t WHERE date = '2019-09-01' AND uuid IN (SELECT uuid FROM temptable), it runs very efficiently.

I am trying to write a processing script that deletes for potentially multiple dates & uuid values, and it just takes hours, trying:

DELETE FROM t WHERE date = (SELECT DISTINCT date from temp) AND uuid IN (select uuid from tempuuds) -- no go, hours.

Tried USING, e.g. DELETE FROM t USING temp WHERE t.date = temp.date AND t.uuid = temp.uuid -- no go, hours.

I just can't delete from this table without an explicit date and a set of uuids using a WHERE IN approach, but I need to.I was thinking of making a plpgsql function or something that loops through dates and makes a more explicit DELETE statement, but I'm thinking there must be some better way using indexing or something.

Appreciate any tips.


--
Wells Oliver
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Deleting more efficiently from large partitions

Greg Spiegelberg
In reply to this post by Wells Oliver-2
On Mon, Jun 15, 2020 at 7:39 PM Wells Oliver <[hidden email]> wrote:
Hi all. I have a partitioned table (by month from a date column), where each partition contains something like 400m rows.

Each partition is defined by a PK with a uuid and date field (the parent table is partitioned by range on the date), and two other columns.

In doing a delete for a specific date, e.g. DELETE FROM t WHERE date = '2019-09-01' AND uuid IN (SELECT uuid FROM temptable), it runs very efficiently.

I am trying to write a processing script that deletes for potentially multiple dates & uuid values, and it just takes hours, trying:

DELETE FROM t WHERE date = (SELECT DISTINCT date from temp) AND uuid IN (select uuid from tempuuds) -- no go, hours.

Tried USING, e.g. DELETE FROM t USING temp WHERE t.date = temp.date AND t.uuid = temp.uuid -- no go, hours.

I just can't delete from this table without an explicit date and a set of uuids using a WHERE IN approach, but I need to.I was thinking of making a plpgsql function or something that loops through dates and makes a more explicit DELETE statement, but I'm thinking there must be some better way using indexing or something.

Appreciate any tips.

Have you considered partitioning by day instead of month?  Could eliminate an index you may have on the date column.
How many days are in the many-days DELETE?  Could you simply wrap it in a transaction and do one DELETE per day?

You could potentially get better performance removing the JOIN/sub-SELECT using
DELETE FROM mytable WHERE date_col = ANY( ARRAY['2020-01-01', '2020-01-13']::date[] );

HTH
-Greg
Reply | Threaded
Open this post in threaded view
|

Re: Deleting more efficiently from large partitions

Wells Oliver-2
So the PK is date + uuid + two other columns. I also have an index (btree) on uuid. When I try to delete via DELETE FROM mpd WHERE uuid IN (SELECT uuid FROM temptable), it takes a long long time as it scans every partition. The explain output is huge. Here is some of it.


This will take hours if I put a couple of thousand uuid records in the temp table. It's untenable. The delete here does NOT seem to be using the index on uuid

----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Delete on mpd  (cost=0.42..264953274.12 rows=5828371861 width=12) (actual time=286791.526..286791.526 rows=0 loops=1)
   Delete on mpd_2015_01
  ...
   Delete on mpd_2021_12
   Buffers: shared hit=834134 read=4148792 dirtied=16844, local read=463956 written=1023
   ->  Nested Loop Semi Join  (cost=0.42..526.20 rows=550 width=12) (actual time=0.010..0.010 rows=0 loops=1)
         ->  Seq Scan on mpd_2015_01  (cost=0.00..21.00 rows=1100 width=22) (actual time=0.008..0.008 rows=0 loops=1)
         ->  Index Scan using uuid_idx on temptable  (cost=0.42..120.11 rows=4088 width=22) (never executed)
               Index Cond: (uuid = mpd_2015_01.uuid)
   ->  Nested Loop Semi Join  (cost=0.42..526.20 rows=550 width=12) (actual time=0.005..0.005 rows=0 loops=1)
         ->  Seq Scan on mpd_2015_02  (cost=0.00..21.00 rows=1100 width=22) (actual time=0.005..0.005 rows=0 loops=1)
         ->  Index Scan using uuid_idx on temptable  (cost=0.42..120.11 rows=4088 width=22) (never executed)
               Index Cond: (uuid = mpd_2015_02.uuid)
   ->  Nested Loop Semi Join  (cost=0.42..526.20 rows=550 width=12) (actual time=0.002..0.003 rows=0 loops=1)
         ->  Seq Scan on mpd_2015_03  (cost=0.00..21.00 rows=1100 width=22) (actual time=0.002..0.002 rows=0 loops=1)
         ->  Index Scan using uuid_idx on temptable  (cost=0.42..120.11 rows=4088 width=22) (never executed)
               Index Cond: (uuid = mpd_2015_03.uuid)
   ->  Nested Loop  (cost=20606.19..7121506.57 rows=149367744 width=12) (actual time=7934.416..7934.416 rows=0 loops=1)
         Buffers: shared hit=478 read=558, local read=10086 written=1023
         ->  HashAggregate  (cost=20306.34..20308.34 rows=200 width=22) (actual time=7555.857..7556.264 rows=259 loops=1)
               Group Key: temptable.uuid
               Buffers: local read=10086 written=1023
               ->  Seq Scan on temptable  (cost=0.00..18262.27 rows=817627 width=22) (actual time=7257.016..7399.478 rows=817627 loops=1)
                     Buffers: local read=10086 written=1023
         ->  Bitmap Heap Scan on mpd_2015_04  (cost=299.85..35376.78 rows=12921 width=22) (actual time=1.457..1.457 rows=0 loops=259)
               Recheck Cond: (uuid = temptable.uuid)
               Buffers: shared hit=478 read=558
               ->  Bitmap Index Scan on mpd_2015_04_uuid_idx  (cost=0.00..296.62 rows=12921 width=0) (actual time=1.455..1.455 rows=0 loops=259)
                     Index Cond: (uuid = temptable.uuid)
                     Buffers: shared hit=478 read=558
  ...
   ->  Hash Join  (cost=20310.84..646068.95 rows=10984718 width=12) (actual time=13374.218..13374.219 rows=0 loops=1)
         Hash Cond: (mpd_2020_02.uuid = temptable.uuid)
         Buffers: shared read=226189, local read=10086
         ->  Seq Scan on mpd_2020_02  (cost=0.00..445883.36 rows=21969436 width=22) (actual time=2.129..10268.765 rows=21969107 loops=1)
               Buffers: shared read=226189
         ->  Hash  (cost=20308.34..20308.34 rows=200 width=22) (actual time=292.617..292.617 rows=259 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 22kB
               Buffers: local read=10086
               ->  HashAggregate  (cost=20306.34..20308.34 rows=200 width=22) (actual time=292.501..292.552 rows=259 loops=1)
                     Group Key: temptable.uuid
                     Buffers: local read=10086
                     ->  Seq Scan on temptable  (cost=0.00..18262.27 rows=817627 width=22) (actual time=0.023..136.102 rows=817627 loops=1)
                           Buffers: local read=10086
   ->  Hash Join  (cost=20310.84..390240.09 rows=6494055 width=12) (actual time=8221.586..8221.586 rows=0 loops=1)
         Hash Cond: (mpd_2020_03.uuid = temptable.uuid)
         Buffers: shared read=133708, local read=10086
         ->  Seq Scan on mpd_2020_03  (cost=0.00..263589.10 rows=12988110 width=22) (actual time=2.118..6207.373 rows=12987440 loops=1)
               Buffers: shared read=133708
         ->  Hash  (cost=20308.34..20308.34 rows=200 width=22) (actual time=289.169..289.169 rows=259 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 22kB
               Buffers: local read=10086
               ->  HashAggregate  (cost=20306.34..20308.34 rows=200 width=22) (actual time=289.054..289.103 rows=259 loops=1)
                     Group Key: temptable.uuid
                     Buffers: local read=10086
                     ->  Seq Scan on temptable  (cost=0.00..18262.27 rows=817627 width=22) (actual time=0.018..135.729 rows=817627 loops=1)
                           Buffers: local read=10086
   ->  Nested Loop Semi Join  (cost=0.42..526.20 rows=550 width=12) (actual time=0.007..0.007 rows=0 loops=1)
         ->  Seq Scan on mpd_2020_04  (cost=0.00..21.00 rows=1100 width=22) (actual time=0.006..0.006 rows=0 loops=1)
         ->  Index Scan using uuid_idx on temptable  (cost=0.42..120.11 rows=4088 width=22) (never executed)
               Index Cond: (uuid = mpd_2020_04.uuid)
  ...
   ->  Nested Loop Semi Join  (cost=0.42..526.20 rows=550 width=12) (actual time=0.003..0.003 rows=0 loops=1)
         ->  Seq Scan on mpd_2021_12  (cost=0.00..21.00 rows=1100 width=22) (actual time=0.002..0.002 rows=0 loops=1)
         ->  Index Scan using uuid_idx on temptable  (cost=0.42..120.11 rows=4088 width=22) (never executed)
               Index Cond: (uuid = mpd_2021_12.uuid)
 Planning Time: 473.612 ms
 JIT:
   Functions: 932
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 99.333 ms, Inlining 9.228 ms, Optimization 4079.459 ms, Emission 3159.415 ms, Total 7347.435 ms
 Execution Time: 286901.250 ms
(843 rows)

On Tue, Jun 16, 2020 at 6:57 AM Greg Spiegelberg <[hidden email]> wrote:
On Mon, Jun 15, 2020 at 7:39 PM Wells Oliver <[hidden email]> wrote:
Hi all. I have a partitioned table (by month from a date column), where each partition contains something like 400m rows.

Each partition is defined by a PK with a uuid and date field (the parent table is partitioned by range on the date), and two other columns.

In doing a delete for a specific date, e.g. DELETE FROM t WHERE date = '2019-09-01' AND uuid IN (SELECT uuid FROM temptable), it runs very efficiently.

I am trying to write a processing script that deletes for potentially multiple dates & uuid values, and it just takes hours, trying:

DELETE FROM t WHERE date = (SELECT DISTINCT date from temp) AND uuid IN (select uuid from tempuuds) -- no go, hours.

Tried USING, e.g. DELETE FROM t USING temp WHERE t.date = temp.date AND t.uuid = temp.uuid -- no go, hours.

I just can't delete from this table without an explicit date and a set of uuids using a WHERE IN approach, but I need to.I was thinking of making a plpgsql function or something that loops through dates and makes a more explicit DELETE statement, but I'm thinking there must be some better way using indexing or something.

Appreciate any tips.

Have you considered partitioning by day instead of month?  Could eliminate an index you may have on the date column.
How many days are in the many-days DELETE?  Could you simply wrap it in a transaction and do one DELETE per day?

You could potentially get better performance removing the JOIN/sub-SELECT using
DELETE FROM mytable WHERE date_col = ANY( ARRAY['2020-01-01', '2020-01-13']::date[] );

HTH
-Greg


--
Wells Oliver
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Deleting more efficiently from large partitions

jim schmidt
In reply to this post by Wells Oliver-2
how does

delete from where (date, uuid) in (select temp.date, tempuuds.uuid from temp, date)

perform?

On Mon, Jun 15, 2020 at 8:39 PM Wells Oliver <[hidden email]> wrote:
Hi all. I have a partitioned table (by month from a date column), where each partition contains something like 400m rows.

Each partition is defined by a PK with a uuid and date field (the parent table is partitioned by range on the date), and two other columns.

In doing a delete for a specific date, e.g. DELETE FROM t WHERE date = '2019-09-01' AND uuid IN (SELECT uuid FROM temptable), it runs very efficiently.

I am trying to write a processing script that deletes for potentially multiple dates & uuid values, and it just takes hours, trying:

DELETE FROM t WHERE date = (SELECT DISTINCT date from temp) AND uuid IN (select uuid from tempuuds) -- no go, hours.

Tried USING, e.g. DELETE FROM t USING temp WHERE t.date = temp.date AND t.uuid = temp.uuid -- no go, hours.

I just can't delete from this table without an explicit date and a set of uuids using a WHERE IN approach, but I need to.I was thinking of making a plpgsql function or something that loops through dates and makes a more explicit DELETE statement, but I'm thinking there must be some better way using indexing or something.

Appreciate any tips.


--
Wells Oliver
[hidden email]


--