partitioning performance tests after recent patches

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

partitioning performance tests after recent patches

Floris Van Nee
Hi all,

After all the great work that was done on partitioning over the past months, I wanted to take a closer look at the performance. I prepared some performance tests for use cases that I often encounter on partitioned tables. My goal was to test the performance and stability of the recent changes that were implemented regarding query planning and run-time pruning between native partitioning on HEAD, PG11 and TimescaleDb partitioning on PG11. I'm posting the results here now. I'm sure many of these results are not new to most people here, but maybe there's some useful information in there. :-) Furthermore, I have a few questions about the results that I obtained.

Several test cases were prepared. These test cases were run for cases with 16, 64, 256, 1024 and 4096 partitions using pgbench functionality for 60 seconds each. The exact commands that were run can be found in the attached output file. Note that most of the SELECT queries that are benchmarked are basically testing planning time, as this is generally the part that becomes much slower when more partitions are added. The queries are written such that only one or two partitions are interesting for that particular query and the rest should be discarded as early as possible. Execution time is very small for these queries as they just do a simple index scan on the remaining chunk.

The test cases were (see benchmark.sql for the SQL commands for setup and test cases):
1. Insert batches of 1000 rows per transaction
2. Simple SELECT query pruning on a static timestamp
3. The same SELECT query with static timestamp but with an added 'ORDER BY a, updated_at DESC LIMIT 1', which matches the index defined on the table
4. The same simple SELECT query, but now it's wrapped inside an inlineable SQL function, called with a static timestamp
5. The same simple SELECT query, but now it's wrapped inside a non-inlineable SQL function, called with a static timestamp
6. The same simple SELECT query, but now it's wrapped inside a plpgsql function, called with a static timestamp
7. Simple SELECT query pruning on a timestamp now()
8. The same SELECT query with dynamic timestamp but with an added 'ORDER BY a, updated_at DESC LIMIT 1', which matches the index defined on the table
9. The same simple SELECT query, but now it's wrapped inside an inlineable SQL function, called with a dynamic timestamp
10. The same simple SELECT query, but now it's wrapped inside a non-inlineable SQL function, called with a dynamic timestamp
11. The same simple SELECT query, but now it's wrapped inside a plpgsql function, called with a dynamic timestamp
12. The same query as 2) but then in an inlineable function
13. The same query as 3) but then in an inlineable function
14. A SELECT with nested loop (10 iterations) with opportunities for run-time pruning - some rows from a table are selected and the timestamp from rows in that table is used to join on another partitioned table

The full results can be found in the attached file (results.txt). I also produced graphs of the results, which can be found on TimescaleDb's Github page [1]. Please take a look at these figures for an easy overview of the results. In general performance of HEAD looks really good.

While looking at these results, there were a few questions that I couldn't answer.
1) It seems like the queries inside plpgsql functions (case 6 and 11) perform relatively well in PG11 compared to a non-inlineable SQL function (case 5 and 10), when a table consists of many partitions. As far as I know, both plpgsql and non-inlineable SQL functions are executed with generic plans. What can explain this difference? Are non-inlineable SQL function plans not reused between transactions, while plpgsql plans are?
2) Is running non-inlined SQL functions with a generic plan even the best option all the time? Wouldn't it be better to adopt a similar approach to what plpgsql does, where it tries to test if using a generic plan is beneficial? The non-inlineable SQL functions suffer a big performance hit for a large number of partitions, because they cannot rely on static planning-time pruning.
3) What could be causing the big performance difference between case 7 (simple SELECT) and 8 (simple SELECT with ORDER BY <index> LIMIT 1)? For 4096 partitions, TPS of 7) is around 5, while adding the ORDER BY <index> LIMIT 1 makes TPS drop well below 1. In theory, run-time pruning of the right chunk should take exactly the same amount of time in both cases, because both are pruning timestamp now() on the same number of partitions. The resulting plans are also identical with the exception of the top LIMIT-node (in PG11 they differ slightly as a MergeAppend is chosen for the ORDER BY instead of an Append, in HEAD with ordered append this is not necessary anymore). Am I missing something here?
4) A more general question about run-time pruning in nested loops, like the one for case 14. I believe I read in one of the previous threads that run-time pruning only reoccurs if it determines that the value that determines which partitions must be excluded has changed in between iterations. How is this defined? Eg. let's say partitions are 1-day wide and the first iteration of the loop filters on the partitioned table for timestamp between 14-04-2019 12:00 and 14-04-2019 20:00 (dynamically determined). Then the second iteration comes along and now filters on values between 14-04-2019 12:00 and 14-04-2019 19:00. The partition that should be scanned hasn't changed, because both timestamps fall into the same partition. Is the full process of run-time pruning applied again, or is there some kind of shortcut that first checks if the previous pruning result is still valid even if the value has changed slightly? If not, would this be a possible optimization, as I think it's a case that occurs very often? I don't know the run-time pruning code very well though, so it may just be a crazy idea that can't be practically achieved.

There was one other thing I noticed, and I believe it was raised by Tom in a separate thread as well: the setup code itself is really slow. Creating of partitions is taking a long time (it's taking several minutes to create the 4096 partition table).

Thanks again for the great work on partitioning! Almost every case that I tested is way better than the comparable case in PG11.

-Floris



benchmark.sql.txt (7K) Download Attachment
results.txt (122K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: partitioning performance tests after recent patches

David Rowley-3
On Mon, 15 Apr 2019 at 07:19, Floris Van Nee <[hidden email]> wrote:
> 3) What could be causing the big performance difference between case 7 (simple SELECT) and 8 (simple SELECT with ORDER BY <index> LIMIT 1)? For 4096 partitions, TPS of 7) is around 5, while adding the ORDER BY <index> LIMIT 1 makes TPS drop well below 1. In theory, run-time pruning of the right chunk should take exactly the same amount of time in both cases, because both are pruning timestamp now() on the same number of partitions. The resulting plans are also identical with the exception of the top LIMIT-node (in PG11 they differ slightly as a MergeAppend is chosen for the ORDER BY instead of an Append, in HEAD with ordered append this is not necessary anymore). Am I missing something here?

With the information provided, I don't really see any reason why the
ORDER BY LIMIT would slow it down if the plan is the same apart from
the LIMIT node. Please share the EXPLAIN ANALYZE output of each.

> 4) A more general question about run-time pruning in nested loops, like the one for case 14. I believe I read in one of the previous threads that run-time pruning only reoccurs if it determines that the value that determines which partitions must be excluded has changed in between iterations. How is this defined? Eg. let's say partitions are 1-day wide and the first iteration of the loop filters on the partitioned table for timestamp between 14-04-2019 12:00 and 14-04-2019 20:00 (dynamically determined). Then the second iteration comes along and now filters on values between 14-04-2019 12:00 and 14-04-2019 19:00. The partition that should be scanned hasn't changed, because both timestamps fall into the same partition. Is the full process of run-time pruning applied again, or is there some kind of shortcut that first checks if the previous pruning result is still valid even if the value has changed slightly? If not, would this be a possible optimization, as I think it's a case that occurs very often? I don't know the run-time pruning code very well though, so it may just be a crazy idea that can't be practically achieved.

Currently, there's no shortcut. It knows which parameters partition
pruning depends on and it reprunes whenever the value of ones of these
changes.

I'm not really sure how rechecking would work exactly. There are cases
where it wouldn't be possible, say the condition was: partkey >= $1
and there was no partition for $1 since it was beyond the range of the
defined range partitions.  How could we tell if we can perform the
shortcut if the next param value falls off the lower bound of the
defined partitions?  The first would include no partitions and the
second includes all partitions, but the actual value of $1 belongs to
no partition in either case so we can't check to see if it matches the
same partition.  Perhaps it could work for equality operators when
just a single partition is matched in the first place, it might then
be possible to do a shortcircuit recheck to see if the same partition
matches the next set of values.  The problem with that is that
run-time pruning code in the executor does not care about which
operators are used. It just passes those details off to the pruning
code to deal with it.  Perhaps something can be decided in the planner
in analyze_partkey_exprs() to have it set a "do_recheck" flag to tell
the executor to check before pruning again...  Or maybe it's okay to
just try a recheck when we match to just a single partition and just
recheck the new values are allowed in that partition when re-pruning.
However, that might be just too overly dumb since for inequality
operators the original values may never even have falling inside the
partition's bounds in the first place.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: partitioning performance tests after recent patches

Floris Van Nee
Hi David,

Thanks for your reply. I really appreciate your work on run-time pruning!
Here's the output of explain/analyze for HEAD. At run-time, technically all partitions could be pruned directly. However, one partition remains in the output of explain/analyze because of other difficulties with removing all of them, if I remember correctly? Still, that partition is never executed.  The only difference I can see is the Limit node on top, as well as apparently another  partition appearing in the analyze output (4096_4096, last partition, remains in the first plan. 4096_1, the first partition, remains the second plan).

-- select_now.sql
explain(analyze, verbose, buffers on)
select * from :tbl where a='abc' and updated_at between now() and now()+interval '1d';

Append  (cost=0.16..8949.61 rows=4096 width=112) (actual time=0.000..0.000 rows=0 loops=1)
  Subplans Removed: 4095
  ->  Index Scan using p4096_4096_a_updated_at_idx on public.p4096_4096  (cost=0.16..2.18 rows=1 width=112) (never executed)
        Output: p4096_4096.a, p4096_4096.b, p4096_4096.c, p4096_4096.d, p4096_4096.updated_at
        Index Cond: ((p4096_4096.a = 'abc'::text) AND (p4096_4096.updated_at >= now()) AND (p4096_4096.updated_at <= (now() + '1 day'::interval)))
Planning Time: 237.603 ms
Execution Time: 0.475 ms

-- select_now_limit.sql
explain(analyze, verbose, buffers on)
select * from :tbl where a='abc' and updated_at between now() and now()+interval '1d'
order by a, updated_at desc limit 1;

Limit  (cost=645.53..647.56 rows=1 width=112) (actual time=0.002..0.002 rows=0 loops=1)
  Output: p4096_1.a, p4096_1.b, p4096_1.c, p4096_1.d, p4096_1.updated_at
  ->  Append  (cost=645.53..8949.61 rows=4096 width=112) (actual time=0.000..0.000 rows=0 loops=1)
        Subplans Removed: 4095
        ->  Index Scan using p4096_1_a_updated_at_idx on public.p4096_1  (cost=0.57..2.03 rows=1 width=54) (never executed)
              Output: p4096_1.a, p4096_1.b, p4096_1.c, p4096_1.d, p4096_1.updated_at
              Index Cond: ((p4096_1.a = 'abc'::text) AND (p4096_1.updated_at >= now()) AND (p4096_1.updated_at <= (now() + '1 day'::interval)))
Planning Time: 3897.687 ms
Execution Time: 0.491 ms

Regarding the nested loops - thanks for your explanation. I can see this is more complicated than I initially thought. It may be doable to determine if your set of pruned partitions is still valid, but it's more difficult to determine if, on top of that, extra partitions must be included due to widening of the range.

-Floris

________________________________________
From: David Rowley <[hidden email]>
Sent: Monday, April 15, 2019 1:25 AM
To: Floris Van Nee
Cc: Pg Hackers
Subject: Re: partitioning performance tests after recent patches [External]

On Mon, 15 Apr 2019 at 07:19, Floris Van Nee <[hidden email]> wrote:
> 3) What could be causing the big performance difference between case 7 (simple SELECT) and 8 (simple SELECT with ORDER BY <index> LIMIT 1)? For 4096 partitions, TPS of 7) is around 5, while adding the ORDER BY <index> LIMIT 1 makes TPS drop well below 1. In theory, run-time pruning of the right chunk should take exactly the same amount of time in both cases, because both are pruning timestamp now() on the same number of partitions. The resulting plans are also identical with the exception of the top LIMIT-node (in PG11 they differ slightly as a MergeAppend is chosen for the ORDER BY instead of an Append, in HEAD with ordered append this is not necessary anymore). Am I missing something here?

With the information provided, I don't really see any reason why the
ORDER BY LIMIT would slow it down if the plan is the same apart from
the LIMIT node. Please share the EXPLAIN ANALYZE output of each.

> 4) A more general question about run-time pruning in nested loops, like the one for case 14. I believe I read in one of the previous threads that run-time pruning only reoccurs if it determines that the value that determines which partitions must be excluded has changed in between iterations. How is this defined? Eg. let's say partitions are 1-day wide and the first iteration of the loop filters on the partitioned table for timestamp between 14-04-2019 12:00 and 14-04-2019 20:00 (dynamically determined). Then the second iteration comes along and now filters on values between 14-04-2019 12:00 and 14-04-2019 19:00. The partition that should be scanned hasn't changed, because both timestamps fall into the same partition. Is the full process of run-time pruning applied again, or is there some kind of shortcut that first checks if the previous pruning result is still valid even if the value has changed slightly? If not, would this be a possible optimization, as I think it's a case that occurs very often? I don't know the run-time pruning code very well though, so it may just be a crazy idea that can't be practically achieved.

Currently, there's no shortcut. It knows which parameters partition
pruning depends on and it reprunes whenever the value of ones of these
changes.

I'm not really sure how rechecking would work exactly. There are cases
where it wouldn't be possible, say the condition was: partkey >= $1
and there was no partition for $1 since it was beyond the range of the
defined range partitions.  How could we tell if we can perform the
shortcut if the next param value falls off the lower bound of the
defined partitions?  The first would include no partitions and the
second includes all partitions, but the actual value of $1 belongs to
no partition in either case so we can't check to see if it matches the
same partition.  Perhaps it could work for equality operators when
just a single partition is matched in the first place, it might then
be possible to do a shortcircuit recheck to see if the same partition
matches the next set of values.  The problem with that is that
run-time pruning code in the executor does not care about which
operators are used. It just passes those details off to the pruning
code to deal with it.  Perhaps something can be decided in the planner
in analyze_partkey_exprs() to have it set a "do_recheck" flag to tell
the executor to check before pruning again...  Or maybe it's okay to
just try a recheck when we match to just a single partition and just
recheck the new values are allowed in that partition when re-pruning.
However, that might be just too overly dumb since for inequality
operators the original values may never even have falling inside the
partition's bounds in the first place.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: partitioning performance tests after recent patches

Amit Langote-2
In reply to this post by Floris Van Nee
Hi,

Thanks a lot for very exhaustive testing.

David already replied to some points, but let me comment on a couple of
points.

Please be advised that these may not be the numbers (or scalability
pattern of these numbers) you'll see when PG 12 is actually released,
because we may end up changing something that makes performance suffer a
bit.  In particular, we are contemplating some changes around the safety
of planner's handling of cached partitioning metadata (in light of reduced
lock levels for adding/removing partitions) that might reduce the TPS
figure, the impact of which would worsen as the number of partitions
increases.  Although, nothing is final yet; if interested, you can follow
that discussion at [1].

On 2019/04/15 4:19, Floris Van Nee wrote:

> The test cases were (see benchmark.sql for the SQL commands for setup and test cases):
> 1. Insert batches of 1000 rows per transaction
> 2. Simple SELECT query pruning on a static timestamp
> 3. The same SELECT query with static timestamp but with an added 'ORDER BY a, updated_at DESC LIMIT 1', which matches the index defined on the table
> 4. The same simple SELECT query, but now it's wrapped inside an inlineable SQL function, called with a static timestamp
> 5. The same simple SELECT query, but now it's wrapped inside a non-inlineable SQL function, called with a static timestamp
> 6. The same simple SELECT query, but now it's wrapped inside a plpgsql function, called with a static timestamp
> 7. Simple SELECT query pruning on a timestamp now()
> 8. The same SELECT query with dynamic timestamp but with an added 'ORDER BY a, updated_at DESC LIMIT 1', which matches the index defined on the table
> 9. The same simple SELECT query, but now it's wrapped inside an inlineable SQL function, called with a dynamic timestamp
> 10. The same simple SELECT query, but now it's wrapped inside a non-inlineable SQL function, called with a dynamic timestamp
> 11. The same simple SELECT query, but now it's wrapped inside a plpgsql function, called with a dynamic timestamp
> 12. The same query as 2) but then in an inlineable function
> 13. The same query as 3) but then in an inlineable function
> 14. A SELECT with nested loop (10 iterations) with opportunities for run-time pruning - some rows from a table are selected and the timestamp from rows in that table is used to join on another partitioned table
>
> The full results can be found in the attached file (results.txt). I also produced graphs of the results, which can be found on TimescaleDb's Github page [1]. Please take a look at these figures for an easy overview of the results. In general performance of HEAD looks really good.
>
> While looking at these results, there were a few questions that I couldn't answer.
> 1) It seems like the queries inside plpgsql functions (case 6 and 11) perform relatively well in PG11 compared to a non-inlineable SQL function (case 5 and 10), when a table consists of many partitions. As far as I know, both plpgsql and non-inlineable SQL functions are executed with generic plans. What can explain this difference? Are non-inlineable SQL function plans not reused between transactions, while plpgsql plans are?
> 2) Is running non-inlined SQL functions with a generic plan even the best option all the time? Wouldn't it be better to adopt a similar approach to what plpgsql does, where it tries to test if using a generic plan is beneficial? The non-inlineable SQL functions suffer a big performance hit for a large number of partitions, because they cannot rely on static planning-time pruning.

I'd never noticed this before.  It indeed seems to be the case that SQL
functions and plpgsql functions are handled using completely different
code paths, of which only for the latter it's possible to use static
planning-time pruning.

> There was one other thing I noticed, and I believe it was raised by Tom in a separate thread as well: the setup code itself is really slow. Creating of partitions is taking a long time (it's taking several minutes to create the 4096 partition table).

Yeah that's rather bad.  Thinking of doing something about that for PG 13.

Thanks,
Amit

[1] https://www.postgresql.org/message-id/27380.1555270166%40sss.pgh.pa.us



Reply | Threaded
Open this post in threaded view
|

Re: partitioning performance tests after recent patches

David Rowley-3
In reply to this post by Floris Van Nee
On Mon, 15 Apr 2019 at 19:33, Floris Van Nee <[hidden email]> wrote:

> Here's the output of explain/analyze for HEAD. At run-time, technically all partitions could be pruned directly. However, one partition remains in the output of explain/analyze because of other difficulties with removing all of them, if I remember correctly? Still, that partition is never executed.  The only difference I can see is the Limit node on top, as well as apparently another  partition appearing in the analyze output (4096_4096, last partition, remains in the first plan. 4096_1, the first partition, remains the second plan).
>
> -- select_now.sql
> explain(analyze, verbose, buffers on)
> select * from :tbl where a='abc' and updated_at between now() and now()+interval '1d';
>
> Append  (cost=0.16..8949.61 rows=4096 width=112) (actual time=0.000..0.000 rows=0 loops=1)
>   Subplans Removed: 4095
>   ->  Index Scan using p4096_4096_a_updated_at_idx on public.p4096_4096  (cost=0.16..2.18 rows=1 width=112) (never executed)
>         Output: p4096_4096.a, p4096_4096.b, p4096_4096.c, p4096_4096.d, p4096_4096.updated_at
>         Index Cond: ((p4096_4096.a = 'abc'::text) AND (p4096_4096.updated_at >= now()) AND (p4096_4096.updated_at <= (now() + '1 day'::interval)))
> Planning Time: 237.603 ms
> Execution Time: 0.475 ms
>
> -- select_now_limit.sql
> explain(analyze, verbose, buffers on)
> select * from :tbl where a='abc' and updated_at between now() and now()+interval '1d'
> order by a, updated_at desc limit 1;
>
> Limit  (cost=645.53..647.56 rows=1 width=112) (actual time=0.002..0.002 rows=0 loops=1)
>   Output: p4096_1.a, p4096_1.b, p4096_1.c, p4096_1.d, p4096_1.updated_at
>   ->  Append  (cost=645.53..8949.61 rows=4096 width=112) (actual time=0.000..0.000 rows=0 loops=1)
>         Subplans Removed: 4095
>         ->  Index Scan using p4096_1_a_updated_at_idx on public.p4096_1  (cost=0.57..2.03 rows=1 width=54) (never executed)
>               Output: p4096_1.a, p4096_1.b, p4096_1.c, p4096_1.d, p4096_1.updated_at
>               Index Cond: ((p4096_1.a = 'abc'::text) AND (p4096_1.updated_at >= now()) AND (p4096_1.updated_at <= (now() + '1 day'::interval)))
> Planning Time: 3897.687 ms
> Execution Time: 0.491 ms

I had a look at this and it's due to get_eclass_for_sort_expr() having
a hard time due to the EquivalenceClass having so many members. This
must be done for each partition, so search time is quadratic based on
the number of partitions. We only hit this in the 2nd plan due to
build_index_paths() finding that there are useful pathkeys from
query_pathkeys.  Of course, this does not happen for the first query
since it has no ORDER BY clause.

Tom and I were doing a bit of work in [1] to speed up cases when there
are many EquivalenceClasses by storing a Bitmapset for each RelOptInfo
to mark the indexes of each eq_classes they have members in.  This
does not really help this case since we're slow due to lots of members
rather than lots of classes, but perhaps something similar can be done
to allow members to be found more quickly.  I'm not sure exactly how
that can be done without having something like an array of Lists
indexed by relid in each EquivalenceClasses. That does not sound great
from a memory consumption point of view.  Maybe having
EquivalenceMember in some data structure that we don't have to perform
a linear search on would be a better fix.  Although, we don't
currently have any means to hash or binary search for note types
though. Perhaps its time we did.

[1] https://commitfest.postgresql.org/23/1984/

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: partitioning performance tests after recent patches

Tom Lane-2
In reply to this post by Amit Langote-2
Amit Langote <[hidden email]> writes:
> On 2019/04/15 4:19, Floris Van Nee wrote:
>> 2) Is running non-inlined SQL functions with a generic plan even the best option all the time? Wouldn't it be better to adopt a similar approach to what plpgsql does, where it tries to test if using a generic plan is beneficial? The non-inlineable SQL functions suffer a big performance hit for a large number of partitions, because they cannot rely on static planning-time pruning.

> I'd never noticed this before.  It indeed seems to be the case that SQL
> functions and plpgsql functions are handled using completely different
> code paths, of which only for the latter it's possible to use static
> planning-time pruning.

Yeah.  Another big problem with the current implementation of SQL
functions is that there's no possibility of cross-query plan caching.
At some point I'd like to throw away functions.c and start over
with an implementation more similar to how plpgsql does it (in
particular, with persistent state and use of the plan cache).
It hasn't gotten to the top of the to-do queue though, mostly because
I think not many people use SQL-language functions except when they
want them to be inlined.

                        regards, tom lane