[POC] Faster processing at Gather node

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[POC] Faster processing at Gather node

rafia.sabih
Hello everybody,

While analysing the performance of TPC-H queries for the newly developed parallel-operators, viz, parallel index, bitmap heap scan, etc. we noticed that the time taken by gather node is significant. On investigation, as per the current method it copies each tuple to the shared queue and notifies the receiver. Since, this copying is done in shared queue, a lot of locking and latching overhead is there. 

So, in this POC patch I tried to copy all the tuples in a local queue thus avoiding all the locks and latches. Once, the local queue is filled as per it's capacity, tuples are transferred to the shared queue. Once, all the tuples are transferred the receiver is sent the notification about the same.

With this patch I could see significant improvement in performance for simple queries, 

head:
explain  analyse select * from t where i < 30000000;
                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=0.00..83225.55 rows=29676454 width=19) (actual time=1.379..35871.235 rows=29999999 loops=1)
   Workers Planned: 64
   Workers Launched: 64
   ->  Parallel Seq Scan on t  (cost=0.00..83225.55 rows=463695 width=19) (actual time=0.125..1415.521 rows=461538 loops=65)
         Filter: (i < 30000000)
         Rows Removed by Filter: 1076923
 Planning time: 0.180 ms
 Execution time: 38503.478 ms
(8 rows)

patch:
 explain  analyse select * from t where i < 30000000;
                                                         QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=0.00..83225.55 rows=29676454 width=19) (actual time=0.980..24499.427 rows=29999999 loops=1)
   Workers Planned: 64
   Workers Launched: 64
   ->  Parallel Seq Scan on t  (cost=0.00..83225.55 rows=463695 width=19) (actual time=0.088..968.406 rows=461538 loops=65)
         Filter: (i < 30000000)
         Rows Removed by Filter: 1076923
 Planning time: 0.158 ms
 Execution time: 27331.849 ms
(8 rows)

head:
 explain  analyse select * from t where i < 40000000;
                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=0.00..83225.55 rows=39501511 width=19) (actual time=0.890..38438.753 rows=39999999 loops=1)
   Workers Planned: 64
   Workers Launched: 64
   ->  Parallel Seq Scan on t  (cost=0.00..83225.55 rows=617211 width=19) (actual time=0.074..1235.180 rows=615385 loops=65)
         Filter: (i < 40000000)
         Rows Removed by Filter: 923077
 Planning time: 0.113 ms
 Execution time: 41609.855 ms
(8 rows)

patch:
explain  analyse select * from t where i < 40000000;
                                                         QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=0.00..83225.55 rows=39501511 width=19) (actual time=1.085..31806.671 rows=39999999 loops=1)
   Workers Planned: 64
   Workers Launched: 64
   ->  Parallel Seq Scan on t  (cost=0.00..83225.55 rows=617211 width=19) (actual time=0.083..954.342 rows=615385 loops=65)
         Filter: (i < 40000000)
         Rows Removed by Filter: 923077
 Planning time: 0.151 ms
 Execution time: 35341.429 ms
(8 rows)

head:
explain  analyse select * from t where i < 45000000;
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=0.00..102756.80 rows=44584013 width=19) (actual time=0.563..49156.252 rows=44999999 loops=1)
   Workers Planned: 32
   Workers Launched: 32
   ->  Parallel Seq Scan on t  (cost=0.00..102756.80 rows=1393250 width=19) (actual time=0.069..1905.436 rows=1363636 loops=33)
         Filter: (i < 45000000)
         Rows Removed by Filter: 1666667
 Planning time: 0.106 ms
 Execution time: 52722.476 ms
(8 rows)

patch:
 explain  analyse select * from t where i < 45000000;
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=0.00..102756.80 rows=44584013 width=19) (actual time=0.545..37501.200 rows=44999999 loops=1)
   Workers Planned: 32
   Workers Launched: 32
   ->  Parallel Seq Scan on t  (cost=0.00..102756.80 rows=1393250 width=19) (actual time=0.068..2165.430 rows=1363636 loops=33)
         Filter: (i < 45000000)
         Rows Removed by Filter: 1666667
 Planning time: 0.087 ms
 Execution time: 41458.969 ms
(8 rows)

The improvement in performance is most when the selectivity is around 20-30%, in which case currently parallelism is not selected.

I am testing the performance impact of this on TPC-H queries, in the meantime would appreciate some feedback on the design, etc.

--
Regards,
Rafia Sabih


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

faster_gather.patch (18K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [POC] Faster processing at Gather node

Robert Haas
On Fri, May 19, 2017 at 7:55 AM, Rafia Sabih
<[hidden email]> wrote:

> While analysing the performance of TPC-H queries for the newly developed
> parallel-operators, viz, parallel index, bitmap heap scan, etc. we noticed
> that the time taken by gather node is significant. On investigation, as per
> the current method it copies each tuple to the shared queue and notifies the
> receiver. Since, this copying is done in shared queue, a lot of locking and
> latching overhead is there.
>
> So, in this POC patch I tried to copy all the tuples in a local queue thus
> avoiding all the locks and latches. Once, the local queue is filled as per
> it's capacity, tuples are transferred to the shared queue. Once, all the
> tuples are transferred the receiver is sent the notification about the same.

What if, instead of doing this, we switched the shm_mq stuff to use atomics?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [POC] Faster processing at Gather node

akapila
On Fri, May 19, 2017 at 5:58 PM, Robert Haas <[hidden email]> wrote:

> On Fri, May 19, 2017 at 7:55 AM, Rafia Sabih
> <[hidden email]> wrote:
>> While analysing the performance of TPC-H queries for the newly developed
>> parallel-operators, viz, parallel index, bitmap heap scan, etc. we noticed
>> that the time taken by gather node is significant. On investigation, as per
>> the current method it copies each tuple to the shared queue and notifies the
>> receiver. Since, this copying is done in shared queue, a lot of locking and
>> latching overhead is there.
>>
>> So, in this POC patch I tried to copy all the tuples in a local queue thus
>> avoiding all the locks and latches. Once, the local queue is filled as per
>> it's capacity, tuples are transferred to the shared queue. Once, all the
>> tuples are transferred the receiver is sent the notification about the same.
>
> What if, instead of doing this, we switched the shm_mq stuff to use atomics?
>

That is one of the very first things we have tried, but it didn't show
us any improvement, probably because sending tuple-by-tuple over
shm_mq is not cheap.  Also, independently, we have tried to reduce the
frequency of SetLatch (used to notify receiver), but that also didn't
result in improving the results. Now, I think one thing that can be
tried is to use atomics in shm_mq and reduce the frequency to notify
receiver, but not sure if that can give us better results than with
this idea. There are a couple of other ideas which has been tried to
improve the speed of Gather like avoiding an extra copy of tuple which
we need to do before sending tuple
(tqueueReceiveSlot->ExecMaterializeSlot) and increasing the size of
tuple queue length, but none of those has shown any noticeable
improvement.  I am aware of all this because I and Dilip were offlist
involved in brainstorming ideas with Rafia to improve the speed of
Gather.  I think it might have been better to show the results of
ideas that didn't work out, but I guess Rafia hasn't shared those with
the intuition that nobody would be interested in hearing what didn't
work out.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [POC] Faster processing at Gather node

Alexander Kuzmenkov
In reply to this post by rafia.sabih
Hi Rafia,

I like the idea of reducing locking overhead by sending tuples in bulk.
The implementation could probably be simpler: you could extend the API
of shm_mq to decouple notifying the sender from actually putting data
into the queue (i.e., make shm_mq_notify_receiver public and make a
variant of shm_mq_sendv that doesn't send the notification). From Amit's
letter I understand that you have already tried something along these
lines and the performance wasn't good. What was the bottleneck then? If
it's the locking around mq_bytes_read/written, it can be rewritten with
atomics. I think it would be great to try this approach because it
doesn't add much code, doesn't add any additional copying and improves
shm_mq performance in general.

--
Alexander Kuzmenkov
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [POC] Faster processing at Gather node

akapila
On Fri, Sep 8, 2017 at 11:07 PM, Alexander Kuzmenkov
<[hidden email]> wrote:
> Hi Rafia,
>
> I like the idea of reducing locking overhead by sending tuples in bulk. The
> implementation could probably be simpler: you could extend the API of shm_mq
> to decouple notifying the sender from actually putting data into the queue
> (i.e., make shm_mq_notify_receiver public and make a variant of shm_mq_sendv
> that doesn't send the notification).
>

Rafia can comment on details, but I would like to bring it to your
notice that we need kind of local buffer (queue) for gathermerge
processing as well where the data needs to be fetched in order from
queues.  So, there is always a chance that some of the workers have
filled their queues while waiting for the master to extract the data.
I think the patch posted by Rafia on the nearby thread [1] addresses
both the problems by one patch.


[1] - https://www.postgresql.org/message-id/CAOGQiiNiMhq5Pg3LiYxjfi2B9eAQ_q5YjS%3DfHiBJmbSOF74aBQ%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [POC] Faster processing at Gather node

rafia.sabih
On Sat, Sep 9, 2017 at 8:14 AM, Amit Kapila <[hidden email]> wrote:

> On Fri, Sep 8, 2017 at 11:07 PM, Alexander Kuzmenkov
> <[hidden email]> wrote:
>> Hi Rafia,
>>
>> I like the idea of reducing locking overhead by sending tuples in bulk. The
>> implementation could probably be simpler: you could extend the API of shm_mq
>> to decouple notifying the sender from actually putting data into the queue
>> (i.e., make shm_mq_notify_receiver public and make a variant of shm_mq_sendv
>> that doesn't send the notification).
>>
>
> Rafia can comment on details, but I would like to bring it to your
> notice that we need kind of local buffer (queue) for gathermerge
> processing as well where the data needs to be fetched in order from
> queues.  So, there is always a chance that some of the workers have
> filled their queues while waiting for the master to extract the data.
> I think the patch posted by Rafia on the nearby thread [1] addresses
> both the problems by one patch.
>
>
> [1] - https://www.postgresql.org/message-id/CAOGQiiNiMhq5Pg3LiYxjfi2B9eAQ_q5YjS%3DfHiBJmbSOF74aBQ%40mail.gmail.com
>

Thanks Alexander for your interest in this work. As rightly pointed
out by Amit, when experimenting with this patch we found that there
are cases when master is busy and unable to read tuples in
shared_queue and the worker get stuck as it can not process tuples any
more. When experimenting aong these lines, I found that Q12 of TPC-H
is showing great performance improvement when increasing
shared_tuple_queue_size [1].
It was then we realised that merging this with the idea of giving an
illusion of larger tuple queue size with a local queue[1] could be
more beneficial. To precisely explain the meaning of merging the two
ideas, now we write tuples in local_queue once shared_queue is full
and as soon as we have filled some enough tuples in local queue we
copy the tuples from local to shared queue in one memcpy call. It is
giving good performance improvements for quite some cases.

I'll be glad if you may have a look at this patch and enlighten me
with your suggestions. :-)

[1] - https://www.postgresql.org/message-id/CAOGQiiNiMhq5Pg3LiYxjfi2B9eAQ_q5YjS%3DfHiBJmbSOF74aBQ%40mail.gmail.com

--
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [POC] Faster processing at Gather node

Alexander Kuzmenkov
Thanks Rafia, Amit, now I understand the ideas behind the patch better.
I'll see if I can look at the new one.

--

Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Previous Thread Next Thread