JOIN on partitions is very slow

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

JOIN on partitions is very slow

daya airody
Hi folks,

We are using postgreSQL database and I am hitting some limits. I have partitions on company_sale_account table
based on company name

We generate a report on accounts matched between the two. Below is the query:

SELECT DISTINCT cpsa1.*
FROM company_sale_account cpsa1  
 JOIN  company_sale_account cpsa2  ON cpsa1.sale_account_id = cpsa2.sale_account_id
 WHERE  cpsa1.company_name = 'company_a'  
 AND cpsa2.company_name = 'company_b'


We have setup BTREE indexes on sale_account_id column on both the tables.
This worked fine till recently. Now, we have 10 million rows in
company_a partition and 7 million rows in company_b partition. This query is taking
more than 10 minutes.

Below is the explain plan output for it:

  Buffers: shared hit=20125996 read=47811 dirtied=75, temp read=1333427 written=1333427
  I/O Timings: read=19619.322
  ->  Sort  (cost=167950986.43..168904299.23 rows=381325118 width=132) (actual time=517017.334..603691.048 rows=16854094 loops=1)
        Sort Key: cpsa1.crm_account_id, ((cpsa1.account_name)::text), ((cpsa1.account_owner)::text), ((cpsa1.account_type)::text), cpsa1.is_customer, ((date_part('epoch'::text, cpsa1.created_date))::integer), ((hstore_to_json(cpsa1.custom_crm_fields))::tex (...)
        Sort Method: external merge  Disk: 2862656kB
        Buffers: shared hit=20125996 read=47811 dirtied=75, temp read=1333427 written=1333427
        I/O Timings: read=19619.322
        ->  Nested Loop  (cost=0.00..9331268.39 rows=381325118 width=132) (actual time=1.680..118698.570 rows=16854094 loops=1)
              Buffers: shared hit=20125977 read=47811 dirtied=75
              I/O Timings: read=19619.322
              ->  Append  (cost=0.00..100718.94 rows=2033676 width=33) (actual time=0.014..1783.243 rows=2033675 loops=1)
                    Buffers: shared hit=75298 dirtied=75
                    ->  Seq Scan on company_sale_account cpsa2  (cost=0.00..0.00 rows=1 width=516) (actual time=0.001..0.001 rows=0 loops=1)
                          Filter: ((company_name)::text = 'company_b'::text)
                    ->  Seq Scan on company_sale_account_concur cpsa2_1  (cost=0.00..100718.94 rows=2033675 width=33) (actual time=0.013..938.145 rows=2033675 loops=1)
                          Filter: ((company_name)::text = 'company_b'::text)
                          Buffers: shared hit=75298 dirtied=75
              ->  Append  (cost=0.00..1.97 rows=23 width=355) (actual time=0.034..0.047 rows=8 loops=2033675)
                    Buffers: shared hit=20050679 read=47811
                    I/O Timings: read=19619.322
                    ->  Seq Scan on company_sale_account cpsa1  (cost=0.00..0.00 rows=1 width=4525) (actual time=0.000..0.000 rows=0 loops=2033675)
                          Filter: (((company_name)::text = 'company_a'::text) AND ((cpsa2.sale_account_id)::text = (sale_account_id)::text))
                    ->  Index Scan using ix_csa_adp_sale_account on company_sale_account_adp cpsa1_1  (cost=0.56..1.97 rows=22 width=165) (actual time=0.033..0.042 rows=8 loops=2033675)
                          Index Cond: ((sale_account_id)::text = (cpsa2.sale_account_id)::text)
                          Filter: ((company_name)::text = 'company_a'::text)
                          Buffers: shared hit=20050679 read=47811
                          I/O Timings: read=19619.322
Planning time: 30.853 ms
Execution time: 618218.321 ms


Do you have any suggestion on how to tune postgres.
Please share your thoughts. It would be a great help to me.
Reply | Threaded
Open this post in threaded view
|

Re: JOIN on partitions is very slow

Michael Lewis
Are you able to tweak the query or is that generated by an ORM? What version of Postgres? Which configs have you changed from default? How many partitions do you have? Is there an index on company name?

Anytime I see distinct keyword, I expect it to be a performance bottleneck and wonder about rewriting the query. Even just using group by can be much faster because of how it gets executed.
Reply | Threaded
Open this post in threaded view
|

Re: JOIN on partitions is very slow

daya airody
Yes. I can tweak the query. Version of postgres is 9.5.15. I have about 20 partitions for company_sale_account table. 
I do have an index on company name.

I need to use DISTINCT as i need to remove the duplicates.

Thanks for your time.



On Sun, Mar 22, 2020 at 11:38 PM Michael Lewis <[hidden email]> wrote:
Are you able to tweak the query or is that generated by an ORM? What version of Postgres? Which configs have you changed from default? How many partitions do you have? Is there an index on company name?

Anytime I see distinct keyword, I expect it to be a performance bottleneck and wonder about rewriting the query. Even just using group by can be much faster because of how it gets executed.
Reply | Threaded
Open this post in threaded view
|

Re: JOIN on partitions is very slow

Michael Lewis
On Mon, Mar 23, 2020 at 1:40 AM daya airody <[hidden email]> wrote:
Yes. I can tweak the query. Version of postgres is 9.5.15. I have about 20 partitions for company_sale_account table. 
I do have an index on company name.

I need to use DISTINCT as i need to remove the duplicates.

DISTINCT is a sign of improper joins most of the time in my experience. Often, just changing to group by is faster

SELECT cpsa1.*
FROM company_sale_account cpsa1  
 JOIN  company_sale_account cpsa2  ON cpsa1.sale_account_id = cpsa2.sale_account_id
 WHERE  cpsa1.company_name = 'company_a'  
 AND cpsa2.company_name = 'company_b'
GROUP BY cpsa1.id; --assuming primary key exists, and I forget if the feature that allows only naming primary key in group by might have been introduced with 9.6

It should be noted that 9.5 is about 1 year from being EOL'd so it would be prudent to update to v11 or 12 when possible.

How does the below query perform? By the way, "top posting" (replying with all previous email thread below your reply) is discouraged on these forums. It makes the reviewing archived posts more cumbersome. Instead, please reply with only your message and copying the relevant parts of prior conversation that you are responding to. 

SELECT cpsa1.*
FROM company_sale_account cpsa1  
WHERE cpsa1.company_name = 'company_a' AND EXISTS(SELECT FROM  company_sale_account cpsa2 WHER cpsa1.sale_account_id = cpsa2.sale_account_id AND cpsa2.company_name = 'company_b' );
 
Reply | Threaded
Open this post in threaded view
|

Re: JOIN on partitions is very slow

Thomas Kellerer-4
Michael Lewis schrieb am 23.03.2020 um 17:16:
> Yes. I can tweak the query. Version of postgres is 9.5.15. I have
> about 20 partitions for company_sale_account table. I do have an
> index on company name.
>
> I need to use DISTINCT as i need to remove the duplicates.
>
>
> DISTINCT is a sign of improper joins most of the time in my
> experience. Often, just changing to group by is faster

As none of the columns of the joined table are used, most probably
this should be re-written as an EXISTS condition.
Then neither GROUP BY nor DISTINCT is needed.





Reply | Threaded
Open this post in threaded view
|

Re: JOIN on partitions is very slow

daya airody


As none of the columns of the joined table are used, most probably
this should be re-written as an EXISTS condition.
Then neither GROUP BY nor DISTINCT is needed.

 
I need the columns from joined tables. To keep it simple, I didn't include them in the query. EXISTS solution won't work for me.