NOT IN vs. NOT EXISTS performance

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

NOT IN vs. NOT EXISTS performance

Lincoln Swaine-Moore
Hi all,

I've figured out how to solve the performance issues I've been encountering with a particular query, but I'm interested in better understanding the intuition behind why the better query is so much more performant.

The query in question involves a NOT IN filter from a CTE:

WITH temp as (
  SELECT temp_id
  FROM other_table
  WHERE ...
)
SELECT COUNT(*)
FROM main_table m
WHERE m.main_id NOT IN (SELECT temp_id from temp);

The query plan for (the non-masked version of) this query includes:
Filter: (NOT (SubPlan 1))
   Rows Removed by Filter: 1
   SubPlan 1
     ->  CTE Scan on temp  (cost=0.00..1950.60 rows=97530 width=418) (actual time=0.018..4.170 rows=15697 loops=100731)

My understanding is that PostgreSQL is not able to make this into a hashed Subplan because the expected number of rows * width of rows is too large for the work_mem setting, and so instead it has to do many repeated linear passes to find whether the various values of main_id are in the list of temp_ids.

The resolution to this problem was discovered via https://explainextended.com/2009/09/16/not-in-vs-not-exists-vs-left-join-is-null-postgresql/, and involves rewriting as so:

WITH temp as (
  SELECT temp_id
  FROM other_table
  WHERE ...
)
SELECT COUNT(*)
FROM main_table m 
WHERE NOT EXISTS (SELECT temp_id from temp where temp_id = m.main_id);

The query plan for this query (which I believe is equivalent to what it would be if I instead explicitly LEFT JOINed main_table to temp and used a WHERE to filter out NULL values of temp_id) does not involve a high number of loops like above:

->  Merge Anti Join  (cost=147305.45..149523.68 rows=192712 width=4) (actual time=5050.773..5622.266 rows=158850 loops=1)
   Merge Cond: (m.main_id = temp.temp_id)
   ->  Sort  (cost=115086.98..115637.10 rows=220050 width=19) (actual time=1290.829..1655.066 rows=199226 loops=1)
         Sort Key: m.main_id
         Sort Method: external merge  Disk: 5632kB
   ->  Materialize  (cost=32218.47..32764.37 rows=109180 width=418) (actual time=3759.936..3787.724 rows=38268 loops=1)
         ->  Sort  (cost=32218.47..32491.42 rows=109180 width=418) (actual time=3759.933..3771.117 rows=38268 loops=1)
               Sort Key: temp.temp_id
               Sort Method: quicksort  Memory: 3160kB
               ->  CTE Scan on temp  (cost=0.00..2183.60 rows=109180 width=418) (actual time=2316.745..3735.486 rows=38268 loops=1)

Instead it sorts using disk, and uses a MERGE ANTI JOIN, which I understand to eliminate the need for multiple passes through the table temp.

My primary question is: why is this approach only possible (for data too large for memory) when using NOT EXISTS, and not when using NOT IN? 

I understand that there is a slight difference in the meaning of the two expressions, in that NOT IN will produce NULL if there are any NULL values in the right hand side (in this case there are none, and the queries should return the same COUNT). But if anything, I would expect that to improve performance of the NOT IN operation, since a single pass through that data should reveal if there are any NULL values, at which point that information could be used to short-circuit. So I am a bit baffled.

Thanks very much for your help!

Best,
Lincoln

--
Lincoln Swaine-Moore

Reply | Threaded
Open this post in threaded view
|

Re: NOT IN vs. NOT EXISTS performance

David Rowley-3
On 9 November 2018 at 08:35, Lincoln Swaine-Moore
<[hidden email]> wrote:

> My primary question is: why is this approach only possible (for data too
> large for memory) when using NOT EXISTS, and not when using NOT IN?
>
> I understand that there is a slight difference in the meaning of the two
> expressions, in that NOT IN will produce NULL if there are any NULL values
> in the right hand side (in this case there are none, and the queries should
> return the same COUNT). But if anything, I would expect that to improve
> performance of the NOT IN operation, since a single pass through that data
> should reveal if there are any NULL values, at which point that information
> could be used to short-circuit. So I am a bit baffled.

The problem is that the planner makes the plan and would have to know
beforehand that no NULLs could exist on either side of the join. For
more simple cases it could make use of NOT NULL constaints, but more
complex cases exist, such as:

SELECT * FROM t1 LEFT JOIN t2 ON t1.x = t2.y WHERE t2.y NOT IN(SELECT
z FROM t3);

There's a bit more reading about the complexity of this in [1]

[1] https://www.postgresql.org/message-id/flat/CAMkU%3D1zPVbez_HWao781L8PzFk%2Bd1J8VaJuhyjUHaRifk6OcUA%40mail.gmail.com#7c6d3178c18103d8508f3ec5982b1b8e

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: NOT IN vs. NOT EXISTS performance

Merlin Moncure-2
On Thu, Nov 8, 2018 at 3:12 PM David Rowley
<[hidden email]> wrote:

>
> On 9 November 2018 at 08:35, Lincoln Swaine-Moore
> <[hidden email]> wrote:
> > My primary question is: why is this approach only possible (for data too
> > large for memory) when using NOT EXISTS, and not when using NOT IN?
> >
> > I understand that there is a slight difference in the meaning of the two
> > expressions, in that NOT IN will produce NULL if there are any NULL values
> > in the right hand side (in this case there are none, and the queries should
> > return the same COUNT). But if anything, I would expect that to improve
> > performance of the NOT IN operation, since a single pass through that data
> > should reveal if there are any NULL values, at which point that information
> > could be used to short-circuit. So I am a bit baffled.
>
> The problem is that the planner makes the plan and would have to know
> beforehand that no NULLs could exist on either side of the join.

Yeah, the core issue is the SQL rules that define NOT IN behaves as:
postgres=# select 1 not in (select 2);
 ?column?
──────────
 t
(1 row)

postgres=# select 1 not in (select 2 union all select null);
 ?column?
──────────

(1 row)

There's a certain logic to it but it's a death sentence for performance.

merlin

Reply | Threaded
Open this post in threaded view
|

Re: NOT IN vs. NOT EXISTS performance

Lincoln Swaine-Moore
Thanks, both! 

That's a very interesting thread. I was confident this was a subject that had been discussed--just wasn't sure where--so thank you for forwarding.

I guess the big-picture summary is that NOT IN's definition introduces complexity (the nature of which I now understand better) that is usually unwarranted by the question the querier is asking. So NOT EXISTS will almost always be preferable when a subquery is involved, unless the behavior around NULL values is specifically desired.

On Fri, Nov 9, 2018 at 8:45 AM Merlin Moncure <[hidden email]> wrote:
On Thu, Nov 8, 2018 at 3:12 PM David Rowley
<[hidden email]> wrote:
>
> On 9 November 2018 at 08:35, Lincoln Swaine-Moore
> <[hidden email]> wrote:
> > My primary question is: why is this approach only possible (for data too
> > large for memory) when using NOT EXISTS, and not when using NOT IN?
> >
> > I understand that there is a slight difference in the meaning of the two
> > expressions, in that NOT IN will produce NULL if there are any NULL values
> > in the right hand side (in this case there are none, and the queries should
> > return the same COUNT). But if anything, I would expect that to improve
> > performance of the NOT IN operation, since a single pass through that data
> > should reveal if there are any NULL values, at which point that information
> > could be used to short-circuit. So I am a bit baffled.
>
> The problem is that the planner makes the plan and would have to know
> beforehand that no NULLs could exist on either side of the join.

Yeah, the core issue is the SQL rules that define NOT IN behaves as:
postgres=# select 1 not in (select 2);
 ?column?
──────────
 t
(1 row)

postgres=# select 1 not in (select 2 union all select null);
 ?column?
──────────

(1 row)

There's a certain logic to it but it's a death sentence for performance.

merlin


--
Lincoln Swaine-Moore