Merge join doesn't seem to break early when I (and planner) think it should - 10.4

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Merge join doesn't seem to break early when I (and planner) think it should - 10.4

Timothy Garnett
Hi all,

Query plan quick link: https://explain.depesz.com/s/JVxn
Version: PostgreSQL 10.4 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609, 64-bit

tbl: ~780 million rows, bigint primary key (keytbl), col1 is smallint and there is an index on (col1, col2)
tmp_tbl: ~10 million rows; columns are identical to tbl, no indices or primary key
Both tables are analyzed.

I'm doing an update of tbl from tmp_tbl joining on the primary key column of tbl.  The planner picks a merge join which takes about 4 hours. If I force it to use a hash join instead (set local enable_mergejoin=false) it takes about 2.5 minutes (see https://explain.depesz.com/s/vtLe for resulting plan, it's my current workaround).  The index scan done for the merge join below [*] is what eats  up the time and planner knows it's expensive [*], but I think it expects it to stop early [**], however it appears to index scan the entire table based on the rows removed by the filter [***] and the time taken.  I don't think the planner's wrong here, the merge join should break early, the max value of keytbl in tmp_tbl is less than all but a small portion of tbl (see **** below). So I think it should have to only go ~12.5 million rows into the index scan before stopping. Continued below...

explain analyze:
Update on schema.tbl t  (cost=5831491.55..7428707.35 rows=120536 width=523) (actual time=12422900.337..12422900.337 rows=0 loops=1)
  ->  Merge Join  (cost=5831491.55..7428707.35[**] rows=120536 width=523) (actual time=121944.122..12406202.383 rows=86663 loops=1)
        Merge Cond: (t.keytbl = tt.keytbl)
        Join Filter: <removed, see link>
        Rows Removed by Join Filter: 9431176
        ->  Index Scan using tbl_pkey on schema.tbl t  (cost=0.57..302404355.44[*] rows=9680745 width=273) (actual time=99112.377..12354593.205 rows=9517839 loops=1)
              Filter: (t.col1 = ANY ('{13,14}'::integer[]))
              Rows Removed by Filter: 769791484[***]
        ->  Materialize  (cost=2807219.47..2855692.46 rows=9694598 width=489) (actual time=19432.549..31462.007 rows=9616269 loops=1)
              ->  Sort  (cost=2807219.47..2831455.96 rows=9694598 width=489) (actual time=19432.537..23493.665 rows=9616269 loops=1)
                    Sort Key: tt.keytbl
                    Sort Method: quicksort  Memory: 3487473kB
                    ->  Seq Scan on schema.tmp_tbl tt  (cost=0.00..389923.98 rows=9694598 width=489) (actual time=0.023..8791.086 rows=9692217 loops=1)
Planning time: 4.454 ms
Execution time: 12438992.607 ms


select max(keytbl) from tmp_tbl;
  max         3940649685073901
select count(*) from tbl where keytbl <= 3940649685073901;
  count       12454354 [
****]
select max(keytbl) from tbl;
  max         147211412825225362



So far I've been unable to create a smaller / toy example that exhibits the same behavior. Some things that may be unusual about the situation: keytbl is bigint and the values are large (all are > 2^48) and sparse/dense (big chunks where the id advances by 1 separated by large (> 2^48) regions with no rows), the top 200k or so rows of tmp_table by keytbl don't have a corresponding row in tbl, and this is a bit of an older dot release (10.4).  I have a workaround (disabling merge join for the query) so I'm mostly trying to figure out what's going on and if I'm understanding the situation correctly.

It's interesting that even if it worked as expected, the merge join plan seems a lot riskier in that if the analyze didn't catch a single large outlier value of keytbl in tmp_tbl or a row with a large value for keytbl was inserted into tmp_tbl since the last analyze it could be forced to walk the entire index of the tbl (which based on the filter count looks like it involves touching each row of this large table for the filter even if it doesn't have a corresponding row to merge to).

Additional info:
Table schema (for tbl and tmp_tbl)
            Column             |            Type             | Collation | Nullable |      Default      
-------------------------------+-----------------------------+-----------+----------+-------------------
 keytbl                        | bigint                      |           | not null |
                               | smallint                    |           | not null |
 col1                          | smallint                    |           | not null |
 col2                          | integer                     |           | not null |
                               | integer                     |           |          |
                               | integer                     |           |          |
                               | character varying(100)      |           |          |
                               | character varying(100)      |           |          |
                               | date                        |           | not null |
                               | integer                     |           |          |
                               | smallint                    |           |          |
                               | smallint                    |           |          |
                               | integer                     |           |          |
                               | smallint                    |           |          |
                               | bigint                      |           |          |
                               | smallint                    |           |          |
                               | character varying(2)        |           |          |
                               | text                        |           |          |
                               | character varying(3)        |           |          |
                               | numeric(14,2)               |           |          |
                               | smallint                    |           |          |
                               | numeric(13,3)               |           |          |
                               | smallint                    |           |          |
                               | boolean                     |           |          |
                               | smallint                    |           |          |
                               | character varying(2)        |           |          |
                               | smallint                    |           |          |
                               | character varying(2)        |           |          |
                               | smallint                    |           |          |
                               | integer                     |           |          |
                               | integer                     |           |          |
                               | bigint                      |           |          |
                               | smallint                    |           | not null | 0
                               | timestamp without time zone |           | not null | CURRENT_TIMESTAMP
                               | timestamp without time zone |           | not null | CURRENT_TIMESTAMP
                               | timestamp without time zone |           | not null | CURRENT_TIMESTAMP
                               | timestamp without time zone |           | not null | CURRENT_TIMESTAMP
                               | integer                     |           |          |
                               | integer                     |           |          |
                               | integer                     |           |          |
Indexes (tbl only, not on tmp_tbl):
    "xxxx" PRIMARY KEY, btree (keytbl)
    "index_1" UNIQUE, btree (col1, col2)
    "index_2" btree (x, z DESC) WHERE x IS NOT NULL
    "index_3" btree (y, z DESC) WHERE y IS NOT NULL


Query:
 explain analyze verbose 
      UPDATE "tbl" t
         SET xxxxx
        FROM "tmp_tbl" tt
       WHERE t."keytbl" = tt."keytbl" AND t.col1 IN (13,14) AND (xxx)

Thanks,
Tim
Reply | Threaded
Open this post in threaded view
|

Re: Merge join doesn't seem to break early when I (and planner) think it should - 10.4

Jeff Janes
On Thu, Dec 26, 2019 at 6:57 PM Timothy Garnett <[hidden email]> wrote:

So far I've been unable to create a smaller / toy example that exhibits the same behavior. Some things that may be unusual about the situation: keytbl is bigint and the values are large (all are > 2^48) and sparse/dense (big chunks where the id advances by 1 separated by large (> 2^48) regions with no rows), the top 200k or so rows of tmp_table by keytbl don't have a corresponding row in tbl, and this is a bit of an older dot release (10.4).  I have a workaround (disabling merge join for the query) so I'm mostly trying to figure out what's going on and if I'm understanding the situation correctly.

Can you share the toy example, using things like random() and generate_series() to populate it?  Preferably scaled down to 10 million rows or so in the larger table.

Does it reproduce in 10.11?  If not, then there is really nothing worth looking into.  Any fix that can be done would certainly not be re-released into 10.4.  And does it reproduce in 12.1 or 13dev?  Because chances are any improvement wouldn't even be back-patches into any minor release at all.
 
It's interesting that even if it worked as expected, the merge join plan seems a lot riskier in that if the analyze didn't catch a single large outlier value of keytbl in tmp_tbl or a row with a large value for keytbl was inserted into tmp_tbl since the last analyze it could be forced to walk the entire index of the tbl (which based on the filter count looks like it involves touching each row of this large table for the filter even if it doesn't have a corresponding row to merge to).

There has been discussion of building a riskiness factor into the planner, but it has never gone anywhere.  Everything has its own risk (with Hash Joins, for example, the data could be pathological and everything might hash to a few buckets, or 32 bits of hashcode might not be enough bits).  By the time you can adequately analyze all the risks, you would probably learn enough to just make the planner better absolutely, without adding another dimension to all the things it considers.
 
Cheers,

Jeff