insert and query performance on big string table with pg_trgm

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

insert and query performance on big string table with pg_trgm

Matthew Hall
Hello PGSQL experts,

I've used your great database pretty heavily for the last 4 years, and during
that time it's helped me to solve an amazingly wide variety of data
challenges. Last week, I finally ran into something weird enough I couldn't
figure it out by myself. I'm using a self-compiled copy from latest 10.x
stable branch, Ubuntu 16.04 LTS, inserts with psycopg2, queries (so far) with
psql for testing, later JDBC (PostgreSQL 10.1 on x86_64-pc-linux-gnu, compiled
by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.5) 5.4.0 20160609, 64-bit).

I have this great big table of strings, about 180 million rows. I want to be
able to search this table for substring matches and overall string similarity
against various inputs (an ideal use case for pg_trgm, from what I can see in
the docs and the research articles for such indexing).

I need a unique b-tree index on the strings, to prevent duplicates in the
input in the beginning, and from adding new strings in the future, and the
{gin,gist}_trgm_ops index to speed up the matching. I couldn't fully
understand from the docs if my use case was a better fit for GIN, or for GIST.
Some parts of the docs implied GIST would be faster, but only for less than
100K entries, at which point GIN would be faster. I am hoping someone could
comment.

Here is the table:

                                   Unlogged table "public.huge_table"
   Column    |           Type           | Collation | Nullable |                    Default
-------------+--------------------------+-----------+----------+-----------------------------------------------
 id          | bigint                   |           | not null | nextval('huge_table_id_seq'::regclass)
 inserted_ts | timestamp with time zone |           |          | transaction_timestamp()
 value       | character varying        |           |          |
Indexes:
    "huge_table_pkey" PRIMARY KEY, btree (id)
    "huge_table_value_idx" UNIQUE, btree (value)
    "huge_table_value_trgm" gin (value gin_trgm_ops)

I managed to load the table initially in about 9 hours, after doing some
optimizations below based on various documentation (the server is 8-core Xeon
E5504, 16 GB RAM, 4 Hitachi 1TB 7200 RPM in a RAID 5 via Linux MD):

* compiled latest 10.x stable code branch from Git
* unlogged table (risky but made a big difference)
* shared_buffers 6 GB
* work_mem 32 MB
* maintenance_work_mem 512 MB
* effective_cache_size 10 GB
* synchronous_commit off
* wal_buffers 16 MB
* max_wal_size 4 GB
* checkpoint_completion_target 0.9
* auto_explain, and slow log for >= 1000 msecs (to debug this)

I'm noticing that the performance of inserts starts slipping quite a bit, as
the data is growing. It starts out fast, <1 sec per batch of 5000, but
eventually slows to 5-10 sec. per batch, sometimes randomly more.

In this example, it was just starting to slow, taking 4 secs to insert 5000
values:

2017-11-18 08:10:21 UTC [29578-11250] arceo@osint LOG:  duration: 4034.901 ms  plan:
        Query Text: INSERT INTO huge_table (value) VALUES
        ('value1'),
        ... 4998 more values ...
        ('value5000')
        ON CONFLICT (value) DO NOTHING
        Insert on huge_table  (cost=0.00..87.50 rows=5000 width=48)
          Conflict Resolution: NOTHING
          Conflict Arbiter Indexes: huge_table_value_idx
          ->  Values Scan on "*VALUES*"  (cost=0.00..87.50 rows=5000 width=48)

When it's inserting, oddly enough, the postgres seems mostly CPU limited,
where I would have expected more of an IO limit personally, and the memory
isn't necessarily over-utilized either, so it makes me wonder if I missed some
things.

KiB Mem : 16232816 total,   159196 free,   487392 used, 15586228 buff/cache
KiB Swap: 93702144 total, 93382320 free,   319816 used.  8714944 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
29578 postgres  20   0 6575672 6.149g 6.139g R  86.0 39.7  45:24.97 postgres

As for queries, doing a simple query like this one seems to require around 30
seconds to a minute. My volume is not crazy high but I am hoping I could get
this down to less than 30 seconds, because other stuff above this code will
start to time out otherwise:

osint=# explain analyze select * from huge_table where value ilike '%keyword%';
                                                                     QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on huge_table  (cost=273.44..61690.09 rows=16702 width=33) (actual time=2897.847..58438.545 rows=16423 loops=1)
   Recheck Cond: ((value)::text ~~* '%keyword%'::text)
   Rows Removed by Index Recheck: 3
   Heap Blocks: exact=5954
   ->  Bitmap Index Scan on huge_table_value_trgm  (cost=0.00..269.26 rows=16702 width=0) (actual time=2888.846..2888.846 rows=16434 loops=1)
         Index Cond: ((value)::text ~~* '%keyword%'::text)
 Planning time: 0.252 ms
 Execution time: 58442.413 ms
(8 rows)

Thanks for reading this and letting me know any recommendations.

Sincerely,
Matthew Hall
Reply | Threaded
Open this post in threaded view
|

Re: insert and query performance on big string table with pg_trgm

Jeff Janes
On Mon, Nov 20, 2017 at 2:54 PM, Matthew Hall <[hidden email]> wrote:

While I have not done exhaustive testing, from the tests I have done I've never found gist to be better than gin with trgm indexes.
 

Here is the table:

                                   Unlogged table "public.huge_table"
   Column    |           Type           | Collation | Nullable |                    Default
-------------+--------------------------+-----------+----------+-----------------------------------------------
 id          | bigint                   |           | not null | nextval('huge_table_id_seq'::regclass)
 inserted_ts | timestamp with time zone |           |          | transaction_timestamp()
 value       | character varying        |           |          |
Indexes:
    "huge_table_pkey" PRIMARY KEY, btree (id)
    "huge_table_value_idx" UNIQUE, btree (value)
    "huge_table_value_trgm" gin (value gin_trgm_ops)

Do you really need the artificial primary key, when you already have another column that would be used as the primary key?  If you need to use this it a foreign key in another type, then very well might.  But maintaining two unique indexes doesn't come free.

Are all indexes present at the time you insert?  It will probably be much faster to insert without the gin index (at least) and build it after the load.

Without knowing this key fact, it is hard to interpret the rest of your data.
 

I managed to load the table initially in about 9 hours, after doing some
optimizations below based on various documentation (the server is 8-core Xeon
E5504, 16 GB RAM, 4 Hitachi 1TB 7200 RPM in a RAID 5 via Linux MD):
 ...
 
 
* maintenance_work_mem 512 MB

Building a gin index in bulk could benefit from more memory here. 

* synchronous_commit off

If you already are using unlogged tables, this might not be so helpful, but does increase the risk of the rest of your system.

 
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
29578 postgres  20   0 6575672 6.149g 6.139g R  86.0 39.7  45:24.97 postgres

You should expand the command line (by hitting 'c', at least in my version of top) so we can see which postgres process this is.
 

As for queries, doing a simple query like this one seems to require around 30
seconds to a minute. My volume is not crazy high but I am hoping I could get
this down to less than 30 seconds, because other stuff above this code will
start to time out otherwise:

osint=# explain analyze select * from huge_table where value ilike '%keyword%';

explain (analyze, buffers), please.  And hopefully with track_io_timing=on.

If you repeat the same query, is it then faster, or is it still slow?

Cheers,

Jeff
Reply | Threaded
Open this post in threaded view
|

Re: insert and query performance on big string table with pg_trgm

Matthew Hall
Hi Jeff,

Thanks so much for writing. You've got some great points.

> On Nov 20, 2017, at 5:42 PM, Jeff Janes <[hidden email]> wrote:
> While I have not done exhaustive testing, from the tests I have done I've never found gist to be better than gin with trgm indexes.

Thanks, this helps considerably, as the documentation was kind of confusing and I didn't want to get it wrong if I could avoid it.

> Do you really need the artificial primary key, when you already have another column that would be used as the primary key?  If you need to use this it a foreign key in another type, then very well might.  But maintaining two unique indexes doesn't come free.

OK, fair enough, I'll test with it removed and see what happens.

> Are all indexes present at the time you insert?  It will probably be much faster to insert without the gin index (at least) and build it after the load.

There is some flexibility on the initial load, but the updates in the future will require the de-duplication capability. I'm willing to accept that might be somewhat slower on the load process, to get the accurate updates, provided we could try meeting the read-side goal I wrote about, or at least figure out why it's impossible, so I can understand what I need to fix to make it possible.

> Without knowing this key fact, it is hard to interpret the rest of your data.

I'm assuming you're referring to the part about the need for the primary key, and the indexes during loading? I did try to describe that in the earlier mail, but obviously I'm new at writing these, so sorry if I didn't make it more clear. I can get rid of the bigserial PK and the indexes could be made separately, but I would need a way to de-duplicate on future reloading... that's why I had the ON CONFLICT DO NOTHING expression on the INSERT. So we'd still want to learn why the INSERT is slow to fix up the update processes that would happen in the future.

> * maintenance_work_mem 512 MB
>
> Building a gin index in bulk could benefit from more memory here.

Fixed it; I will re-test w/ 1 GB. Have you got any recommended values so I don't screw it up?

> * synchronous_commit off
>
> If you already are using unlogged tables, this might not be so helpful, but does increase the risk of the rest of your system.

Fixed it; the unlogged mode change came later than this did.

>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
> 29578 postgres  20   0 6575672 6.149g 6.139g R  86.0 39.7  45:24.97 postgres
>
> You should expand the command line (by hitting 'c', at least in my version of top) so we can see which postgres process this is.

Good point, I'll write back once I retry w/ your other advice.

> explain (analyze, buffers), please.  And hopefully with track_io_timing=on.

track_io_timing was missing because sadly I had only found it in one document at the very end of the investigation, after doing the big job which generated all of the material posted. It's there now, so here is some better output on the query:

explain (analyze, buffers) select * from huge_table where value ilike '%canada%';

 Bitmap Heap Scan on huge_table  (cost=273.44..61690.09 rows=16702 width=33) (actual time=5701.511..76469.688 rows=110166 loops=1)
   Recheck Cond: ((value)::text ~~* '%canada%'::text)
   Rows Removed by Index Recheck: 198
   Heap Blocks: exact=66657
   Buffers: shared hit=12372 read=56201 dirtied=36906
   I/O Timings: read=74195.734
   ->  Bitmap Index Scan on huge_table_value_trgm  (cost=0.00..269.26 rows=16702 width=0) (actual time=5683.032..5683.032 rows=110468 loops=1)
         Index Cond: ((value)::text ~~* '%canada%'::text)
         Buffers: shared hit=888 read=1028
         I/O Timings: read=5470.839
 Planning time: 0.271 ms
 Execution time: 76506.949 ms

I will work some more on the insert piece.

> If you repeat the same query, is it then faster, or is it still slow?

If you keep the expression exactly the same, it still takes a few seconds as could be expected for such a torture test query, but it's still WAY faster than the first such query. If you change it out to a different expression, it's longer again of course. There does seem to be a low-to-medium correlation between the number of rows found and the query completion time.

> Cheers,
> Jeff

Thanks,
Matthew.
Reply | Threaded
Open this post in threaded view
|

Re: insert and query performance on big string table with pg_trgm

Jeff Janes


On Nov 21, 2017 00:05, "Matthew Hall" <[hidden email]> wrote:

> Are all indexes present at the time you insert?  It will probably be much faster to insert without the gin index (at least) and build it after the load.

There is some flexibility on the initial load, but the updates in the future will require the de-duplication capability. I'm willing to accept that might be somewhat slower on the load process, to get the accurate updates, provided we could try meeting the read-side goal I wrote about, or at least figure out why it's impossible, so I can understand what I need to fix to make it possible.

As long as you don't let anyone use the table between the initial load and when the index build finishes, you don't have to compromise on correctness.  But yeah, makes sense to worry about query speed first.






> If you repeat the same query, is it then faster, or is it still slow?

If you keep the expression exactly the same, it still takes a few seconds as could be expected for such a torture test query, but it's still WAY faster than the first such query. If you change it out to a different expression, it's longer again of course. There does seem to be a low-to-medium correlation between the number of rows found and the query completion time.

To make this quick, you will need to get most of the table and most of the index cached into RAM.  A good way to do that is with pg_prewarm.  Of course that only works if you have enough RAM in the first place.

What is the size of the table and the gin index?


Cheers,

Jeff

Reply | Threaded
Open this post in threaded view
|

Re: insert and query performance on big string table with pg_trgm

Gábor SZŰCS
Don't know if it would make PostgreSQL happier but how about adding a hash value column and creating the unique index on that one? May block some false duplicates but the unique index would be way smaller, speeding up inserts.

2017. nov. 25. 7:35 ezt írta ("Jeff Janes" <[hidden email]>):


On Nov 21, 2017 00:05, "Matthew Hall" <[hidden email]> wrote:

> Are all indexes present at the time you insert?  It will probably be much faster to insert without the gin index (at least) and build it after the load.

There is some flexibility on the initial load, but the updates in the future will require the de-duplication capability. I'm willing to accept that might be somewhat slower on the load process, to get the accurate updates, provided we could try meeting the read-side goal I wrote about, or at least figure out why it's impossible, so I can understand what I need to fix to make it possible.

As long as you don't let anyone use the table between the initial load and when the index build finishes, you don't have to compromise on correctness.  But yeah, makes sense to worry about query speed first.






> If you repeat the same query, is it then faster, or is it still slow?

If you keep the expression exactly the same, it still takes a few seconds as could be expected for such a torture test query, but it's still WAY faster than the first such query. If you change it out to a different expression, it's longer again of course. There does seem to be a low-to-medium correlation between the number of rows found and the query completion time.

To make this quick, you will need to get most of the table and most of the index cached into RAM.  A good way to do that is with pg_prewarm.  Of course that only works if you have enough RAM in the first place.

What is the size of the table and the gin index?


Cheers,

Jeff

Reply | Threaded
Open this post in threaded view
|

Re: insert and query performance on big string table with pg_trgm

Matthew Hall
In reply to this post by Matthew Hall
On Nov 21, 2017, at 12:05 AM, Matthew Hall <[hidden email]> wrote:
>> Do you really need the artificial primary key, when you already have another column that would be used as the primary key?  If you need to use this it a foreign key in another type, then very well might.  But maintaining two unique indexes doesn't come free.
>
> OK, fair enough, I'll test with it removed and see what happens.

With the integer primary key removed, it still takes ~9 hours to load the table, so it didn't seem to make a big difference.

> Fixed it; I will re-test w/ 1 GB. Have you got any recommended values so I don't screw it up?

I also took this step for maintenance_work_mem.

Queries on the table still take a long time with the PK removed:

# explain (analyze, buffers) select * from huge_table where value ilike '%yahoo%';

 Bitmap Heap Scan on huge_table  (cost=593.72..68828.97 rows=18803 width=25) (actual time=3224.100..70059.839 rows=20909 loops=1)
   Recheck Cond: ((value)::text ~~* '%yahoo%'::text)
   Rows Removed by Index Recheck: 17
   Heap Blocks: exact=6682
   Buffers: shared hit=544 read=6760 dirtied=4034
   I/O Timings: read=69709.611
   ->  Bitmap Index Scan on huge_table_value_trgm_idx  (cost=0.00..589.02 rows=18803 width=0) (actual time=3216.545..3216.545 rows=20926 loops=1)
         Index Cond: ((value)::text ~~* '%yahoo%'::text)
         Buffers: shared hit=352 read=270
         I/O Timings: read=3171.872
 Planning time: 0.283 ms
 Execution time: 70065.157 ms
(12 rows)

The slow process during inserts is:

postgres: username dbname [local] INSERT

The slow statement example is:

2017-12-06 04:27:11 UTC [16085-10378] username@dbname LOG:  duration: 5028.190 ms  plan:
        Query Text: INSERT INTO huge_table (value) VALUES
        .... 5000 values at once ...
        ON CONFLICT (value) DO NOTHING
        Insert on huge_table  (cost=0.00..75.00 rows=5000 width=40)
          Conflict Resolution: NOTHING
          Conflict Arbiter Indexes: huge_table_value_idx
          ->  Values Scan on "*VALUES*"  (cost=0.00..75.00 rows=5000 width=40)

> What is the size of the table and the gin index?

The table is 10 GB. The gin index is 5.8 GB.

> [From Gabor Szucs] [H]ow about adding a hash value column and creating the unique index on that one? May block some false duplicates but the unique index would be way smaller, speeding up inserts.

The mean length of the input items is about 18 bytes. The max length of the input items is about 67 bytes. The size of the md5 would of course be 16 bytes. I'm testing it now, and I'll write another update.

Matthew.
Reply | Threaded
Open this post in threaded view
|

Re: insert and query performance on big string table with pg_trgm

Sergei Kornilov
>    Buffers: shared hit=544 read=6760 dirtied=4034
>    I/O Timings: read=69709.611
You has very slow (or busy) disks, not postgresql issue. Reading 6760 * 8KB in 70 seconds is very bad result.

For better performance you need better disks, at least raid10 (not raid5). Much more memory in shared_buffers can help with read performance and so reduce disk utilization, but write operations still will be slow.

Sergei

Reply | Threaded
Open this post in threaded view
|

Re: insert and query performance on big string table with pg_trgm

Matthew Hall

> On Dec 5, 2017, at 11:23 PM, Sergei Kornilov <[hidden email]> wrote:
> You has very slow (or busy) disks, not postgresql issue. Reading 6760 * 8KB in 70 seconds is very bad result.
>
> For better performance you need better disks, at least raid10 (not raid5). Much more memory in shared_buffers can help with read performance and so reduce disk utilization, but write operations still will be slow.
>
> Sergei

Sergei,

Thanks so much for confirming, this really helps a lot to know what to do. I thought the disk could be some of my issue, but I wanted to make sure I did all of the obvious tuning first. I have learned some very valuable things which I'll be able to use on future challenges like this which I didn't learn previously.

Based on this advice from everyone, I'm setting up a box with more RAM, lots of SSDs, and RAID 10. I'll write back in a few more days after I've completed it.

I can also confirm that the previous advice about using a hash / digest based unique index seemed to make the loading process slower for me, not faster, which is an interesting result to consider for future users following this thread (if any). I don't yet have specific data how much slower, because it's actually still going!

Sincerely,
Matthew.
Previous Thread Next Thread