pg full text search very slow for Chinese characters

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

pg full text search very slow for Chinese characters

Jimmy Huang

Hi Team,

 

Can anyone shed some light on why postgres 11 is extremely slow in my case?

 

I am making a mirror of zh.wikisource.org and I have downloaded 303049 pages and stored them in a postgres 11 database.

 

My postgres instance is based on docker image postgres:11 and runs on my MacBook Pro i7 16GB.

 

Database schema is as follows

 

Table pages(id, url, html, downloaded, inserted_at, updated_at) and books(id, name, info, preface, text, html, url, parent_id, inserted_at, updated_at, info_html, preface_html)

 

A wikisource web page is downloaded and its html text is inserted into table “pages” column “html.

Later, books.{name, info, preface, text, html, info_html, preface_html} are extracted from pages.html. The text column of books is a txt version of the content of html column of table pages.

 

On average there are 7635 characters (each characters is 3 bytes long because of utf-8 encoding) for text column of table books and I want to add full text search to books(text).

 

I tried pg_trgm and my own customized token parser https://github.com/huangjimmy/pg_cjk_parser

 

To my surprise, postgres 11 is extremely slow when creating a full text index.

 

I added a column of tsvector type and tried to create an index on that column. Pg could not finish creating a GIN index for a long time and I had to cancel the execution.

I then tried to create a partial full text index for 500 rows and it took postgres 2 to 3 minutes to create the index. Based on this estimation, pg will need at least one day to create a full GIN full text search index for 303049 rows of data. I think this is ridiculous slow.

If I tried to create fts index for books(name) or books(info), it took just 3 minutes to create the index. However, name and info are extremely short compared to books(text).

 

I switched to Elasticsearch and it turned out that Elasticsearch is extremely efficient for my case. It took Elasticsearch 3 hours to index all 303049 rows.  

 

Jimmy Huang

[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: pg full text search very slow for Chinese characters

Cory Nemelka
Well, there is a lot of information we would need to diagnose this.  How much tuning have you done?, etc.  

My advice is pretty simple.  Don't expect performance on a notebook and, unless you are planning on hosting it on a notebook, use the notebook for development only .  Test performance on a properly configured and tuned server.

--cnemelka


On Tue, Sep 10, 2019 at 9:53 AM Jimmy Huang <[hidden email]> wrote:

Hi Team,

 

Can anyone shed some light on why postgres 11 is extremely slow in my case?

 

I am making a mirror of zh.wikisource.org and I have downloaded 303049 pages and stored them in a postgres 11 database.

 

My postgres instance is based on docker image postgres:11 and runs on my MacBook Pro i7 16GB.

 

Database schema is as follows

 

Table pages(id, url, html, downloaded, inserted_at, updated_at) and books(id, name, info, preface, text, html, url, parent_id, inserted_at, updated_at, info_html, preface_html)

 

A wikisource web page is downloaded and its html text is inserted into table “pages” column “html.

Later, books.{name, info, preface, text, html, info_html, preface_html} are extracted from pages.html. The text column of books is a txt version of the content of html column of table pages.

 

On average there are 7635 characters (each characters is 3 bytes long because of utf-8 encoding) for text column of table books and I want to add full text search to books(text).

 

I tried pg_trgm and my own customized token parser https://github.com/huangjimmy/pg_cjk_parser

 

To my surprise, postgres 11 is extremely slow when creating a full text index.

 

I added a column of tsvector type and tried to create an index on that column. Pg could not finish creating a GIN index for a long time and I had to cancel the execution.

I then tried to create a partial full text index for 500 rows and it took postgres 2 to 3 minutes to create the index. Based on this estimation, pg will need at least one day to create a full GIN full text search index for 303049 rows of data. I think this is ridiculous slow.

If I tried to create fts index for books(name) or books(info), it took just 3 minutes to create the index. However, name and info are extremely short compared to books(text).

 

I switched to Elasticsearch and it turned out that Elasticsearch is extremely efficient for my case. It took Elasticsearch 3 hours to index all 303049 rows.  

 

Jimmy Huang

[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: pg full text search very slow for Chinese characters

Michael Lewis
In reply to this post by Jimmy Huang
>My postgres instance is based on docker image postgres:11 and runs on my MacBook Pro i7 16GB.

How much ram and such did you give to this vm?
 
 
>To my surprise, postgres 11 is extremely slow when creating a full text index. I added a column of tsvector type and tried to create an index on that column. Pg could not finish creating a GIN index for a long time and I had to cancel the execution.I then tried to create a partial full text index for 500 rows and it took postgres 2 to 3 minutes to create the index.
 

Did you customize any config? maintenance_work_mem specifically would be relevant to the time to create an index and default value is only 64MB. Especially if you are running a spinning hard drive and not ssd, then this could be problematic.
Reply | Threaded
Open this post in threaded view
|

Re: pg full text search very slow for Chinese characters

Cory Nemelka

On Tue, Sep 10, 2019 at 10:11 AM Michael Lewis <[hidden email]> wrote:
>My postgres instance is based on docker image postgres:11 and runs on my MacBook Pro i7 16GB.

How much ram and such did you give to this vm?
 
 
>To my surprise, postgres 11 is extremely slow when creating a full text index. I added a column of tsvector type and tried to create an index on that column. Pg could not finish creating a GIN index for a long time and I had to cancel the execution.I then tried to create a partial full text index for 500 rows and it took postgres 2 to 3 minutes to create the index.
 

Did you customize any config? maintenance_work_mem specifically would be relevant to the time to create an index and default value is only 64MB. Especially if you are running a spinning hard drive and not ssd, then this could be problematic.


I apologize for my top post. :D Won't happen again
Reply | Threaded
Open this post in threaded view
|

Re: pg full text search very slow for Chinese characters

Jimmy Huang
In reply to this post by Michael Lewis

It is all default values.

I just check maintenance_work_mem and indeed it is 64MB.

 

At first I gave 2GB ram to docker and later I increased ram to 4GB.

It did not make much difference when ram increased from 2GB to 4GB.

 

I will try increasing maintenance_work_mem and see if it helps.

 

发件人: Michael Lewis <[hidden email]>
日期: 2019911 星期三 上午12:11
收件人: 少君 <[hidden email]>
抄送: "[hidden email]" <[hidden email]>
主题: Re: pg full text search very slow for Chinese characters

 

>My postgres instance is based on docker image postgres:11 and runs on my MacBook Pro i7 16GB.

How much ram and such did you give to this vm?
 
 
>To my surprise, postgres 11 is extremely slow when creating a full text index. I added a column of tsvector type and tried to create an index on that column. Pg could not finish creating a GIN index for a long time and I had to cancel the execution.I then tried to create a partial full text index for 500 rows and it took postgres 2 to 3 minutes to create the index.
 

Did you customize any config? maintenance_work_mem specifically would be relevant to the time to create an index and default value is only 64MB. Especially if you are running a spinning hard drive and not ssd, then this could be problematic.

Reply | Threaded
Open this post in threaded view
|

Re: pg full text search very slow for Chinese characters

Tom Lane-2
In reply to this post by Jimmy Huang
Jimmy Huang <[hidden email]> writes:
> I tried pg_trgm and my own customized token parser https://github.com/huangjimmy/pg_cjk_parser

pg_trgm is going to be fairly useless for indexing text that's mostly
multibyte characters, since its unit of indexable data is just 3 bytes
(not characters).  I don't know of any comparable issue in the core
tsvector logic, though.  The numbers you're quoting do sound quite awful,
but I share Cory's suspicion that it's something about your setup rather
than an inherent Postgres issue.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: pg full text search very slow for Chinese characters

Andreas Joseph Krogh-2
På tirsdag 10. september 2019 kl. 18:21:45, skrev Tom Lane <[hidden email]>:
Jimmy Huang <[hidden email]> writes:
> I tried pg_trgm and my own customized token parser https://github.com/huangjimmy/pg_cjk_parser

pg_trgm is going to be fairly useless for indexing text that's mostly
multibyte characters, since its unit of indexable data is just 3 bytes
(not characters).  I don't know of any comparable issue in the core
tsvector logic, though.  The numbers you're quoting do sound quite awful,
but I share Cory's suspicion that it's something about your setup rather
than an inherent Postgres issue.

regards, tom lane
 
 
We experienced quite awful performance when we hosted the DB on virtual servers (~5 years ago) and it turned out we hit the write-cache limit (then 8GB), which resulted in ~1MB/s IO thruput.
Running iozone might help tracing down IO-problems.
 
 
--
Andreas Joseph Krogh
Reply | Threaded
Open this post in threaded view
|

Re: pg full text search very slow for Chinese characters

Kyotaro Horiguchi-4
Hi.

At Tue, 10 Sep 2019 18:42:26 +0200 (CEST), Andreas Joseph Krogh <[hidden email]> wrote in <VisenaEmail.3.8750116fce15432e.16d1c0b2b28@tc7-visena>

> På tirsdag 10. september 2019 kl. 18:21:45, skrev Tom Lane <[hidden email]
> <mailto:[hidden email]>>: Jimmy Huang <[hidden email]> writes:
>  > I tried pg_trgm and my own customized token parser
> https://github.com/huangjimmy/pg_cjk_parser
>
>  pg_trgm is going to be fairly useless for indexing text that's mostly
>  multibyte characters, since its unit of indexable data is just 3 bytes
>  (not characters). I don't know of any comparable issue in the core
>  tsvector logic, though. The numbers you're quoting do sound quite awful,
>  but I share Cory's suspicion that it's something about your setup rather
>  than an inherent Postgres issue.
>
>  regards, tom lane We experienced quite awful performance when we hosted the
> DB on virtual servers (~5 years ago) and it turned out we hit the write-cache
> limit (then 8GB), which resulted in ~1MB/s IO thruput. Running iozone might
> help tracing down IO-problems. --
>  Andreas Joseph Krogh

Multibyte characters also quickly bloats index by many many small
buckets for every 3-characters combination of thouhsand of
characters, which makes it useless.

pg_bigm based on bigram/2-gram works better on multibyte
characters.

https://pgbigm.osdn.jp/index_en.html

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center