[PROPOSAL] Shared Ispell dictionaries

classic Classic list List threaded Threaded
81 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

[PROPOSAL] Shared Ispell dictionaries

a.zakirov
Hello, hackers!

Introduction
------------

I'm going to implement a patch which will store Ispell dictionaries in a shared memory.

There is an extension shared_ispell [1], developed by Tomas Vondra. But it is a bad candidate for including into contrib.
Because it should know a lot of information about IspellDict struct to copy it into a shared memory.

Why
---

Shared Ispell dictionary gives the following improvements:
- consume less memory - Ispell dictionary loads into memory for every backends and requires for some dictionaries more than 100Mb
- there is no overhead during first call of a full text search function (such as to_tsvector(), to_tsquery())

Implementation
--------------

It is necessary to change all structures related with IspellDict: SPNode, AffixNode, AFFIX, CMPDAffix, IspellDict itself. They all shouldn't use pointers for this reason. Others are used only during dictionary building.
It would be good to store in a shared memory StopList struct too.

All fields of IspellDict struct, which are used only during dictionary building, will be move into new IspellDictBuild to decrease needed shared memory size. And they are going to be released by buildCxt.

Each dictionary will be stored in its own dsm segment. Structures for regular expressions won't be stored in a shared memory. They are compiled for every backend.

The patch will be ready and added into the 2018-03 commitfest.

Thank you for your attention. Any thoughts?


1 - github.com/tvondra/shared_ispell or github.com/postgrespro/shared_ispell

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] Shared Ispell dictionaries

Álvaro Herrera
Arthur Zakirov wrote:

> Implementation
> --------------
>
> It is necessary to change all structures related with IspellDict:
> SPNode, AffixNode, AFFIX, CMPDAffix, IspellDict itself. They all
> shouldn't use pointers for this reason. Others are used only during
> dictionary building.

So what are you going to use instead?

> It would be good to store in a shared memory StopList struct too.

Sure (probably a separate patch though).

> All fields of IspellDict struct, which are used only during dictionary
> building, will be move into new IspellDictBuild to decrease needed
> shared memory size. And they are going to be released by buildCxt.
>
> Each dictionary will be stored in its own dsm segment.

All that sounds reasonable.

> The patch will be ready and added into the 2018-03 commitfest.

So this will be a large patch not submitted to 2018-01?  Depending on
size/complexity I'm not sure it's OK to submit 2018-03 only -- it may be
too late.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] Shared Ispell dictionaries

Pavel Stehule


2017-12-26 17:55 GMT+01:00 Alvaro Herrera <[hidden email]>:
Arthur Zakirov wrote:

> Implementation
> --------------
>
> It is necessary to change all structures related with IspellDict:
> SPNode, AffixNode, AFFIX, CMPDAffix, IspellDict itself. They all
> shouldn't use pointers for this reason. Others are used only during
> dictionary building.

So what are you going to use instead?

> It would be good to store in a shared memory StopList struct too.

Sure (probably a separate patch though).

> All fields of IspellDict struct, which are used only during dictionary
> building, will be move into new IspellDictBuild to decrease needed
> shared memory size. And they are going to be released by buildCxt.
>
> Each dictionary will be stored in its own dsm segment.

All that sounds reasonable.

> The patch will be ready and added into the 2018-03 commitfest.

So this will be a large patch not submitted to 2018-01?  Depending on
size/complexity I'm not sure it's OK to submit 2018-03 only -- it may be
too late.

Tomas had some workable patches related to this topic

Regards

Pavel

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] Shared Ispell dictionaries

a.zakirov
In reply to this post by Álvaro Herrera
Thank you for your feedback.

On Tue, Dec 26, 2017 at 01:55:57PM -0300, Alvaro Herrera wrote:
> So what are you going to use instead?

For example, AffixNode and AffixNodeData represent prefix tree of an
affix list. They are accessed by Suffix and Prefix pointers of
IspellDict struct now. Instead all affix nodes should be placed into an
array and accessed by an offset. Suffix array goes first, Prefix array
goes after. AffixNodeData will access to a child node by an offset too.

AffixNodeData struct has the array of pointers to AFFIX struct. These
array with all AFFIX data can be stored within AffixNodeData. Or
AffixNodeData can have an array of indexes to a single AFFIX array,
which stored within IspellDict before or after Suffix and Prefix.

Same for prefix tree of a word list, represented by SPNode struct. It
might by stored as an array after the Prefix array.

AffixData and CompoundAffix arrays go after them.

To allocate IspellDict in this case it is necessary to calculate needed
memory size. I think arrays mentioned above will be built first then
memcpy'ed into IspellDict, if it won't take much time.

Hope it makes sense and is reasonable.

>
> So this will be a large patch not submitted to 2018-01?  Depending on
> size/complexity I'm not sure it's OK to submit 2018-03 only -- it may be
> too late.
>

Oh, I see. I try to prepare the patch while 2018-01 is open.

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] Shared Ispell dictionaries

Álvaro Herrera
Arthur Zakirov wrote:

> On Tue, Dec 26, 2017 at 01:55:57PM -0300, Alvaro Herrera wrote:
> > So what are you going to use instead?
>
> [ ... ]
>
> To allocate IspellDict in this case it is necessary to calculate needed
> memory size. I think arrays mentioned above will be built first then
> memcpy'ed into IspellDict, if it won't take much time.

OK, that sounds sensible on first blush.  If there are many processes
concurrently doing text searches, then the amnount of memory saved may
be large enough to justify the additional processing (moreso if it's
just one more memcpy cycle).

I hope that there is some way to cope with the ispell data changing
underneath -- maybe you'll need some sort of RCU?

> > So this will be a large patch not submitted to 2018-01?  Depending on
> > size/complexity I'm not sure it's OK to submit 2018-03 only -- it may be
> > too late.
>
> Oh, I see. I try to prepare the patch while 2018-01 is open.

It isn't necessary that the patch to present to 2018-01 is final and
complete (so don't kill yourself to achieve that) -- a preliminary patch
that reviewers can comment on is enough, as long as the final patch you
present to 2018-03 is not *too* different.  But any medium-large patch
whose first post is to the last commitfest of a cycle is likely to be
thrown out to the next cycle's first commitfest very quickly.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] Shared Ispell dictionaries

a.zakirov
In reply to this post by Pavel Stehule
On Tue, Dec 26, 2017 at 07:03:48PM +0100, Pavel Stehule wrote:
>
> Tomas had some workable patches related to this topic
>

Tomas, have you planned to propose it?

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] Shared Ispell dictionaries

a.zakirov
In reply to this post by a.zakirov
Hello, hackers,

On Tue, Dec 26, 2017 at 07:48:27PM +0300, Arthur Zakirov wrote:
> The patch will be ready and added into the 2018-03 commitfest.
>

I attached the patch itself.

0001-Fix-ispell-memory-handling.patch:

Some strings are allocated via compact_palloc0(). But they are not
persistent, so they should be allocated using temporary memory context.
Also a couple strings are not released if .aff file had new format.

0002-Retreive-shmem-location-for-ispell.patch:

Adds ispell_shmem_location() function which look for location for a
dictionary using .dict and .aff file names. If the location haven't been
allocated in DSM earlier, allocate it. Shared hash table is used here to
search the location.

Maximum number of elements of hash table is NUM_DICTIONARIES=20 now. It
will be better to use a GUC-variable. Also if the number of elements
reached the limit then it will be good to use backend's local memory
instead of shared.

0003-Store-ispell-structures-in-shmem.patch:

Introduces IspellDictBuild and IspellDictData structures, removes
IspellDict structure. IspellDictBuild is used during building the
dictionary, if it haven't been allocated in DSM earlier, within
dispell_build() function. IspellDictBuild has a pointer to
IspellDictData structure, which will be filled with persistent data.

After building the dictionary IspellDictData is copied into
DSM location and temporary data of IspellDictBuild is released.

All prefix trees are stored as a flat array now. Those arrays are
allocated and stored using NodeArray struct now. Required node can be
retreied by node offset. AffixData and Affix arrays have additional
offset array to retreive an element by index.

Affix field (array of AFFIX) of IspellDictBuild is persistent data also. But it is
constructed as a temporary array first, Affix array need to be sorted
via qsort() within NISortAffixes().

So IspellDictData stores:
- AffixData - array of strings, access via AffixDataOffset
- Affix - array of AFFIX, access via AffixOffset
- DictNodes, PrefixNodes, SuffixNodes - prefix trees as a plain array
- CompoundAffix - array of CMPDAffix sequential access

I had to remove compact_palloc0() added by Pavel in
3e5f9412d0a818be77c974e5af710928097b91f3. Ispell dictionary doesn't need
such allocation anymore. It was used to allocate a little locations. I
will definity check performance of Czech dictionary.


There are issues to do:
- add the GUC-variable for hash table limit
- fix bugs
- improve comments
- performance testing

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

0001-Fix-ispell-memory-handling.patch (1K) Download Attachment
0002-Retreive-shmem-location-for-ispell.patch (7K) Download Attachment
0003-Store-ispell-structures-in-shmem.patch (79K) Download Attachment
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] Shared Ispell dictionaries

a.zakirov
On Sun, Dec 31, 2017 at 06:28:13PM +0300, Arthur Zakirov wrote:
>
> There are issues to do:
> - add the GUC-variable for hash table limit
> - fix bugs
> - improve comments
> - performance testing
>

Here is the second version of the patch.

0002-Retreive-shmem-location-for-ispell-v2.patch:

Fixed some bugs and added the GUC variable "shared_dictionaries".

Added documentation for it. I'm not sure about the order of configuration parameters in section "19.4.1.
Memory". Now "shared_dictionaries" goes after "shared_buffers". Maybe it
will be good to make a patch wich will sort parameters in alphabetical
order?

0003-Store-ispell-structures-in-shmem-v2.patch:

Fixed some bugs, regression tests pass now. I added more comments
and fixed old. I also tested with Hunspell dictionaries [1]. They are
good too.

Results of performance testing of Ispell and Hunspell dictionaries will
be ready soon.


1 - github.com/postgrespro/hunspell_dicts

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

0001-Fix-ispell-memory-handling-v2.patch (1K) Download Attachment
0002-Retreive-shmem-location-for-ispell-v2.patch (11K) Download Attachment
0003-Store-ispell-structures-in-shmem-v2.patch (90K) Download Attachment
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] Shared Ispell dictionaries

Tomas Vondra-4
In reply to this post by a.zakirov
Hi Arthur,

Sorry for the delay, I somehow missed this thread ...

On 12/27/2017 10:20 AM, Arthur Zakirov wrote:
> On Tue, Dec 26, 2017 at 07:03:48PM +0100, Pavel Stehule wrote:
>>
>> Tomas had some workable patches related to this topic
>>
>
> Tomas, have you planned to propose it?
>

I believe Pavel was referring to this extension:

    https://github.com/tvondra/shared_ispell

I wasn't going to submit that as in-core solution, but I'm happy you're
making improvements in that direction. I'll take a look at your patch
shortly.

ragards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] Shared Ispell dictionaries

a.zakirov
Thank you for your answer.

On Mon, Jan 08, 2018 at 06:12:37PM +0100, Tomas Vondra wrote:
>
> I believe Pavel was referring to this extension:
>
>     https://github.com/tvondra/shared_ispell
>

Oh, understood.

> I wasn't going to submit that as in-core solution, but I'm happy you're
> making improvements in that direction. I'll take a look at your patch
> shortly.
>

There is the second version of the patch. But I've noticed a performance regression in ts_lexize() and I will try to find where the overhead hides.

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] Shared Ispell dictionaries

Tomas Vondra-4
In reply to this post by a.zakirov
Hi Arthur,

I've done some initial review of the patch today, and here are some
thoughts:

0001-Fix-ispell-memory-handling-v2.patch

This makes sense. The patch simply replaces two cpstrdup() calls with
MemoryContextStrdup, but I see spell.c already has two macros to
allocate memory in the buildCxt. What about adding tmpstrdup to copy a
string into the context? I admit this is mostly nitpicking though.


0002-Retreive-shmem-location-for-ispell-v2.patch

I think the GUC name should make it clear it's a maximum number of
something, just like "max_parallel_workers" and other such GUCs. When I
first saw "shared_dictionaries" in the patch I thought it's a list of
dictionary names, or something like that.

I have a bunch of additional design questions and proposals (not
necessarily required for v1, but perhaps useful for shaping it).


1) Why do we actually need the limit? Is it really necessary / useful?

When I wrote shared_ispell back in 2012, all we had were fixed segments
allocated at start, and so similar limits were a built-in restriction.
But after the DSM stuff was introduced I imagined it would not be necessary.

I realize the current implementation requires that, because the hash
table is still created in an old-style memory context (and only the
dictionaries are in DSM segments).

But that seems fairly straightforward to fix by maintaining the hash
table in a separate DSM segment too. So lookup of the dictionary DSM
would have to fist check what the current hash table segment is, and
then continue as now.

I'm not sure if dynahash can live in a DSM segment, but we already have
a hash table that supports that in dshash.c (which is also concurrent,
although I'm not sure if that's a major advantage for this use case).


2) Do we actually want/need some limits? Which ones?

That is not to say we don't need/want some limits, but the current limit
may not be the droid we're looking for, for a couple of reasons.

Firstly, currently it only matters during startup, when the dynahash is
created. So to change the limit (e.g. to increase it) you actually have
to restart the database, which is obviously a major hassle.

Secondly, dynahash tweaks the values to get proper behavior. For example
it's not using the values directly but some higher value of 2^N form.
Which means the limit may not enforced immediately when hitting the GUC,
but unexpectedly somewhat later.

And finally, I believe this is log-worthy - right now the dictionary
load silently switches to backend memory (thus incurring all the parsing
overhead). This certainly deserves at least a log message.

Actually, I'm not sure "number of dictionaries" is a particularly useful
limit in the first place - that's not a number I really care about. But
I do care about amount of memory consumed by the loaded dictionaries.

So I do suggest adding such "max memory for shared dictionaries" limit.
I'm not sure we can enforce it strictly, because when deciding where to
load the dict we haven't parsed it yet and so don't know how much memory
will be required. But I believe a lazy check should be fine (load it,
and if we exceeded the total memory disable loading additional ones).


3) How do I unload a dictionary from the shared memory?

Assume we've reached the limit (it does not matter if it's the number of
dictionaries or memory used by them). How do I resolve that without
restarting the database? How do I unload a dictionary (which may be
unused) from shared memory?

    ALTER TEXT SEARCH DICTIONARY x UNLOAD


4) How do I reload a dictionary?

Assume I've updated the dictionary files (added new words into the
files, or something like that). How do I reload the dictionary? Do I
have to restart the server, DROP/CREATE everything again, or what?

What about instead having something like this:

    ALTER TEXT SEARCH DICTIONARY x RELOAD


5) Actually, how do I list currently loaded dictionaries (and how much
memory they use in the shared memory)?


6) What other restrictions would be useful?

I think it should be possible to specify which ispell dictionaries may
be loaded into shared memory, and which should be always loaded into
local backend memory. That is, something like

    CREATE TEXT SEARCH DICTIONARY x (
        TEMPLATE = ispell,
        DictFile = czech,
        AffFile = czech,
        StopWords = czech,
        SharedMemory = true/false (default: false)
    );

because otherwise the dictionaries will compete for shared memory, and
it's unclear which of them will get loaded. For a server with a single
application that may not be a huge issue, but think about servers shared
by multiple applications, etc.

In the extension this was achieved kinda explicitly by definition of a
separate 'shared_ispell' template, but if you modify the current one
that won't work, of course.


7) You mentioned you had to get rid of the compact_palloc0 - can you
elaborate a bit why that was necessary? Also, when benchmarking the
impact of this make sure to measure not only the time but also memory
consumption.

In fact, that was the main reason why Pavel implemented it in 2010,
because the czech dictionary takes quite a bit of memory, and without
the shared memory a copy was kept in every backend.

Of course, maybe that would be mostly irrelevant thanks to this patch
(due to changes to the representation and keeping just a single copy).


8) One more thing - I've noticed that the hash table uses this key:

    typedef struct
    {
        char        dictfile[MAXPGPATH];
        char        afffile[MAXPGPATH];
    } TsearchDictKey;

That is, full paths to the two files, and I'm not sure that's a very
good idea. Firstly, it's a bit wasteful (1kB per path). But more
importantly it means all dictionaries referencing the same files will
share the same chunk of shared memory - not only within a single
database, but across the whole cluster. That may lead to surprising
behavior, because e.g. unloading a dictionary in one database will
affect dictionaries in all other databases referencing the same files.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] Shared Ispell dictionaries

a.zakirov
Hello,

Thank you Tomas for your review.

On Sat, Jan 13, 2018 at 03:25:55AM +0100, Tomas Vondra wrote:
> allocate memory in the buildCxt. What about adding tmpstrdup to copy a
> string into the context? I admit this is mostly nitpicking though.

I agree about tmpstrdup(). It will be self consistent with tmpalloc().

> 1) Why do we actually need the limit? Is it really necessary / useful?
> ...
> I realize the current implementation requires that, because the hash
> table is still created in an old-style memory context (and only the
> dictionaries are in DSM segments).

Yes indeed. I tried to implement dynahash via DSM, but I failed. It
seems to me that dynahash can work only in an old-style memory context.

> I'm not sure if dynahash can live in a DSM segment, but we already have
> a hash table that supports that in dshash.c (which is also concurrent,
> although I'm not sure if that's a major advantage for this use case).

Thank you a lot for pointing on dshash.c. I think this is just what we
need. I will try to use it in the new version of the patch.

> 2) Do we actually want/need some limits? Which ones?
> ...
> And finally, I believe this is log-worthy - right now the dictionary
> load silently switches to backend memory (thus incurring all the parsing
> overhead). This certainly deserves at least a log message.

I think such log message may be usefull, so I will add it too.

> So I do suggest adding such "max memory for shared dictionaries" limit.
> I'm not sure we can enforce it strictly, because when deciding where to
> load the dict we haven't parsed it yet and so don't know how much memory
> will be required. But I believe a lazy check should be fine (load it,
> and if we exceeded the total memory disable loading additional ones).

With dshash in DSM it seems that shared_dictionaries GUC variable is not needed
anymore. I aggree that another GUC variable (for example,
max_shared_dictionaries_size) may be useful. But maybe it's worth
checking the size of a dictionary only after actual compiling? We can do
the following:
- within ispell_shmem_location() build a dictionary using the callback
  function
- the callback function returns its size, if the dictionary doesn't fit
  into a remaining shared space ispell_shmem_location() just will return
  pointer to the palloc'ed and compiled dictionary without creating a
  DSM segment.

> 3) How do I unload a dictionary from the shared memory?
> ...
>     ALTER TEXT SEARCH DICTIONARY x UNLOAD
>
> 4) How do I reload a dictionary?
> ...
>     ALTER TEXT SEARCH DICTIONARY x RELOAD

I think these syntax will be very useful not only for Ispell but for
other dictionaries too. So init_function of a text search template may
return pointers for a C funtions which unload and reload dictionaries.
This approach doesn't require to change the catalog by adding additional
functions for the template [1].

If init_function of a template didn't return pointers then this template
doesn't support unloading or reloading. And UNLOAD and RELOAD commands
should throw an error if a user calles them for such template.

> 5) Actually, how do I list currently loaded dictionaries (and how much
> memory they use in the shared memory)?

This is may be very useful too. This function can be called as
pg_get_shared_dictionaries().

> 6) What other restrictions would be useful?
> ...
>     CREATE TEXT SEARCH DICTIONARY x (
>         TEMPLATE = ispell,
>         DictFile = czech,
>         AffFile = czech,
>         StopWords = czech,
>         SharedMemory = true/false (default: false)
>     );

Hm, I didn't think about such option. It will be a very simple way
of shared dictionary control for a user.

> 7) You mentioned you had to get rid of the compact_palloc0 - can you
> elaborate a bit why that was necessary? Also, when benchmarking the
> impact of this make sure to measure not only the time but also memory
> consumption.

As I understood from the commit 3e5f9412d0a818be77c974e5af710928097b91f3
compact_palloc0() reduces overhead from a lot of palloc's for small
chunks of data. And persistent data of the patch should not suffer from
this overhead, because persistent data is allocated using big chunks.

But now I realized that we can keep compact_palloc0() for small chunks of
temporary data. And it may be worth to save compact_palloc0().

> 8) One more thing - I've noticed that the hash table uses this key:
> ...
> That is, full paths to the two files, and I'm not sure that's a very
> good idea. Firstly, it's a bit wasteful (1kB per path). But more
> importantly it means all dictionaries referencing the same files will
> share the same chunk of shared memory - not only within a single
> database, but across the whole cluster. That may lead to surprising
> behavior, because e.g. unloading a dictionary in one database will
> affect dictionaries in all other databases referencing the same files.

Hm, indeed. It's worth to use only file names instead full paths. And
it is good idea to use more information besides file names. It can be
Oid of a database and Oid of a namespace maybe, because a
dictionary can be created in different schemas.

I think your proposals may be implemented in several patches, so they can
be applyed independently but consistently. I suppose I will prepare new
version of the patch with fixes and with initial design of new functions
and commands soon.


1 - https://www.postgresql.org/docs/current/static/sql-createtstemplate.html

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] Shared Ispell dictionaries

Tomas Vondra-4


On 01/13/2018 04:22 PM, Arthur Zakirov wrote:

> Hello,
>
> Thank you Tomas for your review.
>
> On Sat, Jan 13, 2018 at 03:25:55AM +0100, Tomas Vondra wrote:
>> allocate memory in the buildCxt. What about adding tmpstrdup to copy a
>> string into the context? I admit this is mostly nitpicking though.
>
> ... snip ...>
>> 8) One more thing - I've noticed that the hash table uses this key:
>> ...
>> That is, full paths to the two files, and I'm not sure that's a very
>> good idea. Firstly, it's a bit wasteful (1kB per path). But more
>> importantly it means all dictionaries referencing the same files will
>> share the same chunk of shared memory - not only within a single
>> database, but across the whole cluster. That may lead to surprising
>> behavior, because e.g. unloading a dictionary in one database will
>> affect dictionaries in all other databases referencing the same files.
>
> Hm, indeed. It's worth to use only file names instead full paths. And
> it is good idea to use more information besides file names. It can be
> Oid of a database and Oid of a namespace maybe, because a
> dictionary can be created in different schemas.
>

I doubt using filenames (without the directory paths) solves anything,
really. The keys still have to be MAXPGPATH because someone could create
very long filename. But I don't think memory consumption is such a big
deal, really. With 1000 dictionaries it's still just ~2MB of data, which
is negligible compared to the amount of memory saved by sharing the
dictionaries.

Not sure if we really need to add the database/schema OIDs. I mentioned
the unexpected consequences (cross-db sharing) but maybe that's a
feature we should keep (it reduces memory usage). So perhaps this should
be another CREATE TEXT SEARCH DICTIONARY parameter, allowing sharing the
dictionary with other databases?

Aren't we overengineering this?

> I think your proposals may be implemented in several patches, so they can
> be applyed independently but consistently. I suppose I will prepare new
> version of the patch with fixes and with initial design of new functions
> and commands soon.
>

Yes, splitting patches into smaller, more focused bits is a good idea.

BTW the current patch fails to document the dictionary sharing. It only
mentions it when describing the shared_dictionaries GUC. IMHO the right
place for additional details is

https://www.postgresql.org/docs/10/static/textsearch-dictionaries.html


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] Shared Ispell dictionaries

a.zakirov
On Sat, Jan 13, 2018 at 10:33:14PM +0100, Tomas Vondra wrote:
> Not sure if we really need to add the database/schema OIDs. I mentioned
> the unexpected consequences (cross-db sharing) but maybe that's a
> feature we should keep (it reduces memory usage). So perhaps this should
> be another CREATE TEXT SEARCH DICTIONARY parameter, allowing sharing the
> dictionary with other databases?
>
> Aren't we overengineering this?

Another related problem I've noticed is memory leak. When a dictionary
loaded and then dropped it won't be unloaded. I see several approaches:

1 - Use Oid of the dictionary itself as the key instead dictfile and
afffile. When the dictionary is dropped it will be easily unloaded if it
was loaded. Implementing should be easy, but the drawback is more memory consumption.
2 - Use reference counter with cross-db sharing. When the dictionary is
loaded the counter increases. If all record of loaded dictionary is dropped
it will be unloaded.
3 - Or reference counters without cross-db sharing to avoid possible confusing.
Here dictfile, afffile and database Oid will be used as the key.

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] Shared Ispell dictionaries

Tomas Vondra-4

On 01/15/2018 08:02 PM, Arthur Zakirov wrote:

> On Sat, Jan 13, 2018 at 10:33:14PM +0100, Tomas Vondra wrote:
>> Not sure if we really need to add the database/schema OIDs. I mentioned
>> the unexpected consequences (cross-db sharing) but maybe that's a
>> feature we should keep (it reduces memory usage). So perhaps this should
>> be another CREATE TEXT SEARCH DICTIONARY parameter, allowing sharing the
>> dictionary with other databases?
>>
>> Aren't we overengineering this?
>
> Another related problem I've noticed is memory leak. When a
> dictionary loaded and then dropped it won't be unloaded.
>

Good point.

> I see several approaches:
>> 1 - Use Oid of the dictionary itself as the key instead dictfile and
> afffile. When the dictionary is dropped it will be easily unloaded if it
> was loaded. Implementing should be easy, but the drawback is more memory consumption.
> 2 - Use reference counter with cross-db sharing. When the dictionary is
> loaded the counter increases. If all record of loaded dictionary is dropped
> it will be unloaded.
> 3 - Or reference counters without cross-db sharing to avoid possible confusing.
> Here dictfile, afffile and database Oid will be used as the key.
>

I think you're approaching the problem from the right direction, hence
asking the wrong question.

I think the primary question is "Do we want to share dictionaries cross
databases?" and the answer will determine which of the tree options is
the right one.

Another important consideration is the complexity of the patch. In fact,
I suggest to make it your goal to make the initial patch as simple as
possible. If something is "nice to have" it may wait for v2.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] Shared Ispell dictionaries

a.zakirov
In reply to this post by a.zakirov
On Sat, Jan 13, 2018 at 06:22:41PM +0300, Arthur Zakirov wrote:
> I think your proposals may be implemented in several patches, so they can
> be applyed independently but consistently. I suppose I will prepare new
> version of the patch with fixes and with initial design of new functions
> and commands soon.

I attached new version of the patch.

0001-Fix-ispell-memory-handling-v3.patch:

> allocate memory in the buildCxt. What about adding tmpstrdup to copy a
> string into the context? I admit this is mostly nitpicking though.

Fixed. Added tmpstrdup.

0002-Retreive-shmem-location-for-ispell-v3.patch:

dshash.c is used now instead of dynahash.c. A hash table is created during first call of a text search function in an instance. A hash table uses OID of a dictionary instead of file names, so there is no cross-db sharing at all.

Added max_shared_dictionaries_size GUC instead of shared_dictionaries. In current version it can be set only at server start. If a dictionary is allocated in a backend's memory instead of shared memory then LOG message will be raised which includes OID of the dictionary.

Fixed memory leak. During removing a dictionary and invalidating dictionaries cash ts_dict_shmem_release() is called. It unpins mapping of a dictionary, if reference count reaches zero then DSM segment will be unpinned. So allocated shared memory will be released by Postgres.

0003-Store-ispell-structures-in-shmem-v3.patch:

Added documentation fixes. dispell_init() (tmplinit too) has second argument, dictid.

0004-Update-tmplinit-arguments-v3.patch:

It is necessary to fix all dictionaries including contrib extensions because of second argument for tmplinit.

tmplinit has the following signature now:

dict_init(internal, internal)

0005-pg-ts-shared-dictinaries-view-v3.patch:

Added pg_ts_shared_dictionaries() function and pg_ts_shared_dictionaries system view. They return a list of dictionaries currently in shared memory, with the columns:
- dictoid
- schemaname
- dictname
- size

0006-Shared-memory-ispell-option-v3.patch:

Added SharedMemory option for Ispell dictionary template. It is true by default, because I think it would be good that people will haven't to do anything to allocate dictionaries in shared memory.

Setting SharedMemory=false during ALTER TEXT SEARCH DICTIONARY hasn't immediate effect. It is because ALTER doesn't force to invalidate dictionaries cache, if I'm not mistaken.

> 3) How do I unload a dictionary from the shared memory?
> ...
>     ALTER TEXT SEARCH DICTIONARY x UNLOAD
>
> 4) How do I reload a dictionary?
> ...
>     ALTER TEXT SEARCH DICTIONARY x RELOAD

I thought about it. And it seems to me that we can use functions ts_unload() and ts_reload() instead of new syntax. We already have text search functions like ts_lexize() and ts_debug(), and it is better to keep consistency. I think there are to approach for ts_unload():
- use DSM's pin and unpin methods and the invalidation callback, as it done during fixing memory leak. It has the drawback that it won't have an immediate effect, because DSM will be released only when all backends unpin DSM mapping.
- use DSA and dsa_free() method. As far as I understand dsa_free() frees allocated memory immediatly. But it requires more work to do, because we will need some more locks. For instance, what happens when someone calls ts_lexize() and someone else calls dsa_free() at the same time.

> 7) You mentioned you had to get rid of the compact_palloc0 - can you
> elaborate a bit why that was necessary? Also, when benchmarking the
> impact of this make sure to measure not only the time but also memory
> consumption.

It seems to me that there is no need compact_palloc0() anymore. Tests show that czech dictionary doesn't consume more memory after the patch.

Tests
-----

I've measured creation time of dictionaries on my 64-bit machine. You can get them from [1]. Here the master is 434e6e1484418c55561914600de9e180fc408378. I've measured french dictionary too because it has even bigger affix file than czech dictionary.

With patch:
czech_hunspell - 247 ms
english_hunspell - 59 ms
french_hunspell - 103 ms

Master:
czech_hunspell - 224 ms
english_hunspell - 52 ms
french_hunspell - 101 ms

Memory:

With patch (shared memory size + backend's memory):
czech_hunspell - 9573049 + 192584 total in 5 blocks; 1896 free (11 chunks); 190688 used
english_hunspell - 1985299 + 21064 total in 6 blocks; 7736 free (13 chunks); 13328 used
french_hunspell - 4763456 + 626960 total in 7 blocks; 7680 free (14 chunks); 619280 used

Here french dictionary uses more backend's memory because it has big affix file. Regular expression structures are stored in backend's memory still.

Master (backend's memory):
czech_hunspell - 17181544 total in 2034 blocks; 3584 free (10 chunks); 17177960 used
english_hunspell - 4160120 total in 506 blocks; 2792 free (10 chunks); 4157328 used
french_hunspell - 11439184 total in 1187 blocks; 18832 free (171 chunks); 11420352 used

You can see that dictionaries now takes almost two times less memory.

pgbench with select only script:

SELECT ts_lexize('czech_hunspell', 'slon');
patch: 30431 TPS
master: 30419 TPS

SELECT ts_lexize('english_hunspell', 'elephant'):
patch: 35029 TPS
master: 35276 TPS

SELECT ts_lexize('french_hunspell', 'éléphante');
patch: 22264 TPS
master: 22744 TPS

1 - https://github.com/postgrespro/hunspell_dicts

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

0001-Fix-ispell-memory-handling-v3.patch (1K) Download Attachment
0002-Retreive-shmem-location-for-ispell-v3.patch (18K) Download Attachment
0003-Store-ispell-structures-in-shmem-v3.patch (91K) Download Attachment
0004-Update-tmplinit-arguments-v3.patch (10K) Download Attachment
0005-pg-ts-shared-dictinaries-view-v3.patch (12K) Download Attachment
0006-Shared-memory-ispell-option-v3.patch (6K) Download Attachment
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] Shared Ispell dictionaries

Tomas Vondra-4
Hi,

On 01/24/2018 06:20 PM, Arthur Zakirov wrote:
> On Sat, Jan 13, 2018 at 06:22:41PM +0300, Arthur Zakirov wrote:
>> I think your proposals may be implemented in several patches, so
>> they can be applyed independently but consistently. I suppose I
>> will prepare new version of the patch with fixes and with initial
>> design of new functions and commands soon.
>
> I attached new version of the patch.
>

Thanks. I don't have time to review/test this before FOSDEM, but a
couple of comments regarding some of the points you mentioned.

>> 3) How do I unload a dictionary from the shared memory?
>> ...
>>     ALTER TEXT SEARCH DICTIONARY x UNLOAD
>>
>> 4) How do I reload a dictionary?
>> ...
>>     ALTER TEXT SEARCH DICTIONARY x RELOAD
>
> I thought about it. And it seems to me that we can use functions
> ts_unload() and ts_reload() instead of new syntax. We already have
> text search functions like ts_lexize() and ts_debug(), and it is
> better to keep consistency.

This argument seems a bit strange. Both ts_lexize() and ts_debug() are
operating on text values, and are meant to be executed as functions from
SQL - particularly ts_lexize(). It's hard to imagine this implemented as
DDL commands.

The unload/reload is something that operates on a database object
(dictionary), which already has create/drop/alter DDL. So it seems
somewhat natural to treat unload/reload as another DDL action.

Taken to an extreme, this argument would essentially mean we should not
have any DDL commands because we have SQL functions.

That being said, I'm not particularly attached to having this DDL now.
Implementing it seems straight-forward (particularly when we already
have the stuff implemented as functions), and some of the other open
questions seem more important to tackle now.

> I think there are to approach for ts_unload():> - use DSM's pin and unpin methods and the invalidation callback, as
> it done during fixing memory leak. It has the drawback that it won't
> have an immediate effect, because DSM will be released only when all
> backends unpin DSM mapping.
> - use DSA and dsa_free() method. As far as I understand dsa_free()
> frees allocated memory immediatly. But it requires more work to do,
> because we will need some more locks. For instance, what happens
> when someone calls ts_lexize() and someone else calls dsa_free() at
> the same time.


No opinion on this yet, I have to think about it for a bit and look at
the code first.

>> 7) You mentioned you had to get rid of the compact_palloc0 - can you
>> elaborate a bit why that was necessary? Also, when benchmarking the
>> impact of this make sure to measure not only the time but also memory
>> consumption.
>
> It seems to me that there is no need compact_palloc0() anymore. Tests
> show that czech dictionary doesn't consume more memory after the
> patch.
>

That's interesting. I'll do some additional tests to verify the finding.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] Shared Ispell dictionaries

a.zakirov
2018-01-24 20:57 GMT+03:00 Tomas Vondra <[hidden email]>:
Thanks. I don't have time to review/test this before FOSDEM, but a
couple of comments regarding some of the points you mentioned.

Thank you for your thoughts.
 
> I thought about it. And it seems to me that we can use functions
> ts_unload() and ts_reload() instead of new syntax. We already have
> text search functions like ts_lexize() and ts_debug(), and it is
> better to keep consistency.

This argument seems a bit strange. Both ts_lexize() and ts_debug() are
operating on text values, and are meant to be executed as functions from
SQL - particularly ts_lexize(). It's hard to imagine this implemented as
DDL commands.

The unload/reload is something that operates on a database object
(dictionary), which already has create/drop/alter DDL. So it seems
somewhat natural to treat unload/reload as another DDL action.

Taken to an extreme, this argument would essentially mean we should not
have any DDL commands because we have SQL functions.

That being said, I'm not particularly attached to having this DDL now.
Implementing it seems straight-forward (particularly when we already
have the stuff implemented as functions), and some of the other open
questions seem more important to tackle now.

I understood your opinion. I haven't strong opinion on the subject yet.
And I agree that they can be implemented in future improvements for shared
dictionaries.

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] Shared Ispell dictionaries

Ildus Kurbangaliev
In reply to this post by a.zakirov
On Wed, 24 Jan 2018 20:20:41 +0300
Arthur Zakirov <[hidden email]> wrote:

Hi, I did some review of the patch.

In 0001 there are few lines where is only indentation has changed.

0002:
- TsearchShmemSize - calculating size using hash_estimate_size seems
redundant since you use DSA hash now.
- ts_dict_shmem_release - LWLockAcquire in the beginning makes no
  sense, since dict_table couldn't change anyway.

0003:
- ts_dict_shmem_location could return IspellDictData, it makes more
  sense.

0006:
It's very subjective, but I think it would nicer to call option as
Shared (as property of dictionary) or UseSharedMemory, the boolean
option called SharedMemory sounds weird.

Overall the patches look good, all tests passed. I tried to broke it in
few places where I thought it could be unsafe but not succeeded.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Reply | Threaded
Open this post in threaded view
|

Re: [PROPOSAL] Shared Ispell dictionaries

a.zakirov
Hello,

Thank you for your review! Good catches.

On Thu, Jan 25, 2018 at 03:26:46PM +0300, Ildus Kurbangaliev wrote:
> In 0001 there are few lines where is only indentation has changed.

Fixed.

> 0002:
> - TsearchShmemSize - calculating size using hash_estimate_size seems
> redundant since you use DSA hash now.

Fixed. True, there is no need in hash_estimate_size anymore.

> - ts_dict_shmem_release - LWLockAcquire in the beginning makes no
>   sense, since dict_table couldn't change anyway.

Fixed. In earlier version tsearch_ctl was used here, but I forgot to remove LWLockAcquire.

> 0003:
> - ts_dict_shmem_location could return IspellDictData, it makes more
>   sense.

I assume that ts_dict_shmem_location can be used by various types of dictionaries, not only by Ispell. So void * more suitable here.

> 0006:
> It's very subjective, but I think it would nicer to call option as
> Shared (as property of dictionary) or UseSharedMemory, the boolean
> option called SharedMemory sounds weird.

Agree. In our offline conversation we came to Shareable, that is a dictionary can be shared. It may be more appropriate because setting Shareable=true doesn't guarantee that a dictionary will be allocated in shared memory due to max_shared_dictionaries_size GUC.

Attached new version of the patch.

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

0001-Fix-ispell-memory-handling-v4.patch (1K) Download Attachment
0002-Retreive-shmem-location-for-ispell-v4.patch (17K) Download Attachment
0003-Store-ispell-structures-in-shmem-v4.patch (91K) Download Attachment
0004-Update-tmplinit-arguments-v4.patch (10K) Download Attachment
0005-pg-ts-shared-dictinaries-view-v4.patch (9K) Download Attachment
0006-Shared-memory-ispell-option-v4.patch (8K) Download Attachment
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
12345