Global shared meta cache

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
25 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Global shared meta cache

Ideriha, Takeshi
Hi, hackers!

My customer created hundreds of thousands of partition tables and tried to select data from hundreds of applications,
which resulted in enormous consumption of memory because it consumed # of backend multiplied by # of local memory (ex. 100 backends X 1GB = 100GB).
Relation caches are loaded on each backend local memory.

To address this issue I'm trying to move meta caches like catcache or relcache into shared memory.

This topic seems to have been discussed several times.
For instance this thread:
https://www.postgresql.org/message-id/CA%2BTgmobjDw_SWsxyJwT9z-YOwWv0ietuQx5fb%3DWEYdDfvCbzGQ%40mail.gmail.com 

In my understanding, it discussed moving catcache and relcache to shared memory rather than current local backend memory,
and is most concerned with performance overhead.

Robert Haas wrote:
> I think it would be interested for somebody to build a prototype here
> that ignores all the problems but the first and uses some
> straightforward, relatively unoptimized locking strategy for the first
> problem. Then benchmark it. If the results show that the idea has
> legs, then we can try to figure out what a real implementation would
> look like.
> (One possible approach: use Thomas Munro's DHT stuff to build the shared cache.)

I'm inspired by this comment and now developing a prototype (please see attached),
but I haven't yet put cache structure on shared memory.
Instead, I put dummy data on shared memory which is initialized at startup,
and then acquire/release lock just before/after searching/creating catcache entry.

I haven't considered relcache and catcachelist either.
It is difficult for me to do everything at one time with right direction.
So I'm trying to make small prototype and see what I'm walking on the proper way.

I tested pgbench to compare master branch with my patch.

0) Environment
   - RHEL 7.4
   - 16 cores
   - 128 GB memory

1) Initialized with pgbench -i -s10

2) benchmarked 3 times for each conditions and got the average result of TPS.
                                     |master branch | prototype      | proto/master (%)
   ------------------------------------------------------------------------------------
   pgbench -c48 -T60 -Msimple -S   | 131297   |130541   |101%
   pgbench -c48 -T60 -Msimple      | 4956     |4965   |95%
   pgbench -c48 -T60 -Mprepared -S |129688   |132538   |97%
   pgbench -c48 -T60 -Mprepared    |5113   |4615   |84%

  This result seems to show except for prepared protocol with "not only SELECT" it didn't make much difference.
   

What do you think about it?
Before I dig deeper, I want to hear your thoughts.

Best regards,
Takeshi Ideriha


001_global_meta_cache.patch (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: Global shared meta cache

srielau
Takeshi-san,

>My customer created hundreds of thousands of partition tables and tried to select data from hundreds of applications,
>which resulted in enormous consumption of memory because it consumed # of backend multiplied by #
> of local memory (ex. 100 backends X 1GB = 100GB).
>Relation caches are loaded on each backend local memory.
My team and I have been working to make caches shared for the past two years, but the system and rel caches we have chosen not to share..
Reason being that these caches play a big role in transactional DDL processing.
When you do DDL your backend can see all the changes since you update your own cache, but no anyone else's until you commit.
You will find that dealing with that will be the true complexity.
 
 
Have you tried to simply cap the size of these caches?
That's a rather straight forward piece of work and will get you quite far.
We run with a 20MB syscache and a 10MB relcache with 100k+ objects and hundreds of backends
A dumb LRU is plenty good for the purpose. 
 
That being said I would love to see these caches shared. :-)
 
Cheers
Serge
Salesforce 
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Global shared meta cache

Andres Freund
In reply to this post by Ideriha, Takeshi
Hi,

On 2018-06-26 06:48:28 +0000, Ideriha, Takeshi wrote:

> > I think it would be interested for somebody to build a prototype here
> > that ignores all the problems but the first and uses some
> > straightforward, relatively unoptimized locking strategy for the first
> > problem. Then benchmark it. If the results show that the idea has
> > legs, then we can try to figure out what a real implementation would
> > look like.
> > (One possible approach: use Thomas Munro's DHT stuff to build the shared cache.)
>
> I'm inspired by this comment and now developing a prototype (please see attached),
> but I haven't yet put cache structure on shared memory.

> Instead, I put dummy data on shared memory which is initialized at startup,
> and then acquire/release lock just before/after searching/creating catcache entry.

> I haven't considered relcache and catcachelist either.
> It is difficult for me to do everything at one time with right direction.
> So I'm trying to make small prototype and see what I'm walking on the proper way.
>
> I tested pgbench to compare master branch with my patch.
>
> 0) Environment
>    - RHEL 7.4
>    - 16 cores
>    - 128 GB memory
>
> 1) Initialized with pgbench -i -s10
>
> 2) benchmarked 3 times for each conditions and got the average result of TPS.
>                                      |master branch | prototype      | proto/master (%)
>    ------------------------------------------------------------------------------------
>    pgbench -c48 -T60 -Msimple -S   | 131297   |130541   |101%
>    pgbench -c48 -T60 -Msimple      | 4956     |4965   |95%
>    pgbench -c48 -T60 -Mprepared -S |129688   |132538   |97%
>    pgbench -c48 -T60 -Mprepared    |5113   |4615   |84%
>
>   This result seems to show except for prepared protocol with "not only SELECT" it didn't make much difference.

This seems like an pretty large regression to me. And that's an
extremely simplistic case, with tiny caches, and barely any changes to
the cache contents.

Greetings,

Andres Freund

AJG
Reply | Threaded
Open this post in threaded view
|

Re: Global shared meta cache

AJG
In reply to this post by Ideriha, Takeshi
Ideriha, Takeshi wrote

> 2) benchmarked 3 times for each conditions and got the average result of
> TPS.
>                                      |master branch | prototype      |
> proto/master (%)
>  
> ------------------------------------------------------------------------------------
>    pgbench -c48 -T60 -Msimple -S   | 131297   |130541   |101%
>    pgbench -c48 -T60 -Msimple      | 4956     |4965   |95%
>    pgbench -c48 -T60 -Mprepared -S |129688   |132538   |97%
>    pgbench -c48 -T60 -Mprepared    |5113   |4615   |84%
>
>
> 001_global_meta_cache.patch (6K)
> <http://www.postgresql-archive.org/attachment/6026686/0/001_global_meta_cache.patch>


Hello,
Apologies for question. I thought I would just double check percentages that
have been presented.
Is the percentage calculation correct?
as #1 and #3 look inverted to me (say lower when should be higher and vice
versa), and
#2 and #4 look incorrect generally (percentages look much larger than they
should be based on numbers.

I.e. Msimple -S the protype had slightly worse tps performance (130541)
versus Master (131297). I would expect the percentage to be e.g. 99% not
101%

But I may be misunderstanding something :)

Also, Msimple is 4956 master versus 4965 prototype. Just 9 tps change. A
very slight improvement in tps. but the percentage provided is 95%. I would
expect it to be just over 100%?
Again, maybe im not understanding, and hoping it is just my error :)



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

Reply | Threaded
Open this post in threaded view
|

RE: Global shared meta cache

Tsunakawa, Takayuki
In reply to this post by Ideriha, Takeshi
From: Ideriha, Takeshi [mailto:[hidden email]]
> 1) Initialized with pgbench -i -s10
...
>    pgbench -c48 -T60 -Msimple      | 4956     |4965
> |95%

The scaling factor should be much greater than the number of clients.  Otherwise, multiple clients would conflict on the same row of branches table, which might hide the additional overhead of the global metacache.

And I think -j should be 12 or so on your 16-core server, so that the pgbench client can maximize its concurrency.


Regards
Takayuki Tsunakawa





Reply | Threaded
Open this post in threaded view
|

Re: Global shared meta cache

konstantin knizhnik
In reply to this post by Ideriha, Takeshi


On 26.06.2018 09:48, Ideriha, Takeshi wrote:

> Hi, hackers!
>
> My customer created hundreds of thousands of partition tables and tried to select data from hundreds of applications,
> which resulted in enormous consumption of memory because it consumed # of backend multiplied by # of local memory (ex. 100 backends X 1GB = 100GB).
> Relation caches are loaded on each backend local memory.
>
> To address this issue I'm trying to move meta caches like catcache or relcache into shared memory.
>
> This topic seems to have been discussed several times.
> For instance this thread:
> https://www.postgresql.org/message-id/CA%2BTgmobjDw_SWsxyJwT9z-YOwWv0ietuQx5fb%3DWEYdDfvCbzGQ%40mail.gmail.com
>
> In my understanding, it discussed moving catcache and relcache to shared memory rather than current local backend memory,
> and is most concerned with performance overhead.
>
> Robert Haas wrote:
>> I think it would be interested for somebody to build a prototype here
>> that ignores all the problems but the first and uses some
>> straightforward, relatively unoptimized locking strategy for the first
>> problem. Then benchmark it. If the results show that the idea has
>> legs, then we can try to figure out what a real implementation would
>> look like.
>> (One possible approach: use Thomas Munro's DHT stuff to build the shared cache.)
> I'm inspired by this comment and now developing a prototype (please see attached),
> but I haven't yet put cache structure on shared memory.
> Instead, I put dummy data on shared memory which is initialized at startup,
> and then acquire/release lock just before/after searching/creating catcache entry.
>
> I haven't considered relcache and catcachelist either.
> It is difficult for me to do everything at one time with right direction.
> So I'm trying to make small prototype and see what I'm walking on the proper way.
>
> I tested pgbench to compare master branch with my patch.
>
> 0) Environment
>     - RHEL 7.4
>     - 16 cores
>     - 128 GB memory
>
> 1) Initialized with pgbench -i -s10
>
> 2) benchmarked 3 times for each conditions and got the average result of TPS.
>                                       |master branch | prototype      | proto/master (%)
>     ------------------------------------------------------------------------------------
>     pgbench -c48 -T60 -Msimple -S   | 131297   |130541   |101%
>     pgbench -c48 -T60 -Msimple      | 4956     |4965   |95%
>     pgbench -c48 -T60 -Mprepared -S |129688   |132538   |97%
>     pgbench -c48 -T60 -Mprepared    |5113   |4615   |84%
>
>    This result seems to show except for prepared protocol with "not only SELECT" it didn't make much difference.
>    
>
> What do you think about it?
> Before I dig deeper, I want to hear your thoughts.
>
> Best regards,
> Takeshi Ideriha
>

Hi,
I really think that we need to move to global caches (and especially  
catalog caches) in Postgres.
Modern NUMA servers may have hundreds of cores and to be able to utilize  
all of them, we may need to start large number (hundreds) of backends.
Memory overhead of local cache multiplied by 1000 can be quite significant.

But I am not sure that just using RW lock will be enough replace local  
cache with global.
I am quite skeptical concerning performance results you have provided.
Once dataset completely fits in memory (which is true in your case),  
select-only pgbench with prepared statements should be about two times  
faster,
than without prepared statements. And in your case performance with  
prepared statements is even worser.

I wonder if you have repeated each measurement multiple time, to make  
sure that it is not just a fluctuation.
Also which postgresql configuration you have used. If it is default  
postgresql.conf with 128Mb shared buffers size,
then you are measuring time of disk access and catalog cache is not  
relevant for performance in this case.

Below are result I got with pgbench scale 100 (with scale 10 results are  
slightly better) at my desktop with just 16Gb of RAM and 4 ccore.:

                                    |master branch | prototype      | proto/master (%)
    ------------------------------------------------------------------------------------
    pgbench -c10 -T60 -Msimple -S   | 187189   |182123   |97%
    pgbench -c10 -T60 -Msimple      | 15495     |15112   |97%
    pgbench -c10 -T60 -Mprepared -S | 98273   |92810          |94%
    pgbench -c10 -T60 -Mprepared    | 25796   |25169   |97%

As you see there are no surprises here: negative effect of shared cache  
is the largest for the case of non-prepared selects
(because selects themselves are much faster than updates and during  
compilation we have to access relations multiple times).










--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Reply | Threaded
Open this post in threaded view
|

RE: Global shared meta cache

Ideriha, Takeshi
In reply to this post by AJG
>-----Original Message-----
>From: AJG [mailto:[hidden email]]
>Sent: Wednesday, June 27, 2018 3:21 AM
>To: [hidden email]
>Subject: Re: Global shared meta cache
>
>Ideriha, Takeshi wrote
>> 2) benchmarked 3 times for each conditions and got the average result
>> of TPS.
>>                                      |master branch | prototype      |
>> proto/master (%)
>>
>> ------------------------------------------------------------------------------------
>>    pgbench -c48 -T60 -Msimple -S   | 131297   |130541   |101%
>>    pgbench -c48 -T60 -Msimple      | 4956     |4965   |95%
>>    pgbench -c48 -T60 -Mprepared -S |129688   |132538   |97%
>>    pgbench -c48 -T60 -Mprepared    |5113   |4615   |84%
>>
>>
>> 001_global_meta_cache.patch (6K)
>> <http://www.postgresql-archive.org/attachment/6026686/0/001_global_
>> meta_cache.patch>
>
>
>Hello,
>Apologies for question. I thought I would just double check percentages that have
>been presented.
>Is the percentage calculation correct?
>as #1 and #3 look inverted to me (say lower when should be higher and vice versa),
>and
>#2 and #4 look incorrect generally (percentages look much larger than they should be
>based on numbers.
>
>I.e. Msimple -S the protype had slightly worse tps performance (130541) versus
>Master (131297). I would expect the percentage to be e.g. 99% not 101%
>
>But I may be misunderstanding something :)
>
>Also, Msimple is 4956 master versus 4965 prototype. Just 9 tps change. A very slight
>improvement in tps. but the percentage provided is 95%. I would expect it to be just
>over 100%?
>Again, maybe im not understanding, and hoping it is just my error :)
>
>
>
>--
>Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
>
Hi,
Thank you for comments and sorry for late replay.
Thanks to you, I noticed I made a mistake.
As you pointed out, I think my calculation is wrong.

I also need to change some settings of postgresql.conf and pgbench.
So I'm going to measure performance again and submit the result.

Regards,
Takeshi Ideriha


Reply | Threaded
Open this post in threaded view
|

Re: Global shared meta cache

Robert Haas
In reply to this post by konstantin knizhnik
On Mon, Jul 2, 2018 at 5:59 AM, Konstantin Knizhnik
<[hidden email]> wrote:
> But I am not sure that just using RW lock will be enough replace local cache
> with global.

I'm pretty sure it won't.  In fact, no matter what kind of locking you
use, it's bound to cost something.  There is no such thing as a free
lunch.  It does not seem realistic to me to suppose that we're going
to just get rid of all of our backend-private caches and replace them
with a shared cache and somehow there will be no performance
regression.  Maybe someone will come up with something that is
surprisingly awesome, but I doubt it.

I think we need to take a little bit broader view of this problem.
For instance, maybe we could have backend-local caches that are kept
relatively small, and then a larger shared cache that can hold more
entries.  There are code comments someplace that say that
CLOBBER_CACHE_ALWAYS is about 100x slower than a regular build, and
CLOBBER_CACHE_ALWAYS is about 10000x slower.  Obviously, loading an
entry into our backend-private cache must be a LOT slower than
consulting one.  If the shared cache is only, say, 3x slower than the
backend-private cache, then we might be able to get away with having
only the really-frequently-accessed stuff in the actual private cache
and the somewhat-frequently-accessed stuff in the shared cache.  Now
that's probably still going to cost something, but maybe we can make
that "something" very small in the cases people are actually likely to
hit.  I would guess that we'd want to try to use something like the
st_changecount protocol to make reads very cheap and writes
comparatively more expensive, since by and large cache invalidations
aren't that frequent.

Another approach would be to consider whether we're caching too much
useless junk in the first place.  For instance, maybe there's some
stuff in the relcache that takes up a lot of memory compared to how
much of a performance boost it produces, or maybe there's some stuff
that could be represented more compactly.

One experiment I think would be interesting is to study how much
catcache traffic we're actually generating and see if we can find any
way to reduce it.  For instance, somebody could write code to record
the file and line number for every catcache lookup and then run a test
workload (or several test workloads).  That might give us some ideas
about stuff we could just decide not to cache, especially if we also
knew how large each cache ended up being.

With respect to partitioning specifically, it seems like we might be
able to come up with some way of planning that doesn't need a full
relcache entry for every partition, particularly if there are no
partition-local objects (indexes, triggers, etc.).  But that seems
like a hard refactoring, and even if we did it, what about execution
time?  So much code expects to be handed a Relation.  Still, I have a
suspicion that there might be some way to do better here with enough
work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: Global shared meta cache

konstantin knizhnik


On 05.07.2018 17:00, Robert Haas wrote:

> On Mon, Jul 2, 2018 at 5:59 AM, Konstantin Knizhnik
> <[hidden email]> wrote:
>> But I am not sure that just using RW lock will be enough replace local cache
>> with global.
> I'm pretty sure it won't.  In fact, no matter what kind of locking you
> use, it's bound to cost something.  There is no such thing as a free
> lunch.  It does not seem realistic to me to suppose that we're going
> to just get rid of all of our backend-private caches and replace them
> with a shared cache and somehow there will be no performance
> regression.  Maybe someone will come up with something that is
> surprisingly awesome, but I doubt it.
>
> I think we need to take a little bit broader view of this problem.
> For instance, maybe we could have backend-local caches that are kept
> relatively small, and then a larger shared cache that can hold more
> entries.  There are code comments someplace that say that
> CLOBBER_CACHE_ALWAYS is about 100x slower than a regular build, and
> CLOBBER_CACHE_ALWAYS is about 10000x slower.  Obviously, loading an
> entry into our backend-private cache must be a LOT slower than
> consulting one.  If the shared cache is only, say, 3x slower than the
> backend-private cache, then we might be able to get away with having
> only the really-frequently-accessed stuff in the actual private cache
> and the somewhat-frequently-accessed stuff in the shared cache.  Now
> that's probably still going to cost something, but maybe we can make
> that "something" very small in the cases people are actually likely to
> hit.  I would guess that we'd want to try to use something like the
> st_changecount protocol to make reads very cheap and writes
> comparatively more expensive, since by and large cache invalidations
> aren't that frequent.
>
> Another approach would be to consider whether we're caching too much
> useless junk in the first place.  For instance, maybe there's some
> stuff in the relcache that takes up a lot of memory compared to how
> much of a performance boost it produces, or maybe there's some stuff
> that could be represented more compactly.
>
> One experiment I think would be interesting is to study how much
> catcache traffic we're actually generating and see if we can find any
> way to reduce it.  For instance, somebody could write code to record
> the file and line number for every catcache lookup and then run a test
> workload (or several test workloads).  That might give us some ideas
> about stuff we could just decide not to cache, especially if we also
> knew how large each cache ended up being.
>
> With respect to partitioning specifically, it seems like we might be
> able to come up with some way of planning that doesn't need a full
> relcache entry for every partition, particularly if there are no
> partition-local objects (indexes, triggers, etc.).  But that seems
> like a hard refactoring, and even if we did it, what about execution
> time?  So much code expects to be handed a Relation.  Still, I have a
> suspicion that there might be some way to do better here with enough
> work.
>
94% slowdown at my desktop seems to be not so significant degradation.
But I tried this patch at more powerful server with 24 physical cores
and here the negative effect of global cache synchronization was much
more dramatic:
196k TPS  vs. 395k TPS for select-only pgbench with -S -c 100 -j 10.
Almost two times!

So we really need more sophisticated and smart solution for the problem
of global caches.
Looks like combination of small local and big global caches is the best
alternative.




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Reply | Threaded
Open this post in threaded view
|

Re: Global shared meta cache

Andres Freund
In reply to this post by Robert Haas
Hi,

On 2018-07-05 10:00:13 -0400, Robert Haas wrote:
> I think we need to take a little bit broader view of this problem.
> For instance, maybe we could have backend-local caches that are kept
> relatively small, and then a larger shared cache that can hold more
> entries.

I think it's pretty much *required* that we have that. Not just for
speed, but for correctness. It'll otherwise be very hard to deal with
transactional DDL.  I'm pretty sure that we'll have to have everything
modified by the local transaction in that cache. There's a lot of
interesting additional problems with snapshots stil, but that seems like
a baseline requirement.


> Obviously, loading an entry into our backend-private cache must be a
> LOT slower than consulting one.

I'm not actually sure it makes *that* much of a difference, if the local
cache is small.


> I would guess that we'd want to try to use something like the
> st_changecount protocol to make reads very cheap and writes
> comparatively more expensive, since by and large cache invalidations
> aren't that frequent.

That strikes me as hard to get right and fatal to get wrong. I suspect
normal RW locking ought to do fine.


> One experiment I think would be interesting is to study how much
> catcache traffic we're actually generating and see if we can find any
> way to reduce it.  For instance, somebody could write code to record
> the file and line number for every catcache lookup and then run a test
> workload (or several test workloads).  That might give us some ideas
> about stuff we could just decide not to cache, especially if we also
> knew how large each cache ended up being.

We definitely do a lot of redundant lookups for the same entries.

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

RE: Global shared meta cache

Ideriha, Takeshi
In reply to this post by srielau
>-----Original Message-----
>From: [hidden email] [mailto:[hidden email]]
>Sent: Wednesday, June 27, 2018 2:04 AM
>To: Ideriha, Takeshi/出利葉 健 <[hidden email]>; pgsql-hackers
><[hidden email]>
>Subject: RE: Global shared meta cache
>
>Takeshi-san,
>
>
>>My customer created hundreds of thousands of partition tables and tried
>>to select data from hundreds of applications, which resulted in
>>enormous consumption of memory because it consumed # of backend multiplied by
>#  of local memory (ex. 100 backends X 1GB = 100GB).
>>Relation caches are loaded on each backend local memory.
>My team and I have been working to make caches shared for the past two years, but
>the system and rel caches we have chosen not to share..
>Reason being that these caches play a big role in transactional DDL processing.
>When you do DDL your backend can see all the changes since you update your own
>cache, but no anyone else's until you commit.
>You will find that dealing with that will be the true complexity.

Hi Serge,

Thank you for sharing your experience.
I didn't thought much about DDL visibility problem.
Introducing version control like MVCC to catcache may solve the problem, but it seems too much to me.
It may be a good to keep local catcache for in-progress transaction rather than sharing everything.
(Other hackers also pointed out it. )

>Have you tried to simply cap the size of these caches?
>That's a rather straight forward piece of work and will get you quite far.
>We run with a 20MB syscache and a 10MB relcache with 100k+ objects and hundreds
>of backends A dumb LRU is plenty good for the purpose.
>

I haven't tried yet but read some relevant discussion:
 https://www.postgresql.org/message-id/flat/20161219.201505.11562604.horiguchi.kyotaro@...

I think the cap solution alleviates memory bloating in some cases but there is a still problematic case if there are so many backends.

>That being said I would love to see these caches shared. :-)
Thank you!

Regards,
Takeshi
Reply | Threaded
Open this post in threaded view
|

Re: Global shared meta cache

Amit Langote-2
In reply to this post by Robert Haas
On 2018/07/05 23:00, Robert Haas wrote:
> With respect to partitioning specifically, it seems like we might be
> able to come up with some way of planning that doesn't need a full
> relcache entry for every partition, particularly if there are no
> partition-local objects (indexes, triggers, etc.).
We won't know that there are no partition-local objects until we open them
though, right?  As you said, there might be a way to refactor things such
that just knowing that there are no partition-local objects becomes
cheaper than doing a full-fledged RelationBuildDesc.

Thanks,
Amit


Reply | Threaded
Open this post in threaded view
|

RE: Global shared meta cache

Ideriha, Takeshi
In reply to this post by konstantin knizhnik
Hi, Konstantin

>Hi,
>I really think that we need to move to global caches (and especially catalog caches) in
>Postgres.
>Modern NUMA servers may have hundreds of cores and to be able to utilize all of them,
>we may need to start large number (hundreds) of backends.
>Memory overhead of local cache multiplied by 1000 can be quite significant.

Yeah, thank you for the comment.


>I am quite skeptical concerning performance results you have provided.
>Once dataset completely fits in memory (which is true in your case), select-only
>pgbench with prepared statements should be about two times faster, than without
>prepared statements. And in your case performance with prepared statements is even
>worser.
>
>I wonder if you have repeated each measurement multiple time, to make sure that it
>is not just a fluctuation.
>Also which postgresql configuration you have used. If it is default postgresql.conf with
>128Mb shared buffers size, then you are measuring time of disk access and catalog
>cache is not relevant for performance in this case.
>
>Below are result I got with pgbench scale 100 (with scale 10 results are slightly better)
>at my desktop with just 16Gb of RAM and 4 ccore.:
>
>                                    |master branch | prototype      | proto/master
>(%)
>    ------------------------------------------------------------------------------------
>    pgbench -c10 -T60 -Msimple -S   | 187189   |182123   |97%
>    pgbench -c10 -T60 -Msimple      | 15495     |15112   |97%
>    pgbench -c10 -T60 -Mprepared -S | 98273   |92810          |94%
>    pgbench -c10 -T60 -Mprepared    | 25796   |25169   |97%
>
>As you see there are no surprises here: negative effect of shared cache is the largest
>for the case of non-prepared selects (because selects themselves are much faster
>than updates and during compilation we have to access relations multiple times).
>

As you pointed out my shared_memory and scaling factor was too small.
I did the benchmark again with a new setting and my result seems to reproduce your result.

On the machine with 128GB memory and 16 cores, shared_buffer was set to 32GB and
db was initialized with -s100.

TPS result follows: (mean of 10 times measurement; round off the decimal)
                                          |master branch | proto | proto/master (%)
   ------------------------------------------------------------------------------------
  pgbench -c48 -T60 -j16 -Msimple -S    |122140 | 114103 | 93
  pgbench -c48 -T60 -j16 -Msimple       | 7858 | 7822   | 100
  pgbench -c48 -T60 -j16 -Mprepared -S  |221740 | 210778 | 95
  pgbench -c48 -T60 -j16 -Mprepared     | 9257 | 8998   | 97
 
As you mentioned, SELECT only query has more overheads.

( By the way, I think in the later email you mentioned about the result when the concurrent number of clients is larger.
 On this point I'll also try to check the result.)

====================
Takeshi Ideriha
Fujitsu Limited


Reply | Threaded
Open this post in threaded view
|

RE: Global shared meta cache

Ideriha, Takeshi
In reply to this post by Amit Langote-2
Hi,

Thank you for the previous discussion while ago.
I’m afraid I haven't replied to all.

To move forward this development I attached a PoC patch.

I introduced a guc called shared_catacache_mem to specify
how much memory is supposed be allocated on the shared memory area.
It defaults to zero, which indicates that no catalog cache is shared
but allocated on each backend MemoryContext (same as current Postgres).

At this moment this patch only allocates catalog cache header and CatCache data on the shared memory area.
It doesn't do much work, just starting and stopping postgres server with shared_catcache_mem non-zero.

Shared version CatCacheHdr is put on the postgres-initialized shared memory so that backends attach it
and build SysCache[] to store pointers of CatCache.
Each CatCache, CatCTup and CacCList is also allocated on the shared memory area,
where the limit size is the value of shared_catcache_mem and backed by DSA.
This area is first created at the postgres-initialized shared memory and re-initialized as DSA area
because the address of postgres-initialized shared area does not change among different process
and hopefully makes it easy to handle pointers on the shared memory.
(Though I'm still struggling to grasp the idea of DSA and underlying DSM..)

The followings are major items I haven't touched:
- make hash table of each CatCache shared, which I'm going to take advantage of dshash
- how to evict shared cache (LRU mechanism)
- how to treat cache visibility and invalidation coming from transactions including DDL
- how to alleviate the slowness compared to current PostgreSQL
- make relcache shared as well as catcache

If you have any insights/reactions/suggestions, please feel free to comment.

====================
Takeshi Ideriha
Fujitsu Limited



0001-PoC-Allocate-catcache-on-the-shared-memory.patch (24K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: Global shared meta cache

Ideriha, Takeshi
Hi,

>From: Ideriha, Takeshi [mailto:[hidden email]]
>Sent: Wednesday, October 3, 2018 3:18 PM
>At this moment this patch only allocates catalog cache header and CatCache data on
>the shared memory area.
On this allocation stuffs I'm trying to handle it in another thread [1] in a broader way.

>The followings are major items I haven't touched:
>- how to treat cache visibility and invalidation coming from transactions including DDL

On this point some of you gave me comment before but at that time I had less knowledge
and couldn't replay them immediately. Sorry for that.

Right now I hit upon two things.
Plan A is that all of the works is done in the shared memory and no local cache is used.
Plan B is that both shared cache and local cache are used.
Maybe based on the discussion several month ago in this thread, plan B would be better.
But there are some variations of plan B so I'd like to hear opinions.

A. Use only shared memory
Because everything should be done inside shared memory it needs same machinery as current DB shared_buffers
That is, handling transaction including DDL in a proper way needs MVCC and cleaning up obsoleted cache needs vacuum.
Taking advantage of MVCC and vacuum would work but it seems to me pretty tough to implement them.
So another option is plan B, which handles version control of cache and clean them up in a different way.

B. Use both shared memory and local memory
Basic policy is that the shared memory keeps the latest version cache as much as possible and each cache has version information (xmin, xmax).
Local cache is a kind of cache of shared one and its lifetime is temporal.

[Search cache]
When a backend wants to use relation or catalog cache in a transaction, it tries to find them in a following order:
1. local cache
2. shared cache
3. disk

At first there is no local cache so it tries to search shared cache and if found loads it into the local memory.
If wanted cache is not found in shared memory, backend fetches it from disk.

[Lifetime of local cache]
When ALTER TABLE/DROP TABLE is issued in a transaction, relevant local cache should be different from the original one.
On this point I'm thinking two cases.
B-1: Create a local cache at the first reference and keep it until transaction ends.
     The relevant local cache is updated or deleted when the DROP/ALTER is issued. It's freed when transaction is committed or aborted.
B-2: The lifetime of local cache is during one snapshot. If isolation-level is read-committed, every time the command is issued local cache is deleted.

In case of B-1 sinval messages machinery is necessary to update the local cache, which is same as current machinery.
On the other hand, case B-2 doesn't need sinval message because after one snapshot duration is expired the local cache is deleted.
From another point of view, there is trade-off relation between B-1 and B-2. B-1 would outweigh B-2 in terms of performance
but B-2 would use less memory.

[Invalidation of shared cache]
I'm thinking that invalidating shared cache can be responsible for a backend which wants to see the latest version rather than
one has committed DROP/ALTER command. In my sketch caches has its own version information so transaction can compare its snapshot
with shared cache version and if cache is not wanted one, we can obtain it from disk.

Do you have any thoughts?

[1] https://www.postgresql.org/message-id/flat/4E72940DA2BF16479384A86D54D0988A6F1EE452%40G01JPEXMBKW04 
Regards,
Takeshi Ideriha

Reply | Threaded
Open this post in threaded view
|

Re: Global shared meta cache

Michael Paquier-2
On Mon, Nov 26, 2018 at 12:12:09PM +0000, Ideriha, Takeshi wrote:
> On this allocation stuffs I'm trying to handle it in another thread
>  [1] in a broader way.

Based on the latets updates of this thread, this is waiting for
review, so moved to next CF.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: Global shared meta cache

Ideriha, Takeshi
In reply to this post by Ideriha, Takeshi
>From: Ideriha, Takeshi [mailto:[hidden email]]
>Do you have any thoughts?
>
Hi, I updated my idea, hoping get some feedback.

[TL; DR]
The basic idea is following 4 points:
A. User can choose which database to put a cache (relation and catalog) on shared memory and how much memory is used
B. Caches of committed data are on the shared memory. Caches of uncommitted data are on the local memory.
C. Caches on the shared memory have xid information (xmin, xmax)
D. Evict not recently used cache from shared memory


[A]
Regarding point A, I can imagine some databases are connected by lots of clients but others don't.
So I introduced a new parameter in postgresql.conf, "shared_meta_cache",
which is disabled by default and needs server restart to enable.
ex. shared_meta_cache = 'db1:500MB, db2:100MB'.

Some catcaches like pg_database are shared among the whole database,
so such shared catcaches are allocated in a dedicated space within shared memory.
This space can be controlled by "shared_meta_global_catcache" parameter, which is named after global directory.
But I want this parameter to be hidden in postgresql.conf to make it simple for users. It's too detailed.

[B & C]
Regarding B & C, the motivation is we don't want other backends to see uncommitted tables.
Search order is local memory -> shared memory -> disk.
Local process searches cache in shared memory based from its own snapshot and xid of cache.
When cache is not found in shared memory, cache with xmin is made in shared memory ( but not in local one).

When cache definition is changed by DDL, new cache is created in local one, and thus next commands refer to local cache if needed.
When it's committed, local cache is cleared and shared cache is updated. This update is done by adding xmax to old cache
and also make a new one with xmin. The idea behind adding a new one is that newly created cache (new table or altered table)
is likely to be used in next transactions. At this point maybe we can make use of current invalidation mechanism,
even though invalidation message to other backends is not sent.

[D]
As for D, I'm thinking to do benchmark with simple LRU. If the performance is bad, change to other algorithm like Clock.
We don't care about eviction of local cache because its lifetime is in a transaction, and I don't want to make it bloat.

best regards,
Takeshi Ideriha



Reply | Threaded
Open this post in threaded view
|

RE: Global shared meta cache

Ideriha, Takeshi
>From: Ideriha, Takeshi [mailto:[hidden email]]
>[TL; DR]
>The basic idea is following 4 points:
>A. User can choose which database to put a cache (relation and catalog) on shared
>memory and how much memory is used
>B. Caches of committed data are on the
>shared memory. Caches of uncommitted data are on the local memory.
>C. Caches on the shared memory have xid information (xmin, xmax)
>D. Evict not recently used cache from shared memory

I updated some thoughts about B and C for CatCache.
I would be very happy if you put some comments.

>[B & C]
>Regarding B & C, the motivation is we don't want other backends to see uncommitted
>tables.
>Search order is local memory -> shared memory -> disk.
>Local process searches cache in shared memory based from its own snapshot and xid
>of cache.
>When cache is not found in shared memory, cache with xmin is made in shared
>memory ( but not in local one).
>
>When cache definition is changed by DDL, new cache is created in local one, and thus
>next commands refer to local cache if needed.
>When it's committed, local cache is cleared and shared cache is updated. This update
>is done by adding xmax to old cache and also make a new one with xmin. The idea
>behind adding a new one is that newly created cache (new table or altered table) is
>likely to be used in next transactions. At this point maybe we can make use of current
>invalidation mechanism, even though invalidation message to other backends is not
>sent.

My current thoughts:
- Each catcache has (maybe partial) HeapTupleHeader
- put every catcache on shared memory and no local catcache
- but catcache for aborted tuple is not put on shared memory
- Hash table exists per kind of CatCache
- These hash tables exists for each database and shared
  - e.g) there is a hash table for pg_class of a DB

Why I'm leaning toward not to use local cache follows:
- At commit moment you need to copy local cache to global cache. This would delay
  the response time.
- Even if uncommitted catcache is on shared memory, other transaction cannot
  see the cache. In my idea they have xid information and visibility is checked
  by comparing xmin, xmax of catcache and snapshot.  

OK, then if we put catcache on shared memory, we need to check their visibility.
But if we use the exact same visibility check mechanism as heap tuple,
it takes much more steps compared to current local catcache search.
Current visibility check is based on snapshot check and commit/abort check.
So I'm thinking to only put in-progress caches or committed one. This would
save time for checking catcache status (commit/abort) while searching cache.
But basically I'm going to use current visibility check mechanism except commit/
abort check (in other words check of clog).

These are how it works.
- When creating a catcache, copy heap tuple with heapTupleHeader
- When update/delete command for catalog tuple is finished,
  update xmax of corresponding cache
- If there is a cache whose xmin is aborted xid, delete the cache
- If there is a cache whose xmax is aborted xid, initialize xmax information
- At commit time, there is no action to the shared cache

Pending items are
- thoughts about shared relcache
- "vacuum" process for shared cache

Regards,
Ideriha Takeshi



Reply | Threaded
Open this post in threaded view
|

RE: Global shared meta cache

Ideriha, Takeshi
Hi, everyone.

>From: Ideriha, Takeshi [mailto:[hidden email]]
>My current thoughts:
>- Each catcache has (maybe partial) HeapTupleHeader
>- put every catcache on shared memory and no local catcache
>- but catcache for aborted tuple is not put on shared memory
>- Hash table exists per kind of CatCache
>- These hash tables exists for each database and shared
>  - e.g) there is a hash table for pg_class of a DB

I talked about shared CatCache (SysCache) with Thomas at PGCon and he
suggested using sinval to control cache visibility instead of xid.
Base on this I've changed my design. I'll send some PoC patch in a week
but share my idea beforehand. I'm sorry this email is too long to read
but I'm happy if you have some comments.

Basically I won't make shared catcache as default, make it as option.

Both local and shared memory has hash tables of catcache. A shared hash
entry is catctup itself and a local hash entry is a pointer to the
shared catctup. Actually, local hash entry does not hold a direct pointer
but points to a handle of shared catctup. The handle points to shared
catctup and is located in shared memory. This is intended to avoid
dangling pointer of local hash entry due to eviction of shared catctup
by LRU. ( The detail about LRU will be written in another email because
I'll implement it later.)

* Search and Insert
Current postgres searches (local) hash table and if it's missed, search
the actual catalog (shared buffer and disk) and build the cache; build
the negative cache if not found.

In new architecture, if cache is not found in local hash table, postgres
tries to search shared one before consulting shared buffer. Here is a
detail. To begin with, postgres looks up the pointer in local hash
table. If it's found, it references the pointer and gets catctup. If
not, it searches the shared hash table and gets catctup and insert
its pointer into local hash table if the catctup is found. If it doesn't
exist in shared hash table either, postgres searches actual catalog and
build the cache and in most cases insert it into shared hash table
and its pointer to local one. The exception case is that the cache
is made from uncommitted catalog tuple, which must not be seen from
other process. So an uncommitted cache is built in local memory and
pushed directly into local table but not shared one. Lastly, if there
is no tuple we're looking for, put negative tuple into shared hash table.

* Invalidation and visibility control
Now let's talk about invalidation. Current cache invalidation is based
on local and shared invalidation queue (sinval). When transaction is
committed, sinval msg is queued into shared one. Other processes read and
process sinval msgs at their own timing.

In shared catcache, I follow the current sinval in most parts. But I'll
change the action when sinval msg is queued up and read by a process.
When messages are added to shared queue, identify corresponding shared
caches (matched by hash value) and turn their "obsolete flag" on. When
sinval msg is read by a process, each process deletes the local hash
entries (pointer to handler). Each process can see a shared catctup as
long as its pointer (local entry) is valid. Because sinval msgs are not
processed yet, it's ok to keep seeing the pointer to possibly old
cache. After local entry is invalidated, its local process tries
to search shared hash table to always find a catctup whose obsolete flag
is off. The process can see the right shared cache after invalidation
messages are read because it checks the obsolete flag and also
uncommitted cache never exists in shared memory at all.

There is a subtle thing here. Always finding a shared catctup without
obsolete mark assumes that the process already read the sinval msgs. So
before trying to search shared table, I make the process read sinval msg.
After it's read, local cache status becomes consistent with the action
to get a new cache. This reading timing is almost same as current postgres
behavior because it's happened after local cache miss both in current
design and mine. After cache miss in current design, a process opens
the relation and gets a heavyweight lock. At this time, in fact, it reads
the sinval msgs. (These things are well summarized in talking by Robert
Haas at PGCon[1]).

Lastly, we need to invalidate a shared catctup itself at some point. But
we cannot delete is as long as someone sees it. So I'll introduce
refcounter. It's increased or decreased at the same timing when
current postgres manipulates the local refcounter of catctup and catclist
to avoid catctup is deleted while catclist is used or vice versa (that
is SearchCatCache/RelaseCatCache). So shared catctup is deleted when
its shared refcount becomes zero and obsolete flag is on. Once it's
vanished from shared cache, the obsolete cache never comes back again
because a process which tries to get cache but fails in shared hash table
already read the sinval messages (in any case it reads them when opening
a table and taking a lock).


I'll make a PoC aside from performance issue at first and use
SharedMemoryContext (ShmContext) [2], which I'm making to allocate/free
shared items via palloc/pfree.  

[1] https://www.pgcon.org/2019/schedule/attachments/548_Challenges%20of%20Concurrent%20DDL.pdf
[2] https://commitfest.postgresql.org/23/2166/

---
Regards,
Takeshi Ideriha


Reply | Threaded
Open this post in threaded view
|

Re: Global shared meta cache

konstantin knizhnik
Takeshi-san,

I am sorry for late response - I just waited new version of the patch
from you for review.
I read your last proposal and it seems to be very reasonable.
 From my point of view we can not reach acceptable level of performance
if we do not have local cache at all.
So, as you proposed, we should maintain local cache for uncommitted data.

I think that size of global cache should be limited (you have introduced
GUC for it).
In principle it is possible to use dynamic shared memory and have
unlimited global cache.
But I do not see much sense in it.

I do not completely understand from your description when are are going
to evict entry from local cache?
Just once transaction is committed? I think it will be more efficient to
also specify memory threshold for local cache size
and use LRU or some other eviction policy to remove data from local cache.

So if working set (accessed relations) fits in local cache limit, there
will be no performance penalty comparing with current implementation.
There should be completely on difference on pgbench or other benchmarks
with relatively small number of relations.

If entry is not found in local cache, then we should look for it in
global cache and in case of double cache miss - read it from the disk.
I do not completely understand why we need to store references to global
cache entries in local cache and use reference counters for global cache
entries.
Why we can not maintain just two independent caches?

While there are really databases with hundreds and even thousands of
tables, application is still used to work with only some small subset of
them.
So I think that "working set" can still fit in memory.  This is why I
think that in case of local cache miss and global cache hit, we should
copy data from global cache to local cache
to make it possible to access it in future without any sycnhronization.

As far as we need to keep all uncommitted data in local cache, there is
still a chance of local memory overflow (if some transaction creates or
alters too much number of tables).
But I think that it is very exotic and rare use case. The problem with
memory overflow usually takes place if we have large number of backends,
each maintaining its own  catalog cache.
So I think that we should have "soft" limit for local cache and "hard"
limit for global cache.

I didn't think much about cache invalidation. I read your proposal, but
frankly speaking do not understand why it should be so complicated.
Why we can't immediately invalidate entry in global cache and lazily (as
it is done now using invalidation signals) invalidate local caches?



On 26.06.2019 9:23, Ideriha, Takeshi wrote:

> Hi, everyone.
>
>> From: Ideriha, Takeshi [mailto:[hidden email]]
>> My current thoughts:
>> - Each catcache has (maybe partial) HeapTupleHeader
>> - put every catcache on shared memory and no local catcache
>> - but catcache for aborted tuple is not put on shared memory
>> - Hash table exists per kind of CatCache
>> - These hash tables exists for each database and shared
>>   - e.g) there is a hash table for pg_class of a DB
> I talked about shared CatCache (SysCache) with Thomas at PGCon and he
> suggested using sinval to control cache visibility instead of xid.
> Base on this I've changed my design. I'll send some PoC patch in a week
> but share my idea beforehand. I'm sorry this email is too long to read
> but I'm happy if you have some comments.
>
> Basically I won't make shared catcache as default, make it as option.
>
> Both local and shared memory has hash tables of catcache. A shared hash
> entry is catctup itself and a local hash entry is a pointer to the
> shared catctup. Actually, local hash entry does not hold a direct pointer
> but points to a handle of shared catctup. The handle points to shared
> catctup and is located in shared memory. This is intended to avoid
> dangling pointer of local hash entry due to eviction of shared catctup
> by LRU. ( The detail about LRU will be written in another email because
> I'll implement it later.)
>
> * Search and Insert
> Current postgres searches (local) hash table and if it's missed, search
> the actual catalog (shared buffer and disk) and build the cache; build
> the negative cache if not found.
>
> In new architecture, if cache is not found in local hash table, postgres
> tries to search shared one before consulting shared buffer. Here is a
> detail. To begin with, postgres looks up the pointer in local hash
> table. If it's found, it references the pointer and gets catctup. If
> not, it searches the shared hash table and gets catctup and insert
> its pointer into local hash table if the catctup is found. If it doesn't
> exist in shared hash table either, postgres searches actual catalog and
> build the cache and in most cases insert it into shared hash table
> and its pointer to local one. The exception case is that the cache
> is made from uncommitted catalog tuple, which must not be seen from
> other process. So an uncommitted cache is built in local memory and
> pushed directly into local table but not shared one. Lastly, if there
> is no tuple we're looking for, put negative tuple into shared hash table.
>
> * Invalidation and visibility control
> Now let's talk about invalidation. Current cache invalidation is based
> on local and shared invalidation queue (sinval). When transaction is
> committed, sinval msg is queued into shared one. Other processes read and
> process sinval msgs at their own timing.
>
> In shared catcache, I follow the current sinval in most parts. But I'll
> change the action when sinval msg is queued up and read by a process.
> When messages are added to shared queue, identify corresponding shared
> caches (matched by hash value) and turn their "obsolete flag" on. When
> sinval msg is read by a process, each process deletes the local hash
> entries (pointer to handler). Each process can see a shared catctup as
> long as its pointer (local entry) is valid. Because sinval msgs are not
> processed yet, it's ok to keep seeing the pointer to possibly old
> cache. After local entry is invalidated, its local process tries
> to search shared hash table to always find a catctup whose obsolete flag
> is off. The process can see the right shared cache after invalidation
> messages are read because it checks the obsolete flag and also
> uncommitted cache never exists in shared memory at all.
>
> There is a subtle thing here. Always finding a shared catctup without
> obsolete mark assumes that the process already read the sinval msgs. So
> before trying to search shared table, I make the process read sinval msg.
> After it's read, local cache status becomes consistent with the action
> to get a new cache. This reading timing is almost same as current postgres
> behavior because it's happened after local cache miss both in current
> design and mine. After cache miss in current design, a process opens
> the relation and gets a heavyweight lock. At this time, in fact, it reads
> the sinval msgs. (These things are well summarized in talking by Robert
> Haas at PGCon[1]).
>
> Lastly, we need to invalidate a shared catctup itself at some point. But
> we cannot delete is as long as someone sees it. So I'll introduce
> refcounter. It's increased or decreased at the same timing when
> current postgres manipulates the local refcounter of catctup and catclist
> to avoid catctup is deleted while catclist is used or vice versa (that
> is SearchCatCache/RelaseCatCache). So shared catctup is deleted when
> its shared refcount becomes zero and obsolete flag is on. Once it's
> vanished from shared cache, the obsolete cache never comes back again
> because a process which tries to get cache but fails in shared hash table
> already read the sinval messages (in any case it reads them when opening
> a table and taking a lock).
>
>
> I'll make a PoC aside from performance issue at first and use
> SharedMemoryContext (ShmContext) [2], which I'm making to allocate/free
> shared items via palloc/pfree.
>
> [1] https://www.pgcon.org/2019/schedule/attachments/548_Challenges%20of%20Concurrent%20DDL.pdf
> [2] https://commitfest.postgresql.org/23/2166/
>
> ---
> Regards,
> Takeshi Ideriha
Hi

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



12