Cache relation sizes?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
29 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Cache relation sizes?

Thomas Munro-3
Hello,

PostgreSQL likes to probe the size of relations with lseek(SEEK_END) a
lot.  For example, a fully prewarmed pgbench -S transaction consists
of recvfrom(), lseek(SEEK_END), lseek(SEEK_END), sendto().  I think
lseek() is probably about as cheap as a syscall can be so I doubt it
really costs us much, but it's still a context switch and it stands
out when tracing syscalls, especially now that all the lseek(SEEK_SET)
calls are gone (commit c24dcd0cfd).

If we had a different kind of buffer mapping system of the kind that
Andres has described, there might be a place in shared memory that
could track the size of the relation.  Even if we could do that, I
wonder if it would still be better to do a kind of per-backend
lock-free caching, like this:

1.  Whenever a file size has been changed by extending or truncating
(ie immediately after the syscall), bump a shared "size_change"
invalidation counter.

2.  Somewhere in SMgrRelation, store the last known size_change
counter and the last known size.  In _mdnblocks(), if the counter
hasn't moved, we can use the cached size and skip the call to
FileSize().

3.  To minimise false cache invalidations (caused by other relations'
size changes), instead of using a single size_change counter in shared
memory, use an array of N and map relation OIDs onto them.

4.  As for memory coherency, I think it might be enough to use uint32
without locks or read barriers on the read size, since you have a view
of memory at least as new as your snapshot (the taking of which
included a memory barrier).  That's good enough because we don't need
to see blocks added after our snapshot was taken (the same assumption
applies today, this just takes further advantage of it), and
truncations can't happen while we have a share lock on the relation
(the taking of which also includes memory barrier, covering the case
where the truncation happened after our snapshot and the acquisition
of the share lock on the relation).  In other words, there is heavy
locking around truncation already, and for extension we don't care
about recent extensions so we can be quite relaxed about memory.
Right?

I don't have a patch for this (though I did once try it as a
throw-away hack and it seemed to work), but I just wanted to share the
idea and see if anyone sees a problem with the logic/interlocking, or
has a better idea for how to do this.  It occurred to me that I might
be missing something or this would have been done already...

--
Thomas Munro
http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Cache relation sizes?

Andres Freund
Hi,

On 2018-11-07 11:40:22 +1300, Thomas Munro wrote:
> PostgreSQL likes to probe the size of relations with lseek(SEEK_END) a
> lot.  For example, a fully prewarmed pgbench -S transaction consists
> of recvfrom(), lseek(SEEK_END), lseek(SEEK_END), sendto().  I think
> lseek() is probably about as cheap as a syscall can be so I doubt it
> really costs us much, but it's still a context switch and it stands
> out when tracing syscalls, especially now that all the lseek(SEEK_SET)
> calls are gone (commit c24dcd0cfd).

I'd really really like to see some benchmarking before embarking on a
more complex scheme.  I aesthetically dislike those lseeks, but ...


> If we had a different kind of buffer mapping system of the kind that
> Andres has described, there might be a place in shared memory that
> could track the size of the relation.  Even if we could do that, I
> wonder if it would still be better to do a kind of per-backend
> lock-free caching, like this:

Note that the reason for introducing that isn't primarily motivated
by getting rid of the size determining lseeks, but reducing the locking
for extending *and* truncating relations.


Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: Cache relation sizes?

Edmund Horner
In reply to this post by Thomas Munro-3
On Wed, 7 Nov 2018 at 11:41, Thomas Munro <[hidden email]> wrote:

>
> Hello,
>
> PostgreSQL likes to probe the size of relations with lseek(SEEK_END) a
> lot.  For example, a fully prewarmed pgbench -S transaction consists
> of recvfrom(), lseek(SEEK_END), lseek(SEEK_END), sendto().  I think
> lseek() is probably about as cheap as a syscall can be so I doubt it
> really costs us much, but it's still a context switch and it stands
> out when tracing syscalls, especially now that all the lseek(SEEK_SET)
> calls are gone (commit c24dcd0cfd).
>
> If we had a different kind of buffer mapping system of the kind that
> Andres has described, there might be a place in shared memory that
> could track the size of the relation.  Even if we could do that, I
> wonder if it would still be better to do a kind of per-backend
> lock-free caching, like this:

On behalf of those looking after databases running over NFS (sigh!), I
think this is definitely worth exploring.  Looking at the behaviour of
my (9.4.9) server, there's an lseek(SEEK_END) for every relation
(table or index) used by a query, which is a lot of them for a heavily
partitioned database.  The lseek counts seem to be the same with
native partitions and 10.4.

As an incredibly rough benchmark, a "SELECT * FROM t ORDER BY pk LIMIT
0" on a table with 600 partitions, which builds a
MergeAppend/IndexScan plan, invokes lseek around 1200 times, and takes
600ms to return when repeated.  (It's much slower the first time,
because the backend has to open the files, and read index blocks.  I
found that increasing max_files_per_process above the number of
tables/indexes in the query made a huge difference, too!)  Testing
separately, 1200 lseeks on that NFS mount takes around 400ms.

I'm aware of other improvements since 9.4.9 that would likely improve
things (the pread/pwrite change; possibly using native partitioning
instead of inheritance), but I imagine reducing lseeks would too.

(Of course, an even better improvement is to not put your data
directory on an NFS mount (sigh).)

Reply | Threaded
Open this post in threaded view
|

Re: Cache relation sizes?

David Rowley-3
In reply to this post by Andres Freund
On 7 November 2018 at 11:46, Andres Freund <[hidden email]> wrote:

> Hi,
>
> On 2018-11-07 11:40:22 +1300, Thomas Munro wrote:
>> PostgreSQL likes to probe the size of relations with lseek(SEEK_END) a
>> lot.  For example, a fully prewarmed pgbench -S transaction consists
>> of recvfrom(), lseek(SEEK_END), lseek(SEEK_END), sendto().  I think
>> lseek() is probably about as cheap as a syscall can be so I doubt it
>> really costs us much, but it's still a context switch and it stands
>> out when tracing syscalls, especially now that all the lseek(SEEK_SET)
>> calls are gone (commit c24dcd0cfd).
>
> I'd really really like to see some benchmarking before embarking on a
> more complex scheme.  I aesthetically dislike those lseeks, but ...

I agree. It would be good to see benchmarks on this first.  Those
could be as simple as just some crude local backend cache that stuff
the return value of RelationGetNumberOfBlocks in estimate_rel_size
into a hashtable and does not take into account the fact that it might
change. Should be okay to do some read-only benchmarking.

The partitioning case is probably a less interesting case to improve
if we get [1] as we'll no longer ask for the size of any pruned
partitions. Queries that don't prune any partitions are less likely to
notice the extra overhead of the lseek(SEEK_END) since they've
probably got more work to do elsewhere.

[1] https://commitfest.postgresql.org/20/1778/

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Cache relation sizes?

Thomas Munro-3
On Fri, Nov 9, 2018 at 4:42 PM David Rowley
<[hidden email]> wrote:

> On 7 November 2018 at 11:46, Andres Freund <[hidden email]> wrote:
> > On 2018-11-07 11:40:22 +1300, Thomas Munro wrote:
> >> PostgreSQL likes to probe the size of relations with lseek(SEEK_END) a
> >> lot.  For example, a fully prewarmed pgbench -S transaction consists
> >> of recvfrom(), lseek(SEEK_END), lseek(SEEK_END), sendto().  I think
> >> lseek() is probably about as cheap as a syscall can be so I doubt it
> >> really costs us much, but it's still a context switch and it stands
> >> out when tracing syscalls, especially now that all the lseek(SEEK_SET)
> >> calls are gone (commit c24dcd0cfd).
> >
> > I'd really really like to see some benchmarking before embarking on a
> > more complex scheme.  I aesthetically dislike those lseeks, but ...
>
> I agree. It would be good to see benchmarks on this first.  Those
> could be as simple as just some crude local backend cache that stuff
> the return value of RelationGetNumberOfBlocks in estimate_rel_size
> into a hashtable and does not take into account the fact that it might
> change. Should be okay to do some read-only benchmarking.
Oh, I just found the throw-away patch I wrote ages ago down the back
of the sofa.  Here's a rebase.  It somehow breaks initdb so you have
to initdb with unpatched.  Unfortunately I couldn't seem to measure
any speed-up on a random EDB test lab Linux box using pgbench -S (not
"prepared"), but that test doesn't involve many tables, and also it's
an older kernel without KPTI mitigations.  Attached in case anyone
else would like to try it.

--
Thomas Munro
http://www.enterprisedb.com

0001-Cache-file-sizes-to-avoid-lseek-calls.patch (8K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Cache relation sizes?

David Rowley-3
On Fri, 16 Nov 2018 at 12:06, Thomas Munro
<[hidden email]> wrote:
> Oh, I just found the throw-away patch I wrote ages ago down the back
> of the sofa.  Here's a rebase.  It somehow breaks initdb so you have
> to initdb with unpatched.  Unfortunately I couldn't seem to measure
> any speed-up on a random EDB test lab Linux box using pgbench -S (not
> "prepared"), but that test doesn't involve many tables, and also it's
> an older kernel without KPTI mitigations.  Attached in case anyone
> else would like to try it.

Over on [1] there's some talk about how when using PREPAREd statements
on a table with many partitions where some of the parameters help
prune many of the partitions, that on the 6th, and only on the 6th
execution of the statement that there's a huge spike in the query
latency.  This will be down to the fact that GetCachedPlan() builds a
generic plan on the 6th execution and most likely discards it due to
it appearing too expensive because of lack of any partition pruning.
The custom plan's cost is likely much much cheaper, so the generic
plan is planned but never used.  This may be a good real-life
candidate to test this patch with.  I know from benchmarks I performed
several months ago that the lseek() call to determine the relation
size was a large part of the cost of planning with many partitions.

[1] https://www.postgresql.org/message-id/flat/25C1C6B2E7BE044889E4FE8643A58BA963D89214%40G01JPEXMBKW03

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

RE: Cache relation sizes?

Jamison, Kirk
Hello,

I also find this proposed feature to be beneficial for performance, especially when we want to extend or truncate large tables.
As mentioned by David, currently there is a query latency spike when we make generic plan for partitioned table with many partitions.
I tried to apply Thomas' patch for that use case. Aside from measuring the planning and execution time,
I also monitored the lseek calls using simple strace, with and without the patch.

Below are the test results.
Setup 8192 table partitions.
(1) set plan_cache_mode = 'force_generic_plan';

  [Without Patch]
    prepare select_stmt(int) as select * from t where id = $1;
    explain (timing off, analyze) execute select_stmt(8192);
    […]
    Planning Time: 1678.680 ms
    Execution Time: 643.495 ms

    $ strace -p [pid] -e trace=lseek -c
    % time    seconds  usecs/call  calls   errors   syscall
    ---------------------------------------------------------------------------
    100.00    0.017247      1     16385          lseek

  [With Patch]
    […]
    Planning Time: 1596.566 ms
    Execution Time: 653.646 ms

    $ strace -p [pid] -e trace=lseek -c
    % time    seconds  usecs/call  calls   errors   syscall
    ---------------------------------------------------------------------------
    100.00    0.009196      1     8192           lseek

It was mentioned in the other thread [1] that creating a generic plan for the first time is very expensive.
Although this patch did not seem to reduce the cost of planning time for force_generic_plan,
it seems that number of lseek calls was reduced into half during the first execution of generic plan.


(2) plan_cache_mode = 'auto’
    reset plan_cache_mode; -- resets to auto / custom plan

  [Without Patch]
    […]
    Planning Time: 768.669 ms
    Execution Time: 0.776 ms

    $ strace -p [pid] -e trace=lseek -c
    % time    seconds  usecs/call  calls   errors   syscall
    ---------------------------------------------------------------------------
    100.00    0.015117     2       8193             lseek

  [With Patch]
    […]
    Planning Time: 181.690 ms
    Execution Time: 0.615 ms

    $ strace -p [pid] -e trace=lseek -c
    […]
    NO (zero) lseek calls.

Without the patch, there were around 8193 lseek calls.
With the patch applied, there were no lseek calls when creating the custom plan.


(3) set plan_cache_mode = 'force_generic_plan';
    -- force it to generic plan again to use the cached plan (no re-planning)

  [Without Patch]
    […]
    Planning Time: 14.294 ms
    Execution Time: 601.141 ms

    $ strace -p [pid] -e trace=lseek -c
    % time    seconds  usecs/call  calls   errors   syscall
    ---------------------------------------------------------------------------
    0.00    0.000000        0        1              lseek

  [With Patch]
    […]
    Planning Time: 13.976 ms
    Execution Time: 570.518 ms
   
    $ strace -p [pid] -e trace=lseek -c
    […]
    NO (zero) lseek calls.

----
If I did the test correctly, I am not sure though as to why the patch did not affect the generic planning performance of table with many partitions.
However, the number of lseek calls was greatly reduced with Thomas’ patch.
I also did not get considerable speed up in terms of latency average using pgbench –S (read-only, unprepared).
I am assuming this might be applicable to other use cases as well.
(I just tested the patch, but haven’t dug up the patch details yet).

Would you like to submit this to the commitfest to get more reviews for possible idea/patch improvement?


[1] https://www.postgresql.org/message-id/flat/CAEepm%3D3SSw-Ty1DFcK%3D1rU-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com


Regards,
Kirk Jamison
Reply | Threaded
Open this post in threaded view
|

Re: Cache relation sizes?

Thomas Munro-3
On Thu, Dec 27, 2018 at 8:00 PM Jamison, Kirk <[hidden email]> wrote:
> I also find this proposed feature to be beneficial for performance, especially when we want to extend or truncate large tables.
> As mentioned by David, currently there is a query latency spike when we make generic plan for partitioned table with many partitions.
> I tried to apply Thomas' patch for that use case. Aside from measuring the planning and execution time,
> I also monitored the lseek calls using simple strace, with and without the patch.

Thanks for looking into this and testing!

> Setup 8192 table partitions.

> (1) set plan_cache_mode = 'force_generic_plan';
>     Planning Time: 1678.680 ms
>     Planning Time: 1596.566 ms

> (2) plan_cache_mode = 'auto’
>     Planning Time: 768.669 ms
>     Planning Time: 181.690 ms

> (3) set plan_cache_mode = 'force_generic_plan';
>     Planning Time: 14.294 ms
>     Planning Time: 13.976 ms

> If I did the test correctly, I am not sure though as to why the patch did not affect the generic planning performance of table with many partitions.
> However, the number of lseek calls was greatly reduced with Thomas’ patch.
> I also did not get considerable speed up in terms of latency average using pgbench –S (read-only, unprepared).
> I am assuming this might be applicable to other use cases as well.
> (I just tested the patch, but haven’t dug up the patch details yet).

The result for (2) is nice.  Even though you had to use 8192
partitions to see it.

> Would you like to submit this to the commitfest to get more reviews for possible idea/patch improvement?

For now I think this still in the experiment/hack phase and I have a
ton of other stuff percolating in this commitfest already (and a week
of family holiday in the middle of January).  But if you have ideas
about the validity of the assumptions, the reason it breaks initdb, or
any other aspect of this approach (or alternatives), please don't let
me stop you, and of course please feel free to submit this, an
improved version or an alternative proposal yourself!  Unfortunately I
wouldn't have time to nurture it this time around, beyond some
drive-by comments.

Assorted armchair speculation:  I wonder how much this is affected by
the OS and KPTI, virtualisation technology, PCID support, etc.  Back
in the good old days, Linux's lseek(SEEK_END) stopped acquiring the
inode mutex when reading the size, at least in the generic
implementation used by most filesystems (I wonder if our workloads
were indirectly responsible for that optimisation?) so maybe it became
about as fast as a syscall could possibly be, but now the baseline for
how fast syscalls can be has moved and it also depends on your
hardware, and it also has external costs that depend on what memory
you touch in between syscalls.  Also, other operating systems might
still acquire a per-underlying-file/vnode/whatever lock (<checks
source code>... yes) and the contention for that might depend on what
else is happening, so that a single standalone test wouldn't capture
that but a super busy DB with a rapidly expanding and contracting
table that many other sessions are trying to observe with
lseek(SEEK_END) could slow down more.

--
Thomas Munro
http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

RE: Cache relation sizes?

Jamison, Kirk
Hi Thomas,

On Friday, December 28, 2018 6:43 AM Thomas Munro <[hidden email]> wrote:
> [...]if you have ideas about the validity of the assumptions, the reason it breaks initdb, or any other aspect of this approach (or alternatives), please don't let me stop you, and of course please feel free to submit this, an improved version or an alternative proposal [...]

Sure. Thanks. I'd like to try to work on the idea. I also took a look at the code, and I hope you don't mind if I ask for clarifications (explanation/advice/opinions) on the following, since my postgres experience is not substantial enough yet.

(1) I noticed that you used a "relsize_change_counter" to store in MdSharedData. Is my understanding below correct?

The intention is to cache the rel_size per-backend (lock-free), with an array of relsize_change_counter to skip using lseek syscall when the counter does not change.
In _mdnblocks(), if the counter did not change, the cached rel size (nblocks) is used and skip the call to FileSize() (lseek to get and cache rel size). That means in the default Postgres master, lseek syscall (through FileSize()) is called whenever we want to get the rel size (nblocks).

On the other hand, the simplest method I thought that could also work is to only cache the file size (nblock) in shared memory, not in the backend process, since both nblock and relsize_change_counter are uint32 data type anyway. If relsize_change_counter can be changed without lock, then nblock can be changed without lock, is it right? In that case, nblock can be accessed directly in shared memory. In this case, is the relation size necessary to be cached in backend?

(2) Is the MdSharedData temporary or permanent in shared memory?
from the patch:
 typedef struct MdSharedData
 {
  /* XXX could have an array of these, and use rel OID % nelements? */
  pg_atomic_uint32 relsize_change_counter;
 } MdSharedData;
 
 static MdSharedData *MdShared;

What I intend to have is a permanent hashtable that will keep the file size (eventually/future dev, including table addresses) in the shared memory for faster access by backend processes. The idea is to keep track of the unallocated blocks, based from how much the relation has been extended or truncated. Memory for this hashtable will be dynamically allocated.

Thanks,
Kirk Jamison
Reply | Threaded
Open this post in threaded view
|

RE: Cache relation sizes?

Ideriha, Takeshi
>From: Jamison, Kirk [mailto:[hidden email]]

>On the other hand, the simplest method I thought that could also work is to only cache
>the file size (nblock) in shared memory, not in the backend process, since both nblock
>and relsize_change_counter are uint32 data type anyway. If relsize_change_counter
>can be changed without lock, then nblock can be changed without lock, is it right? In
>that case, nblock can be accessed directly in shared memory. In this case, is the
>relation size necessary to be cached in backend?

(Aside from which idea is better.. )
If you want to put relation size on the shared memory, then I don't think caches in backend
is necessary because every time relation_size is updated you need to invalidate cache
in backends. At the reference taking shared lock on the cache and at the update taking
exclusive lock is simple without backend cache.

>(2) Is the MdSharedData temporary or permanent in shared memory?
That data structure seems to initialize at the time of InitPostgre, which means it's permanent
because postgres-initialized-shared-memory doesn't have a chance to drop it as far as I know.
(If you want to use temporary data structure, then other mechanism like dsm/dsa/dshash is a candidate.)

Regards,
Takeshi Ideriha
Reply | Threaded
Open this post in threaded view
|

RE: Cache relation sizes?

Tsunakawa, Takayuki
In reply to this post by Jamison, Kirk
From: Jamison, Kirk [mailto:[hidden email]]
> On the other hand, the simplest method I thought that could also work is
> to only cache the file size (nblock) in shared memory, not in the backend
> process, since both nblock and relsize_change_counter are uint32 data type
> anyway. If relsize_change_counter can be changed without lock, then nblock
> can be changed without lock, is it right? In that case, nblock can be accessed
> directly in shared memory. In this case, is the relation size necessary
> to be cached in backend?

Although I haven't looked deeply at Thomas's patch yet, there's currently no place to store the size per relation in shared memory.  You have to wait for the global metacache that Ideriha-san is addressing.  Then, you can store the relation size in the RelationData structure in relcache.



> (2) Is the MdSharedData temporary or permanent in shared memory?
> from the patch:
>  typedef struct MdSharedData
>  {
>   /* XXX could have an array of these, and use rel OID % nelements?
> */
>   pg_atomic_uint32 relsize_change_counter;
>  } MdSharedData;
>
>  static MdSharedData *MdShared;

Permanent in shared memory.


Regards
Takayuki Tsunakawa

Reply | Threaded
Open this post in threaded view
|

Re: Cache relation sizes?

Kyotaro HORIGUCHI-2
At Wed, 6 Feb 2019 06:29:15 +0000, "Tsunakawa, Takayuki" <[hidden email]> wrote in <0A3221C70F24FB45833433255569204D1FB955DF@G01JPEXMBYT05>

> From: Jamison, Kirk [mailto:[hidden email]]
> > On the other hand, the simplest method I thought that could also work is
> > to only cache the file size (nblock) in shared memory, not in the backend
> > process, since both nblock and relsize_change_counter are uint32 data type
> > anyway. If relsize_change_counter can be changed without lock, then nblock
> > can be changed without lock, is it right? In that case, nblock can be accessed
> > directly in shared memory. In this case, is the relation size necessary
> > to be cached in backend?
>
> Although I haven't looked deeply at Thomas's patch yet, there's currently no place to store the size per relation in shared memory.  You have to wait for the global metacache that Ideriha-san is addressing.  Then, you can store the relation size in the RelationData structure in relcache.

Just one counter in the patch *seems* to give significant gain
comparing to the complexity, given that lseek is so complex or it
brings latency, especially on workloads where file is scarcely
changed. Though I didn't run it on a test bench.

> > (2) Is the MdSharedData temporary or permanent in shared memory?
> > from the patch:
> >  typedef struct MdSharedData
> >  {
> >   /* XXX could have an array of these, and use rel OID % nelements?
> > */
> >   pg_atomic_uint32 relsize_change_counter;
> >  } MdSharedData;
> >
> >  static MdSharedData *MdShared;
>
> Permanent in shared memory.

I'm not sure the duration of the 'permanent' there, but it
disappears when server stops. Anyway it doesn't need to be
permanent beyond a server restart.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

RE: Cache relation sizes?

Tsunakawa, Takayuki
From: Kyotaro HORIGUCHI [mailto:[hidden email]]
> Just one counter in the patch *seems* to give significant gain
> comparing to the complexity, given that lseek is so complex or it
> brings latency, especially on workloads where file is scarcely
> changed. Though I didn't run it on a test bench.

I expect so, too.


> I'm not sure the duration of the 'permanent' there, but it
> disappears when server stops. Anyway it doesn't need to be
> permanent beyond a server restart.

Right, it exists while the server is running.


Regards
Takayuki Tsunakawa



Reply | Threaded
Open this post in threaded view
|

RE: Cache relation sizes?

Jamison, Kirk
In reply to this post by Kyotaro HORIGUCHI-2
On February 6, 2019, 08:25AM +0000, Kyotaro HORIGUCHI wrote:

>At Wed, 6 Feb 2019 06:29:15 +0000, "Tsunakawa, Takayuki" <[hidden email]> wrote:
>> Although I haven't looked deeply at Thomas's patch yet, there's currently no place to store the size per relation in shared memory.  You have to wait for the global metacache that Ideriha-san is addressing.  Then, you can store the relation size in the RelationData structure in relcache.

>Just one counter in the patch *seems* to give significant gain comparing to the complexity, given that lseek is so complex or it brings latency, especially on workloads where file is scarcely changed. Though I didn't run it on a test bench.

> > > (2) Is the MdSharedData temporary or permanent in shared memory?
> > Permanent in shared memory.
> I'm not sure the duration of the 'permanent' there, but it disappears when server stops. Anyway it doesn't need to be permanent beyond a server restart.


Thank you for the insights.
I did a simple test in the previous email using simple syscall tracing,
the patch significantly reduced the number of lseek syscall.
(but that simple test might not be enough to describe the performance benefit)

Regarding Tsunakawa-san's comment,
in Thomas' patch, he made a place in shared memory that stores the
relsize_change_counter, so I am thinking of utilizing the same,
but for caching the relsize itself.

Perhaps I should explain further the intention for the design.

First step, to cache the file size in the shared memory. Similar to the
intention or purpose of the patch written by Thomas, to reduce the
number of lseek(SEEK_END) by caching the relation size without using
additional locks. The difference is by caching a rel size on the shared
memory itself. I wonder if there are problems that you can see with
this approach.

Eventually, the next step is to have a structure in shared memory
that caches file addresses along with their sizes (Andres' idea of
putting an open relations table in the shared memory). With a
structure that group buffers into file relation units, we can get
file size directly from shared memory, so the assumption is it would
be faster whenever we truncate/extend our relations because we can
track the offset of the changes in size and use range for managing
the truncation, etc..
The next step is a complex direction that needs serious discussion,
but I am wondering if we can proceed with the first step for now if
the idea and direction are valid.

Regards,
Kirk Jamison


Reply | Threaded
Open this post in threaded view
|

Re: Cache relation sizes?

Andres Freund
On 2019-02-06 08:50:45 +0000, Jamison, Kirk wrote:

> On February 6, 2019, 08:25AM +0000, Kyotaro HORIGUCHI wrote:
>
> >At Wed, 6 Feb 2019 06:29:15 +0000, "Tsunakawa, Takayuki" <[hidden email]> wrote:
> >> Although I haven't looked deeply at Thomas's patch yet, there's currently no place to store the size per relation in shared memory.  You have to wait for the global metacache that Ideriha-san is addressing.  Then, you can store the relation size in the RelationData structure in relcache.
>
> >Just one counter in the patch *seems* to give significant gain comparing to the complexity, given that lseek is so complex or it brings latency, especially on workloads where file is scarcely changed. Though I didn't run it on a test bench.
>
> > > > (2) Is the MdSharedData temporary or permanent in shared memory?
> > > Permanent in shared memory.
> > I'm not sure the duration of the 'permanent' there, but it disappears when server stops. Anyway it doesn't need to be permanent beyond a server restart.
>
>
> Thank you for the insights.
> I did a simple test in the previous email using simple syscall tracing,
> the patch significantly reduced the number of lseek syscall.
> (but that simple test might not be enough to describe the performance benefit)
>
> Regarding Tsunakawa-san's comment,
> in Thomas' patch, he made a place in shared memory that stores the
> relsize_change_counter, so I am thinking of utilizing the same,
> but for caching the relsize itself.
>
> Perhaps I should explain further the intention for the design.
>
> First step, to cache the file size in the shared memory. Similar to the
> intention or purpose of the patch written by Thomas, to reduce the
> number of lseek(SEEK_END) by caching the relation size without using
> additional locks. The difference is by caching a rel size on the shared
> memory itself. I wonder if there are problems that you can see with
> this approach.
>
> Eventually, the next step is to have a structure in shared memory
> that caches file addresses along with their sizes (Andres' idea of
> putting an open relations table in the shared memory). With a
> structure that group buffers into file relation units, we can get
> file size directly from shared memory, so the assumption is it would
> be faster whenever we truncate/extend our relations because we can
> track the offset of the changes in size and use range for managing
> the truncation, etc..
> The next step is a complex direction that needs serious discussion,
> but I am wondering if we can proceed with the first step for now if
> the idea and direction are valid.

Maybe I'm missing something here, but why is it actually necessary to
have the sizes in shared memory, if we're just talking about caching
sizes?  It's pretty darn cheap to determine the filesize of a file that
has been recently stat()/lseek()/ed, and introducing per-file shared
data adds *substantial* complexity, because the amount of shared memory
needs to be pre-determined.  The reason I want to put per-relation data
into shared memory is different, it's about putting the buffer mapping
into shared memory, and that, as a prerequisite, also need per-relation
data. And there's a limit of the number of relation sthat need to be
open (one per cached page at max), and space can be freed by evicting
pages.

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

RE: Cache relation sizes?

Jamison, Kirk
On February 6, 2019, 8:57 AM +0000, Andres Freund wrote:

> Maybe I'm missing something here, but why is it actually necessary to
> have the sizes in shared memory, if we're just talking about caching
> sizes?  It's pretty darn cheap to determine the filesize of a file that
> has been recently stat()/lseek()/ed, and introducing per-file shared
> data adds *substantial* complexity, because the amount of shared memory
> needs to be pre-determined.  The reason I want to put per-relation data
> into shared memory is different, it's about putting the buffer mapping
> into shared memory, and that, as a prerequisite, also need per-relation
> data. And there's a limit of the number of relations that need to be
> open (one per cached page at max), and space can be freed by evicting
> pages.

Ahh.. You are right about the logic of putting it in the shared memory.
As for Thomas' toy patch, multiple files share one counter in shmem.
Although it currently works, it might likely to miss.
Though his eventual plan of the idea is to use an array of N counters
and map relation OIDs onto them.
But as your point about complexity says, in shared memory we cannot
share the same area with multiple files, so that needs an area to
allocate depending on the number of files.

Regarding the allocation of per-relation data in shared memory, I
thought it can be a separated component at first so I asked for
validity of the idea. But now I consider the point raised.

Regards,
Kirk Jamison


Reply | Threaded
Open this post in threaded view
|

Re: Cache relation sizes?

Kyotaro HORIGUCHI-2
At Wed, 13 Feb 2019 05:48:28 +0000, "Jamison, Kirk" <[hidden email]> wrote in <D09B13F772D2274BB348A310EE3027C643880D@g01jpexmbkw24>

> On February 6, 2019, 8:57 AM +0000, Andres Freund wrote:
> > Maybe I'm missing something here, but why is it actually necessary to
> > have the sizes in shared memory, if we're just talking about caching
> > sizes?  It's pretty darn cheap to determine the filesize of a file that
> > has been recently stat()/lseek()/ed, and introducing per-file shared
> > data adds *substantial* complexity, because the amount of shared memory
> > needs to be pre-determined.  The reason I want to put per-relation data
> > into shared memory is different, it's about putting the buffer mapping
> > into shared memory, and that, as a prerequisite, also need per-relation
> > data. And there's a limit of the number of relations that need to be
> > open (one per cached page at max), and space can be freed by evicting
> > pages.
>
> Ahh.. You are right about the logic of putting it in the shared memory.
> As for Thomas' toy patch, multiple files share one counter in shmem.
> Although it currently works, it might likely to miss.
> Though his eventual plan of the idea is to use an array of N counters
> and map relation OIDs onto them.
> But as your point about complexity says, in shared memory we cannot
> share the same area with multiple files, so that needs an area to
> allocate depending on the number of files.
>
> Regarding the allocation of per-relation data in shared memory, I
> thought it can be a separated component at first so I asked for
> validity of the idea. But now I consider the point raised.

I still believe that one shared memory element for every
non-mapped relation is not only too-complex but also too-much, as
Andres (and implicitly I) wrote. I feel that just one flag for
all works fine but partitioned flags (that is, relations or files
corresponds to the same hash value share one flag) can reduce the
shared memory elements to a fixed small number.

Note: I'm still not sure how much lseek impacts performance.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: Cache relation sizes?

Kyotaro HORIGUCHI-2


2019年2月14日(木) 20:41、Kyotaro HORIGUCHI さん([hidden email])のメッセージ:
At Wed, 13 Feb 2019 05:48:28 +0000, "Jamison, Kirk" <[hidden email]> wrote in <D09B13F772D2274BB348A310EE3027C643880D@g01jpexmbkw24>
> On February 6, 2019, 8:57 AM +0000, Andres Freund wrote:
> > Maybe I'm missing something here, but why is it actually necessary to
> > have the sizes in shared memory, if we're just talking about caching
> > sizes?  It's pretty darn cheap to determine the filesize of a file that
> > has been recently stat()/lseek()/ed, and introducing per-file shared
> > data adds *substantial* complexity, because the amount of shared memory
> > needs to be pre-determined.  The reason I want to put per-relation data
> > into shared memory is different, it's about putting the buffer mapping
> > into shared memory, and that, as a prerequisite, also need per-relation
> > data. And there's a limit of the number of relations that need to be
> > open (one per cached page at max), and space can be freed by evicting
> > pages.
>
> Ahh.. You are right about the logic of putting it in the shared memory.
> As for Thomas' toy patch, multiple files share one counter in shmem.
> Although it currently works, it might likely to miss.
> Though his eventual plan of the idea is to use an array of N counters
> and map relation OIDs onto them.
> But as your point about complexity says, in shared memory we cannot
> share the same area with multiple files, so that needs an area to
> allocate depending on the number of files.
>
> Regarding the allocation of per-relation data in shared memory, I
> thought it can be a separated component at first so I asked for
> validity of the idea. But now I consider the point raised.

I still believe that one shared memory element for every
non-mapped relation is not only too-complex but also too-much, as
Andres (and implicitly I) wrote. I feel that just one flag for
all works fine but partitioned flags (that is, relations or files
corresponds to the same hash value share one flag) can reduce the
shared memory elements to a fixed small number.

Note: I'm still not sure how much lseek impacts performance.

Of course all the "flag"s above actually are "update counter"s :)

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
Reply | Threaded
Open this post in threaded view
|

Re: Cache relation sizes?

Thomas Munro-5
In reply to this post by Kyotaro HORIGUCHI-2
On Tue, Dec 31, 2019 at 4:43 PM Kyotaro HORIGUCHI
<[hidden email]> wrote:
> I still believe that one shared memory element for every
> non-mapped relation is not only too-complex but also too-much, as
> Andres (and implicitly I) wrote. I feel that just one flag for
> all works fine but partitioned flags (that is, relations or files
> corresponds to the same hash value share one flag) can reduce the
> shared memory elements to a fixed small number.

There is one potentially interesting case that doesn't require any
kind of shared cache invalidation AFAICS.  XLogReadBufferExtended()
calls smgrnblocks() for every buffer access, even if the buffer is
already in our buffer pool.  I tried to make yet another quick
experiment-grade patch to cache the size[1], this time for use in
recovery only.

initdb -D pgdata
postgres -D pgdata -c checkpoint_timeout=60min

In another shell:
pgbench -i -s100 postgres
pgbench -M prepared -T60 postgres
killall -9 postgres
mv pgdata pgdata-save

Master branch:

cp -r pgdata-save pgdata
strace -c -f postgres -D pgdata
[... wait for "redo done", then hit ^C ...]
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
...
 18.61   22.492286          26    849396           lseek
  6.95    8.404369          30    277134           pwrite64
  6.63    8.009679          28    277892           pread64
  0.50    0.604037          39     15169           sync_file_range
...

Patched:

rm -fr pgdata
cp -r pgdata-save pgdata
strace -c -f ~/install/bin/postgres -D pgdata
[... wait for "redo done", then hit ^C ...]
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
...
 16.33    8.097631          29    277134           pwrite64
 15.56    7.715052          27    277892           pread64
  1.13    0.559648          39     14137           sync_file_range
...
  0.00    0.001505          25        59           lseek

> Note: I'm still not sure how much lseek impacts performance.

It doesn't seem great that we are effectively making system calls for
most WAL records we replay, but, sadly, in this case the patch didn't
really make any measurable difference when run without strace on this
Linux VM.  I suspect there is some workload and stack where it would
make a difference (CF the read(postmaster pipe) call for every WAL
record that was removed), but this is just something I noticed in
passing while working on something else, so I haven't investigated
much.

[1] https://github.com/postgres/postgres/compare/master...macdice:cache-nblocks
(just a test, unfinished, probably has bugs)


Reply | Threaded
Open this post in threaded view
|

Re: Cache relation sizes?

Andres Freund
Hi,

On 2019-12-31 17:05:31 +1300, Thomas Munro wrote:
> There is one potentially interesting case that doesn't require any
> kind of shared cache invalidation AFAICS.  XLogReadBufferExtended()
> calls smgrnblocks() for every buffer access, even if the buffer is
> already in our buffer pool.

Yea, that's really quite bad*. The bit about doing so even when already
in the buffer pool is particularly absurd.  Needing to have special
handling in mdcreate() for XLogReadBufferExtended() always calling it is
also fairly ugly.


> It doesn't seem great that we are effectively making system calls for
> most WAL records we replay, but, sadly, in this case the patch didn't
> really make any measurable difference when run without strace on this
> Linux VM.  I suspect there is some workload and stack where it would
> make a difference (CF the read(postmaster pipe) call for every WAL
> record that was removed), but this is just something I noticed in
> passing while working on something else, so I haven't investigated
> much.

I wonder if that's just because your workload is too significantly
bottlenecked elsewhere:

> postgres -D pgdata -c checkpoint_timeout=60min

> In another shell:
> pgbench -i -s100 postgres
> pgbench -M prepared -T60 postgres
> killall -9 postgres
> mv pgdata pgdata-save

With scale 100, but the default shared_buffers, you'll frequently hit
the OS for reads/writes. Which will require the same metadata in the
kernel, but then also memcpys between kernel and userspace.

A word of caution about strace's -c: In my experience the total time
measurements are very imprecise somehow. I think it might be that some
of the overhead of ptracing will be attributed to the syscalls or such,
which means frequent syscalls appear relatively more expensive than they
really are.

Greetings,

Andres Freund

* it insults my sense of aesthetics


12