Cache relation sizes?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
29 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Cache relation sizes?

Thomas Munro-5
On Tue, Feb 4, 2020 at 2:23 AM Andres Freund <[hidden email]> wrote:

> On 2019-12-31 17:05:31 +1300, Thomas Munro wrote:
> > There is one potentially interesting case that doesn't require any
> > kind of shared cache invalidation AFAICS.  XLogReadBufferExtended()
> > calls smgrnblocks() for every buffer access, even if the buffer is
> > already in our buffer pool.
>
> Yea, that's really quite bad*. The bit about doing so even when already
> in the buffer pool is particularly absurd.  Needing to have special
> handling in mdcreate() for XLogReadBufferExtended() always calling it is
> also fairly ugly.
Yeah.  It seems like there are several things to fix there.  So now
I'm wondering if we should start out by trying to cache the size it in
the smgr layer for recovery only, like in the attached, and then later
try to extend the scheme to cover the shared case as discussed at the
beginning of the thread.

> A word of caution about strace's -c: In my experience the total time
> measurements are very imprecise somehow. I think it might be that some
> of the overhead of ptracing will be attributed to the syscalls or such,
> which means frequent syscalls appear relatively more expensive than they
> really are.

Yeah, those times are large, meaningless tracing overheads.  While
some systems might in fact be happy replaying a couple of million
lseeks per second, (1) I'm pretty sure some systems would not be happy
about that (see claims in this thread) and (2) it means you can't
practically use strace on recovery because it slows it down to a
crawl, which is annoying.

0001-Use-cached-smgrnblocks-results-in-recovery.patch (15K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Cache relation sizes?

Thomas Munro-5
On Thu, Feb 13, 2020 at 7:18 PM Thomas Munro <[hidden email]> wrote:
> ... (1) I'm pretty sure some systems would not be happy
> about that (see claims in this thread) ...

I poked a couple of people off-list and learned that, although the
Linux and FreeBSD systems I tried could do a million lseek(SEEK_END)
calls in 60-100ms, a couple of different Windows systems took between
1.1 and 3.4 seconds (worse times when running as non-administrator),
which seems to be clearly in the territory that can put a dent in your
recovery speeds on that OS.  I also learned that GetFileSizeEx() is
"only" about twice as fast, which is useful information for that other
thread about which syscall to use for this, but it's kind of
irrelevant this thread about how we can get rid of these crazy
syscalls altogether.


Reply | Threaded
Open this post in threaded view
|

Re: Cache relation sizes?

Thomas Munro-5
On Fri, Feb 14, 2020 at 1:50 PM Thomas Munro <[hidden email]> wrote:

> On Thu, Feb 13, 2020 at 7:18 PM Thomas Munro <[hidden email]> wrote:
> > ... (1) I'm pretty sure some systems would not be happy
> > about that (see claims in this thread) ...
>
> I poked a couple of people off-list and learned that, although the
> Linux and FreeBSD systems I tried could do a million lseek(SEEK_END)
> calls in 60-100ms, a couple of different Windows systems took between
> 1.1 and 3.4 seconds (worse times when running as non-administrator),
> which seems to be clearly in the territory that can put a dent in your
> recovery speeds on that OS.  I also learned that GetFileSizeEx() is
> "only" about twice as fast, which is useful information for that other
> thread about which syscall to use for this, but it's kind of
> irrelevant this thread about how we can get rid of these crazy
> syscalls altogether.

I received a report off-list from someone who experimented with the
patch I shared earlier on this thread[1], using a crash recovery test
similar to one I showed on the WAL prefetching thread[2] (which he was
also testing, separately).

He observed that the lseek() rate in recovery was actually a
significant problem for his environment on unpatched master, showing
up as the top sampled function in perf, and by using that patch he got
(identical) crash recovery to complete in 41s instead of 65s, with a
sane looking perf (= 58% speedup).  His test system was an AWS
i3.16xlarge running an unspecified version of Linux.

I think it's possible that all of the above reports can be explained
by variations in the speculative execution bug mitigations that are
enabled by default on different systems, but I haven't tried to
investigate that yet.

[1] https://github.com/postgres/postgres/compare/master...macdice:cache-nblocks
[2] https://www.postgresql.org/message-id/CA%2BhUKG%2BOcWr8nHqa3%3DZoPTGgdDcrgjSC4R2sT%2BjrUunBua3rpg%40mail.gmail.com


Reply | Threaded
Open this post in threaded view
|

Re: Cache relation sizes?

Thomas Munro-5
On Sat, Apr 11, 2020 at 4:10 PM Thomas Munro <[hidden email]> wrote:
> I received a report off-list from someone who experimented with the
> patch I shared earlier on this thread[1], using a crash recovery test
> similar to one I showed on the WAL prefetching thread[2] (which he was
> also testing, separately).

Rebased.  I'll add this to the open commitfest.

v2-0001-Cache-smgrnblocks-results-in-recovery.patch (15K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Cache relation sizes?

Thomas Munro-5
On Sat, Jun 20, 2020 at 10:32 AM Thomas Munro <[hidden email]> wrote:
> Rebased.  I'll add this to the open commitfest.

I traced the recovery process while running pgbench -M prepared -c16
-j16 -t10000 (= 160,000 transactions).  With the patch, the number of
lseeks went from 1,080,661 (6.75 per pgbench transaction) to just 85.

I went ahead and pushed this patch.

There's still the matter of crazy numbers of lseeks in regular
backends; looking at all processes while running the above test, I get
1,469,060 (9.18 per pgbench transaction) without -M prepared, and
193,722 with -M prepared (1.21 per pgbench transaction).  Fixing that
with this approach will require bullet-proof shared invalidation, but
I think it's doable, in later work.


Reply | Threaded
Open this post in threaded view
|

Re: Cache relation sizes?

Thomas Munro-5
On Fri, Jul 31, 2020 at 2:36 PM Thomas Munro <[hidden email]> wrote:
> There's still the matter of crazy numbers of lseeks in regular
> backends; looking at all processes while running the above test, I get
> 1,469,060 (9.18 per pgbench transaction) without -M prepared, and
> 193,722 with -M prepared (1.21 per pgbench transaction).  Fixing that
> with this approach will require bullet-proof shared invalidation, but
> I think it's doable, in later work.

I couldn't help hacking on this a bit.  Perhaps instead of
bullet-proof general shared invalidation, we should have a way for
localised bits of code to state that they are ok with a "relaxed"
value.  Then they should explain the theory for why that is safe in
each case based on arguments about memory barrier pairs, but leave all
other callers making the system call so that we don't torpedo the
whole project by making it too hard.  For now, the main cases I'm
interested in are the ones that show up all the time as the dominant
system call in various workloads:

(1) Sequential scan relation-size probe.  This should be OK with a
relaxed value.  You can't miss the invalidation for a truncation,
because the read barrier in your lock acquisition on the relation
pairs with the write barrier in the exclusive lock release of any
session that truncated, and you can't miss relation any relation
extension that your snapshot can see, because the release of the
extension lock pairs with the lock involved in snapshot acquisition.

(2) Planner relation-size probe, which should be OK with a relaxed
value.  Similar arguments give you a fresh enough view, I think.

Or maybe there is a theory along these lines that already covers every
use of smgrnblocks(), so a separate mode isn't require... I don't
know!

The attached sketch-quality patch shows some of the required elements
working, though I ran out of steam trying to figure out how to thread
this thing through the right API layers so for now it always asks for
a relaxed value in table_block_relation_size().

0001-WIP-Cache-smgrnblocks-in-more-cases.patch (18K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Cache relation sizes?

konstantin knizhnik


On 01.08.2020 00:56, Thomas Munro wrote:

> On Fri, Jul 31, 2020 at 2:36 PM Thomas Munro <[hidden email]> wrote:
>> There's still the matter of crazy numbers of lseeks in regular
>> backends; looking at all processes while running the above test, I get
>> 1,469,060 (9.18 per pgbench transaction) without -M prepared, and
>> 193,722 with -M prepared (1.21 per pgbench transaction).  Fixing that
>> with this approach will require bullet-proof shared invalidation, but
>> I think it's doable, in later work.
> I couldn't help hacking on this a bit.  Perhaps instead of
> bullet-proof general shared invalidation, we should have a way for
> localised bits of code to state that they are ok with a "relaxed"
> value.  Then they should explain the theory for why that is safe in
> each case based on arguments about memory barrier pairs, but leave all
> other callers making the system call so that we don't torpedo the
> whole project by making it too hard.  For now, the main cases I'm
> interested in are the ones that show up all the time as the dominant
> system call in various workloads:
>
> (1) Sequential scan relation-size probe.  This should be OK with a
> relaxed value.  You can't miss the invalidation for a truncation,
> because the read barrier in your lock acquisition on the relation
> pairs with the write barrier in the exclusive lock release of any
> session that truncated, and you can't miss relation any relation
> extension that your snapshot can see, because the release of the
> extension lock pairs with the lock involved in snapshot acquisition.
>
> (2) Planner relation-size probe, which should be OK with a relaxed
> value.  Similar arguments give you a fresh enough view, I think.
>
> Or maybe there is a theory along these lines that already covers every
> use of smgrnblocks(), so a separate mode isn't require... I don't
> know!
>
> The attached sketch-quality patch shows some of the required elements
> working, though I ran out of steam trying to figure out how to thread
> this thing through the right API layers so for now it always asks for
> a relaxed value in table_block_relation_size().
So in this thread three solutions are proposed:
1. "bullet-proof general shared invalidation"
2. recovery-only solution avoiding shared memory and invalidation
3. "relaxed" shared memory cache with simplified invalidation

If solving such very important by still specific problem of caching
relation size requires so much efforts,
then may be it is time to go further in the direction towards shared
catalog?
This shared relation cache can easily store relation size as well.
In addition it will solve a lot of other problems:
- noticeable overhead of local relcache warming
- large memory consumption in case of larger number of relations
O(max_connections*n_relations)
- sophisticated invalidation protocol and related performance issues

Certainly access to shared cache requires extra synchronization.But DDL
operations are relatively rare.
So in most cases we will have only shared locks. May be overhead of
locking will not be too large?



Reply | Threaded
Open this post in threaded view
|

Re: Cache relation sizes?

Pavel Stehule


po 3. 8. 2020 v 17:54 odesílatel Konstantin Knizhnik <[hidden email]> napsal:


On 01.08.2020 00:56, Thomas Munro wrote:
> On Fri, Jul 31, 2020 at 2:36 PM Thomas Munro <[hidden email]> wrote:
>> There's still the matter of crazy numbers of lseeks in regular
>> backends; looking at all processes while running the above test, I get
>> 1,469,060 (9.18 per pgbench transaction) without -M prepared, and
>> 193,722 with -M prepared (1.21 per pgbench transaction).  Fixing that
>> with this approach will require bullet-proof shared invalidation, but
>> I think it's doable, in later work.
> I couldn't help hacking on this a bit.  Perhaps instead of
> bullet-proof general shared invalidation, we should have a way for
> localised bits of code to state that they are ok with a "relaxed"
> value.  Then they should explain the theory for why that is safe in
> each case based on arguments about memory barrier pairs, but leave all
> other callers making the system call so that we don't torpedo the
> whole project by making it too hard.  For now, the main cases I'm
> interested in are the ones that show up all the time as the dominant
> system call in various workloads:
>
> (1) Sequential scan relation-size probe.  This should be OK with a
> relaxed value.  You can't miss the invalidation for a truncation,
> because the read barrier in your lock acquisition on the relation
> pairs with the write barrier in the exclusive lock release of any
> session that truncated, and you can't miss relation any relation
> extension that your snapshot can see, because the release of the
> extension lock pairs with the lock involved in snapshot acquisition.
>
> (2) Planner relation-size probe, which should be OK with a relaxed
> value.  Similar arguments give you a fresh enough view, I think.
>
> Or maybe there is a theory along these lines that already covers every
> use of smgrnblocks(), so a separate mode isn't require... I don't
> know!
>
> The attached sketch-quality patch shows some of the required elements
> working, though I ran out of steam trying to figure out how to thread
> this thing through the right API layers so for now it always asks for
> a relaxed value in table_block_relation_size().
So in this thread three solutions are proposed:
1. "bullet-proof general shared invalidation"
2. recovery-only solution avoiding shared memory and invalidation
3. "relaxed" shared memory cache with simplified invalidation

If solving such very important by still specific problem of caching
relation size requires so much efforts,
then may be it is time to go further in the direction towards shared
catalog?
This shared relation cache can easily store relation size as well.
In addition it will solve a lot of other problems:
- noticeable overhead of local relcache warming
- large memory consumption in case of larger number of relations
O(max_connections*n_relations)
- sophisticated invalidation protocol and related performance issues

Certainly access to shared cache requires extra synchronization.But DDL
operations are relatively rare.

Some applications use very frequently CREATE TEMP TABLE, DROP TEMP TABLE, or CREATE TABLE AS SELECT ..

Regards

Pavel

So in most cases we will have only shared locks. May be overhead of
locking will not be too large?



Reply | Threaded
Open this post in threaded view
|

Re: Cache relation sizes?

Thomas Munro-5
In reply to this post by konstantin knizhnik
On Tue, Aug 4, 2020 at 3:54 AM Konstantin Knizhnik
<[hidden email]> wrote:
> So in this thread three solutions are proposed:
> 1. "bullet-proof general shared invalidation"
> 2. recovery-only solution avoiding shared memory and invalidation
> 3. "relaxed" shared memory cache with simplified invalidation

Hi Konstantin,

By the way, point 2 is now committed (c5315f4f).  As for 1 vs 3, I
wasn't proposing two different invalidation techniques: in both
approaches, I'm calling the cached values "relaxed", meaning that
their freshness is controlled by memory barriers elsewhere that the
caller has to worry about.  I'm just suggesting for idea 3 that it
might be a good idea to use relaxed values only in a couple of hot
code paths where we do the analysis required to convince ourselves
that memory barriers are already in the right places to make it safe.
By "bullet-proof" I meant that we could in theory convince ourselves
that *all* users of smgrnblocks() can safely use relaxed values, but
that's hard.

That said, the sketch patch I posted certainly needs more work, and
maybe someone has a better idea on how to do it.

> If solving such very important by still specific problem of caching
> relation size requires so much efforts,
> then may be it is time to go further in the direction towards shared
> catalog?

I wouldn't say it requires too much effort, at least the conservative
approach (3).  But I also agree with what you're saying, in the long
term:

> This shared relation cache can easily store relation size as well.
> In addition it will solve a lot of other problems:
> - noticeable overhead of local relcache warming
> - large memory consumption in case of larger number of relations
> O(max_connections*n_relations)
> - sophisticated invalidation protocol and related performance issues
> Certainly access to shared cache requires extra synchronization.But DDL
> operations are relatively rare.
> So in most cases we will have only shared locks. May be overhead of
> locking will not be too large?

Yeah, I would be very happy if we get a high performance shared
sys/rel/plan/... caches in the future, and separately, having the
relation size available in shmem is something that has come up in
discussions about other topics too (tree-based buffer mapping,
multi-relation data files, ...).  I agree with you that our cache
memory usage is a big problem, and it will be great to fix that one
day.  I don't think that should stop us from making small improvements
to the existing design in the meantime, though.  "The perfect is the
enemy of the good."  Look at all this trivial stuff:

https://wiki.postgresql.org/wiki/Syscall_Reduction

I don't have high quality data to report yet, but from simple tests
I'm seeing orders of magnitude fewer syscalls per pgbench transaction
in recovery, when comparing 11 to 14devel, due to various changes.
Admittedly, the size probes in regular backends aren't quite as bad as
the recovery ones were, because they're already limited by the rate
you can throw request/response cycles at the DB, but there are still
some cases like planning for high-partition counts and slow
filesystems that can benefit measurably from caching.


12