WIP: WAL prefetch (another approach)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
121 messages Options
1234 ... 7
Reply | Threaded
Open this post in threaded view
|

WIP: WAL prefetch (another approach)

Thomas Munro-5
Hello hackers,

Based on ideas from earlier discussions[1][2], here is an experimental
WIP patch to improve recovery speed by prefetching blocks.  If you set
wal_prefetch_distance to a positive distance, measured in bytes, then
the recovery loop will look ahead in the WAL and call PrefetchBuffer()
for referenced blocks.  This can speed things up with cold caches
(example: after a server reboot) and working sets that don't fit in
memory (example: large scale pgbench).

Results vary, but in contrived larger-than-memory pgbench crash
recovery experiments on a Linux development system, I've seen recovery
running as much as 20x faster with full_page_writes=off and
wal_prefetch_distance=8kB.  FPWs reduce the potential speed-up as
discussed in the other thread.

Some notes:

* PrefetchBuffer() is only beneficial if your kernel and filesystem
have a working POSIX_FADV_WILLNEED implementation.  That includes
Linux ext4 and xfs, but excludes macOS and Windows.  In future we
might use asynchronous I/O to bring data all the way into our own
buffer pool; hopefully the PrefetchBuffer() interface wouldn't change
much and this code would automatically benefit.

* For now, for proof-of-concept purposes, the patch uses a second
XLogReader to read ahead in the WAL.  I am thinking about how to write
a two-cursor XLogReader that reads and decodes each record just once.

* It can handle simple crash recovery and streaming replication
scenarios, but doesn't yet deal with complications like timeline
changes (the way to do that might depend on how the previous point
works out).  The integration with WAL receiver probably needs some
work, I've been testing pretty narrow cases so far, and the way I
hijacked read_local_xlog_page() probably isn't right.

* On filesystems with block size <= BLCKSZ, it's a waste of a syscall
to try to prefetch a block that we have a FPW for, but otherwise it
can avoid a later stall due to a read-before-write at pwrite() time,
so I added a second GUC wal_prefetch_fpw to make that optional.

Earlier work, and how this patch compares:

* Sean Chittenden wrote pg_prefaulter[1], an external process that
uses worker threads to pread() referenced pages some time before
recovery does, and demonstrated very good speed-up, triggering a lot
of discussion of this topic.  My WIP patch differs mainly in that it's
integrated with PostgreSQL, and it uses POSIX_FADV_WILLNEED rather
than synchronous I/O from worker threads/processes.  Sean wouldn't
have liked my patch much because he was working on ZFS and that
doesn't support POSIX_FADV_WILLNEED, but with a small patch to ZFS it
works pretty well, and I'll try to get that upstreamed.

* Konstantin Knizhnik proposed a dedicated PostgreSQL process that
would do approximately the same thing[2].  My WIP patch differs mainly
in that it does the prefetching work in the recovery loop itself, and
uses PrefetchBuffer() rather than FilePrefetch() directly.  This
avoids a bunch of communication and complications, but admittedly does
introduce new system calls into a hot loop (for now); perhaps I could
pay for that by removing more lseek(SEEK_END) noise.  It also deals
with various edge cases relating to created, dropped and truncated
relations a bit differently.  It also tries to avoid generating
sequential WILLNEED advice, based on experimental evidence[3] that
that affects Linux's readahead heuristics negatively, though I don't
understand the exact mechanism there.

Here are some cases where I expect this patch to perform badly:

* Your WAL has multiple intermixed sequential access streams (ie
sequential access to N different relations), so that sequential access
is not detected, and then all the WILLNEED advice prevents Linux's
automagic readahead from working well.  Perhaps that could be
mitigated by having a system that can detect up to N concurrent
streams, where N is more than the current 1, or by flagging buffers in
the WAL as part of a sequential stream.  I haven't looked into this.

* The data is always found in our buffer pool, so PrefetchBuffer() is
doing nothing useful and you might as well not be calling it or doing
the extra work that leads up to that.  Perhaps that could be mitigated
with an adaptive approach: too many PrefetchBuffer() hits and we stop
trying to prefetch, too many XLogReadBufferForRedo() misses and we
start trying to prefetch.  That might work nicely for systems that
start out with cold caches but eventually warm up.  I haven't looked
into this.

* The data is actually always in the kernel's cache, so the advice is
a waste of a syscall.  That might imply that you should probably be
running with a larger shared_buffers (?).  It's technically possible
to ask the operating system if a region is cached on many systems,
which could in theory be used for some kind of adaptive heuristic that
would disable pointless prefetching, but I'm not proposing that.
Ultimately this problem would be avoided by moving to true async I/O,
where we'd be initiating the read all the way into our buffers (ie it
replaces the later pread() so it's a wash, at worst).

* The prefetch distance is set too low so that pread() waits are not
avoided, or your storage subsystem can't actually perform enough
concurrent I/O to get ahead of the random access pattern you're
generating, so no distance would be far enough ahead.  To help with
the former case, perhaps we could invent something smarter than a
user-supplied distance (something like "N cold block references
ahead", possibly using effective_io_concurrency, rather than "N bytes
ahead").

[1] https://www.pgcon.org/2018/schedule/track/Case%20Studies/1204.en.html
[2] https://www.postgresql.org/message-id/flat/49df9cd2-7086-02d0-3f8d-535a32d44c82%40postgrespro.ru
[3] https://github.com/macdice/some-io-tests

wal-prefetch-another-approach-v1.tgz (14K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: WIP: WAL prefetch (another approach)

Tomas Vondra-4
On Thu, Jan 02, 2020 at 02:39:04AM +1300, Thomas Munro wrote:

>Hello hackers,
>
>Based on ideas from earlier discussions[1][2], here is an experimental
>WIP patch to improve recovery speed by prefetching blocks.  If you set
>wal_prefetch_distance to a positive distance, measured in bytes, then
>the recovery loop will look ahead in the WAL and call PrefetchBuffer()
>for referenced blocks.  This can speed things up with cold caches
>(example: after a server reboot) and working sets that don't fit in
>memory (example: large scale pgbench).
>

Thanks, I only did a very quick review so far, but the patch looks fine.

In general, I find it somewhat non-intuitive to configure prefetching by
specifying WAL distance. I mean, how would you know what's a good value?
If you know the storage hardware, you probably know the optimal queue
depth i.e. you know you the number of requests to get best throughput.
But how do you deduce the WAL distance from that? I don't know.

Could we instead specify the number of blocks to prefetch? We'd probably
need to track additional details needed to determine number of blocks to
prefetch (essentially LSN for all prefetch requests).

Another thing to consider might be skipping recently prefetched blocks.
Consider you have a loop that does DML, where each statement creates a
separate WAL record, but it can easily touch the same block over and
over (say inserting to the same page). That means the prefetches are
not really needed, but I'm not sure how expensive it really is.

>Results vary, but in contrived larger-than-memory pgbench crash
>recovery experiments on a Linux development system, I've seen recovery
>running as much as 20x faster with full_page_writes=off and
>wal_prefetch_distance=8kB.  FPWs reduce the potential speed-up as
>discussed in the other thread.
>

OK, so how did you test that? I'll do some tests with a traditional
streaming replication setup, multiple sessions on the primary (and maybe
a weaker storage system on the replica). I suppose that's another setup
that should benefit from this.

> ...
>
>Earlier work, and how this patch compares:
>
>* Sean Chittenden wrote pg_prefaulter[1], an external process that
>uses worker threads to pread() referenced pages some time before
>recovery does, and demonstrated very good speed-up, triggering a lot
>of discussion of this topic.  My WIP patch differs mainly in that it's
>integrated with PostgreSQL, and it uses POSIX_FADV_WILLNEED rather
>than synchronous I/O from worker threads/processes.  Sean wouldn't
>have liked my patch much because he was working on ZFS and that
>doesn't support POSIX_FADV_WILLNEED, but with a small patch to ZFS it
>works pretty well, and I'll try to get that upstreamed.
>

How long would it take to get the POSIX_FADV_WILLNEED to ZFS systems, if
everything goes fine? I'm not sure what's the usual life-cycle, but I
assume it may take a couple years to get it on most production systems.

What other common filesystems are missing support for this?

Presumably we could do what Sean's extension does, i.e. use a couple of
bgworkers, each doing simple pread() calls. Of course, that's
unnecessarily complicated on systems that have FADV_WILLNEED.

> ...
>
>Here are some cases where I expect this patch to perform badly:
>
>* Your WAL has multiple intermixed sequential access streams (ie
>sequential access to N different relations), so that sequential access
>is not detected, and then all the WILLNEED advice prevents Linux's
>automagic readahead from working well.  Perhaps that could be
>mitigated by having a system that can detect up to N concurrent
>streams, where N is more than the current 1, or by flagging buffers in
>the WAL as part of a sequential stream.  I haven't looked into this.
>

Hmmm, wouldn't it be enough to prefetch blocks in larger batches (not
one by one), and doing some sort of sorting? That should allow readahead
to kick in.

>* The data is always found in our buffer pool, so PrefetchBuffer() is
>doing nothing useful and you might as well not be calling it or doing
>the extra work that leads up to that.  Perhaps that could be mitigated
>with an adaptive approach: too many PrefetchBuffer() hits and we stop
>trying to prefetch, too many XLogReadBufferForRedo() misses and we
>start trying to prefetch.  That might work nicely for systems that
>start out with cold caches but eventually warm up.  I haven't looked
>into this.
>

I think the question is what's the cost of doing such unnecessary
prefetch. Presumably it's fairly cheap, especially compared to the
opposite case (not prefetching a block not in shared buffers). I wonder
how expensive would the adaptive logic be on cases that never need a
prefetch (i.e. datasets smaller than shared_buffers).

>* The data is actually always in the kernel's cache, so the advice is
>a waste of a syscall.  That might imply that you should probably be
>running with a larger shared_buffers (?).  It's technically possible
>to ask the operating system if a region is cached on many systems,
>which could in theory be used for some kind of adaptive heuristic that
>would disable pointless prefetching, but I'm not proposing that.
>Ultimately this problem would be avoided by moving to true async I/O,
>where we'd be initiating the read all the way into our buffers (ie it
>replaces the later pread() so it's a wash, at worst).
>

Makes sense.

>* The prefetch distance is set too low so that pread() waits are not
>avoided, or your storage subsystem can't actually perform enough
>concurrent I/O to get ahead of the random access pattern you're
>generating, so no distance would be far enough ahead.  To help with
>the former case, perhaps we could invent something smarter than a
>user-supplied distance (something like "N cold block references
>ahead", possibly using effective_io_concurrency, rather than "N bytes
>ahead").
>

In general, I find it quite non-intuitive to configure prefetching by
specifying WAL distance. I mean, how would you know what's a good value?
If you know the storage hardware, you probably know the optimal queue
depth i.e. you know you the number of requests to get best throughput.

But how do you deduce the WAL distance from that? I don't know. Plus
right after the checkpoint the WAL contains FPW, reducing the number of
blocks in a given amount of WAL (compared to right before a checkpoint).
So I expect users might pick unnecessarily high WAL distance. OTOH with
FPW we don't quite need agressive prefetching, right?

Could we instead specify the number of blocks to prefetch? We'd probably
need to track additional details needed to determine number of blocks to
prefetch (essentially LSN for all prefetch requests).

Another thing to consider might be skipping recently prefetched blocks.
Consider you have a loop that does DML, where each statement creates a
separate WAL record, but it can easily touch the same block over and
over (say inserting to the same page). That means the prefetches are
not really needed, but I'm not sure how expensive it really is.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: WIP: WAL prefetch (another approach)

Thomas Munro-5
On Fri, Jan 3, 2020 at 7:10 AM Tomas Vondra
<[hidden email]> wrote:

> On Thu, Jan 02, 2020 at 02:39:04AM +1300, Thomas Munro wrote:
> >Based on ideas from earlier discussions[1][2], here is an experimental
> >WIP patch to improve recovery speed by prefetching blocks.  If you set
> >wal_prefetch_distance to a positive distance, measured in bytes, then
> >the recovery loop will look ahead in the WAL and call PrefetchBuffer()
> >for referenced blocks.  This can speed things up with cold caches
> >(example: after a server reboot) and working sets that don't fit in
> >memory (example: large scale pgbench).
> >
>
> Thanks, I only did a very quick review so far, but the patch looks fine.

Thanks for looking!

> >Results vary, but in contrived larger-than-memory pgbench crash
> >recovery experiments on a Linux development system, I've seen recovery
> >running as much as 20x faster with full_page_writes=off and
> >wal_prefetch_distance=8kB.  FPWs reduce the potential speed-up as
> >discussed in the other thread.
>
> OK, so how did you test that? I'll do some tests with a traditional
> streaming replication setup, multiple sessions on the primary (and maybe
> a weaker storage system on the replica). I suppose that's another setup
> that should benefit from this.

Using a 4GB RAM 16 thread virtual machine running Linux debian10
4.19.0-6-amd64 with an ext4 filesystem on NVMe storage:

postgres -D pgdata \
  -c full_page_writes=off \
  -c checkpoint_timeout=60min \
  -c max_wal_size=10GB \
  -c synchronous_commit=off

# in another shell
pgbench -i -s300 postgres
psql postgres -c checkpoint
pgbench -T60 -Mprepared -c4 -j4 postgres
killall -9 postgres

# save the crashed pgdata dir for repeated experiments
mv pgdata pgdata-save

# repeat this with values like wal_prefetch_distance=-1, 1kB, 8kB, 64kB, ...
rm -fr pgdata
cp -r pgdata-save pgdata
postgres -D pgdata -c wal_prefetch_distance=-1

What I see on my desktop machine is around 10x speed-up:

wal_prefetch_distance=-1 -> 62s (same number for unpatched)
wal_prefetch_distance=8kb -> 6s
wal_prefetch_distance=64kB -> 5s

On another dev machine I managed to get a 20x speedup, using a much
longer test.  It's probably more interesting to try out some more
realistic workloads rather than this cache-destroying uniform random
stuff, though.  It might be interesting to test on systems with high
random read latency, but high concurrency; I can think of a bunch of
network storage environments where that's the case, but I haven't
looked into them, beyond some toy testing with (non-Linux) NFS over a
slow network (results were promising).

> >Earlier work, and how this patch compares:
> >
> >* Sean Chittenden wrote pg_prefaulter[1], an external process that
> >uses worker threads to pread() referenced pages some time before
> >recovery does, and demonstrated very good speed-up, triggering a lot
> >of discussion of this topic.  My WIP patch differs mainly in that it's
> >integrated with PostgreSQL, and it uses POSIX_FADV_WILLNEED rather
> >than synchronous I/O from worker threads/processes.  Sean wouldn't
> >have liked my patch much because he was working on ZFS and that
> >doesn't support POSIX_FADV_WILLNEED, but with a small patch to ZFS it
> >works pretty well, and I'll try to get that upstreamed.
> >
>
> How long would it take to get the POSIX_FADV_WILLNEED to ZFS systems, if
> everything goes fine? I'm not sure what's the usual life-cycle, but I
> assume it may take a couple years to get it on most production systems.

Assuming they like it enough to commit it (and initial informal
feedback on the general concept has been positive -- it's not messing
with their code at all, it's just boilerplate code to connect the
relevant Linux and FreeBSD VFS callbacks), it could indeed be quite a
while before it appear in conservative package repos, but I don't
know, it depends where you get your OpenZFS/ZoL module from.

> What other common filesystems are missing support for this?

Using our build farm as a way to know which operating systems we care
about as a community, in no particular order:

* I don't know for exotic or network filesystems on Linux
* AIX 7.2's manual says "Valid option, but this value does not perform
any action" for every kind of advice except POSIX_FADV_NOWRITEBEHIND
(huh, nonstandard advice).
* Solaris's posix_fadvise() was a dummy libc function, as of 10 years
ago when they closed the source; who knows after that.
* FreeBSD's UFS and NFS support other advice through a default handler
but unfortunately ignore WILLNEED (I have patches for those too, not
good enough to send anywhere yet).
* OpenBSD has no such syscall
* NetBSD has the syscall, and I can see that it's hooked up to
readahead code, so that's probably the only unqualified yes in this
list
* Windows has no equivalent syscall; the closest thing might be to use
ReadFileEx() to initiate an async read into a dummy buffer; maybe you
can use a zero event so it doesn't even try to tell you when the I/O
completes, if you don't care?
* macOS has no such syscall, but you could in theory do an aio_read()
into a dummy buffer.  On the other hand I don't think that interface
is a general solution for POSIX systems, because on at least Linux and
Solaris, aio_read() is emulated by libc with a whole bunch of threads
and we are allergic to those things (and even if we weren't, we
wouldn't want a whole threadpool in every PostgreSQL process, so you'd
need to hand off to a worker process, and then why bother?).
* HPUX, I don't know

We could test any of those with a simple test I wrote[1], but I'm not
likely to test any non-open-source OS myself due to lack of access.
Amazingly, HPUX's posix_fadvise() doesn't appear to conform to POSIX:
it sets errno and returns -1, while POSIX says that it should return
an error number.  Checking our source tree, I see that in
pg_flush_data(), we also screwed that up and expect errno to be set,
though we got it right in FilePrefetch().

In any case, Linux must be at the very least 90% of PostgreSQL
installations.  Incidentally, sync_file_range() without wait is a sort
of opposite of WILLNEED (it means something like
"POSIX_FADV_WILLSYNC"), and no one seem terribly upset that we really
only have that on Linux (the emulations are pretty poor AFAICS).

> Presumably we could do what Sean's extension does, i.e. use a couple of
> bgworkers, each doing simple pread() calls. Of course, that's
> unnecessarily complicated on systems that have FADV_WILLNEED.

That is a good idea, and I agree.  I have a patch set that does
exactly that.  It's nearly independent of the WAL prefetch work; it
just changes how PrefetchBuffer() is implemented, affecting bitmap
index scans, vacuum and any future user of PrefetchBuffer.  If you
apply these patches too then WAL prefetch will use it (just set
max_background_readers = 4 or whatever):

https://github.com/postgres/postgres/compare/master...macdice:bgreader

That's simplified from an abandoned patch I had lying around because I
was experimenting with prefetching all the way into shared buffers
this way.  The simplified version just does pread() into a dummy
buffer, for the side effect of warming the kernel's cache, pretty much
like pg_prefaulter.  There are some tricky questions around whether
it's better to wait or not when the request queue is full; the way I
have that is far too naive, and that question is probably related to
your point about being cleverer about how many prefetch blocks you
should try to have in flight.  A future version of PrefetchBuffer()
might lock the buffer then tell the worker (or some kernel async I/O
facility) to write the data into the buffer.  If I understand
correctly, to make that work we need Robert's IO lock/condition
variable transplant[2], and Andres's scheme for a suitable
interlocking protocol, and no doubt some bulletproof cleanup
machinery.  I'm not working on any of that myself right now because I
don't want to step on Andres's toes.

> >Here are some cases where I expect this patch to perform badly:
> >
> >* Your WAL has multiple intermixed sequential access streams (ie
> >sequential access to N different relations), so that sequential access
> >is not detected, and then all the WILLNEED advice prevents Linux's
> >automagic readahead from working well.  Perhaps that could be
> >mitigated by having a system that can detect up to N concurrent
> >streams, where N is more than the current 1, or by flagging buffers in
> >the WAL as part of a sequential stream.  I haven't looked into this.
> >
>
> Hmmm, wouldn't it be enough to prefetch blocks in larger batches (not
> one by one), and doing some sort of sorting? That should allow readahead
> to kick in.

Yeah, but I don't want to do too much work in the startup process, or
get too opinionated about how the underlying I/O stack works.  I think
we'd need to do things like that in a direct I/O future, but we'd
probably offload it (?).  I figured the best approach for early work
in this space would be to just get out of the way if we detect
sequential access.

> >* The data is always found in our buffer pool, so PrefetchBuffer() is
> >doing nothing useful and you might as well not be calling it or doing
> >the extra work that leads up to that.  Perhaps that could be mitigated
> >with an adaptive approach: too many PrefetchBuffer() hits and we stop
> >trying to prefetch, too many XLogReadBufferForRedo() misses and we
> >start trying to prefetch.  That might work nicely for systems that
> >start out with cold caches but eventually warm up.  I haven't looked
> >into this.
> >
>
> I think the question is what's the cost of doing such unnecessary
> prefetch. Presumably it's fairly cheap, especially compared to the
> opposite case (not prefetching a block not in shared buffers). I wonder
> how expensive would the adaptive logic be on cases that never need a
> prefetch (i.e. datasets smaller than shared_buffers).

Hmm.  It's basically a buffer map probe.  I think the adaptive logic
would probably be some kind of periodically resetting counter scheme,
but you're probably right to suspect that it might not even be worth
bothering with, especially if a single XLogReader can be made to do
the readahead with no real extra cost.  Perhaps we should work on
making the cost of all prefetching overheads as low as possible first,
before trying to figure out whether it's worth building a system for
avoiding it.

> >* The prefetch distance is set too low so that pread() waits are not
> >avoided, or your storage subsystem can't actually perform enough
> >concurrent I/O to get ahead of the random access pattern you're
> >generating, so no distance would be far enough ahead.  To help with
> >the former case, perhaps we could invent something smarter than a
> >user-supplied distance (something like "N cold block references
> >ahead", possibly using effective_io_concurrency, rather than "N bytes
> >ahead").
> >
>
> In general, I find it quite non-intuitive to configure prefetching by
> specifying WAL distance. I mean, how would you know what's a good value?
> If you know the storage hardware, you probably know the optimal queue
> depth i.e. you know you the number of requests to get best throughput.

FWIW, on pgbench tests on flash storage I've found that 1KB only helps
a bit, 8KB is great, and more than that doesn't get any better.  Of
course, this is meaningless in general; a zipfian workload might need
to look a lot further head than a uniform one to find anything worth
prefetching, and that's exactly what you're complaining about, and I
agree.

> But how do you deduce the WAL distance from that? I don't know. Plus
> right after the checkpoint the WAL contains FPW, reducing the number of
> blocks in a given amount of WAL (compared to right before a checkpoint).
> So I expect users might pick unnecessarily high WAL distance. OTOH with
> FPW we don't quite need agressive prefetching, right?

Yeah, so you need to be touching blocks more than once between
checkpoints, if you want to see speed-up on a system with blocks <=
BLCKSZ and FPW on.  If checkpoints are far enough apart you'll
eventually run out of FPWs and start replaying non-FPW stuff.  Or you
could be on a filesystem with larger blocks than PostgreSQL.

> Could we instead specify the number of blocks to prefetch? We'd probably
> need to track additional details needed to determine number of blocks to
> prefetch (essentially LSN for all prefetch requests).

Yeah, I think you're right, we should probably try to make a little
queue to track LSNs and count prefetch requests in and out.  I think
you'd also want PrefetchBuffer() to tell you if the block was already
in the buffer pool, so that you don't count blocks that it decided not
to prefetch.  I guess PrefetchBuffer() needs to return an enum (I
already had it returning a bool for another purpose relating to an
edge case in crash recovery, when relations have been dropped by a
later WAL record).  I will think about that.

> Another thing to consider might be skipping recently prefetched blocks.
> Consider you have a loop that does DML, where each statement creates a
> separate WAL record, but it can easily touch the same block over and
> over (say inserting to the same page). That means the prefetches are
> not really needed, but I'm not sure how expensive it really is.

There are two levels of defence against repeatedly prefetching the
same block: PrefetchBuffer() checks for blocks that are already in our
cache, and before that, PrefetchState remembers the last block so that
we can avoid fetching that block (or the following block).

[1] https://github.com/macdice/some-io-tests
[2] https://www.postgresql.org/message-id/CA%2BTgmoaj2aPti0yho7FeEf2qt-JgQPRWb0gci_o1Hfr%3DC56Xng%40mail.gmail.com


Reply | Threaded
Open this post in threaded view
|

Re: WIP: WAL prefetch (another approach)

Thomas Munro-5
On Fri, Jan 3, 2020 at 5:57 PM Thomas Munro <[hidden email]> wrote:
> On Fri, Jan 3, 2020 at 7:10 AM Tomas Vondra
> <[hidden email]> wrote:
> > Could we instead specify the number of blocks to prefetch? We'd probably
> > need to track additional details needed to determine number of blocks to
> > prefetch (essentially LSN for all prefetch requests).

Here is a new WIP version of the patch set that does that.  Changes:

1.  It now uses effective_io_concurrency to control how many
concurrent prefetches to allow.  It's possible that we should have a
different GUC to control "maintenance" users of concurrency I/O as
discussed elsewhere[1], but I'm staying out of that for now; if we
agree to do that for VACUUM etc, we can change it easily here.  Note
that the value is percolated through the ComputeIoConcurrency()
function which I think we should discuss, but again that's off topic,
I just want to use the standard infrastructure here.

2.  You can now change the relevant GUCs (wal_prefetch_distance,
wal_prefetch_fpw, effective_io_concurrency) at runtime and reload for
them to take immediate effect.  For example, you can enable the
feature on a running replica by setting wal_prefetch_distance=8kB
(from the default of -1, which means off), and something like
effective_io_concurrency=10, and telling the postmaster to reload.

3.  The new code is moved out to a new file
src/backend/access/transam/xlogprefetcher.c, to minimise new bloat in
the mighty xlog.c file.  Functions were renamed to make their purpose
clearer, and a lot of comments were added.

4.  The WAL receiver now exposes the current 'write' position via an
atomic value in shared memory, so we don't need to hammer the WAL
receiver's spinlock.

5.  There is some rudimentary user documentation of the GUCs.

[1] https://www.postgresql.org/message-id/13619.1557935593%40sss.pgh.pa.us

0001-Allow-PrefetchBuffer-to-be-called-with-a-SMgrRela-v2.patch (6K) Download Attachment
0002-Rename-GetWalRcvWriteRecPtr-to-GetWalRcvFlushRecP-v2.patch (5K) Download Attachment
0003-Add-WalRcvGetWriteRecPtr-new-definition-v2.patch (4K) Download Attachment
0004-Allow-PrefetchBuffer-to-report-missing-file-in-re-v2.patch (9K) Download Attachment
0005-Prefetch-referenced-blocks-during-recovery-v2.patch (41K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: WIP: WAL prefetch (another approach)

Thomas Munro-5
On Wed, Feb 12, 2020 at 7:52 PM Thomas Munro <[hidden email]> wrote:
> 1.  It now uses effective_io_concurrency to control how many
> concurrent prefetches to allow.  It's possible that we should have a
> different GUC to control "maintenance" users of concurrency I/O as
> discussed elsewhere[1], but I'm staying out of that for now; if we
> agree to do that for VACUUM etc, we can change it easily here.  Note
> that the value is percolated through the ComputeIoConcurrency()
> function which I think we should discuss, but again that's off topic,
> I just want to use the standard infrastructure here.

I started a separate thread[1] to discuss that GUC, because it's
basically an independent question.  Meanwhile, here's a new version of
the WAL prefetch patch, with the following changes:

1.  A monitoring view:

  postgres=# select * from pg_stat_wal_prefetcher ;
   prefetch | skip_hit | skip_new | skip_fpw | skip_seq | distance | queue_depth
  ----------+----------+----------+----------+----------+----------+-------------
      95854 |   291458 |      435 |        0 |    26245 |   261800 |          10
  (1 row)

That shows a bunch of counters for blocks prefetched and skipped for
various reasons.  It also shows the current read-ahead distance (in
bytes of WAL) and queue depth (an approximation of how many I/Os might
be in flight, used for rate limiting; I'm struggling to come up with a
better short name for this).  This can be used to see the effects of
experiments with different settings, eg:

  alter system set effective_io_concurrency = 20;
  alter system set wal_prefetch_distance = '256kB';
  select pg_reload_conf();

2.  A log message when WAL prefetching begins and ends, so you can see
what it did during crash recovery:

  LOG:  WAL prefetch finished at 0/C5E98758; prefetch = 1112628,
skip_hit = 3607540,
    skip_new = 45592, skip_fpw = 0, skip_seq = 177049, avg_distance =
247907.942532,
    avg_queue_depth = 22.261352

3.  A bit of general user documentation.

[1] https://www.postgresql.org/message-id/flat/CA%2BhUKGJUw08dPs_3EUcdO6M90GnjofPYrWp4YSLaBkgYwS-AqA%40mail.gmail.com

0001-Allow-PrefetchBuffer-to-be-called-with-a-SMgrRelatio.patch (6K) Download Attachment
0002-Rename-GetWalRcvWriteRecPtr-to-GetWalRcvFlushRecPtr.patch (5K) Download Attachment
0003-Add-WalRcvGetWriteRecPtr-new-definition.patch (4K) Download Attachment
0004-Allow-PrefetchBuffer-to-report-the-outcome.patch (9K) Download Attachment
0005-Prefetch-referenced-blocks-during-recovery.patch (61K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: WIP: WAL prefetch (another approach)

Alvaro Herrera-9
I tried my luck at a quick read of this patchset.
I didn't manage to go over 0005 though, but I agree with Tomas that
having this be configurable in terms of bytes of WAL is not very
user-friendly.

First of all, let me join the crowd chanting that this is badly needed;
I don't need to repeat what Chittenden's talk showed.  "WAL recovery is
now 10x-20x times faster" would be a good item for pg13 press release,
I think.

> From a61b4e00c42ace5db1608e02165f89094bf86391 Mon Sep 17 00:00:00 2001
> From: Thomas Munro <[hidden email]>
> Date: Tue, 3 Dec 2019 17:13:40 +1300
> Subject: [PATCH 1/5] Allow PrefetchBuffer() to be called with a SMgrRelation.
>
> Previously a Relation was required, but it's annoying to have
> to create a "fake" one in recovery.

LGTM.

It's a pity to have to include smgr.h in bufmgr.h.  Maybe it'd be sane
to use a forward struct declaration and "struct SMgrRelation *" instead.


> From acbff1444d0acce71b0218ce083df03992af1581 Mon Sep 17 00:00:00 2001
> From: Thomas Munro <[hidden email]>
> Date: Mon, 9 Dec 2019 17:10:17 +1300
> Subject: [PATCH 2/5] Rename GetWalRcvWriteRecPtr() to GetWalRcvFlushRecPtr().
>
> The new name better reflects the fact that the value it returns
> is updated only when received data has been flushed to disk.
>
> An upcoming patch will make use of the latest data that was
> written without waiting for it to be flushed, so use more
> precise function names.

Ugh.  (Not for your patch -- I mean for the existing naming convention).
It would make sense to rename WalRcvData->receivedUpto in this commit,
maybe to flushedUpto.


> From d7fa7d82c5f68d0cccf441ce9e8dfa40f64d3e0d Mon Sep 17 00:00:00 2001
> From: Thomas Munro <[hidden email]>
> Date: Mon, 9 Dec 2019 17:22:07 +1300
> Subject: [PATCH 3/5] Add WalRcvGetWriteRecPtr() (new definition).
>
> A later patch will read received WAL to prefetch referenced blocks,
> without waiting for the data to be flushed to disk.  To do that,
> it needs to be able to see the write pointer advancing in shared
> memory.
>
> The function formerly bearing name was recently renamed to
> WalRcvGetFlushRecPtr(), which better described what it does.

> + pg_atomic_init_u64(&WalRcv->writtenUpto, 0);

Umm, how come you're using WalRcv here instead of walrcv?  I would flag
this patch for sneaky nastiness if this weren't mostly harmless.  (I
think we should do away with local walrcv pointers altogether.  But that
should be a separate patch, I think.)

> + pg_atomic_uint64 writtenUpto;

Are we already using uint64s for XLogRecPtrs anywhere?  This seems
novel.  Given this, I wonder if the comment near "mutex" needs an
update ("except where atomics are used"), or perhaps just move the
member to after the line with mutex.


I didn't understand the purpose of inc_counter() as written.  Why not
just pg_atomic_fetch_add_u64(..., 1)?

>  /*
>   * smgrprefetch() -- Initiate asynchronous read of the specified block of a relation.
> + *
> + * In recovery only, this can return false to indicate that a file
> + * doesn't exist (presumably it has been dropped by a later WAL
> + * record).
>   */
> -void
> +bool
>  smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)

I think this API, where the behavior of a low-level module changes
depending on InRecovery, is confusingly crazy.  I'd rather have the
callers specifying whether they're OK with a file that doesn't exist.

> +extern PrefetchBufferResult SharedPrefetchBuffer(SMgrRelation smgr_reln,
> + ForkNumber forkNum,
> + BlockNumber blockNum);
>  extern void PrefetchBuffer(Relation reln, ForkNumber forkNum,
>     BlockNumber blockNum);

Umm, I would keep the return values of both these functions in sync.
It's really strange that PrefetchBuffer does not return
PrefetchBufferResult, don't you think?

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: WIP: WAL prefetch (another approach)

Thomas Munro-5
Hi Alvaro,

On Sat, Mar 14, 2020 at 10:15 AM Alvaro Herrera
<[hidden email]> wrote:
> I tried my luck at a quick read of this patchset.

Thanks!  Here's a new patch set, and some inline responses to your feedback:

> I didn't manage to go over 0005 though, but I agree with Tomas that
> having this be configurable in terms of bytes of WAL is not very
> user-friendly.

The primary control is now maintenance_io_concurrency, which is
basically what Tomas suggested.

The byte-based control is just a cap to prevent it reading a crazy
distance ahead, that also functions as the on/off switch for the
feature.  In this version I've added "max" to the name, to make that
clearer.

> First of all, let me join the crowd chanting that this is badly needed;
> I don't need to repeat what Chittenden's talk showed.  "WAL recovery is
> now 10x-20x times faster" would be a good item for pg13 press release,
> I think.

We should be careful about over-promising here: Sean basically had a
best case scenario for this type of techology, partly due to his 16kB
filesystem blocks.  Common results may be a lot more pedestrian,
though it could get more interesting if we figure out how to get rid
of FPWs...

> > From a61b4e00c42ace5db1608e02165f89094bf86391 Mon Sep 17 00:00:00 2001
> > From: Thomas Munro <[hidden email]>
> > Date: Tue, 3 Dec 2019 17:13:40 +1300
> > Subject: [PATCH 1/5] Allow PrefetchBuffer() to be called with a SMgrRelation.
> >
> > Previously a Relation was required, but it's annoying to have
> > to create a "fake" one in recovery.
>
> LGTM.
>
> It's a pity to have to include smgr.h in bufmgr.h.  Maybe it'd be sane
> to use a forward struct declaration and "struct SMgrRelation *" instead.
OK, done.

While staring at this, I decided that SharedPrefetchBuffer() was a
weird word order, so I changed it to PrefetchSharedBuffer().  Then, by
analogy, I figured I should also change the pre-existing function
LocalPrefetchBuffer() to PrefetchLocalBuffer().  Do you think this is
an improvement?

> > From acbff1444d0acce71b0218ce083df03992af1581 Mon Sep 17 00:00:00 2001
> > From: Thomas Munro <[hidden email]>
> > Date: Mon, 9 Dec 2019 17:10:17 +1300
> > Subject: [PATCH 2/5] Rename GetWalRcvWriteRecPtr() to GetWalRcvFlushRecPtr().
> >
> > The new name better reflects the fact that the value it returns
> > is updated only when received data has been flushed to disk.
> >
> > An upcoming patch will make use of the latest data that was
> > written without waiting for it to be flushed, so use more
> > precise function names.
>
> Ugh.  (Not for your patch -- I mean for the existing naming convention).
> It would make sense to rename WalRcvData->receivedUpto in this commit,
> maybe to flushedUpto.
Ok, I renamed that variable and a related one.  There are more things
you could rename if you pull on that thread some more, including
pg_stat_wal_receiver's received_lsn column, but I didn't do that in
this patch.

> > From d7fa7d82c5f68d0cccf441ce9e8dfa40f64d3e0d Mon Sep 17 00:00:00 2001
> > From: Thomas Munro <[hidden email]>
> > Date: Mon, 9 Dec 2019 17:22:07 +1300
> > Subject: [PATCH 3/5] Add WalRcvGetWriteRecPtr() (new definition).
> >
> > A later patch will read received WAL to prefetch referenced blocks,
> > without waiting for the data to be flushed to disk.  To do that,
> > it needs to be able to see the write pointer advancing in shared
> > memory.
> >
> > The function formerly bearing name was recently renamed to
> > WalRcvGetFlushRecPtr(), which better described what it does.
>
> > +     pg_atomic_init_u64(&WalRcv->writtenUpto, 0);
>
> Umm, how come you're using WalRcv here instead of walrcv?  I would flag
> this patch for sneaky nastiness if this weren't mostly harmless.  (I
> think we should do away with local walrcv pointers altogether.  But that
> should be a separate patch, I think.)
OK, done.

> > +     pg_atomic_uint64 writtenUpto;
>
> Are we already using uint64s for XLogRecPtrs anywhere?  This seems
> novel.  Given this, I wonder if the comment near "mutex" needs an
> update ("except where atomics are used"), or perhaps just move the
> member to after the line with mutex.

Moved.

We use [u]int64 in various places in the replication code.  Ideally
I'd have a magic way to say atomic<XLogRecPtr> so I didn't have to
assume that pg_atomic_uint64 is the right atomic integer width and
signedness, but here we are.  In dsa.h I made a special typedef for
the atomic version of something else, but that's because the size of
that thing varied depending on the build, whereas our LSNs are of a
fixed width that ought to be en... <trails off>.

> I didn't understand the purpose of inc_counter() as written.  Why not
> just pg_atomic_fetch_add_u64(..., 1)?

I didn't want counters that wrap at ~4 billion, but I did want to be
able to read and write concurrently without tearing.  Instructions
like "lock xadd" would provide more guarantees that I don't need,
since only one thread is doing all the writing and there's no ordering
requirement.  It's basically just counter++, but some platforms need a
spinlock to perform atomic read and write of 64 bit wide numbers, so
more hoop jumping is required.

> >  /*
> >   *   smgrprefetch() -- Initiate asynchronous read of the specified block of a relation.
> > + *
> > + *           In recovery only, this can return false to indicate that a file
> > + *           doesn't exist (presumably it has been dropped by a later WAL
> > + *           record).
> >   */
> > -void
> > +bool
> >  smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
>
> I think this API, where the behavior of a low-level module changes
> depending on InRecovery, is confusingly crazy.  I'd rather have the
> callers specifying whether they're OK with a file that doesn't exist.
Hmm.  But... md.c has other code like that.  It's true that I'm adding
InRecovery awareness to a function that didn't previously have it, but
that's just because we previously had no reason to prefetch stuff in
recovery.

> > +extern PrefetchBufferResult SharedPrefetchBuffer(SMgrRelation smgr_reln,
> > +                                                                                              ForkNumber forkNum,
> > +                                                                                              BlockNumber blockNum);
> >  extern void PrefetchBuffer(Relation reln, ForkNumber forkNum,
> >                                                  BlockNumber blockNum);
>
> Umm, I would keep the return values of both these functions in sync.
> It's really strange that PrefetchBuffer does not return
> PrefetchBufferResult, don't you think?

Agreed, and changed.  I suspect that other users of the main
PrefetchBuffer() call will eventually want that, to do a better job of
keeping the request queue full, for example bitmap heap scan and
(hypothetical) btree scan with prefetch.

0001-Allow-PrefetchBuffer-to-be-called-with-a-SMgrRela-v4.patch (9K) Download Attachment
0002-Rename-GetWalRcvWriteRecPtr-to-GetWalRcvFlushRecP-v4.patch (14K) Download Attachment
0003-Add-WalRcvGetWriteRecPtr-new-definition-v4.patch (5K) Download Attachment
0004-Allow-PrefetchBuffer-to-report-what-happened-v4.patch (13K) Download Attachment
0005-Prefetch-referenced-blocks-during-recovery-v4.patch (63K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: WIP: WAL prefetch (another approach)

Alvaro Herrera-9
On 2020-Mar-17, Thomas Munro wrote:

Hi Thomas

> On Sat, Mar 14, 2020 at 10:15 AM Alvaro Herrera
> <[hidden email]> wrote:

> > I didn't manage to go over 0005 though, but I agree with Tomas that
> > having this be configurable in terms of bytes of WAL is not very
> > user-friendly.
>
> The primary control is now maintenance_io_concurrency, which is
> basically what Tomas suggested.

> The byte-based control is just a cap to prevent it reading a crazy
> distance ahead, that also functions as the on/off switch for the
> feature.  In this version I've added "max" to the name, to make that
> clearer.

Mumble.  I guess I should wait to comment on this after reading 0005
more in depth.

> > First of all, let me join the crowd chanting that this is badly needed;
> > I don't need to repeat what Chittenden's talk showed.  "WAL recovery is
> > now 10x-20x times faster" would be a good item for pg13 press release,
> > I think.
>
> We should be careful about over-promising here: Sean basically had a
> best case scenario for this type of techology, partly due to his 16kB
> filesystem blocks.  Common results may be a lot more pedestrian,
> though it could get more interesting if we figure out how to get rid
> of FPWs...

Well, in my mind it's an established fact that our WAL replay uses far
too little of the available I/O speed.  I guess if the system is
generating little WAL, then this change will show no benefit, but that's
not the kind of system that cares about this anyway -- for the others,
the parallelisation gains will be substantial, I'm sure.

> > > From a61b4e00c42ace5db1608e02165f89094bf86391 Mon Sep 17 00:00:00 2001
> > > From: Thomas Munro <[hidden email]>
> > > Date: Tue, 3 Dec 2019 17:13:40 +1300
> > > Subject: [PATCH 1/5] Allow PrefetchBuffer() to be called with a SMgrRelation.
> > >
> > > Previously a Relation was required, but it's annoying to have
> > > to create a "fake" one in recovery.

> While staring at this, I decided that SharedPrefetchBuffer() was a
> weird word order, so I changed it to PrefetchSharedBuffer().  Then, by
> analogy, I figured I should also change the pre-existing function
> LocalPrefetchBuffer() to PrefetchLocalBuffer().  Do you think this is
> an improvement?

Looks good.  I doubt you'll break anything by renaming that routine.

> > > From acbff1444d0acce71b0218ce083df03992af1581 Mon Sep 17 00:00:00 2001
> > > From: Thomas Munro <[hidden email]>
> > > Date: Mon, 9 Dec 2019 17:10:17 +1300
> > > Subject: [PATCH 2/5] Rename GetWalRcvWriteRecPtr() to GetWalRcvFlushRecPtr().
> > >
> > > The new name better reflects the fact that the value it returns
> > > is updated only when received data has been flushed to disk.
> > >
> > > An upcoming patch will make use of the latest data that was
> > > written without waiting for it to be flushed, so use more
> > > precise function names.
> >
> > Ugh.  (Not for your patch -- I mean for the existing naming convention).
> > It would make sense to rename WalRcvData->receivedUpto in this commit,
> > maybe to flushedUpto.
>
> Ok, I renamed that variable and a related one.  There are more things
> you could rename if you pull on that thread some more, including
> pg_stat_wal_receiver's received_lsn column, but I didn't do that in
> this patch.

+1 for that approach.  Maybe we'll want to rename the SQL-visible name,
but I wouldn't burden this patch with that, lest we lose the entire
series to that :-)

> > > +     pg_atomic_uint64 writtenUpto;
> >
> > Are we already using uint64s for XLogRecPtrs anywhere?  This seems
> > novel.  Given this, I wonder if the comment near "mutex" needs an
> > update ("except where atomics are used"), or perhaps just move the
> > member to after the line with mutex.
>
> Moved.

LGTM.

> We use [u]int64 in various places in the replication code.  Ideally
> I'd have a magic way to say atomic<XLogRecPtr> so I didn't have to
> assume that pg_atomic_uint64 is the right atomic integer width and
> signedness, but here we are.  In dsa.h I made a special typedef for
> the atomic version of something else, but that's because the size of
> that thing varied depending on the build, whereas our LSNs are of a
> fixed width that ought to be en... <trails off>.

Let's rewrite Postgres in Rust ...

> > I didn't understand the purpose of inc_counter() as written.  Why not
> > just pg_atomic_fetch_add_u64(..., 1)?
>
> I didn't want counters that wrap at ~4 billion, but I did want to be
> able to read and write concurrently without tearing.  Instructions
> like "lock xadd" would provide more guarantees that I don't need,
> since only one thread is doing all the writing and there's no ordering
> requirement.  It's basically just counter++, but some platforms need a
> spinlock to perform atomic read and write of 64 bit wide numbers, so
> more hoop jumping is required.

Ah, I see, you don't want lock xadd ... That's non-obvious.  I suppose
the function could use more commentary on *why* you're doing it that way
then.

> > >  /*
> > >   *   smgrprefetch() -- Initiate asynchronous read of the specified block of a relation.
> > > + *
> > > + *           In recovery only, this can return false to indicate that a file
> > > + *           doesn't exist (presumably it has been dropped by a later WAL
> > > + *           record).
> > >   */
> > > -void
> > > +bool
> > >  smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
> >
> > I think this API, where the behavior of a low-level module changes
> > depending on InRecovery, is confusingly crazy.  I'd rather have the
> > callers specifying whether they're OK with a file that doesn't exist.
>
> Hmm.  But... md.c has other code like that.  It's true that I'm adding
> InRecovery awareness to a function that didn't previously have it, but
> that's just because we previously had no reason to prefetch stuff in
> recovery.

True.  I'm uncomfortable about it anyway.  I also noticed that
_mdfd_getseg() already has InRecovery-specific behavior flags.
Clearly that ship has sailed.  Consider my objection^W comment withdrawn.

> > Umm, I would keep the return values of both these functions in sync.
> > It's really strange that PrefetchBuffer does not return
> > PrefetchBufferResult, don't you think?
>
> Agreed, and changed.  I suspect that other users of the main
> PrefetchBuffer() call will eventually want that, to do a better job of
> keeping the request queue full, for example bitmap heap scan and
> (hypothetical) btree scan with prefetch.

LGTM.

As before, I didn't get to reading 0005 in depth.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: WIP: WAL prefetch (another approach)

Thomas Munro-5
On Wed, Mar 18, 2020 at 2:47 PM Alvaro Herrera <[hidden email]> wrote:

> On 2020-Mar-17, Thomas Munro wrote:
> > I didn't want counters that wrap at ~4 billion, but I did want to be
> > able to read and write concurrently without tearing.  Instructions
> > like "lock xadd" would provide more guarantees that I don't need,
> > since only one thread is doing all the writing and there's no ordering
> > requirement.  It's basically just counter++, but some platforms need a
> > spinlock to perform atomic read and write of 64 bit wide numbers, so
> > more hoop jumping is required.
>
> Ah, I see, you don't want lock xadd ... That's non-obvious.  I suppose
> the function could use more commentary on *why* you're doing it that way
> then.
I updated the comment:

+/*
+ * On modern systems this is really just *counter++.  On some older systems
+ * there might be more to it, due to inability to read and write 64 bit values
+ * atomically.  The counters will only be written to by one process, and there
+ * is no ordering requirement, so there's no point in using higher overhead
+ * pg_atomic_fetch_add_u64().
+ */
+static inline void inc_counter(pg_atomic_uint64 *counter)

> > > Umm, I would keep the return values of both these functions in sync.
> > > It's really strange that PrefetchBuffer does not return
> > > PrefetchBufferResult, don't you think?
> >
> > Agreed, and changed.  I suspect that other users of the main
> > PrefetchBuffer() call will eventually want that, to do a better job of
> > keeping the request queue full, for example bitmap heap scan and
> > (hypothetical) btree scan with prefetch.
>
> LGTM.
Here's a new version that changes that part just a bit more, after a
brief chat with Andres about his async I/O plans.  It seems clear that
returning an enum isn't very extensible, so I decided to try making
PrefetchBufferResult a struct whose contents can be extended in the
future.  In this patch set it's still just used to distinguish 3 cases
(hit, miss, no file), but it's now expressed as a buffer and a flag to
indicate whether I/O was initiated.  You could imagine that the second
thing might be replaced by a pointer to an async I/O handle you can
wait on or some other magical thing from the future.

The concept here is that eventually we'll have just one XLogReader for
both read ahead and recovery, and we could attach the prefetch results
to the decoded records, and then recovery would try to use already
looked up buffers to avoid a bit of work (and then recheck).  In other
words, the WAL would be decoded only once, and the buffers would
hopefully be looked up only once, so you'd claw back all of the
overheads of this patch.  For now that's not done, and the buffer in
the result is only compared with InvalidBuffer to check if there was a
hit or not.

Similar things could be done for bitmap heap scan and btree prefetch
with this interface: their prefetch machinery could hold onto these
results in their block arrays and try to avoid a more expensive
ReadBuffer() call if they already have a buffer (though as before,
there's a small chance it turns out to be the wrong one and they need
to fall back to ReadBuffer()).

> As before, I didn't get to reading 0005 in depth.

Updated to account for the above-mentioned change, and with a couple
of elog() calls changed to ereport().

0001-Allow-PrefetchBuffer-to-be-called-with-a-SMgrRela-v5.patch (9K) Download Attachment
0002-Rename-GetWalRcvWriteRecPtr-to-GetWalRcvFlushRecP-v5.patch (14K) Download Attachment
0003-Add-WalRcvGetWriteRecPtr-new-definition-v5.patch (5K) Download Attachment
0004-Allow-PrefetchBuffer-to-report-what-happened-v5.patch (15K) Download Attachment
0005-Prefetch-referenced-blocks-during-recovery-v5.patch (63K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: WIP: WAL prefetch (another approach)

Andres Freund
Hi,

On 2020-03-18 18:18:44 +1300, Thomas Munro wrote:

> From 1b03eb5ada24c3b23ab8ca6db50e0c5d90d38259 Mon Sep 17 00:00:00 2001
> From: Thomas Munro <[hidden email]>
> Date: Mon, 9 Dec 2019 17:22:07 +1300
> Subject: [PATCH 3/5] Add WalRcvGetWriteRecPtr() (new definition).
>
> A later patch will read received WAL to prefetch referenced blocks,
> without waiting for the data to be flushed to disk.  To do that, it
> needs to be able to see the write pointer advancing in shared memory.
>
> The function formerly bearing name was recently renamed to
> WalRcvGetFlushRecPtr(), which better described what it does.

Hm. I'm a bit weary of reusing the name with a different meaning. If
there's any external references, this'll hide that they need to
adapt. Perhaps, even if it's a bit clunky, name it GetUnflushedRecPtr?


> From c62fde23f70ff06833d743a1c85716e15f3c813c Mon Sep 17 00:00:00 2001
> From: Thomas Munro <[hidden email]>
> Date: Tue, 17 Mar 2020 17:26:41 +1300
> Subject: [PATCH 4/5] Allow PrefetchBuffer() to report what happened.
>
> Report whether a prefetch was actually initiated due to a cache miss, so
> that callers can limit the number of concurrent I/Os they try to issue,
> without counting the prefetch calls that did nothing because the page
> was already in our buffers.
>
> If the requested block was already cached, return a valid buffer.  This
> might enable future code to avoid a buffer mapping lookup, though it
> will need to recheck the buffer before using it because it's not pinned
> so could be reclaimed at any time.
>
> Report neither hit nor miss when a relation's backing file is missing,
> to prepare for use during recovery.  This will be used to handle cases
> of relations that are referenced in the WAL but have been unlinked
> already due to actions covered by WAL records that haven't been replayed
> yet, after a crash.

We probably should take this into account in nodeBitmapHeapscan.c


> diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
> index d30aed6fd9..4ceb40a856 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -469,11 +469,13 @@ static int ts_ckpt_progress_comparator(Datum a, Datum b, void *arg);
>  /*
>   * Implementation of PrefetchBuffer() for shared buffers.
>   */
> -void
> +PrefetchBufferResult
>  PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
>   ForkNumber forkNum,
>   BlockNumber blockNum)
>  {
> + PrefetchBufferResult result = { InvalidBuffer, false };
> +
>  #ifdef USE_PREFETCH
>   BufferTag newTag; /* identity of requested block */
>   uint32 newHash; /* hash value for newTag */
> @@ -497,7 +499,23 @@ PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
>  
>   /* If not in buffers, initiate prefetch */
>   if (buf_id < 0)
> - smgrprefetch(smgr_reln, forkNum, blockNum);
> + {
> + /*
> + * Try to initiate an asynchronous read.  This returns false in
> + * recovery if the relation file doesn't exist.
> + */
> + if (smgrprefetch(smgr_reln, forkNum, blockNum))
> + result.initiated_io = true;
> + }
> + else
> + {
> + /*
> + * Report the buffer it was in at that time.  The caller may be able
> + * to avoid a buffer table lookup, but it's not pinned and it must be
> + * rechecked!
> + */
> + result.buffer = buf_id + 1;

Perhaps it'd be better to name this "last_buffer" or such, to make it
clearer that it may be outdated?


> -void
> +PrefetchBufferResult
>  PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
>  {
>  #ifdef USE_PREFETCH
> @@ -540,13 +564,17 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
>   errmsg("cannot access temporary tables of other sessions")));
>  
>   /* pass it off to localbuf.c */
> - PrefetchLocalBuffer(reln->rd_smgr, forkNum, blockNum);
> + return PrefetchLocalBuffer(reln->rd_smgr, forkNum, blockNum);
>   }
>   else
>   {
>   /* pass it to the shared buffer version */
> - PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum);
> + return PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum);
>   }
> +#else
> + PrefetchBuffer result = { InvalidBuffer, false };
> +
> + return result;
>  #endif /* USE_PREFETCH */
>  }

Hm. Now that results are returned indicating whether the buffer is in
s_b - shouldn't the return value be accurate regardless of USE_PREFETCH?



> +/*
> + * Type returned by PrefetchBuffer().
> + */
> +typedef struct PrefetchBufferResult
> +{
> + Buffer buffer; /* If valid, a hit (recheck needed!) */

I assume there's no user of this yet? Even if there's not, I wonder if
it still is worth adding and referencing a helper to do so correctly?


> From 42ba0a89260d46230ac0df791fae18bfdca0092f Mon Sep 17 00:00:00 2001
> From: Thomas Munro <[hidden email]>
> Date: Wed, 18 Mar 2020 16:35:27 +1300
> Subject: [PATCH 5/5] Prefetch referenced blocks during recovery.
>
> Introduce a new GUC max_wal_prefetch_distance.  If it is set to a
> positive number of bytes, then read ahead in the WAL at most that
> distance, and initiate asynchronous reading of referenced blocks.  The
> goal is to avoid I/O stalls and benefit from concurrent I/O.  The number
> of concurrency asynchronous reads is capped by the existing
> maintenance_io_concurrency GUC.  The feature is disabled by default.
>
> Reviewed-by: Tomas Vondra <[hidden email]>
> Reviewed-by: Alvaro Herrera <[hidden email]>
> Discussion:
> https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com

Why is it disabled by default? Just for "risk management"?


> +     <varlistentry id="guc-max-wal-prefetch-distance" xreflabel="max_wal_prefetch_distance">
> +      <term><varname>max_wal_prefetch_distance</varname> (<type>integer</type>)
> +      <indexterm>
> +       <primary><varname>max_wal_prefetch_distance</varname> configuration parameter</primary>
> +      </indexterm>
> +      </term>
> +      <listitem>
> +       <para>
> +        The maximum distance to look ahead in the WAL during recovery, to find
> +        blocks to prefetch.  Prefetching blocks that will soon be needed can
> +        reduce I/O wait times.  The number of concurrent prefetches is limited
> +        by this setting as well as <xref linkend="guc-maintenance-io-concurrency"/>.
> +        If this value is specified without units, it is taken as bytes.
> +        The default is -1, meaning that WAL prefetching is disabled.
> +       </para>
> +      </listitem>
> +     </varlistentry>

Is it worth noting that a too large distance could hurt, because the
buffers might get evicted again?


> +     <varlistentry id="guc-wal-prefetch-fpw" xreflabel="wal_prefetch_fpw">
> +      <term><varname>wal_prefetch_fpw</varname> (<type>boolean</type>)
> +      <indexterm>
> +       <primary><varname>wal_prefetch_fpw</varname> configuration parameter</primary>
> +      </indexterm>
> +      </term>
> +      <listitem>
> +       <para>
> +        Whether to prefetch blocks with full page images during recovery.
> +        Usually this doesn't help, since such blocks will not be read.  However,
> +        on file systems with a block size larger than
> +        <productname>PostgreSQL</productname>'s, prefetching can avoid a costly
> +        read-before-write when a blocks are later written.
> +        This setting has no effect unless
> +        <xref linkend="guc-max-wal-prefetch-distance"/> is set to a positive number.
> +        The default is off.
> +       </para>
> +      </listitem>
> +     </varlistentry>

Hm. I think this needs more details - it's not clear enough what this
actually controls. I assume it's about prefetching for WAL records that
contain the FPW, but it also could be read to be about not prefetching
any pages that had FPWs before, or such?


>       </variablelist>
>       </sect2>
>       <sect2 id="runtime-config-wal-archiving">
> diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> index 987580d6df..df4291092b 100644
> --- a/doc/src/sgml/monitoring.sgml
> +++ b/doc/src/sgml/monitoring.sgml
> @@ -320,6 +320,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
>        </entry>
>       </row>
>  
> +     <row>
> +      <entry><structname>pg_stat_wal_prefetcher</structname><indexterm><primary>pg_stat_wal_prefetcher</primary></indexterm></entry>
> +      <entry>Only one row, showing statistics about blocks prefetched during recovery.
> +       See <xref linkend="pg-stat-wal-prefetcher-view"/> for details.
> +      </entry>
> +     </row>
> +

'prefetcher' somehow sounds odd to me. I also suspect that we'll want to
have additional prefetching stat tables going forward. Perhaps
'pg_stat_prefetch_wal'?


> +    <row>
> +     <entry><structfield>distance</structfield></entry>
> +     <entry><type>integer</type></entry>
> +     <entry>How far ahead of recovery the WAL prefetcher is currently reading, in bytes</entry>
> +    </row>
> +    <row>
> +     <entry><structfield>queue_depth</structfield></entry>
> +     <entry><type>integer</type></entry>
> +     <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
> +    </row>
> +   </tbody>
> +   </tgroup>
> +  </table>

Is there a way we could have a "historical" version of at least some of
these? An average queue depth, or such?

It'd be useful to somewhere track the time spent initiating prefetch
requests. Otherwise it's quite hard to evaluate whether the queue is too
deep (and just blocks in the OS).

I think it'd be good to have a 'reset time' column.


> +  <para>
> +   The <structname>pg_stat_wal_prefetcher</structname> view will contain only
> +   one row.  It is filled with nulls if recovery is not running or WAL
> +   prefetching is not enabled.  See <xref linkend="guc-max-wal-prefetch-distance"/>
> +   for more information.  The counters in this view are reset whenever the
> +   <xref linkend="guc-max-wal-prefetch-distance"/>,
> +   <xref linkend="guc-wal-prefetch-fpw"/> or
> +   <xref linkend="guc-maintenance-io-concurrency"/> setting is changed and
> +   the server configuration is reloaded.
> +  </para>
> +

So pg_stat_reset_shared() cannot be used? If so, why?

It sounds like the counters aren't persisted via the stats system - if
so, why?



> @@ -7105,6 +7114,31 @@ StartupXLOG(void)
>   /* Handle interrupt signals of startup process */
>   HandleStartupProcInterrupts();
>  
> + /*
> + * The first time through, or if any relevant settings or the
> + * WAL source changes, we'll restart the prefetching machinery
> + * as appropriate.  This is simpler than trying to handle
> + * various complicated state changes.
> + */
> + if (unlikely(reset_wal_prefetcher))
> + {
> + /* If we had one already, destroy it. */
> + if (prefetcher)
> + {
> + XLogPrefetcherFree(prefetcher);
> + prefetcher = NULL;
> + }
> + /* If we want one, create it. */
> + if (max_wal_prefetch_distance > 0)
> + prefetcher = XLogPrefetcherAllocate(xlogreader->ReadRecPtr,
> + currentSource == XLOG_FROM_STREAM);
> + reset_wal_prefetcher = false;
> + }

Do we really need all of this code in StartupXLOG() itself? Could it be
in HandleStartupProcInterrupts() or at least a helper routine called
here?


> + /* Peform WAL prefetching, if enabled. */
> + if (prefetcher)
> + XLogPrefetcherReadAhead(prefetcher, xlogreader->ReadRecPtr);
> +
>   /*
>   * Pause WAL replay, if requested by a hot-standby session via
>   * SetRecoveryPause().

Personally, I'd rather have the if () be in
XLogPrefetcherReadAhead(). With an inline wrapper doing the check, if
the call bothers you (but I don't think it needs to).


> +/*-------------------------------------------------------------------------
> + *
> + * xlogprefetcher.c
> + * Prefetching support for PostgreSQL write-ahead log manager
> + *

An architectural overview here would be good.


> +struct XLogPrefetcher
> +{
> + /* Reader and current reading state. */
> + XLogReaderState *reader;
> + XLogReadLocalOptions options;
> + bool have_record;
> + bool shutdown;
> + int next_block_id;
> +
> + /* Book-keeping required to avoid accessing non-existing blocks. */
> + HTAB   *filter_table;
> + dlist_head filter_queue;
> +
> + /* Book-keeping required to limit concurrent prefetches. */
> + XLogRecPtr   *prefetch_queue;
> + int prefetch_queue_size;
> + int prefetch_head;
> + int prefetch_tail;
> +
> + /* Details of last prefetch to skip repeats and seq scans. */
> + SMgrRelation last_reln;
> + RelFileNode last_rnode;
> + BlockNumber last_blkno;

Do you have a comment somewhere explaining why you want to avoid
seqscans (I assume it's about avoiding regressions in linux, but only
because I recall chatting with you about it).


> +/*
> + * On modern systems this is really just *counter++.  On some older systems
> + * there might be more to it, due to inability to read and write 64 bit values
> + * atomically.  The counters will only be written to by one process, and there
> + * is no ordering requirement, so there's no point in using higher overhead
> + * pg_atomic_fetch_add_u64().
> + */
> +static inline void inc_counter(pg_atomic_uint64 *counter)
> +{
> + pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
> +}

Could be worthwhile to add to the atomics infrastructure itself - on the
platforms where this needs spinlocks this will lead to two acquisitions,
rather than one.


> +/*
> + * Create a prefetcher that is ready to begin prefetching blocks referenced by
> + * WAL that is ahead of the given lsn.
> + */
> +XLogPrefetcher *
> +XLogPrefetcherAllocate(XLogRecPtr lsn, bool streaming)
> +{
> + static HASHCTL hash_table_ctl = {
> + .keysize = sizeof(RelFileNode),
> + .entrysize = sizeof(XLogPrefetcherFilter)
> + };
> + XLogPrefetcher *prefetcher = palloc0(sizeof(*prefetcher));
> +
> + prefetcher->options.nowait = true;
> + if (streaming)
> + {
> + /*
> + * We're only allowed to read as far as the WAL receiver has written.
> + * We don't have to wait for it to be flushed, though, as recovery
> + * does, so that gives us a chance to get a bit further ahead.
> + */
> + prefetcher->options.read_upto_policy = XLRO_WALRCV_WRITTEN;
> + }
> + else
> + {
> + /* We're allowed to read as far as we can. */
> + prefetcher->options.read_upto_policy = XLRO_LSN;
> + prefetcher->options.lsn = (XLogRecPtr) -1;
> + }
> + prefetcher->reader = XLogReaderAllocate(wal_segment_size,
> + NULL,
> + read_local_xlog_page,
> + &prefetcher->options);
> + prefetcher->filter_table = hash_create("PrefetchFilterTable", 1024,
> +   &hash_table_ctl,
> +   HASH_ELEM | HASH_BLOBS);
> + dlist_init(&prefetcher->filter_queue);
> +
> + /*
> + * The size of the queue is based on the maintenance_io_concurrency
> + * setting.  In theory we might have a separate queue for each tablespace,
> + * but it's not clear how that should work, so for now we'll just use the
> + * general GUC to rate-limit all prefetching.
> + */
> + prefetcher->prefetch_queue_size = maintenance_io_concurrency;
> + prefetcher->prefetch_queue = palloc0(sizeof(XLogRecPtr) * prefetcher->prefetch_queue_size);
> + prefetcher->prefetch_head = prefetcher->prefetch_tail = 0;
> +
> + /* Prepare to read at the given LSN. */
> + ereport(LOG,
> + (errmsg("WAL prefetch started at %X/%X",
> + (uint32) (lsn << 32), (uint32) lsn)));
> + XLogBeginRead(prefetcher->reader, lsn);
> +
> + XLogPrefetcherResetMonitoringStats();
> +
> + return prefetcher;
> +}
> +
> +/*
> + * Destroy a prefetcher and release all resources.
> + */
> +void
> +XLogPrefetcherFree(XLogPrefetcher *prefetcher)
> +{
> + double avg_distance = 0;
> + double avg_queue_depth = 0;
> +
> + /* Log final statistics. */
> + if (prefetcher->samples > 0)
> + {
> + avg_distance = prefetcher->distance_sum / prefetcher->samples;
> + avg_queue_depth = prefetcher->queue_depth_sum / prefetcher->samples;
> + }
> + ereport(LOG,
> + (errmsg("WAL prefetch finished at %X/%X; "
> + "prefetch = " UINT64_FORMAT ", "
> + "skip_hit = " UINT64_FORMAT ", "
> + "skip_new = " UINT64_FORMAT ", "
> + "skip_fpw = " UINT64_FORMAT ", "
> + "skip_seq = " UINT64_FORMAT ", "
> + "avg_distance = %f, "
> + "avg_queue_depth = %f",
> + (uint32) (prefetcher->reader->EndRecPtr << 32),
> + (uint32) (prefetcher->reader->EndRecPtr),
> + pg_atomic_read_u64(&MonitoringStats->prefetch),
> + pg_atomic_read_u64(&MonitoringStats->skip_hit),
> + pg_atomic_read_u64(&MonitoringStats->skip_new),
> + pg_atomic_read_u64(&MonitoringStats->skip_fpw),
> + pg_atomic_read_u64(&MonitoringStats->skip_seq),
> + avg_distance,
> + avg_queue_depth)));
> + XLogReaderFree(prefetcher->reader);
> + hash_destroy(prefetcher->filter_table);
> + pfree(prefetcher->prefetch_queue);
> + pfree(prefetcher);
> +
> + XLogPrefetcherResetMonitoringStats();
> +}

It's possibly overkill, but I think it'd be a good idea to do all the
allocations within a prefetch specific memory context. That makes
detecting potential leaks or such easier.



> + /* Can we drop any filters yet, due to problem records begin replayed? */

Odd grammar.


> + XLogPrefetcherCompleteFilters(prefetcher, replaying_lsn);

Hm, why isn't this part of the loop below?


> + /* Main prefetch loop. */
> + for (;;)
> + {

This kind of looks like a separate process' main loop. The name
indicates similar. And there's no architecture documentation
disinclining one from that view...


The loop body is quite long. I think it should be split into a number of
helper functions. Perhaps one to ensure a block is read, one to maintain
stats, and then one to process block references?


> + /*
> + * Scan the record for block references.  We might already have been
> + * partway through processing this record when we hit maximum I/O
> + * concurrency, so start where we left off.
> + */
> + for (int i = prefetcher->next_block_id; i <= reader->max_block_id; ++i)
> + {

Super pointless nitpickery: For a loop-body this big I'd rather name 'i'
'blockid' or such.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: WIP: WAL prefetch (another approach)

Thomas Munro-5
Hi,

Thanks for all that feedback.  It's been a strange couple of weeks,
but I finally have a new version that addresses most of that feedback
(but punts on a couple of suggestions for later development, due to
lack of time).

It also fixes a couple of other problems I found with the previous version:

1.  While streaming, whenever it hit the end of available data (ie LSN
written by WAL receiver), it would close and then reopen the WAL
segment.  Fixed by the machinery in 0007 which allows for "would
block" as distinct from other errors.

2.  During crash recovery, there were some edge cases where it would
try to read the next WAL segment when there isn't one.  Also fixed by
0007.

3.  It was maxing out at maintenance_io_concurrency - 1 due to a silly
circular buffer fence post bug.

Note that 0006 is just for illustration, it's not proposed for commit.

On Wed, Mar 25, 2020 at 11:31 AM Andres Freund <[hidden email]> wrote:

> On 2020-03-18 18:18:44 +1300, Thomas Munro wrote:
> > From 1b03eb5ada24c3b23ab8ca6db50e0c5d90d38259 Mon Sep 17 00:00:00 2001
> > From: Thomas Munro <[hidden email]>
> > Date: Mon, 9 Dec 2019 17:22:07 +1300
> > Subject: [PATCH 3/5] Add WalRcvGetWriteRecPtr() (new definition).
> >
> > A later patch will read received WAL to prefetch referenced blocks,
> > without waiting for the data to be flushed to disk.  To do that, it
> > needs to be able to see the write pointer advancing in shared memory.
> >
> > The function formerly bearing name was recently renamed to
> > WalRcvGetFlushRecPtr(), which better described what it does.
>
> Hm. I'm a bit weary of reusing the name with a different meaning. If
> there's any external references, this'll hide that they need to
> adapt. Perhaps, even if it's a bit clunky, name it GetUnflushedRecPtr?
Well, at least external code won't compile due to the change in arguments:

extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart,
TimeLineID *receiveTLI);
extern XLogRecPtr GetWalRcvWriteRecPtr(void);

Anyone who is using that for some kind of data integrity purposes
should hopefully be triggered to investigate, no?  I tried to think of
a better naming scheme but...

> > From c62fde23f70ff06833d743a1c85716e15f3c813c Mon Sep 17 00:00:00 2001
> > From: Thomas Munro <[hidden email]>
> > Date: Tue, 17 Mar 2020 17:26:41 +1300
> > Subject: [PATCH 4/5] Allow PrefetchBuffer() to report what happened.
> >
> > Report whether a prefetch was actually initiated due to a cache miss, so
> > that callers can limit the number of concurrent I/Os they try to issue,
> > without counting the prefetch calls that did nothing because the page
> > was already in our buffers.
> >
> > If the requested block was already cached, return a valid buffer.  This
> > might enable future code to avoid a buffer mapping lookup, though it
> > will need to recheck the buffer before using it because it's not pinned
> > so could be reclaimed at any time.
> >
> > Report neither hit nor miss when a relation's backing file is missing,
> > to prepare for use during recovery.  This will be used to handle cases
> > of relations that are referenced in the WAL but have been unlinked
> > already due to actions covered by WAL records that haven't been replayed
> > yet, after a crash.
>
> We probably should take this into account in nodeBitmapHeapscan.c
Indeed.  The naive version would be something like:

diff --git a/src/backend/executor/nodeBitmapHeapscan.c
b/src/backend/executor/nodeBitmapHeapscan.c
index 726d3a2d9a..3cd644d0ac 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -484,13 +484,11 @@ BitmapPrefetch(BitmapHeapScanState *node,
TableScanDesc scan)
                                        node->prefetch_iterator = NULL;
                                        break;
                                }
-                               node->prefetch_pages++;

                                /*
                                 * If we expect not to have to
actually read this heap page,
                                 * skip this prefetch call, but
continue to run the prefetch
-                                * logic normally.  (Would it be
better not to increment
-                                * prefetch_pages?)
+                                * logic normally.
                                 *
                                 * This depends on the assumption that
the index AM will
                                 * report the same recheck flag for
this future heap page as
@@ -504,7 +502,13 @@ BitmapPrefetch(BitmapHeapScanState *node,
TableScanDesc scan)

                  &node->pvmbuffer));

                                if (!skip_fetch)
-                                       PrefetchBuffer(scan->rs_rd,
MAIN_FORKNUM, tbmpre->blockno);
+                               {
+                                       PrefetchBufferResult prefetch;
+
+                                       prefetch =
PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
+                                       if (prefetch.initiated_io)
+                                               node->prefetch_pages++;
+                               }
                        }
                }

... but that might get arbitrarily far ahead, so it probably needs
some kind of cap, and the parallel version is a bit more complicated.
Something for later, along with more prefetching opportunities.

> > diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
> > index d30aed6fd9..4ceb40a856 100644
> > --- a/src/backend/storage/buffer/bufmgr.c
> > +++ b/src/backend/storage/buffer/bufmgr.c
> > @@ -469,11 +469,13 @@ static int      ts_ckpt_progress_comparator(Datum a, Datum b, void *arg);
> >  /*
> >   * Implementation of PrefetchBuffer() for shared buffers.
> >   */
> > -void
> > +PrefetchBufferResult
> >  PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
> >                                        ForkNumber forkNum,
> >                                        BlockNumber blockNum)
> >  {
> > +     PrefetchBufferResult result = { InvalidBuffer, false };
> > +
> >  #ifdef USE_PREFETCH
> >       BufferTag       newTag;         /* identity of requested block */
> >       uint32          newHash;        /* hash value for newTag */
> > @@ -497,7 +499,23 @@ PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
> >
> >       /* If not in buffers, initiate prefetch */
> >       if (buf_id < 0)
> > -             smgrprefetch(smgr_reln, forkNum, blockNum);
> > +     {
> > +             /*
> > +              * Try to initiate an asynchronous read.  This returns false in
> > +              * recovery if the relation file doesn't exist.
> > +              */
> > +             if (smgrprefetch(smgr_reln, forkNum, blockNum))
> > +                     result.initiated_io = true;
> > +     }
> > +     else
> > +     {
> > +             /*
> > +              * Report the buffer it was in at that time.  The caller may be able
> > +              * to avoid a buffer table lookup, but it's not pinned and it must be
> > +              * rechecked!
> > +              */
> > +             result.buffer = buf_id + 1;
>
> Perhaps it'd be better to name this "last_buffer" or such, to make it
> clearer that it may be outdated?
OK.  Renamed to "recent_buffer".

> > -void
> > +PrefetchBufferResult
> >  PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
> >  {
> >  #ifdef USE_PREFETCH
> > @@ -540,13 +564,17 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
> >                                        errmsg("cannot access temporary tables of other sessions")));
> >
> >               /* pass it off to localbuf.c */
> > -             PrefetchLocalBuffer(reln->rd_smgr, forkNum, blockNum);
> > +             return PrefetchLocalBuffer(reln->rd_smgr, forkNum, blockNum);
> >       }
> >       else
> >       {
> >               /* pass it to the shared buffer version */
> > -             PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum);
> > +             return PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum);
> >       }
> > +#else
> > +     PrefetchBuffer result = { InvalidBuffer, false };
> > +
> > +     return result;
> >  #endif                                                       /* USE_PREFETCH */
> >  }
>
> Hm. Now that results are returned indicating whether the buffer is in
> s_b - shouldn't the return value be accurate regardless of USE_PREFETCH?
Yeah.  Done.

> > +/*
> > + * Type returned by PrefetchBuffer().
> > + */
> > +typedef struct PrefetchBufferResult
> > +{
> > +     Buffer          buffer;                 /* If valid, a hit (recheck needed!) */
>
> I assume there's no user of this yet? Even if there's not, I wonder if
> it still is worth adding and referencing a helper to do so correctly?

It *is* used, but only to see if it's valid.  0006 is a not-for-commit
patch to show how you might use it later to read a buffer.  To
actually use this for something like bitmap heap scan, you'd first
need to fix the modularity violations in that code (I mean we have
PrefetchBuffer() in nodeBitmapHeapscan.c, but the corresponding
[ReleaseAnd]ReadBuffer() in heapam.c, and you'd need to get these into
the same module and/or to communicate in some graceful way).

> > From 42ba0a89260d46230ac0df791fae18bfdca0092f Mon Sep 17 00:00:00 2001
> > From: Thomas Munro <[hidden email]>
> > Date: Wed, 18 Mar 2020 16:35:27 +1300
> > Subject: [PATCH 5/5] Prefetch referenced blocks during recovery.
> >
> > Introduce a new GUC max_wal_prefetch_distance.  If it is set to a
> > positive number of bytes, then read ahead in the WAL at most that
> > distance, and initiate asynchronous reading of referenced blocks.  The
> > goal is to avoid I/O stalls and benefit from concurrent I/O.  The number
> > of concurrency asynchronous reads is capped by the existing
> > maintenance_io_concurrency GUC.  The feature is disabled by default.
> >
> > Reviewed-by: Tomas Vondra <[hidden email]>
> > Reviewed-by: Alvaro Herrera <[hidden email]>
> > Discussion:
> > https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
>
> Why is it disabled by default? Just for "risk management"?
Well, it's not free, and might not help you, so not everyone would
want it on.  I think the overheads can be mostly removed with more
work in a later release.  Perhaps we could commit it enabled by
default, and then discuss it before release after looking at some more
data?  On that basis I have now made it default to on, with
max_wal_prefetch_distance = 256kB, if your build has USE_PREFETCH.
Obviously this number can be discussed.

> > +     <varlistentry id="guc-max-wal-prefetch-distance" xreflabel="max_wal_prefetch_distance">
> > +      <term><varname>max_wal_prefetch_distance</varname> (<type>integer</type>)
> > +      <indexterm>
> > +       <primary><varname>max_wal_prefetch_distance</varname> configuration parameter</primary>
> > +      </indexterm>
> > +      </term>
> > +      <listitem>
> > +       <para>
> > +        The maximum distance to look ahead in the WAL during recovery, to find
> > +        blocks to prefetch.  Prefetching blocks that will soon be needed can
> > +        reduce I/O wait times.  The number of concurrent prefetches is limited
> > +        by this setting as well as <xref linkend="guc-maintenance-io-concurrency"/>.
> > +        If this value is specified without units, it is taken as bytes.
> > +        The default is -1, meaning that WAL prefetching is disabled.
> > +       </para>
> > +      </listitem>
> > +     </varlistentry>
>
> Is it worth noting that a too large distance could hurt, because the
> buffers might get evicted again?
OK, I tried to explain that.

> > +     <varlistentry id="guc-wal-prefetch-fpw" xreflabel="wal_prefetch_fpw">
> > +      <term><varname>wal_prefetch_fpw</varname> (<type>boolean</type>)
> > +      <indexterm>
> > +       <primary><varname>wal_prefetch_fpw</varname> configuration parameter</primary>
> > +      </indexterm>
> > +      </term>
> > +      <listitem>
> > +       <para>
> > +        Whether to prefetch blocks with full page images during recovery.
> > +        Usually this doesn't help, since such blocks will not be read.  However,
> > +        on file systems with a block size larger than
> > +        <productname>PostgreSQL</productname>'s, prefetching can avoid a costly
> > +        read-before-write when a blocks are later written.
> > +        This setting has no effect unless
> > +        <xref linkend="guc-max-wal-prefetch-distance"/> is set to a positive number.
> > +        The default is off.
> > +       </para>
> > +      </listitem>
> > +     </varlistentry>
>
> Hm. I think this needs more details - it's not clear enough what this
> actually controls. I assume it's about prefetching for WAL records that
> contain the FPW, but it also could be read to be about not prefetching
> any pages that had FPWs before, or such?
Ok, I have elaborated.

> >       </variablelist>
> >       </sect2>
> >       <sect2 id="runtime-config-wal-archiving">
> > diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> > index 987580d6df..df4291092b 100644
> > --- a/doc/src/sgml/monitoring.sgml
> > +++ b/doc/src/sgml/monitoring.sgml
> > @@ -320,6 +320,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
> >        </entry>
> >       </row>
> >
> > +     <row>
> > +      <entry><structname>pg_stat_wal_prefetcher</structname><indexterm><primary>pg_stat_wal_prefetcher</primary></indexterm></entry>
> > +      <entry>Only one row, showing statistics about blocks prefetched during recovery.
> > +       See <xref linkend="pg-stat-wal-prefetcher-view"/> for details.
> > +      </entry>
> > +     </row>
> > +
>
> 'prefetcher' somehow sounds odd to me. I also suspect that we'll want to
> have additional prefetching stat tables going forward. Perhaps
> 'pg_stat_prefetch_wal'?
Works for me, though while thinking about this I realised that the
"WAL" part was bothering me.  It sounds like we're prefetching WAL
itself, which would be a different thing.  So I renamed this view to
pg_stat_prefetch_recovery.

Then I renamed the main GUCs that control this thing to:

  max_recovery_prefetch_distance
  recovery_prefetch_fpw

> > +    <row>
> > +     <entry><structfield>distance</structfield></entry>
> > +     <entry><type>integer</type></entry>
> > +     <entry>How far ahead of recovery the WAL prefetcher is currently reading, in bytes</entry>
> > +    </row>
> > +    <row>
> > +     <entry><structfield>queue_depth</structfield></entry>
> > +     <entry><type>integer</type></entry>
> > +     <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
> > +    </row>
> > +   </tbody>
> > +   </tgroup>
> > +  </table>
>
> Is there a way we could have a "historical" version of at least some of
> these? An average queue depth, or such?
Ok, I added simple online averages for distance and queue depth that
take a sample every time recovery advances by 256kB.

> It'd be useful to somewhere track the time spent initiating prefetch
> requests. Otherwise it's quite hard to evaluate whether the queue is too
> deep (and just blocks in the OS).

I agree that that sounds useful, and I thought about various ways to
do that that involved new views, until I eventually found myself
wondering: why isn't recovery's I/O already tracked via the existing
stats views?  For example, why can't I see blks_read, blks_hit,
blk_read_time etc moving in pg_stat_database due to recovery activity?

I seems like if you made that work first, or created a new view
pgstatio view for that, then you could add prefetching counters and
timing (if track_io_timing is on) to the existing machinery so that
bufmgr.c would automatically capture it, and then not only recovery
but also stuff like bitmap heap scan could also be measured the same
way.

However, time is short, so I'm not attempting to do anything like that
now.  You can measure the posix_fadvise() times with OS facilities in
the meantime.

> I think it'd be good to have a 'reset time' column.

Done, as stats_reset following other examples.

> > +  <para>
> > +   The <structname>pg_stat_wal_prefetcher</structname> view will contain only
> > +   one row.  It is filled with nulls if recovery is not running or WAL
> > +   prefetching is not enabled.  See <xref linkend="guc-max-wal-prefetch-distance"/>
> > +   for more information.  The counters in this view are reset whenever the
> > +   <xref linkend="guc-max-wal-prefetch-distance"/>,
> > +   <xref linkend="guc-wal-prefetch-fpw"/> or
> > +   <xref linkend="guc-maintenance-io-concurrency"/> setting is changed and
> > +   the server configuration is reloaded.
> > +  </para>
> > +
>
> So pg_stat_reset_shared() cannot be used? If so, why?
Hmm.  OK, I made pg_stat_reset_shared('prefetch_recovery') work.

> It sounds like the counters aren't persisted via the stats system - if
> so, why?

Ok, I made it persist the simple counters by sending to the to stats
collector periodically.  The view still shows data straight out of
shmem though, not out of the stats file.  Now I'm wondering if I
should have the view show it from the stats file, more like other
things, now that I understand that a bit better...  hmm.

> > @@ -7105,6 +7114,31 @@ StartupXLOG(void)
> >                               /* Handle interrupt signals of startup process */
> >                               HandleStartupProcInterrupts();
> >
> > +                             /*
> > +                              * The first time through, or if any relevant settings or the
> > +                              * WAL source changes, we'll restart the prefetching machinery
> > +                              * as appropriate.  This is simpler than trying to handle
> > +                              * various complicated state changes.
> > +                              */
> > +                             if (unlikely(reset_wal_prefetcher))
> > +                             {
> > +                                     /* If we had one already, destroy it. */
> > +                                     if (prefetcher)
> > +                                     {
> > +                                             XLogPrefetcherFree(prefetcher);
> > +                                             prefetcher = NULL;
> > +                                     }
> > +                                     /* If we want one, create it. */
> > +                                     if (max_wal_prefetch_distance > 0)
> > +                                                     prefetcher = XLogPrefetcherAllocate(xlogreader->ReadRecPtr,
> > +                                                                                                                             currentSource == XLOG_FROM_STREAM);
> > +                                     reset_wal_prefetcher = false;
> > +                             }
>
> Do we really need all of this code in StartupXLOG() itself? Could it be
> in HandleStartupProcInterrupts() or at least a helper routine called
> here?
It's now done differently, so that StartupXLOG() only has three new
lines: XLogPrefetchBegin() before the loop, XLogPrefetch() in the
loop, and XLogPrefetchEnd() after the loop.

> > +                             /* Peform WAL prefetching, if enabled. */
> > +                             if (prefetcher)
> > +                                     XLogPrefetcherReadAhead(prefetcher, xlogreader->ReadRecPtr);
> > +
> >                               /*
> >                                * Pause WAL replay, if requested by a hot-standby session via
> >                                * SetRecoveryPause().
>
> Personally, I'd rather have the if () be in
> XLogPrefetcherReadAhead(). With an inline wrapper doing the check, if
> the call bothers you (but I don't think it needs to).
Done.

> > +/*-------------------------------------------------------------------------
> > + *
> > + * xlogprefetcher.c
> > + *           Prefetching support for PostgreSQL write-ahead log manager
> > + *
>
> An architectural overview here would be good.

OK, added.

> > +struct XLogPrefetcher
> > +{
> > +     /* Reader and current reading state. */
> > +     XLogReaderState *reader;
> > +     XLogReadLocalOptions options;
> > +     bool                    have_record;
> > +     bool                    shutdown;
> > +     int                             next_block_id;
> > +
> > +     /* Book-keeping required to avoid accessing non-existing blocks. */
> > +     HTAB               *filter_table;
> > +     dlist_head              filter_queue;
> > +
> > +     /* Book-keeping required to limit concurrent prefetches. */
> > +     XLogRecPtr         *prefetch_queue;
> > +     int                             prefetch_queue_size;
> > +     int                             prefetch_head;
> > +     int                             prefetch_tail;
> > +
> > +     /* Details of last prefetch to skip repeats and seq scans. */
> > +     SMgrRelation    last_reln;
> > +     RelFileNode             last_rnode;
> > +     BlockNumber             last_blkno;
>
> Do you have a comment somewhere explaining why you want to avoid
> seqscans (I assume it's about avoiding regressions in linux, but only
> because I recall chatting with you about it).
I've added a note to the new architectural comments.

> > +/*
> > + * On modern systems this is really just *counter++.  On some older systems
> > + * there might be more to it, due to inability to read and write 64 bit values
> > + * atomically.  The counters will only be written to by one process, and there
> > + * is no ordering requirement, so there's no point in using higher overhead
> > + * pg_atomic_fetch_add_u64().
> > + */
> > +static inline void inc_counter(pg_atomic_uint64 *counter)
> > +{
> > +     pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
> > +}
>
> Could be worthwhile to add to the atomics infrastructure itself - on the
> platforms where this needs spinlocks this will lead to two acquisitions,
> rather than one.
Ok, I added pg_atomic_unlocked_add_fetch_XXX().  (Could also be
"fetch_add", I don't care, I don't use the result).

> > +/*
> > + * Create a prefetcher that is ready to begin prefetching blocks referenced by
> > + * WAL that is ahead of the given lsn.
> > + */
> > +XLogPrefetcher *
> > +XLogPrefetcherAllocate(XLogRecPtr lsn, bool streaming)
> > +{
> > +     static HASHCTL hash_table_ctl = {
> > +             .keysize = sizeof(RelFileNode),
> > +             .entrysize = sizeof(XLogPrefetcherFilter)
> > +     };
> > +     XLogPrefetcher *prefetcher = palloc0(sizeof(*prefetcher));
> > +
> > +     prefetcher->options.nowait = true;
> > +     if (streaming)
> > +     {
> > +             /*
> > +              * We're only allowed to read as far as the WAL receiver has written.
> > +              * We don't have to wait for it to be flushed, though, as recovery
> > +              * does, so that gives us a chance to get a bit further ahead.
> > +              */
> > +             prefetcher->options.read_upto_policy = XLRO_WALRCV_WRITTEN;
> > +     }
> > +     else
> > +     {
> > +             /* We're allowed to read as far as we can. */
> > +             prefetcher->options.read_upto_policy = XLRO_LSN;
> > +             prefetcher->options.lsn = (XLogRecPtr) -1;
> > +     }
> > +     prefetcher->reader = XLogReaderAllocate(wal_segment_size,
> > +                                                                                     NULL,
> > +                                                                                     read_local_xlog_page,
> > +                                                                                     &prefetcher->options);
> > +     prefetcher->filter_table = hash_create("PrefetchFilterTable", 1024,
> > +                                                                                &hash_table_ctl,
> > +                                                                                HASH_ELEM | HASH_BLOBS);
> > +     dlist_init(&prefetcher->filter_queue);
> > +
> > +     /*
> > +      * The size of the queue is based on the maintenance_io_concurrency
> > +      * setting.  In theory we might have a separate queue for each tablespace,
> > +      * but it's not clear how that should work, so for now we'll just use the
> > +      * general GUC to rate-limit all prefetching.
> > +      */
> > +     prefetcher->prefetch_queue_size = maintenance_io_concurrency;
> > +     prefetcher->prefetch_queue = palloc0(sizeof(XLogRecPtr) * prefetcher->prefetch_queue_size);
> > +     prefetcher->prefetch_head = prefetcher->prefetch_tail = 0;
> > +
> > +     /* Prepare to read at the given LSN. */
> > +     ereport(LOG,
> > +                     (errmsg("WAL prefetch started at %X/%X",
> > +                                     (uint32) (lsn << 32), (uint32) lsn)));
> > +     XLogBeginRead(prefetcher->reader, lsn);
> > +
> > +     XLogPrefetcherResetMonitoringStats();
> > +
> > +     return prefetcher;
> > +}
> > +
> > +/*
> > + * Destroy a prefetcher and release all resources.
> > + */
> > +void
> > +XLogPrefetcherFree(XLogPrefetcher *prefetcher)
> > +{
> > +     double          avg_distance = 0;
> > +     double          avg_queue_depth = 0;
> > +
> > +     /* Log final statistics. */
> > +     if (prefetcher->samples > 0)
> > +     {
> > +             avg_distance = prefetcher->distance_sum / prefetcher->samples;
> > +             avg_queue_depth = prefetcher->queue_depth_sum / prefetcher->samples;
> > +     }
> > +     ereport(LOG,
> > +                     (errmsg("WAL prefetch finished at %X/%X; "
> > +                                     "prefetch = " UINT64_FORMAT ", "
> > +                                     "skip_hit = " UINT64_FORMAT ", "
> > +                                     "skip_new = " UINT64_FORMAT ", "
> > +                                     "skip_fpw = " UINT64_FORMAT ", "
> > +                                     "skip_seq = " UINT64_FORMAT ", "
> > +                                     "avg_distance = %f, "
> > +                                     "avg_queue_depth = %f",
> > +                      (uint32) (prefetcher->reader->EndRecPtr << 32),
> > +                      (uint32) (prefetcher->reader->EndRecPtr),
> > +                      pg_atomic_read_u64(&MonitoringStats->prefetch),
> > +                      pg_atomic_read_u64(&MonitoringStats->skip_hit),
> > +                      pg_atomic_read_u64(&MonitoringStats->skip_new),
> > +                      pg_atomic_read_u64(&MonitoringStats->skip_fpw),
> > +                      pg_atomic_read_u64(&MonitoringStats->skip_seq),
> > +                      avg_distance,
> > +                      avg_queue_depth)));
> > +     XLogReaderFree(prefetcher->reader);
> > +     hash_destroy(prefetcher->filter_table);
> > +     pfree(prefetcher->prefetch_queue);
> > +     pfree(prefetcher);
> > +
> > +     XLogPrefetcherResetMonitoringStats();
> > +}
>
> It's possibly overkill, but I think it'd be a good idea to do all the
> allocations within a prefetch specific memory context. That makes
> detecting potential leaks or such easier.
I looked into that, but in fact it's already pretty clear how much
memory this thing is using, if you call
MemoryContextStats(TopMemoryContext), because it's almost all in a
named hash table:

TopMemoryContext: 155776 total in 6 blocks; 18552 free (8 chunks); 137224 used
  XLogPrefetcherFilterTable: 16384 total in 2 blocks; 4520 free (3
chunks); 11864 used
  SP-GiST temporary context: 8192 total in 1 blocks; 7928 free (0
chunks); 264 used
  GiST temporary context: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
  GIN recovery temporary context: 8192 total in 1 blocks; 7928 free (0
chunks); 264 used
  Btree recovery temporary context: 8192 total in 1 blocks; 7928 free
(0 chunks); 264 used
  RecoveryLockLists: 8192 total in 1 blocks; 2584 free (0 chunks); 5608 used
  PrivateRefCount: 8192 total in 1 blocks; 2584 free (0 chunks); 5608 used
  MdSmgr: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
  Pending ops context: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
  LOCALLOCK hash: 8192 total in 1 blocks; 512 free (0 chunks); 7680 used
  Timezones: 104128 total in 2 blocks; 2584 free (0 chunks); 101544 used
  ErrorContext: 8192 total in 1 blocks; 7928 free (4 chunks); 264 used
Grand total: 358208 bytes in 20 blocks; 86832 free (15 chunks); 271376 used

The XLogPrefetcher struct itself is not measured seperately, but I
don't think that's a problem, it's small and there's only ever one at
a time.  It's that XLogPrefetcherFilterTable that is of variable size
(though it's often empty).  While thinking about this, I made
prefetch_queue into a flexible array rather than a pointer to palloc'd
memory, which seemed a bit tidier.

> > +     /* Can we drop any filters yet, due to problem records begin replayed? */
>
> Odd grammar.

Rewritten.

> > +     XLogPrefetcherCompleteFilters(prefetcher, replaying_lsn);
>
> Hm, why isn't this part of the loop below?

It only needs to run when replaying_lsn has advanced (ie when records
have been replayed).  I hope the new comment makes that clearer.

> > +     /* Main prefetch loop. */
> > +     for (;;)
> > +     {
>
> This kind of looks like a separate process' main loop. The name
> indicates similar. And there's no architecture documentation
> disinclining one from that view...

OK, I have updated the comment.

> The loop body is quite long. I think it should be split into a number of
> helper functions. Perhaps one to ensure a block is read, one to maintain
> stats, and then one to process block references?

I've broken the function up.  It's now:

StartupXLOG()
 -> XLogPrefetch()
     -> XLogPrefetcherReadAhead()
         -> XLogPrefetcherScanRecords()
             -> XLogPrefetcherScanBlocks()

> > +             /*
> > +              * Scan the record for block references.  We might already have been
> > +              * partway through processing this record when we hit maximum I/O
> > +              * concurrency, so start where we left off.
> > +              */
> > +             for (int i = prefetcher->next_block_id; i <= reader->max_block_id; ++i)
> > +             {
>
> Super pointless nitpickery: For a loop-body this big I'd rather name 'i'
> 'blockid' or such.
Done.

v6-0001-Allow-PrefetchBuffer-to-be-called-with-a-SMgrRela.patch (9K) Download Attachment
v6-0002-Rename-GetWalRcvWriteRecPtr-to-GetWalRcvFlushRecP.patch (14K) Download Attachment
v6-0003-Add-GetWalRcvWriteRecPtr-new-definition.patch (5K) Download Attachment
v6-0004-Add-pg_atomic_unlocked_add_fetch_XXX.patch (4K) Download Attachment
v6-0005-Allow-PrefetchBuffer-to-report-what-happened.patch (16K) Download Attachment
v6-0006-Add-ReadBufferPrefetched-POC-only.patch (4K) Download Attachment
v6-0007-Allow-XLogReadRecord-to-be-non-blocking.patch (13K) Download Attachment
v6-0008-Prefetch-referenced-blocks-during-recovery.patch (82K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: WIP: WAL prefetch (another approach)

Thomas Munro-5
On Wed, Apr 8, 2020 at 4:24 AM Thomas Munro <[hidden email]> wrote:
> Thanks for all that feedback.  It's been a strange couple of weeks,
> but I finally have a new version that addresses most of that feedback
> (but punts on a couple of suggestions for later development, due to
> lack of time).

Here's an executive summary of an off-list chat with Andres:

* he withdrew his objection to the new definition of
GetWalRcvWriteRecPtr() based on my argument that any external code
will fail to compile anyway

* he doesn't like the naive code that detects sequential access and
skips prefetching; I agreed to rip it out for now and revisit if/when
we have better evidence that that's worth bothering with; the code
path that does that and the pg_stat_recovery_prefetch.skip_seq counter
will remain, but be used only to skip prefetching of repeated access
to the *same* block for now

* he gave some feedback on the read_local_xlog_page() modifications: I
probably need to reconsider the change to logical.c that passes NULL
instead of cxt to the read_page callback; and the switch statement in
read_local_xlog_page() probably should have a case for the preexisting
mode

* he +1s the plan to commit with the feature enabled, and revisit before release

* he thinks the idea of a variant of ReadBuffer() that takes a
PrefetchBufferResult (as sketched by the v6 0006 patch) broadly makes
sense as a stepping stone towards his asychronous I/O proposal, but
there's no point in committing something like 0006 without a user

I'm going to go and commit the first few patches in this series, and
come back in a bit with a new version of the main patch to fix the
above and a compiler warning reported by cfbot.


Reply | Threaded
Open this post in threaded view
|

Re: WIP: WAL prefetch (another approach)

Thomas Munro-5
On Wed, Apr 8, 2020 at 12:52 PM Thomas Munro <[hidden email]> wrote:
> * he gave some feedback on the read_local_xlog_page() modifications: I
> probably need to reconsider the change to logical.c that passes NULL
> instead of cxt to the read_page callback; and the switch statement in
> read_local_xlog_page() probably should have a case for the preexisting
> mode

So... logical.c wants to give its LogicalDecodingContext to any
XLogPageReadCB you give it, via "private_data"; that is, it really
only accepts XLogPageReadCB implementations that understand that (or
ignore it).  What I want to do is give every XLogPageReadCB the chance
to have its own state that it is control of (to receive settings
specific to the implementation, or whatever), that you supply along
with it.  We can't do both kinds of things with private_data, so I
have added a second member read_page_data to XLogReaderState.  If you
pass in read_local_xlog_page as read_page, then you can optionally
install a pointer to XLogReadLocalOptions as reader->read_page_data,
to activate the new behaviours I added for prefetching purposes.

While working on that, I realised the readahead XLogReader was
breaking a rule expressed in XLogReadDetermineTimeLine().  Timelines
are really confusing and there were probably several subtle or not to
subtle bugs there.  So I added an option to skip all of that logic,
and just say "I command you to read only from TLI X".  It reads the
same TLI as recovery is reading, until it hits the end of readable
data and that causes prefetching to shut down.  Then the main recovery
loop resets the prefetching module when it sees a TLI switch, so then
it starts up again.  This seems to work reliably, but I've obviously
had limited time to test.  Does this scheme sound sane?

I think this is basically committable (though of course I wish I had
more time to test and review).  Ugh.  Feature freeze in half an hour.

v7-0001-Rationalize-GetWalRcv-Write-Flush-RecPtr.patch (16K) Download Attachment
v7-0002-Add-pg_atomic_unlocked_add_fetch_XXX.patch (4K) Download Attachment
v7-0003-Allow-XLogReadRecord-to-be-non-blocking.patch (19K) Download Attachment
v7-0004-Prefetch-referenced-blocks-during-recovery.patch (83K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: WIP: WAL prefetch (another approach)

Thomas Munro-5
On Wed, Apr 8, 2020 at 11:27 PM Thomas Munro <[hidden email]> wrote:

> On Wed, Apr 8, 2020 at 12:52 PM Thomas Munro <[hidden email]> wrote:
> > * he gave some feedback on the read_local_xlog_page() modifications: I
> > probably need to reconsider the change to logical.c that passes NULL
> > instead of cxt to the read_page callback; and the switch statement in
> > read_local_xlog_page() probably should have a case for the preexisting
> > mode
>
> So... logical.c wants to give its LogicalDecodingContext to any
> XLogPageReadCB you give it, via "private_data"; that is, it really
> only accepts XLogPageReadCB implementations that understand that (or
> ignore it).  What I want to do is give every XLogPageReadCB the chance
> to have its own state that it is control of (to receive settings
> specific to the implementation, or whatever), that you supply along
> with it.  We can't do both kinds of things with private_data, so I
> have added a second member read_page_data to XLogReaderState.  If you
> pass in read_local_xlog_page as read_page, then you can optionally
> install a pointer to XLogReadLocalOptions as reader->read_page_data,
> to activate the new behaviours I added for prefetching purposes.
>
> While working on that, I realised the readahead XLogReader was
> breaking a rule expressed in XLogReadDetermineTimeLine().  Timelines
> are really confusing and there were probably several subtle or not to
> subtle bugs there.  So I added an option to skip all of that logic,
> and just say "I command you to read only from TLI X".  It reads the
> same TLI as recovery is reading, until it hits the end of readable
> data and that causes prefetching to shut down.  Then the main recovery
> loop resets the prefetching module when it sees a TLI switch, so then
> it starts up again.  This seems to work reliably, but I've obviously
> had limited time to test.  Does this scheme sound sane?
>
> I think this is basically committable (though of course I wish I had
> more time to test and review).  Ugh.  Feature freeze in half an hour.

Ok, so the following parts of this work have been committed:

b09ff536:  Simplify the effective_io_concurrency setting.
fc34b0d9:  Introduce a maintenance_io_concurrency setting.
3985b600:  Support PrefetchBuffer() in recovery.
d140f2f3:  Rationalize GetWalRcv{Write,Flush}RecPtr().

However, I didn't want to push the main patch into the tree at
(literally) the last minute after doing such much work on it in the
last few days, without more review from recovery code experts and some
independent testing.  Judging by the comments made in this thread and
elsewhere, I think the feature is in demand so I hope there is a way
we could get it into 13 in the next couple of days, but I totally
accept the release management team's prerogative on that.


Reply | Threaded
Open this post in threaded view
|

Re: WIP: WAL prefetch (another approach)

David Steele
On 4/8/20 8:12 AM, Thomas Munro wrote:

>
> Ok, so the following parts of this work have been committed:
>
> b09ff536:  Simplify the effective_io_concurrency setting.
> fc34b0d9:  Introduce a maintenance_io_concurrency setting.
> 3985b600:  Support PrefetchBuffer() in recovery.
> d140f2f3:  Rationalize GetWalRcv{Write,Flush}RecPtr().
>
> However, I didn't want to push the main patch into the tree at
> (literally) the last minute after doing such much work on it in the
> last few days, without more review from recovery code experts and some
> independent testing.  

I definitely think that was the right call.

> Judging by the comments made in this thread and
> elsewhere, I think the feature is in demand so I hope there is a way
> we could get it into 13 in the next couple of days, but I totally
> accept the release management team's prerogative on that.

That's up to the RMT, of course, but we did already have an extra week.
Might be best to just get this in at the beginning of the PG14 cycle.
FWIW, I do think the feature is really valuable.

Looks like you'll need to rebase, so I'll move this to the next CF in
WoA state.

Regards,
--
-David
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: WIP: WAL prefetch (another approach)

Thomas Munro-5
On Thu, Apr 9, 2020 at 12:27 AM David Steele <[hidden email]> wrote:

> On 4/8/20 8:12 AM, Thomas Munro wrote:
> > Judging by the comments made in this thread and
> > elsewhere, I think the feature is in demand so I hope there is a way
> > we could get it into 13 in the next couple of days, but I totally
> > accept the release management team's prerogative on that.
>
> That's up to the RMT, of course, but we did already have an extra week.
> Might be best to just get this in at the beginning of the PG14 cycle.
> FWIW, I do think the feature is really valuable.
>
> Looks like you'll need to rebase, so I'll move this to the next CF in
> WoA state.
Thanks.  Here's a rebase.

v8-0001-Add-pg_atomic_unlocked_add_fetch_XXX.patch (4K) Download Attachment
v8-0002-Allow-XLogReadRecord-to-be-non-blocking.patch (19K) Download Attachment
v8-0003-Prefetch-referenced-blocks-during-recovery.patch (83K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: WIP: WAL prefetch (another approach)

Dmitry Dolgov
> On Thu, Apr 09, 2020 at 09:55:25AM +1200, Thomas Munro wrote:
> Thanks.  Here's a rebase.

Thanks for working on this patch, it seems like a great feature. I'm
probably a bit late to the party, but still want to make couple of
commentaries.

The patch indeed looks good, I couldn't find any significant issues so
far and almost all my questions I had while reading it were actually
answered in this thread. I'm still busy with benchmarking, mostly to see
how prefetching would work with different workload distributions and how
much the kernel will actually prefetch.

In the meantime I have a few questions:

> On Wed, Feb 12, 2020 at 07:52:42PM +1300, Thomas Munro wrote:
> > On Fri, Jan 3, 2020 at 7:10 AM Tomas Vondra
> > <[hidden email]> wrote:
> > > Could we instead specify the number of blocks to prefetch? We'd probably
> > > need to track additional details needed to determine number of blocks to
> > > prefetch (essentially LSN for all prefetch requests).
>
> Here is a new WIP version of the patch set that does that.  Changes:
>
> 1.  It now uses effective_io_concurrency to control how many
> concurrent prefetches to allow.  It's possible that we should have a
> different GUC to control "maintenance" users of concurrency I/O as
> discussed elsewhere[1], but I'm staying out of that for now; if we
> agree to do that for VACUUM etc, we can change it easily here.  Note
> that the value is percolated through the ComputeIoConcurrency()
> function which I think we should discuss, but again that's off topic,
> I just want to use the standard infrastructure here.

This totally makes sense, I believe the question "how much to prefetch"
eventually depends equally on a type of workload (correlates with how
far in WAL to read) and how much resources are available for prefetching
(correlates with queue depth). But in the documentation it looks like
maintenance-io-concurrency is just an "unimportant" option, and I'm
almost sure will be overlooked by many readers:

    The maximum distance to look ahead in the WAL during recovery, to find
    blocks to prefetch.  Prefetching blocks that will soon be needed can
    reduce I/O wait times.  The number of concurrent prefetches is limited
    by this setting as well as
    <xref linkend="guc-maintenance-io-concurrency"/>.  Setting it too high
    might be counterproductive, if it means that data falls out of the
    kernel cache before it is needed.  If this value is specified without
    units, it is taken as bytes.  A setting of -1 disables prefetching
    during recovery.

Maybe it makes also sense to emphasize that maintenance-io-concurrency
directly affects resource consumption and it's a "primary control"?

> On Wed, Mar 18, 2020 at 06:18:44PM +1300, Thomas Munro wrote:
>
> Here's a new version that changes that part just a bit more, after a
> brief chat with Andres about his async I/O plans.  It seems clear that
> returning an enum isn't very extensible, so I decided to try making
> PrefetchBufferResult a struct whose contents can be extended in the
> future.  In this patch set it's still just used to distinguish 3 cases
> (hit, miss, no file), but it's now expressed as a buffer and a flag to
> indicate whether I/O was initiated.  You could imagine that the second
> thing might be replaced by a pointer to an async I/O handle you can
> wait on or some other magical thing from the future.

I like the idea of extensible PrefetchBufferResult. Just one commentary,
if I understand correctly the way how it is being used together with
prefetch_queue assumes one IO operation at a time. This limits potential
extension of the underlying code, e.g. one can't implement some sort of
buffering of requests and submitting an iovec to a sycall, then
prefetch_queue will no longer correctly represent inflight IO. Also,
taking into account that "we don't have any awareness of when I/O really
completes", maybe in the future it makes to reconsider having queue in
the prefetcher itself and rather ask for this information from the
underlying code?

> On Wed, Apr 08, 2020 at 04:24:21AM +1200, Thomas Munro wrote:
> > Is there a way we could have a "historical" version of at least some of
> > these? An average queue depth, or such?
>
> Ok, I added simple online averages for distance and queue depth that
> take a sample every time recovery advances by 256kB.

Maybe it was discussed in the past in other threads. But if I understand
correctly, this implementation weights all the samples. Since at the
moment it depends directly on replaying speed (so a lot of IO involved),
couldn't it lead to a single outlier at the beginning skewing this value
and make it less useful? Does it make sense to decay old values?


Reply | Threaded
Open this post in threaded view
|

Re: WIP: WAL prefetch (another approach)

Thomas Munro-5
On Sun, Apr 19, 2020 at 11:46 PM Dmitry Dolgov <[hidden email]> wrote:
> Thanks for working on this patch, it seems like a great feature. I'm
> probably a bit late to the party, but still want to make couple of
> commentaries.

Hi Dmitry,

Thanks for your feedback and your interest in this work!

> The patch indeed looks good, I couldn't find any significant issues so
> far and almost all my questions I had while reading it were actually
> answered in this thread. I'm still busy with benchmarking, mostly to see
> how prefetching would work with different workload distributions and how
> much the kernel will actually prefetch.

Cool.

One report I heard recently said that if  you get rid of I/O stalls,
pread() becomes cheap enough that the much higher frequency lseek()
calls I've complained about elsewhere[1] become the main thing
recovery is doing, at least on some systems, but I haven't pieced
together the conditions required yet.  I'd be interested to know if
you see that.

> In the meantime I have a few questions:
>
> > 1.  It now uses effective_io_concurrency to control how many
> > concurrent prefetches to allow.  It's possible that we should have a
> > different GUC to control "maintenance" users of concurrency I/O as
> > discussed elsewhere[1], but I'm staying out of that for now; if we
> > agree to do that for VACUUM etc, we can change it easily here.  Note
> > that the value is percolated through the ComputeIoConcurrency()
> > function which I think we should discuss, but again that's off topic,
> > I just want to use the standard infrastructure here.
>
> This totally makes sense, I believe the question "how much to prefetch"
> eventually depends equally on a type of workload (correlates with how
> far in WAL to read) and how much resources are available for prefetching
> (correlates with queue depth). But in the documentation it looks like
> maintenance-io-concurrency is just an "unimportant" option, and I'm
> almost sure will be overlooked by many readers:
>
>     The maximum distance to look ahead in the WAL during recovery, to find
>     blocks to prefetch.  Prefetching blocks that will soon be needed can
>     reduce I/O wait times.  The number of concurrent prefetches is limited
>     by this setting as well as
>     <xref linkend="guc-maintenance-io-concurrency"/>.  Setting it too high
>     might be counterproductive, if it means that data falls out of the
>     kernel cache before it is needed.  If this value is specified without
>     units, it is taken as bytes.  A setting of -1 disables prefetching
>     during recovery.
>
> Maybe it makes also sense to emphasize that maintenance-io-concurrency
> directly affects resource consumption and it's a "primary control"?

You're right.  I will add something in the next version to emphasise that.

> > On Wed, Mar 18, 2020 at 06:18:44PM +1300, Thomas Munro wrote:
> >
> > Here's a new version that changes that part just a bit more, after a
> > brief chat with Andres about his async I/O plans.  It seems clear that
> > returning an enum isn't very extensible, so I decided to try making
> > PrefetchBufferResult a struct whose contents can be extended in the
> > future.  In this patch set it's still just used to distinguish 3 cases
> > (hit, miss, no file), but it's now expressed as a buffer and a flag to
> > indicate whether I/O was initiated.  You could imagine that the second
> > thing might be replaced by a pointer to an async I/O handle you can
> > wait on or some other magical thing from the future.
>
> I like the idea of extensible PrefetchBufferResult. Just one commentary,
> if I understand correctly the way how it is being used together with
> prefetch_queue assumes one IO operation at a time. This limits potential
> extension of the underlying code, e.g. one can't implement some sort of
> buffering of requests and submitting an iovec to a sycall, then
> prefetch_queue will no longer correctly represent inflight IO. Also,
> taking into account that "we don't have any awareness of when I/O really
> completes", maybe in the future it makes to reconsider having queue in
> the prefetcher itself and rather ask for this information from the
> underlying code?

Yeah, you're right that it'd be good to be able to do some kind of
batching up of these requests to reduce system calls.  Of course
posix_fadvise() doesn't support that, but clearly in the AIO future[2]
it would indeed make sense to buffer up a few of these and then make a
single call to io_uring_enter() on Linux[3] or lio_listio() on a
hypothetical POSIX AIO implementation[4].  (I'm not sure if there is a
thing like that on Windows; at a glance, ReadFileScatter() is
asynchronous ("overlapped") but works only on a single handle so it's
like a hypothetical POSIX aio_readv(), not like POSIX lio_list()).

Perhaps there could be an extra call PrefetchBufferSubmit() that you'd
call at appropriate times, but you obviously can't call it too
infrequently.

As for how to make the prefetch queue a reusable component, rather
than having a custom thing like that for each part of our system that
wants to support prefetching: that's a really good question.  I didn't
see how to do it, but maybe I didn't try hard enough.  I looked at the
three users I'm aware of, namely this patch, a btree prefetching patch
I haven't shared yet, and the existing bitmap heap scan code, and they
all needed to have their own custom book keeping for this, and I
couldn't figure out how to share more infrastructure.  In the case of
this patch, you currently need to do LSN based book keeping to
simulate "completion", and that doesn't make sense for other users.
Maybe it'll become clearer when we have support for completion
notification?

Some related questions are why all these parts of our system that know
how to prefetch are allowed to do so independently without any kind of
shared accounting, and why we don't give each tablespace (= our model
of a device?) its own separate queue.  I think it's OK to put these
questions off a bit longer until we have more infrastructure and
experience.  Our current non-answer is at least consistent with our
lack of an approach to system-wide memory and CPU accounting...  I
personally think that a better XLogReader that can be used for
prefetching AND recovery would be a higher priority than that.

> > On Wed, Apr 08, 2020 at 04:24:21AM +1200, Thomas Munro wrote:
> > > Is there a way we could have a "historical" version of at least some of
> > > these? An average queue depth, or such?
> >
> > Ok, I added simple online averages for distance and queue depth that
> > take a sample every time recovery advances by 256kB.
>
> Maybe it was discussed in the past in other threads. But if I understand
> correctly, this implementation weights all the samples. Since at the
> moment it depends directly on replaying speed (so a lot of IO involved),
> couldn't it lead to a single outlier at the beginning skewing this value
> and make it less useful? Does it make sense to decay old values?

Hmm.

I wondered about a reporting one or perhaps three exponential moving
averages (like Unix 1/5/15 minute load averages), but I didn't propose
it because: (1) In crash recovery, you can't query it, you just get
the log message at the end, and mean unweighted seems OK in that case,
no? (you are not more interested in the I/O saturation at the end of
the recovery compared to the start of recovery are you?), and (2) on a
streaming replica, if you want to sample the instantaneous depth and
compute an exponential moving average or some more exotic statistical
concoction in your monitoring tool, you're free to do so.  I suppose
(2) is an argument for removing the existing average completely from
the stat view; I put it in there at Andres's suggestion, but I'm not
sure I really believe in it.  Where is our average replication lag,
and why don't we compute the stddev of X, Y or Z?  I think we should
provide primary measurements and let people compute derived statistics
from those.

I suppose the reason for this request was the analogy with Linux
iostat -x's "aqu-sz", which is the primary way that people understand
device queue depth on that OS.  This number is actually computed by
iostat, not the kernel, so by analogy I could argue that a
hypothetical pg_iostat program compute that for you from raw
ingredients.  AFAIK iostat computes the *unweighted* average queue
depth during the time between output lines, by observing changes in
the "aveq" ("the sum of how long all requests have spent in flight, in
milliseconds") and "use" ("how many milliseconds there has been at
least one IO in flight") fields of /proc/diskstats.  But it's OK that
it's unweighted, because it computes a new value for every line it
output (ie every 5 seconds or whatever you asked for).  It's not too
clear how to do something like that here, but all suggestions are
weclome.

Or maybe we'll have something more general that makes this more
specific thing irrelevant, in future AIO infrastructure work.

On a more superficial note, one thing I don't like about the last
version of the patch is the difference in the ordering of the words in
the GUC recovery_prefetch_distance and the view
pg_stat_prefetch_recovery.  Hrmph.

[1] https://www.postgresql.org/message-id/CA%2BhUKG%2BNPZeEdLXAcNr%2Bw0YOZVb0Un0_MwTBpgmmVDh7No2jbg%40mail.gmail.com
[2] https://anarazel.de/talks/2020-01-31-fosdem-aio/aio.pdf
[3] https://kernel.dk/io_uring.pdf
[4] https://pubs.opengroup.org/onlinepubs/009695399/functions/lio_listio.html


Reply | Threaded
Open this post in threaded view
|

Re: WIP: WAL prefetch (another approach)

Dmitry Dolgov
> On Tue, Apr 21, 2020 at 05:26:52PM +1200, Thomas Munro wrote:
>
> One report I heard recently said that if  you get rid of I/O stalls,
> pread() becomes cheap enough that the much higher frequency lseek()
> calls I've complained about elsewhere[1] become the main thing
> recovery is doing, at least on some systems, but I haven't pieced
> together the conditions required yet.  I'd be interested to know if
> you see that.

At the moment I've performed couple of tests for the replication in case
when almost everything is in memory (mostly by mistake, I was expecting
that a postgres replica within a badly memory limited cgroup will cause
more IO, but looks like kernel do not evict pages anyway). Not sure if
that's what you mean by getting rid of IO stalls, but in these tests
profiling shows lseek & pread appear in similar amount of samples.

If I understand correctly, eventually one can measure prefetching
influence by looking at different redo function execution time (assuming
that data they operate with is already prefetched they should be
faster). I still have to clarify what is the exact reason, but even in
the situation described above (in memory) there is some visible
difference, e.g.

    # with prefetch
    Function = b'heap2_redo' [8064]
     nsecs               : count     distribution
      4096 -> 8191       : 1213     |                                        |
      8192 -> 16383      : 66639    |****************************************|
     16384 -> 32767      : 27846    |****************                        |
     32768 -> 65535      : 873      |                                        |

    # without prefetch
    Function = b'heap2_redo' [17980]
     nsecs               : count     distribution
      4096 -> 8191       : 1        |                                        |
      8192 -> 16383      : 66997    |****************************************|
     16384 -> 32767      : 30966    |******************                      |
     32768 -> 65535      : 1602     |                                        |

    # with prefetch
    Function = b'btree_redo' [8064]
     nsecs               : count     distribution
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 246      |****************************************|
      8192 -> 16383      : 5        |                                        |
     16384 -> 32767      : 2        |                                        |

    # without prefetch
    Function = b'btree_redo' [17980]
     nsecs               : count     distribution
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 82       |********************                    |
      8192 -> 16383      : 19       |****                                    |
     16384 -> 32767      : 160      |****************************************|

Of course it doesn't take into account time we spend doing extra
syscalls for prefetching, but still can give some interesting
information.

> > I like the idea of extensible PrefetchBufferResult. Just one commentary,
> > if I understand correctly the way how it is being used together with
> > prefetch_queue assumes one IO operation at a time. This limits potential
> > extension of the underlying code, e.g. one can't implement some sort of
> > buffering of requests and submitting an iovec to a sycall, then
> > prefetch_queue will no longer correctly represent inflight IO. Also,
> > taking into account that "we don't have any awareness of when I/O really
> > completes", maybe in the future it makes to reconsider having queue in
> > the prefetcher itself and rather ask for this information from the
> > underlying code?
>
> Yeah, you're right that it'd be good to be able to do some kind of
> batching up of these requests to reduce system calls.  Of course
> posix_fadvise() doesn't support that, but clearly in the AIO future[2]
> it would indeed make sense to buffer up a few of these and then make a
> single call to io_uring_enter() on Linux[3] or lio_listio() on a
> hypothetical POSIX AIO implementation[4].  (I'm not sure if there is a
> thing like that on Windows; at a glance, ReadFileScatter() is
> asynchronous ("overlapped") but works only on a single handle so it's
> like a hypothetical POSIX aio_readv(), not like POSIX lio_list()).
>
> Perhaps there could be an extra call PrefetchBufferSubmit() that you'd
> call at appropriate times, but you obviously can't call it too
> infrequently.
>
> As for how to make the prefetch queue a reusable component, rather
> than having a custom thing like that for each part of our system that
> wants to support prefetching: that's a really good question.  I didn't
> see how to do it, but maybe I didn't try hard enough.  I looked at the
> three users I'm aware of, namely this patch, a btree prefetching patch
> I haven't shared yet, and the existing bitmap heap scan code, and they
> all needed to have their own custom book keeping for this, and I
> couldn't figure out how to share more infrastructure.  In the case of
> this patch, you currently need to do LSN based book keeping to
> simulate "completion", and that doesn't make sense for other users.
> Maybe it'll become clearer when we have support for completion
> notification?

Yes, definitely.

> Some related questions are why all these parts of our system that know
> how to prefetch are allowed to do so independently without any kind of
> shared accounting, and why we don't give each tablespace (= our model
> of a device?) its own separate queue.  I think it's OK to put these
> questions off a bit longer until we have more infrastructure and
> experience.  Our current non-answer is at least consistent with our
> lack of an approach to system-wide memory and CPU accounting...  I
> personally think that a better XLogReader that can be used for
> prefetching AND recovery would be a higher priority than that.

Sure, this patch is quite valuable as it is, and those questions I've
mentioned are targeting mostly future development.

> > Maybe it was discussed in the past in other threads. But if I understand
> > correctly, this implementation weights all the samples. Since at the
> > moment it depends directly on replaying speed (so a lot of IO involved),
> > couldn't it lead to a single outlier at the beginning skewing this value
> > and make it less useful? Does it make sense to decay old values?
>
> Hmm.
>
> I wondered about a reporting one or perhaps three exponential moving
> averages (like Unix 1/5/15 minute load averages), but I didn't propose
> it because: (1) In crash recovery, you can't query it, you just get
> the log message at the end, and mean unweighted seems OK in that case,
> no? (you are not more interested in the I/O saturation at the end of
> the recovery compared to the start of recovery are you?), and (2) on a
> streaming replica, if you want to sample the instantaneous depth and
> compute an exponential moving average or some more exotic statistical
> concoction in your monitoring tool, you're free to do so.  I suppose
> (2) is an argument for removing the existing average completely from
> the stat view; I put it in there at Andres's suggestion, but I'm not
> sure I really believe in it.  Where is our average replication lag,
> and why don't we compute the stddev of X, Y or Z?  I think we should
> provide primary measurements and let people compute derived statistics
> from those.

For once I disagree, since I believe this very approach, widely applied,
leads to a slightly chaotic situation with monitoring. But of course
you're right, it has nothing to do with the patch itself. I also would
be in favour of removing the existing averages, unless Andres has more
arguments to keep it.


Reply | Threaded
Open this post in threaded view
|

Re: WIP: WAL prefetch (another approach)

Dmitry Dolgov
> On Sat, Apr 25, 2020 at 09:19:35PM +0200, Dmitry Dolgov wrote:
> > On Tue, Apr 21, 2020 at 05:26:52PM +1200, Thomas Munro wrote:
> >
> > One report I heard recently said that if  you get rid of I/O stalls,
> > pread() becomes cheap enough that the much higher frequency lseek()
> > calls I've complained about elsewhere[1] become the main thing
> > recovery is doing, at least on some systems, but I haven't pieced
> > together the conditions required yet.  I'd be interested to know if
> > you see that.
>
> At the moment I've performed couple of tests for the replication in case
> when almost everything is in memory (mostly by mistake, I was expecting
> that a postgres replica within a badly memory limited cgroup will cause
> more IO, but looks like kernel do not evict pages anyway). Not sure if
> that's what you mean by getting rid of IO stalls, but in these tests
> profiling shows lseek & pread appear in similar amount of samples.
>
> If I understand correctly, eventually one can measure prefetching
> influence by looking at different redo function execution time (assuming
> that data they operate with is already prefetched they should be
> faster). I still have to clarify what is the exact reason, but even in
> the situation described above (in memory) there is some visible
> difference, e.g.

I've finally performed couple of tests involving more IO. The
not-that-big dataset of 1.5 GB for the replica with the memory allowing
fitting ~ 1/6 of it, default prefetching parameters and an update
workload with uniform distribution. Rather a small setup, but causes
stable reading into the page cache on the replica and allows to see a
visible influence of the patch (more measurement samples tend to happen
at lower latencies):

    # with patch
    Function = b'heap_redo' [206]
     nsecs               : count     distribution
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 32833    |**********************                  |
      4096 -> 8191       : 59476    |****************************************|
      8192 -> 16383      : 18617    |************                            |
     16384 -> 32767      : 3992     |**                                      |
     32768 -> 65535      : 425      |                                        |
     65536 -> 131071     : 5        |                                        |
    131072 -> 262143     : 326      |                                        |
    262144 -> 524287     : 6        |                                        |

    # without patch
    Function = b'heap_redo' [130]
     nsecs               : count     distribution
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 20062    |***********                             |
      4096 -> 8191       : 70662    |****************************************|
      8192 -> 16383      : 12895    |*******                                 |
     16384 -> 32767      : 9123     |*****                                   |
     32768 -> 65535      : 560      |                                        |
     65536 -> 131071     : 1        |                                        |
    131072 -> 262143     : 460      |                                        |
    262144 -> 524287     : 3        |                                        |

Not that there were any doubts, but at the same time it was surprising
to me how good linux readahead works in this situation. The results
above are shown with disabled readahead for filesystem and device, and
without that there was almost no difference, since a lot of IO was
avoided by readahead (which was in fact the majority of all reads):

    # with patch
    flags = Read
         usecs               : count     distribution
            16 -> 31         : 0        |                                        |
            32 -> 63         : 1        |********                                |
            64 -> 127        : 5        |****************************************|

    flags = ReadAhead-Read
         usecs               : count     distribution
            32 -> 63         : 0        |                                        |
            64 -> 127        : 131      |****************************************|
           128 -> 255        : 12       |***                                     |
           256 -> 511        : 6        |*                                       |

    # without patch
    flags = Read
         usecs               : count     distribution
            16 -> 31         : 0        |                                        |
            32 -> 63         : 0        |                                        |
            64 -> 127        : 4        |****************************************|

    flags = ReadAhead-Read
         usecs               : count     distribution
            32 -> 63         : 0        |                                        |
            64 -> 127        : 143      |****************************************|
           128 -> 255        : 20       |*****                                   |

Numbers of reads in this case were similar with and without patch, which
means it couldn't be attributed to the situation when a page was read
too early, then evicted and read again later.


1234 ... 7