WAL prefetch

classic Classic list List threaded Threaded
59 messages Options
123
Reply | Threaded
Open this post in threaded view
|

WAL prefetch

konstantin knizhnik
There was very interesting presentation at pgconf about pg_prefaulter:

http://www.pgcon.org/2018/schedule/events/1204.en.html

But it is implemented in GO and using pg_waldump.
I tried to do the same but using built-on Postgres WAL traverse functions.
I have implemented it as extension for simplicity of integration.
In principle it can be started as BG worker.

First of all I tried to estimate effect of preloading data.
I have implemented prefetch utility with is also attached to this mail.
It performs random reads of blocks of some large file and spawns some
number of prefetch threads:

Just normal read without prefetch:
./prefetch -n 0 SOME_BIG_FILE

One prefetch thread which uses pread:
./prefetch SOME_BIG_FILE

One prefetch thread which uses posix_fadvise:
./prefetch -f SOME_BIG_FILE

4 prefetch thread which uses posix_fadvise:
./prefetch -f -n 4 SOME_BIG_FILE

Based on this experiments (on my desktop), I made the following conclusions:

1. Prefetch at HDD doesn't give any positive effect.
2. Using posix_fadvise allows to speed-up random read speed at SSD up to
2 times.
3. posix_fadvise(WILLNEED) is more efficient than performing normal reads.
4. Calling posix_fadvise in more than one thread has no sense.

I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb
NVME RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
The speed of synchronous replication between two nodes is increased from
56k TPS to 60k TPS (on pgbench with scale 1000).

Usage:
1. At master: create extension wal_prefetch
2. At replica: Call pg_wal_prefetch() function: it will not return until
you interrupt it.

pg_wal_prefetch function will infinitely traverse WAL and prefetch block
references in WAL records
using posix_fadvise(WILLNEED) system call.

It is possible to explicitly specify start LSN for pg_wal_prefetch()
function. Otherwise, WAL redo position will be used as start LSN.


--

Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


prefetch.c (2K) Download Attachment
wal_prefetch.tgz (9K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: WAL prefetch

akapila
On Wed, Jun 13, 2018 at 6:39 PM, Konstantin Knizhnik
<[hidden email]> wrote:
> There was very interesting presentation at pgconf about pg_prefaulter:
>
> http://www.pgcon.org/2018/schedule/events/1204.en.html
>
> But it is implemented in GO and using pg_waldump.
> I tried to do the same but using built-on Postgres WAL traverse functions.
> I have implemented it as extension for simplicity of integration.
> In principle it can be started as BG worker.
>

Right or in other words, it could do something like autoprewarm [1]
which can allow a more user-friendly interface for this utility if we
decides to include it.

> First of all I tried to estimate effect of preloading data.
> I have implemented prefetch utility with is also attached to this mail.
> It performs random reads of blocks of some large file and spawns some number
> of prefetch threads:
>
> Just normal read without prefetch:
> ./prefetch -n 0 SOME_BIG_FILE
>
> One prefetch thread which uses pread:
> ./prefetch SOME_BIG_FILE
>
> One prefetch thread which uses posix_fadvise:
> ./prefetch -f SOME_BIG_FILE
>
> 4 prefetch thread which uses posix_fadvise:
> ./prefetch -f -n 4 SOME_BIG_FILE
>
> Based on this experiments (on my desktop), I made the following conclusions:
>
> 1. Prefetch at HDD doesn't give any positive effect.
> 2. Using posix_fadvise allows to speed-up random read speed at SSD up to 2
> times.
> 3. posix_fadvise(WILLNEED) is more efficient than performing normal reads.
> 4. Calling posix_fadvise in more than one thread has no sense.
>
> I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb NVME
> RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
> The speed of synchronous replication between two nodes is increased from 56k
> TPS to 60k TPS (on pgbench with scale 1000).
>

That's a reasonable improvement.

> Usage:
> 1. At master: create extension wal_prefetch
> 2. At replica: Call pg_wal_prefetch() function: it will not return until you
> interrupt it.
>

I think it is not a very user-friendly interface, but the idea sounds
good to me, it can help some other workloads.  I think this can help
in recovery as well.


[1] - https://www.postgresql.org/docs/devel/static/pgprewarm.html

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: WAL prefetch

Thomas Munro-3
In reply to this post by konstantin knizhnik
On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
<[hidden email]> wrote:
> pg_wal_prefetch function will infinitely traverse WAL and prefetch block
> references in WAL records
> using posix_fadvise(WILLNEED) system call.

Hi Konstantin,

Why stop at the page cache...  what about shared buffers?

--
Thomas Munro
http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: WAL prefetch

konstantin knizhnik


On 14.06.2018 09:52, Thomas Munro wrote:
> On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
> <[hidden email]> wrote:
>> pg_wal_prefetch function will infinitely traverse WAL and prefetch block
>> references in WAL records
>> using posix_fadvise(WILLNEED) system call.
> Hi Konstantin,
>
> Why stop at the page cache...  what about shared buffers?
>

It is good question. I thought a lot about prefetching directly to
shared buffers.
But the current c'est la vie with Postgres is that allocating too large
memory for shared buffers is not recommended.
Due to many different reasons: degradation of clock replacement
algorithm, "write storm",...

If your system has 1Tb of memory,  almost none of Postgresql
administrators will recommend to use all this 1Tb for shared buffers.
Moreover there are recommendations to choose shared buffers size based
on size of internal cache of persistent storage device
(so that it will be possible to flush changes without doing writes to
physical media). So at this system with 1Tb of RAM, size of shared
buffers will be most likely set to few hundreds of gigabytes.

Also PostgreSQL is not currently supporting dynamic changing of shared
buffers size. Without it, the only way of using Postgres in clouds and
another multiuser systems where system load is not fully controlled by 
user is to choose relatively small shared buffer size and rely on OS
caching.

Yes, access to shared buffer is about two times faster than reading data
from file system cache.
But it is better, then situation when shared buffers are swapped out and
effect of large shared buffers becomes negative.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Reply | Threaded
Open this post in threaded view
|

Re: WAL prefetch

Robert Haas
In reply to this post by akapila
On Wed, Jun 13, 2018 at 11:45 PM, Amit Kapila <[hidden email]> wrote:
>> I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb NVME
>> RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
>> The speed of synchronous replication between two nodes is increased from 56k
>> TPS to 60k TPS (on pgbench with scale 1000).
>
> That's a reasonable improvement.

Somehow I would have expected more.  That's only a 7% speedup.

I am also surprised that HDD didn't show any improvement.  Since HDD's
are bad at random I/O, I would have expected prefetching to help more
in that case.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: WAL prefetch

akapila
On Thu, Jun 14, 2018 at 6:14 PM, Robert Haas <[hidden email]> wrote:

> On Wed, Jun 13, 2018 at 11:45 PM, Amit Kapila <[hidden email]> wrote:
>>> I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb NVME
>>> RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
>>> The speed of synchronous replication between two nodes is increased from 56k
>>> TPS to 60k TPS (on pgbench with scale 1000).
>>
>> That's a reasonable improvement.
>
> Somehow I would have expected more.  That's only a 7% speedup.
>

It might be due to the reason that there is already a big overhead of
synchronous mode of replication that it didn't show a big speedup.  We
might want to try recovery (PITR) or maybe async replication to see if
we see any better numbers.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: WAL prefetch

konstantin knizhnik
In reply to this post by Robert Haas


On 14.06.2018 15:44, Robert Haas wrote:
> On Wed, Jun 13, 2018 at 11:45 PM, Amit Kapila <[hidden email]> wrote:
>>> I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb NVME
>>> RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
>>> The speed of synchronous replication between two nodes is increased from 56k
>>> TPS to 60k TPS (on pgbench with scale 1000).
>> That's a reasonable improvement.
> Somehow I would have expected more.  That's only a 7% speedup.
>
> I am also surprised that HDD didn't show any improvement.

My be pgbench is not the best use case for prefetch. It is updating more
or less random pages and if database is large enough and
full_page_writes is true (default value)
then most pages will be updated only once since last checkpoint and most
of updates will be represented in WAL by full page records.
And such records do not require reading any data from disk.

>   Since HDD's
> are bad at random I/O, I would have expected prefetching to help more
> in that case.
>
Speed of random HDD access is limited by speed of disk head movement.
By running several IO requests in parallel we just increase probability
of head movement, so actually parallel access to HDD may even decrease
IO speed rather than increase it.
In theory, given several concurrent IO requests, driver can execute them
in optimal order, trying to minimize head movement. But if there are
really access to random pages,
then probability that we can win something by such optimization is very
small.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Reply | Threaded
Open this post in threaded view
|

Re: WAL prefetch

Robert Haas
On Thu, Jun 14, 2018 at 9:23 AM, Konstantin Knizhnik
<[hidden email]> wrote:
> Speed of random HDD access is limited by speed of disk head movement.
> By running several IO requests in parallel we just increase probability of
> head movement, so actually parallel access to HDD may even decrease IO speed
> rather than increase it.
> In theory, given several concurrent IO requests, driver can execute them in
> optimal order, trying to minimize head movement. But if there are really
> access to random pages,
> then probability that we can win something by such optimization is very
> small.

You might be right, but I feel like I've heard previous reports of
significant speedups from prefetching on HDDs.  Perhaps I am
mis-remembering.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: WAL prefetch

konstantin knizhnik


On 14.06.2018 16:25, Robert Haas wrote:

> On Thu, Jun 14, 2018 at 9:23 AM, Konstantin Knizhnik
> <[hidden email]> wrote:
>> Speed of random HDD access is limited by speed of disk head movement.
>> By running several IO requests in parallel we just increase probability of
>> head movement, so actually parallel access to HDD may even decrease IO speed
>> rather than increase it.
>> In theory, given several concurrent IO requests, driver can execute them in
>> optimal order, trying to minimize head movement. But if there are really
>> access to random pages,
>> then probability that we can win something by such optimization is very
>> small.
> You might be right, but I feel like I've heard previous reports of
> significant speedups from prefetching on HDDs.  Perhaps I am
> mis-remembering.
>

It is true for RAIDs of HDD which can really win by issuing parallel IO
operations.

But there are some many different factors that I will not be surprised
by any result:)

The last problem I have observed with NVME device at one of the
customer's system was huge performance degradation (> 10 times: from
500Mb/sec to 50Mb/sec write speed)
after space exhaustion at the device. There is 3Tb NVME RAID device with
1.5Gb database. ext4 was mounted without "discard" option.
After incorrect execution of rsync, space was exhausted. Then I removed
all data and copied database from master node.
Then I observed huge lags in async. replication between master and
replica. wal_receiver is saving received data too slowly: write speed is
about ~50Mb/sec vs. 0.5Gb at master.
All my attempts to use fstrim or ex4defrag didn't help. The problem was
solved only after deleting all database files, performing fstrim and
copying database once again.
After it wal_sender is writing data with normal speed ~0.5Gb and there
is no lag between master and replica.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Reply | Threaded
Open this post in threaded view
|

Re: WAL prefetch

Stephen Frost
In reply to this post by konstantin knizhnik
Greetings,

* Konstantin Knizhnik ([hidden email]) wrote:
> There was very interesting presentation at pgconf about pg_prefaulter:
>
> http://www.pgcon.org/2018/schedule/events/1204.en.html

I agree and I've chatted a bit w/ Sean further about it.

> But it is implemented in GO and using pg_waldump.

Yeah, that's not too good if we want it in core.

> I tried to do the same but using built-on Postgres WAL traverse functions.
> I have implemented it as extension for simplicity of integration.
> In principle it can be started as BG worker.

I don't think this needs to be, or should be, an extension..  If this is
worthwhile (and it certainly appears to be) then we should just do it in
core.

> First of all I tried to estimate effect of preloading data.
> I have implemented prefetch utility with is also attached to this mail.
> It performs random reads of blocks of some large file and spawns some number
> of prefetch threads:
>
> Just normal read without prefetch:
> ./prefetch -n 0 SOME_BIG_FILE
>
> One prefetch thread which uses pread:
> ./prefetch SOME_BIG_FILE
>
> One prefetch thread which uses posix_fadvise:
> ./prefetch -f SOME_BIG_FILE
>
> 4 prefetch thread which uses posix_fadvise:
> ./prefetch -f -n 4 SOME_BIG_FILE
>
> Based on this experiments (on my desktop), I made the following conclusions:
>
> 1. Prefetch at HDD doesn't give any positive effect.
> 2. Using posix_fadvise allows to speed-up random read speed at SSD up to 2
> times.
> 3. posix_fadvise(WILLNEED) is more efficient than performing normal reads.
> 4. Calling posix_fadvise in more than one thread has no sense.
Ok.

> I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb NVME
> RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
> The speed of synchronous replication between two nodes is increased from 56k
> TPS to 60k TPS (on pgbench with scale 1000).

I'm also surprised that it wasn't a larger improvement.

Seems like it would make sense to implement in core using
posix_fadvise(), perhaps in the wal receiver and in RestoreArchivedFile
or nearby..  At least, that's the thinking I had when I was chatting w/
Sean.

Thanks!

Stephen

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: WAL prefetch

akapila
On Fri, Jun 15, 2018 at 12:16 AM, Stephen Frost <[hidden email]> wrote:

>
>> I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb NVME
>> RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
>> The speed of synchronous replication between two nodes is increased from 56k
>> TPS to 60k TPS (on pgbench with scale 1000).
>
> I'm also surprised that it wasn't a larger improvement.
>
> Seems like it would make sense to implement in core using
> posix_fadvise(), perhaps in the wal receiver and in RestoreArchivedFile
> or nearby..  At least, that's the thinking I had when I was chatting w/
> Sean.
>

Doing in-core certainly has some advantage such as it can easily reuse
the existing xlog code rather trying to make a copy as is currently
done in the patch, but I think it also depends on whether this is
really a win in a number of common cases or is it just a win in some
limited cases.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: WAL prefetch

konstantin knizhnik


On 15.06.2018 07:36, Amit Kapila wrote:

> On Fri, Jun 15, 2018 at 12:16 AM, Stephen Frost <[hidden email]> wrote:
>>> I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb NVME
>>> RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
>>> The speed of synchronous replication between two nodes is increased from 56k
>>> TPS to 60k TPS (on pgbench with scale 1000).
>> I'm also surprised that it wasn't a larger improvement.
>>
>> Seems like it would make sense to implement in core using
>> posix_fadvise(), perhaps in the wal receiver and in RestoreArchivedFile
>> or nearby..  At least, that's the thinking I had when I was chatting w/
>> Sean.
>>
> Doing in-core certainly has some advantage such as it can easily reuse
> the existing xlog code rather trying to make a copy as is currently
> done in the patch, but I think it also depends on whether this is
> really a win in a number of common cases or is it just a win in some
> limited cases.
>
I am completely agree. It was my mail concern: on which use cases this
prefetch will be efficient.
If "full_page_writes" is on (and it is safe and default value), then
first update of a page since last checkpoint will be written in WAL as
full page and applying it will not require reading any data from disk.
If this pages is updated multiple times in subsequent transactions, then
most likely it will be still present in OS file cache, unless checkpoint
interval exceeds OS cache size (amount of free memory in the system). So
if this conditions are satisfied then looks like prefetch is not needed.
And it seems to be true for most real configurations: checkpoint
interval is rarely set larger than hundred of gigabytes and modern
servers usually have more RAM.

But once this condition is not satisfied and lag is larger than size of
OS cache, then prefetch can be not efficient because prefetched pages
may be thrown away from OS cache before them are actually accessed by
redo process. In this case extra synchronization between prefetch and
replay processes is needed so that prefetch is not moving too far away
from replayed LSN.

It is not a problem to integrate this code in Postgres core and run it
in background worker. I do not think that performing prefetch in wal
receiver process itself is good idea: it may slow down speed of
receiving changes from master. And in this case I really can throw away
cut&pasted code. But it is easier to experiment with extension rather
than with patch to Postgres core.
And I have published this extension to make it possible to perform
experiments and check whether it is useful on real workloads.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Reply | Threaded
Open this post in threaded view
|

Re: WAL prefetch

akapila
On Fri, Jun 15, 2018 at 1:08 PM, Konstantin Knizhnik
<[hidden email]> wrote:

>
>
> On 15.06.2018 07:36, Amit Kapila wrote:
>>
>> On Fri, Jun 15, 2018 at 12:16 AM, Stephen Frost <[hidden email]>
>> wrote:
>>>>
>>>> I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb
>>>> NVME
>>>> RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
>>>> The speed of synchronous replication between two nodes is increased from
>>>> 56k
>>>> TPS to 60k TPS (on pgbench with scale 1000).
>>>
>>> I'm also surprised that it wasn't a larger improvement.
>>>
>>> Seems like it would make sense to implement in core using
>>> posix_fadvise(), perhaps in the wal receiver and in RestoreArchivedFile
>>> or nearby..  At least, that's the thinking I had when I was chatting w/
>>> Sean.
>>>
>> Doing in-core certainly has some advantage such as it can easily reuse
>> the existing xlog code rather trying to make a copy as is currently
>> done in the patch, but I think it also depends on whether this is
>> really a win in a number of common cases or is it just a win in some
>> limited cases.
>>
> I am completely agree. It was my mail concern: on which use cases this
> prefetch will be efficient.
> If "full_page_writes" is on (and it is safe and default value), then first
> update of a page since last checkpoint will be written in WAL as full page
> and applying it will not require reading any data from disk.
>

What exactly you mean by above?  AFAIU, it needs to read WAL to apply
full page image.  See below code:

XLogReadBufferForRedoExtended()
{
..
/* If it has a full-page image and it should be restored, do it. */
if (XLogRecBlockImageApply(record, block_id))
{
Assert(XLogRecHasBlockImage(record, block_id));
*buf = XLogReadBufferExtended(rnode, forknum, blkno,
  get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK);
page = BufferGetPage(*buf);
if (!RestoreBlockImage(record, block_id, page))
..
}


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: WAL prefetch

konstantin knizhnik


On 15.06.2018 18:03, Amit Kapila wrote:

> On Fri, Jun 15, 2018 at 1:08 PM, Konstantin Knizhnik
> <[hidden email]> wrote:
>>
>> On 15.06.2018 07:36, Amit Kapila wrote:
>>> On Fri, Jun 15, 2018 at 12:16 AM, Stephen Frost <[hidden email]>
>>> wrote:
>>>>> I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb
>>>>> NVME
>>>>> RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
>>>>> The speed of synchronous replication between two nodes is increased from
>>>>> 56k
>>>>> TPS to 60k TPS (on pgbench with scale 1000).
>>>> I'm also surprised that it wasn't a larger improvement.
>>>>
>>>> Seems like it would make sense to implement in core using
>>>> posix_fadvise(), perhaps in the wal receiver and in RestoreArchivedFile
>>>> or nearby..  At least, that's the thinking I had when I was chatting w/
>>>> Sean.
>>>>
>>> Doing in-core certainly has some advantage such as it can easily reuse
>>> the existing xlog code rather trying to make a copy as is currently
>>> done in the patch, but I think it also depends on whether this is
>>> really a win in a number of common cases or is it just a win in some
>>> limited cases.
>>>
>> I am completely agree. It was my mail concern: on which use cases this
>> prefetch will be efficient.
>> If "full_page_writes" is on (and it is safe and default value), then first
>> update of a page since last checkpoint will be written in WAL as full page
>> and applying it will not require reading any data from disk.
>>
> What exactly you mean by above?  AFAIU, it needs to read WAL to apply
> full page image.  See below code:
>
> XLogReadBufferForRedoExtended()
> {
> ..
> /* If it has a full-page image and it should be restored, do it. */
> if (XLogRecBlockImageApply(record, block_id))
> {
> Assert(XLogRecHasBlockImage(record, block_id));
> *buf = XLogReadBufferExtended(rnode, forknum, blkno,
>    get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK);
> page = BufferGetPage(*buf);
> if (!RestoreBlockImage(record, block_id, page))
> ..
> }
>
>

Sorry, for my confusing statement.
Definitely we need to read page from WAL.
I mean that in case of "full page write" we do not need to read updated
page from the database.
It can be just overwritten.

pg_prefaulter and my wal_prefetch are not prefetching WAL pages themselves.
There is no sense to do it, because them are just written by
wal_receiver and so should be present in file system cache.
wal_prefetch is prefetching blocks referenced by WAL records. But in
case of "full page writes" such prefetch is not needed and even is harmful.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Reply | Threaded
Open this post in threaded view
|

Re: WAL prefetch

Andres Freund
In reply to this post by konstantin knizhnik
On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:

>
>
> On 14.06.2018 09:52, Thomas Munro wrote:
> > On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
> > <[hidden email]> wrote:
> > > pg_wal_prefetch function will infinitely traverse WAL and prefetch block
> > > references in WAL records
> > > using posix_fadvise(WILLNEED) system call.
> > Hi Konstantin,
> >
> > Why stop at the page cache...  what about shared buffers?
> >
>
> It is good question. I thought a lot about prefetching directly to shared
> buffers.

I think that's definitely how this should work.  I'm pretty strongly
opposed to a prefetching implementation that doesn't read into s_b.


> But the current c'est la vie with Postgres is that allocating too large
> memory for shared buffers is not recommended.
> Due to many different reasons: degradation of clock replacement algorithm,
> "write storm",...

I think a lot of that fear is overplayed. And we've fixed a number of
issues.  We don't really generate write storms in the default config
anymore in most scenarios, and if it's an issue you can turn on
backend_flush_after.


> If your system has 1Tb of memory,  almost none of Postgresql administrators
> will recommend to use all this 1Tb for shared buffers.

I've used 1TB successfully.


> Also PostgreSQL is not currently supporting dynamic changing of shared
> buffers size. Without it, the only way of using Postgres in clouds and
> another multiuser systems where system load is not fully controlled by  user
> is to choose relatively small shared buffer size and rely on OS caching.

That seems largely unrelated to the replay case, because there the data
will be read into shared buffers anyway. And it'll be dirtied therein.

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: WAL prefetch

akapila
In reply to this post by konstantin knizhnik
On Fri, Jun 15, 2018 at 8:45 PM, Konstantin Knizhnik
<[hidden email]> wrote:
>
> On 15.06.2018 18:03, Amit Kapila wrote:
>
> wal_prefetch is prefetching blocks referenced by WAL records. But in case of
> "full page writes" such prefetch is not needed and even is harmful.
>

Okay, IIUC, the basic idea is to prefetch recently modified data
pages, so that they can be referenced.  If so, isn't there some
overlap with autoprewarm functionality which dumps recently modified
blocks and then on recovery, it can prefetch those?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: WAL prefetch

akapila
In reply to this post by Andres Freund
On Fri, Jun 15, 2018 at 11:31 PM, Andres Freund <[hidden email]> wrote:

> On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:
>>
>>
>> On 14.06.2018 09:52, Thomas Munro wrote:
>> > On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
>> > <[hidden email]> wrote:
>> > > pg_wal_prefetch function will infinitely traverse WAL and prefetch block
>> > > references in WAL records
>> > > using posix_fadvise(WILLNEED) system call.
>> > Hi Konstantin,
>> >
>> > Why stop at the page cache...  what about shared buffers?
>> >
>>
>> It is good question. I thought a lot about prefetching directly to shared
>> buffers.
>
> I think that's definitely how this should work.  I'm pretty strongly
> opposed to a prefetching implementation that doesn't read into s_b.
>

We can think of supporting two modes (a) allows to read into shared
buffers or (b)  allows to read into OS page cache.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: WAL prefetch

konstantin knizhnik
In reply to this post by akapila


On 16.06.2018 06:30, Amit Kapila wrote:

> On Fri, Jun 15, 2018 at 8:45 PM, Konstantin Knizhnik
> <[hidden email]> wrote:
>> On 15.06.2018 18:03, Amit Kapila wrote:
>>
>> wal_prefetch is prefetching blocks referenced by WAL records. But in case of
>> "full page writes" such prefetch is not needed and even is harmful.
>>
> Okay, IIUC, the basic idea is to prefetch recently modified data
> pages, so that they can be referenced.  If so, isn't there some
> overlap with autoprewarm functionality which dumps recently modified
> blocks and then on recovery, it can prefetch those?
>
Sorry,  I do not see any intersection with autoprewarw functionality:
wal prefetch is performed at replica where data was not yet modified:
actually the goal of WAL prefetch is to make this update more efficient.
WAL prefetch can be also done at standalone server to speed up recovery
after crash. But it seems to be much more exotic use case.


Reply | Threaded
Open this post in threaded view
|

Re: WAL prefetch

konstantin knizhnik
In reply to this post by akapila


On 16.06.2018 06:33, Amit Kapila wrote:

> On Fri, Jun 15, 2018 at 11:31 PM, Andres Freund <[hidden email]> wrote:
>> On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:
>>>
>>> On 14.06.2018 09:52, Thomas Munro wrote:
>>>> On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
>>>> <[hidden email]> wrote:
>>>>> pg_wal_prefetch function will infinitely traverse WAL and prefetch block
>>>>> references in WAL records
>>>>> using posix_fadvise(WILLNEED) system call.
>>>> Hi Konstantin,
>>>>
>>>> Why stop at the page cache...  what about shared buffers?
>>>>
>>> It is good question. I thought a lot about prefetching directly to shared
>>> buffers.
>> I think that's definitely how this should work.  I'm pretty strongly
>> opposed to a prefetching implementation that doesn't read into s_b.
>>
> We can think of supporting two modes (a) allows to read into shared
> buffers or (b)  allows to read into OS page cache.
>
Unfortunately I afraid that a) requires different approach: unlike
posix_fadvise,  reading data to shared buffer is blocking operation. If
we do it by one worker, then it will read it with the same speed as redo
process. So to make prefetch really efficient,  in this case we have to
spawn multiple workers to perform prefetch in parallel (as pg_prefaulter
does).

Another my concern against prefetching to shared buffers is that it may
flush away from cache pages which are most frequently used by read only
queries at hot standby replica.


Reply | Threaded
Open this post in threaded view
|

Re: WAL prefetch

akapila
On Sat, Jun 16, 2018 at 10:47 AM, Konstantin Knizhnik
<[hidden email]> wrote:

>
>
> On 16.06.2018 06:33, Amit Kapila wrote:
>>
>> On Fri, Jun 15, 2018 at 11:31 PM, Andres Freund <[hidden email]>
>> wrote:
>>>
>>> On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:
>>>>
>>>>
>>>> On 14.06.2018 09:52, Thomas Munro wrote:
>>>>>
>>>>> On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
>>>>> <[hidden email]> wrote:
>>>>>>
>>>>>> pg_wal_prefetch function will infinitely traverse WAL and prefetch
>>>>>> block
>>>>>> references in WAL records
>>>>>> using posix_fadvise(WILLNEED) system call.
>>>>>
>>>>> Hi Konstantin,
>>>>>
>>>>> Why stop at the page cache...  what about shared buffers?
>>>>>
>>>> It is good question. I thought a lot about prefetching directly to
>>>> shared
>>>> buffers.
>>>
>>> I think that's definitely how this should work.  I'm pretty strongly
>>> opposed to a prefetching implementation that doesn't read into s_b.
>>>
>> We can think of supporting two modes (a) allows to read into shared
>> buffers or (b)  allows to read into OS page cache.
>>
> Unfortunately I afraid that a) requires different approach: unlike
> posix_fadvise,  reading data to shared buffer is blocking operation. If we
> do it by one worker, then it will read it with the same speed as redo
> process. So to make prefetch really efficient,  in this case we have to
> spawn multiple workers to perform prefetch in parallel (as pg_prefaulter
> does).
>
> Another my concern against prefetching to shared buffers is that it may
> flush away from cache pages which are most frequently used by read only
> queries at hot standby replica.
>

Okay, but I am suggesting to make it optional so that it can be
enabled when helpful (say when the user has enough shared buffers to
hold the data).


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

123