[PoC] Non-volatile WAL buffer

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
69 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

[PoC] Non-volatile WAL buffer

Takashi Menjo-2
Dear hackers,

I propose "non-volatile WAL buffer," a proof-of-concept new feature.  It
enables WAL records to be durable without output to WAL segment files by
residing on persistent memory (PMEM) instead of DRAM.  It improves database
performance by reducing copies of WAL and shortening the time of write
transactions.

I attach the first patchset that can be applied to PostgreSQL 12.0 (refs/
tags/REL_12_0).  Please see README.nvwal (added by the patch 0003) to use
the new feature.

PMEM [1] is fast, non-volatile, and byte-addressable memory installed into
DIMM slots. Such products have been already available.  For example, an
NVDIMM-N is a type of PMEM module that contains both DRAM and NAND flash.
It can be accessed like a regular DRAM, but on power loss, it can save its
contents into flash area.  On power restore, it performs the reverse, that
is, the contents are copied back into DRAM.  PMEM also has been already
supported by major operating systems such as Linux and Windows, and new
open-source libraries such as Persistent Memory Development Kit (PMDK) [2].
Furthermore, several DBMSes have started to support PMEM.

It's time for PostgreSQL.  PMEM is faster than a solid state disk and
naively can be used as a block storage.  However, we cannot gain much
performance in that way because it is so fast that the overhead of
traditional software stacks now becomes unignorable, such as user buffers,
filesystems, and block layers.  Non-volatile WAL buffer is a work to make
PostgreSQL PMEM-aware, that is, accessing directly to PMEM as a RAM to
bypass such overhead and achieve the maximum possible benefit.  I believe
WAL is one of the most important modules to be redesigned for PMEM because
it has assumed slow disks such as HDDs and SSDs but PMEM is not so.

This work is inspired by "Non-volatile Memory Logging" talked in PGCon
2016 [3] to gain more benefit from PMEM than my and Yoshimi's previous
work did [4][5].  I submitted a talk proposal for PGCon in this year, and
have measured and analyzed performance of my PostgreSQL with non-volatile
WAL buffer, comparing with the original one that uses PMEM as "a faster-
than-SSD storage."  I will talk about the results if accepted.

Best regards,
Takashi Menjo

[1] Persistent Memory (SNIA)
      https://www.snia.org/PM
[2] Persistent Memory Development Kit (pmem.io)
      https://pmem.io/pmdk/ 
[3] Non-volatile Memory Logging (PGCon 2016)
      https://www.pgcon.org/2016/schedule/track/Performance/945.en.html
[4] Introducing PMDK into PostgreSQL (PGCon 2018)
      https://www.pgcon.org/2018/schedule/events/1154.en.html
[5] Applying PMDK to WAL operations for persistent memory (pgsql-hackers)
      https://www.postgresql.org/message-id/C20D38E97BCB33DAD59E3A1@...

--
Takashi Menjo <[hidden email]>
NTT Software Innovation Center



0001-Support-GUCs-for-external-WAL-buffer.patch (30K) Download Attachment
0002-Non-volatile-WAL-buffer.patch (52K) Download Attachment
0003-README-for-non-volatile-WAL-buffer.patch (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PoC] Non-volatile WAL buffer

Fabien COELHO-3

Hello,

+1 on the idea.

By quickly looking at the patch, I notice that there are no tests.

Is it possible to emulate somthing without the actual hardware, at least
for testing purposes?

--
Fabien.


Reply | Threaded
Open this post in threaded view
|

Re: [PoC] Non-volatile WAL buffer

Heikki Linnakangas
In reply to this post by Takashi Menjo-2
On 24/01/2020 10:06, Takashi Menjo wrote:
> I propose "non-volatile WAL buffer," a proof-of-concept new feature.  It
> enables WAL records to be durable without output to WAL segment files by
> residing on persistent memory (PMEM) instead of DRAM.  It improves database
> performance by reducing copies of WAL and shortening the time of write
> transactions.
>
> I attach the first patchset that can be applied to PostgreSQL 12.0 (refs/
> tags/REL_12_0).  Please see README.nvwal (added by the patch 0003) to use
> the new feature.

I have the same comments on this that I had on the previous patch, see:

https://www.postgresql.org/message-id/2aec6e2a-6a32-0c39-e4e2-aad854543aa8%40iki.fi

- Heikki


Reply | Threaded
Open this post in threaded view
|

RE: [PoC] Non-volatile WAL buffer

Takashi Menjo-2
In reply to this post by Fabien COELHO-3
Hello Fabien,

Thank you for your +1 :)

> Is it possible to emulate somthing without the actual hardware, at least
> for testing purposes?

Yes, you can emulate PMEM using DRAM on Linux, via "memmap=nnG!ssG" kernel
parameter.  Please see [1] and [2] for emulation details.  If your emulation
does not work well, please check if the kernel configuration options (like
CONFIG_ FOOBAR) for PMEM and DAX (in [1] and [3]) are set up properly.

Best regards,
Takashi

[1] How to Emulate Persistent Memory Using Dynamic Random-access Memory (DRAM)
 https://software.intel.com/en-us/articles/how-to-emulate-persistent-memory-on-an-intel-architecture-server
[2] how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system
 https://nvdimm.wiki.kernel.org/how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system
[3] Persistent Memory Wiki
 https://nvdimm.wiki.kernel.org/

--
Takashi Menjo <[hidden email]>
NTT Software Innovation Center





Reply | Threaded
Open this post in threaded view
|

RE: [PoC] Non-volatile WAL buffer

Takashi Menjo-2
In reply to this post by Heikki Linnakangas
Hello Heikki,

> I have the same comments on this that I had on the previous patch, see:
>
> https://www.postgresql.org/message-id/2aec6e2a-6a32-0c39-e4e2-aad854543aa8%40iki.fi

Thanks.  I re-read your messages [1][2].  What you meant, AFAIU, is how
about using memory-mapped WAL segment files as WAL buffers, and switching
CPU instructions or msync() depending on whether the segment files are on
PMEM or not, to sync inserted WAL records.

It sounds reasonable, but I'm sorry that I haven't tested such a program
yet.  I'll try it to compare with my non-volatile WAL buffer.  For now, I'm
a little worried about the overhead of mmap()/munmap() for each WAL segment
file.

You also told a SIGBUS problem of memory-mapped I/O.  I think it's true for
reading from bad memory blocks, as you mentioned, and also true for writing
to such blocks [3].  Handling SIGBUS properly or working around it is future
work.

Best regards,
Takashi

[1] https://www.postgresql.org/message-id/83eafbfd-d9c5-6623-2423-7cab1be3888c%40iki.fi
[2] https://www.postgresql.org/message-id/2aec6e2a-6a32-0c39-e4e2-aad854543aa8%40iki.fi
[3] https://pmem.io/2018/11/26/bad-blocks.htm

--
Takashi Menjo <[hidden email]>
NTT Software Innovation Center





Reply | Threaded
Open this post in threaded view
|

Re: [PoC] Non-volatile WAL buffer

Robert Haas
On Mon, Jan 27, 2020 at 2:01 AM Takashi Menjo
<[hidden email]> wrote:
> It sounds reasonable, but I'm sorry that I haven't tested such a program
> yet.  I'll try it to compare with my non-volatile WAL buffer.  For now, I'm
> a little worried about the overhead of mmap()/munmap() for each WAL segment
> file.

I guess the question here is how the cost of one mmap() and munmap()
pair per WAL segment (normally 16MB) compares to the cost of one
write() per block (normally 8kB). It could be that mmap() is a more
expensive call than read(), but by a small enough margin that the
vastly reduced number of system calls makes it a winner. But that's
just speculation, because I don't know how heavy mmap() actually is.

I have a different concern. I think that, right now, when we reuse a
WAL segment, we write entire blocks at a time, so the old contents of
the WAL segment are overwritten without ever being read. But that
behavior might not be maintained when using mmap(). It might be that
as soon as we write the first byte to a mapped page, the old contents
have to be faulted into memory. Indeed, it's unclear how it could be
otherwise, since the VM page must be made read-write at that point and
the system cannot know that we will overwrite the whole page. But
reading in the old contents of a recycled WAL file just to overwrite
them seems like it would be disastrously expensive.

A related, but more minor, concern is whether there are any
differences in in the write-back behavior when modifying a mapped
region vs. using write(). Either way, the same pages of the same file
will get dirtied, but the kernel might not have the same idea in
either case about when the changed pages should be written back down
to disk, and that could make a big difference to performance.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

RE: [PoC] Non-volatile WAL buffer

Takashi Menjo-2
Hello Robert,

I think our concerns are roughly classified into two:

 (1) Performance
 (2) Consistency

And your "different concern" is rather into (2), I think.

I'm also worried about it, but I have no good answer for now.  I suppose mmap(flags|=MAP_SHARED) called by multiple backend processes for the same file works consistently for both PMEM and non-PMEM devices.  However, I have not found any evidence such as specification documents yet.

I also made a tiny program calling memcpy() and msync() on the same mmap()-ed file but mutually distinct address range in parallel, and found that there was no corrupted data.  However, that result does not ensure any consistency I'm worried about.  I could give it up if there *were* corrupted data...

So I will go to (1) first.  I will test the way Heikki told us to answer whether the cost of mmap() and munmap() per WAL segment, etc, is reasonable or not.  If it really is, then I will go to (2).

Best regards,
Takashi

--
Takashi Menjo <[hidden email]>
NTT Software Innovation Center





Reply | Threaded
Open this post in threaded view
|

Re: [PoC] Non-volatile WAL buffer

Robert Haas
On Tue, Jan 28, 2020 at 3:28 AM Takashi Menjo
<[hidden email]> wrote:
> I think our concerns are roughly classified into two:
>
>  (1) Performance
>  (2) Consistency
>
> And your "different concern" is rather into (2), I think.

Actually, I think it was mostly a performance concern (writes
triggering lots of reading) but there might be a consistency issue as
well.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: [PoC] Non-volatile WAL buffer

Andres Freund
In reply to this post by Robert Haas
Hi,

On 2020-01-27 13:54:38 -0500, Robert Haas wrote:

> On Mon, Jan 27, 2020 at 2:01 AM Takashi Menjo
> <[hidden email]> wrote:
> > It sounds reasonable, but I'm sorry that I haven't tested such a program
> > yet.  I'll try it to compare with my non-volatile WAL buffer.  For now, I'm
> > a little worried about the overhead of mmap()/munmap() for each WAL segment
> > file.
>
> I guess the question here is how the cost of one mmap() and munmap()
> pair per WAL segment (normally 16MB) compares to the cost of one
> write() per block (normally 8kB). It could be that mmap() is a more
> expensive call than read(), but by a small enough margin that the
> vastly reduced number of system calls makes it a winner. But that's
> just speculation, because I don't know how heavy mmap() actually is.

mmap()/munmap() on a regular basis does have pretty bad scalability
impacts. I don't think they'd fully hit us, because we're not in a
threaded world however.


My issue with the proposal to go towards mmap()/munmap() is that I think
doing so forcloses a lot of improvements. Even today, on fast storage,
using the open_datasync is faster (at least when somehow hitting the
O_DIRECT path, which isn't that easy these days) - and that's despite it
being really unoptimized.  I think our WAL scalability is a serious
issue. There's a fair bit that we can improve by just fix without really
changing the way we do IO:

- Split WALWriteLock into one lock for writing and one for flushing the
  WAL. Right now we prevent other sessions from writing out WAL - even
  to other segments - when one session is doing a WAL flush. But there's
  absolutely no need for that.
- Stop increasing the size of the flush request to the max when flushing
  WAL (cf "try to write/flush later additions to XLOG as well" in
  XLogFlush()) - that currently reduces throughput in OLTP workloads
  quite noticably. It made some sense in the spinning disk times, but I
  don't think it does for a halfway decent SSD. By writing the maximum
  ready to write, we hold the lock for longer, increasing latency for
  the committing transaction *and* preventing more WAL from being written.
- We should immediately ask the OS to flush writes for full XLOG pages
  back to the OS. Right now the IO for that will never be started before
  the commit comes around in an OLTP workload, which means that we just
  waste the time between the XLogWrite() and the commit.

That'll gain us 2-3x, I think. But after that I think we're going to
have to actually change more fundamentally how we do IO for WAL
writes. Using async IO I can do like 18k individual durable 8kb writes
(using O_DSYNC) a second, at a queue depth of 32. On my laptop. If I
make it 4k writes, it's 22k.

That's not directly comparable with postgres WAL flushes, of course, as
it's all separate blocks, whereas WAL will often end up overwriting the
last block. But it doesn't at all account for group commits either,
which we *constantly* end up doing.

Postgres manages somewhere between ~450 (multiple users) ~800 (single
user) individually durable WAL writes  / sec on the same hardware. Yes,
that's more than an order of magnitude less. Of course some of that is
just that postgres does more than just IO - but that's not effect on the
order of a magnitude.

So, why am I bringing this up in this thread? Only because I do not see
a way to actually utilize non-pmem hardware to a much higher degree than
we are doing now by using mmap(). Doing so requires using direct IO,
which is fundamentally incompatible with using mmap().



> I have a different concern. I think that, right now, when we reuse a
> WAL segment, we write entire blocks at a time, so the old contents of
> the WAL segment are overwritten without ever being read. But that
> behavior might not be maintained when using mmap(). It might be that
> as soon as we write the first byte to a mapped page, the old contents
> have to be faulted into memory. Indeed, it's unclear how it could be
> otherwise, since the VM page must be made read-write at that point and
> the system cannot know that we will overwrite the whole page. But
> reading in the old contents of a recycled WAL file just to overwrite
> them seems like it would be disastrously expensive.

Yea, that's a serious concern.


> A related, but more minor, concern is whether there are any
> differences in in the write-back behavior when modifying a mapped
> region vs. using write(). Either way, the same pages of the same file
> will get dirtied, but the kernel might not have the same idea in
> either case about when the changed pages should be written back down
> to disk, and that could make a big difference to performance.

I don't think there's a significant difference in case of linux - no
idea about others. And either way we probably should force the kernels
hand to start flushing much sooner.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

RE: [PoC] Non-volatile WAL buffer

Takashi Menjo-2
In reply to this post by Robert Haas
Dear hackers,

I made another WIP patchset to mmap WAL segments as WAL buffers.  Note that this is not a non-volatile WAL buffer patchset but its competitor.  I am measuring and analyzing the performance of this patchset to compare with my N.V.WAL buffer.

Please wait for a several more days for the result report...

Best regards,
Takashi

--
Takashi Menjo <[hidden email]>
NTT Software Innovation Center

> -----Original Message-----
> From: Robert Haas <[hidden email]>
> Sent: Wednesday, January 29, 2020 6:00 AM
> To: Takashi Menjo <[hidden email]>
> Cc: Heikki Linnakangas <[hidden email]>; [hidden email]
> Subject: Re: [PoC] Non-volatile WAL buffer
>
> On Tue, Jan 28, 2020 at 3:28 AM Takashi Menjo <[hidden email]> wrote:
> > I think our concerns are roughly classified into two:
> >
> >  (1) Performance
> >  (2) Consistency
> >
> > And your "different concern" is rather into (2), I think.
>
> Actually, I think it was mostly a performance concern (writes triggering lots of reading) but there might be a
> consistency issue as well.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company

0001-Preallocate-more-WAL-segments.patch (3K) Download Attachment
0002-Use-WAL-segments-as-WAL-buffers.patch (40K) Download Attachment
0003-Lazy-unmap-WAL-segments.patch (2K) Download Attachment
0004-Speculative-map-WAL-segments.patch (1K) Download Attachment
0005-Allocate-WAL-segments-to-utilize-hugepage.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: [PoC] Non-volatile WAL buffer

Takashi Menjo-2
In reply to this post by Robert Haas
Dear hackers,

I applied my patchset that mmap()-s WAL segments as WAL buffers to refs/tags/REL_12_0, and measured and analyzed its performance with pgbench.  Roughly speaking, When I used *SSD and ext4* to store WAL, it was "obviously worse" than the original REL_12_0.  VTune told me that the CPU time of memcpy() called by CopyXLogRecordToWAL() got larger than before.  When I used *NVDIMM-N and ext4 with filesystem DAX* to store WAL, however, it achieved "not bad" performance compared with our previous patchset and non-volatile WAL buffer.  Each CPU time of XLogInsert() and XLogFlush() was reduced like as non-volatile WAL buffer.

So I think mmap()-ing WAL segments as WAL buffers is not such a bad idea as long as we use PMEM, at least NVDIMM-N.

Excuse me but for now I'd keep myself not talking about how much the performance was, because the mmap()-ing patchset is WIP so there might be bugs which wrongfully  "improve" or "degrade" performance.  Also we need to know persistent memory programming and related features such as filesystem DAX, huge page faults, and WAL persistence with cache flush and memory barrier instructions to explain why the performance improved.  I'd talk about all the details at the appropriate time and place. (The conference, or here later...)

Best regards,
Takashi

--
Takashi Menjo <[hidden email]>
NTT Software Innovation Center

> -----Original Message-----
> From: Takashi Menjo <[hidden email]>
> Sent: Monday, February 10, 2020 6:30 PM
> To: 'Robert Haas' <[hidden email]>; 'Heikki Linnakangas' <[hidden email]>
> Cc: '[hidden email]' <[hidden email]>
> Subject: RE: [PoC] Non-volatile WAL buffer
>
> Dear hackers,
>
> I made another WIP patchset to mmap WAL segments as WAL buffers.  Note that this is not a non-volatile WAL
> buffer patchset but its competitor.  I am measuring and analyzing the performance of this patchset to compare
> with my N.V.WAL buffer.
>
> Please wait for a several more days for the result report...
>
> Best regards,
> Takashi
>
> --
> Takashi Menjo <[hidden email]> NTT Software Innovation Center
>
> > -----Original Message-----
> > From: Robert Haas <[hidden email]>
> > Sent: Wednesday, January 29, 2020 6:00 AM
> > To: Takashi Menjo <[hidden email]>
> > Cc: Heikki Linnakangas <[hidden email]>; [hidden email]
> > Subject: Re: [PoC] Non-volatile WAL buffer
> >
> > On Tue, Jan 28, 2020 at 3:28 AM Takashi Menjo <[hidden email]> wrote:
> > > I think our concerns are roughly classified into two:
> > >
> > >  (1) Performance
> > >  (2) Consistency
> > >
> > > And your "different concern" is rather into (2), I think.
> >
> > Actually, I think it was mostly a performance concern (writes
> > triggering lots of reading) but there might be a consistency issue as well.
> >
> > --
> > Robert Haas
> > EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL
> > Company




Reply | Threaded
Open this post in threaded view
|

Re: [PoC] Non-volatile WAL buffer

Amit Langote
Menjo-san,

On Mon, Feb 17, 2020 at 1:13 PM Takashi Menjo
<[hidden email]> wrote:
> I applied my patchset that mmap()-s WAL segments as WAL buffers to refs/tags/REL_12_0, and measured and analyzed its performance with pgbench.  Roughly speaking, When I used *SSD and ext4* to store WAL, it was "obviously worse" than the original REL_12_0.

I apologize for not having any opinion on the patches themselves, but
let me point out that it's better to base these patches on HEAD
(master branch) than REL_12_0, because all new code is committed to
the master branch, whereas stable branches such as REL_12_0 only
receive bug fixes.  Do you have any specific reason to be working on
REL_12_0?

Thanks,
Amit


Reply | Threaded
Open this post in threaded view
|

RE: [PoC] Non-volatile WAL buffer

Takashi Menjo-2
Hello Amit,

> I apologize for not having any opinion on the patches themselves, but let me point out that it's better to base these
> patches on HEAD (master branch) than REL_12_0, because all new code is committed to the master branch,
> whereas stable branches such as REL_12_0 only receive bug fixes.  Do you have any specific reason to be working
> on REL_12_0?

Yes, because I think it's human-friendly to reproduce and discuss performance measurement.  Of course I know all new accepted patches are merged into master's HEAD, not stable branches and not even release tags, so I'm aware of rebasing my patchset onto master sooner or later.  However, if someone, including me, says that s/he applies my patchset to "master" and measures its performance, we have to pay attention to which commit the "master" really points to.  Although we have sha1 hashes to specify which commit, we should check whether the specific commit on master has patches affecting performance or not because master's HEAD gets new patches day by day.  On the other hand, a release tag clearly points the commit all we probably know.  Also we can check more easily the features and improvements by using release notes and user manuals.

Best regards,
Takashi

--
Takashi Menjo <[hidden email]>
NTT Software Innovation Center

> -----Original Message-----
> From: Amit Langote <[hidden email]>
> Sent: Monday, February 17, 2020 1:39 PM
> To: Takashi Menjo <[hidden email]>
> Cc: Robert Haas <[hidden email]>; Heikki Linnakangas <[hidden email]>; PostgreSQL-development
> <[hidden email]>
> Subject: Re: [PoC] Non-volatile WAL buffer
>
> Menjo-san,
>
> On Mon, Feb 17, 2020 at 1:13 PM Takashi Menjo <[hidden email]> wrote:
> > I applied my patchset that mmap()-s WAL segments as WAL buffers to refs/tags/REL_12_0, and measured and
> analyzed its performance with pgbench.  Roughly speaking, When I used *SSD and ext4* to store WAL, it was
> "obviously worse" than the original REL_12_0.
>
> I apologize for not having any opinion on the patches themselves, but let me point out that it's better to base these
> patches on HEAD (master branch) than REL_12_0, because all new code is committed to the master branch,
> whereas stable branches such as REL_12_0 only receive bug fixes.  Do you have any specific reason to be working
> on REL_12_0?
>
> Thanks,
> Amit




Reply | Threaded
Open this post in threaded view
|

Re: [PoC] Non-volatile WAL buffer

Amit Langote
Hello,

On Mon, Feb 17, 2020 at 4:16 PM Takashi Menjo
<[hidden email]> wrote:
> Hello Amit,
>
> > I apologize for not having any opinion on the patches themselves, but let me point out that it's better to base these
> > patches on HEAD (master branch) than REL_12_0, because all new code is committed to the master branch,
> > whereas stable branches such as REL_12_0 only receive bug fixes.  Do you have any specific reason to be working
> > on REL_12_0?
>
> Yes, because I think it's human-friendly to reproduce and discuss performance measurement.  Of course I know all new accepted patches are merged into master's HEAD, not stable branches and not even release tags, so I'm aware of rebasing my patchset onto master sooner or later.  However, if someone, including me, says that s/he applies my patchset to "master" and measures its performance, we have to pay attention to which commit the "master" really points to.  Although we have sha1 hashes to specify which commit, we should check whether the specific commit on master has patches affecting performance or not because master's HEAD gets new patches day by day.  On the other hand, a release tag clearly points the commit all we probably know.  Also we can check more easily the features and improvements by using release notes and user manuals.

Thanks for clarifying. I see where you're coming from.

While I do sometimes see people reporting numbers with the latest
stable release' branch, that's normally just one of the baselines.
The more important baseline for ongoing development is the master
branch's HEAD, which is also what people volunteering to test your
patches would use.  Anyone who reports would have to give at least two
numbers -- performance with a branch's HEAD without patch applied and
that with patch applied -- which can be enough in most cases to see
the difference the patch makes.  Sure, the numbers might change on
each report, but that's fine I'd think.  If you continue to develop
against the stable branch, you might miss to notice impact from any
relevant developments in the master branch, even developments which
possibly require rethinking the architecture of your own changes,
although maybe that rarely occurs.

Thanks,
Amit


Reply | Threaded
Open this post in threaded view
|

Re: [PoC] Non-volatile WAL buffer

Andres Freund
In reply to this post by Takashi Menjo-2
Hi,

On 2020-02-17 13:12:37 +0900, Takashi Menjo wrote:
> I applied my patchset that mmap()-s WAL segments as WAL buffers to
> refs/tags/REL_12_0, and measured and analyzed its performance with
> pgbench.  Roughly speaking, When I used *SSD and ext4* to store WAL,
> it was "obviously worse" than the original REL_12_0.  VTune told me
> that the CPU time of memcpy() called by CopyXLogRecordToWAL() got
> larger than before.

FWIW, this might largely be because of page faults. In contrast to
before we wouldn't reuse the same pages (because they've been
munmap()/mmap()ed), so the first time they're touched, we'll incur page
faults.  Did you try mmap()ing with MAP_POPULATE? It's probably also
worthwhile to try to use MAP_HUGETLB.

Still doubtful it's the right direction, but I'd rather have good
numbers to back me up :)

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

RE: [PoC] Non-volatile WAL buffer

Takashi Menjo-2
In reply to this post by Amit Langote
Dear Amit,

Thank you for your advice.  Exactly, it's so to speak "do as the hackers do when in pgsql"...

I'm rebasing my branch onto master.  I'll submit an updated patchset and performance report later.

Best regards,
Takashi

--
Takashi Menjo <[hidden email]>
NTT Software Innovation Center

> -----Original Message-----
> From: Amit Langote <[hidden email]>
> Sent: Monday, February 17, 2020 5:21 PM
> To: Takashi Menjo <[hidden email]>
> Cc: Robert Haas <[hidden email]>; Heikki Linnakangas <[hidden email]>; PostgreSQL-development
> <[hidden email]>
> Subject: Re: [PoC] Non-volatile WAL buffer
>
> Hello,
>
> On Mon, Feb 17, 2020 at 4:16 PM Takashi Menjo <[hidden email]> wrote:
> > Hello Amit,
> >
> > > I apologize for not having any opinion on the patches themselves,
> > > but let me point out that it's better to base these patches on HEAD
> > > (master branch) than REL_12_0, because all new code is committed to
> > > the master branch, whereas stable branches such as REL_12_0 only receive bug fixes.  Do you have any
> specific reason to be working on REL_12_0?
> >
> > Yes, because I think it's human-friendly to reproduce and discuss performance measurement.  Of course I know
> all new accepted patches are merged into master's HEAD, not stable branches and not even release tags, so I'm
> aware of rebasing my patchset onto master sooner or later.  However, if someone, including me, says that s/he
> applies my patchset to "master" and measures its performance, we have to pay attention to which commit the
> "master" really points to.  Although we have sha1 hashes to specify which commit, we should check whether the
> specific commit on master has patches affecting performance or not because master's HEAD gets new patches day
> by day.  On the other hand, a release tag clearly points the commit all we probably know.  Also we can check more
> easily the features and improvements by using release notes and user manuals.
>
> Thanks for clarifying. I see where you're coming from.
>
> While I do sometimes see people reporting numbers with the latest stable release' branch, that's normally just one
> of the baselines.
> The more important baseline for ongoing development is the master branch's HEAD, which is also what people
> volunteering to test your patches would use.  Anyone who reports would have to give at least two numbers --
> performance with a branch's HEAD without patch applied and that with patch applied -- which can be enough in
> most cases to see the difference the patch makes.  Sure, the numbers might change on each report, but that's fine
> I'd think.  If you continue to develop against the stable branch, you might miss to notice impact from any relevant
> developments in the master branch, even developments which possibly require rethinking the architecture of your
> own changes, although maybe that rarely occurs.
>
> Thanks,
> Amit




Reply | Threaded
Open this post in threaded view
|

RE: [PoC] Non-volatile WAL buffer

Takashi Menjo-2
In reply to this post by Amit Langote
Dear hackers,

I rebased my non-volatile WAL buffer's patchset onto master.  A new v2 patchset is attached to this mail.

I also measured performance before and after patchset, varying -c/--client and -j/--jobs options of pgbench, for each scaling factor s = 50 or 1000.  The results are presented in the following tables and the attached charts.  Conditions, steps, and other details will be shown later.


Results (s=50)
==============
         Throughput [10^3 TPS]  Average latency [ms]
( c, j)  before  after          before  after
-------  ---------------------  ---------------------
( 8, 8)  35.7    37.1 (+3.9%)   0.224   0.216 (-3.6%)
(18,18)  70.9    74.7 (+5.3%)   0.254   0.241 (-5.1%)
(36,18)  76.0    80.8 (+6.3%)   0.473   0.446 (-5.7%)
(54,18)  75.5    81.8 (+8.3%)   0.715   0.660 (-7.7%)


Results (s=1000)
================
         Throughput [10^3 TPS]  Average latency [ms]
( c, j)  before  after          before  after
-------  ---------------------  ---------------------
( 8, 8)  37.4    40.1 (+7.3%)   0.214   0.199 (-7.0%)
(18,18)  79.3    86.7 (+9.3%)   0.227   0.208 (-8.4%)
(36,18)  87.2    95.5 (+9.5%)   0.413   0.377 (-8.7%)
(54,18)  86.8    94.8 (+9.3%)   0.622   0.569 (-8.5%)


Both throughput and average latency are improved for each scaling factor.  Throughput seemed to almost reach the upper limit when (c,j)=(36,18).

The percentage in s=1000 case looks larger than in s=50 case.  I think larger scaling factor leads to less contentions on the same tables and/or indexes, that is, less lock and unlock operations.  In such a situation, write-ahead logging appears to be more significant for performance.


Conditions
==========
- Use one physical server having 2 NUMA nodes (node 0 and 1)
  - Pin postgres (server processes) to node 0 and pgbench to node 1
  - 18 cores and 192GiB DRAM per node
- Use an NVMe SSD for PGDATA and an interleaved 6-in-1 NVDIMM-N set for pg_wal
  - Both are installed on the server-side node, that is, node 0
  - Both are formatted with ext4
  - NVDIMM-N is mounted with "-o dax" option to enable Direct Access (DAX)
- Use the attached postgresql.conf
  - Two new items nvwal_path and nvwal_size are used only after patch


Steps
=====
For each (c,j) pair, I did the following steps three times then I found the median of the three as a final result shown in the tables above.

(1) Run initdb with proper -D and -X options; and also give --nvwal-path and --nvwal-size options after patch
(2) Start postgres and create a database for pgbench tables
(3) Run "pgbench -i -s ___" to create tables (s = 50 or 1000)
(4) Stop postgres, remount filesystems, and start postgres again
(5) Execute pg_prewarm extension for all the four pgbench tables
(6) Run pgbench during 30 minutes


pgbench command line
====================
$ pgbench -h /tmp -p 5432 -U username -r -M prepared -T 1800 -c ___ -j ___ dbname

I gave no -b option to use the built-in "TPC-B (sort-of)" query.


Software
========
- Distro: Ubuntu 18.04
- Kernel: Linux 5.4 (vanilla kernel)
- C Compiler: gcc 7.4.0
- PMDK: 1.7
- PostgreSQL: d677550 (master on Mar 3, 2020)


Hardware
========
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6154 (Skylake) x 2sockets
- DRAM: DDR4 2666MHz {32GiB/ch x 6ch}/socket x 2sockets
- NVDIMM-N: DDR4 2666MHz {16GiB/ch x 6ch}/socket x 2sockets
- NVMe SSD: Intel Optane DC P4800X Series SSDPED1K750GA


Best regards,
Takashi

--
Takashi Menjo <[hidden email]>
NTT Software Innovation Center

> -----Original Message-----
> From: Takashi Menjo <[hidden email]>
> Sent: Thursday, February 20, 2020 6:30 PM
> To: 'Amit Langote' <[hidden email]>
> Cc: 'Robert Haas' <[hidden email]>; 'Heikki Linnakangas' <[hidden email]>; 'PostgreSQL-development'
> <[hidden email]>
> Subject: RE: [PoC] Non-volatile WAL buffer
>
> Dear Amit,
>
> Thank you for your advice.  Exactly, it's so to speak "do as the hackers do when in pgsql"...
>
> I'm rebasing my branch onto master.  I'll submit an updated patchset and performance report later.
>
> Best regards,
> Takashi
>
> --
> Takashi Menjo <[hidden email]> NTT Software Innovation Center
>
> > -----Original Message-----
> > From: Amit Langote <[hidden email]>
> > Sent: Monday, February 17, 2020 5:21 PM
> > To: Takashi Menjo <[hidden email]>
> > Cc: Robert Haas <[hidden email]>; Heikki Linnakangas
> > <[hidden email]>; PostgreSQL-development
> > <[hidden email]>
> > Subject: Re: [PoC] Non-volatile WAL buffer
> >
> > Hello,
> >
> > On Mon, Feb 17, 2020 at 4:16 PM Takashi Menjo <[hidden email]> wrote:
> > > Hello Amit,
> > >
> > > > I apologize for not having any opinion on the patches themselves,
> > > > but let me point out that it's better to base these patches on
> > > > HEAD (master branch) than REL_12_0, because all new code is
> > > > committed to the master branch, whereas stable branches such as
> > > > REL_12_0 only receive bug fixes.  Do you have any
> > specific reason to be working on REL_12_0?
> > >
> > > Yes, because I think it's human-friendly to reproduce and discuss
> > > performance measurement.  Of course I know
> > all new accepted patches are merged into master's HEAD, not stable
> > branches and not even release tags, so I'm aware of rebasing my
> > patchset onto master sooner or later.  However, if someone, including
> > me, says that s/he applies my patchset to "master" and measures its
> > performance, we have to pay attention to which commit the "master"
> > really points to.  Although we have sha1 hashes to specify which
> > commit, we should check whether the specific commit on master has patches affecting performance or not
> because master's HEAD gets new patches day by day.  On the other hand, a release tag clearly points the commit
> all we probably know.  Also we can check more easily the features and improvements by using release notes and
> user manuals.
> >
> > Thanks for clarifying. I see where you're coming from.
> >
> > While I do sometimes see people reporting numbers with the latest
> > stable release' branch, that's normally just one of the baselines.
> > The more important baseline for ongoing development is the master
> > branch's HEAD, which is also what people volunteering to test your
> > patches would use.  Anyone who reports would have to give at least two
> > numbers -- performance with a branch's HEAD without patch applied and
> > that with patch applied -- which can be enough in most cases to see
> > the difference the patch makes.  Sure, the numbers might change on
> > each report, but that's fine I'd think.  If you continue to develop against the stable branch, you might miss to
> notice impact from any relevant developments in the master branch, even developments which possibly require
> rethinking the architecture of your own changes, although maybe that rarely occurs.
> >
> > Thanks,
> > Amit

v2-0001-Support-GUCs-for-external-WAL-buffer.patch (36K) Download Attachment
v2-0002-Non-volatile-WAL-buffer.patch (53K) Download Attachment
v2-0003-README-for-non-volatile-WAL-buffer.patch (7K) Download Attachment
nvwal-performance-s50.png (39K) Download Attachment
nvwal-performance-s1000.png (40K) Download Attachment
postgresql.conf (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: [PoC] Non-volatile WAL buffer

Takashi Menjo-2
In reply to this post by Andres Freund
Dear Andres,

Thank you for your advice about MAP_POPULATE flag.  I rebased my msync patchset onto master and added a commit to append that flag
when mmap.  A new v2 patchset is attached to this mail.  Note that this patchset is NOT non-volatile WAL buffer's one.

I also measured performance of the following three versions, varying -c/--client and -j/--jobs options of pgbench, for each scaling
factor s = 50 or 1000.

- Before patchset (say "before")
- After patchset except patch 0005 not to use MAP_POPULATE ("after (no populate)")
- After full patchset to use MAP_POPULATE ("after (populate)")

The results are presented in the following tables and the attached charts.  Conditions, steps, and other details will be shown
later.  Note that, unlike the measurement of non-volatile WAL buffer I sent recently [1], I used an NVMe SSD for pg_wal to evaluate
this patchset with traditional mmap-ed files, that is, direct access (DAX) is not supported and there are page caches.


Results (s=50)
==============
         Throughput [10^3 TPS]
( c, j)  before  after           after
                 (no populate)   (populate)
-------  -------------------------------------
( 8, 8)  30.9    28.1 (- 9.2%)   28.3 (- 8.6%)
(18,18)  61.5    46.1 (-25.0%)   47.7 (-22.3%)
(36,18)  67.0    45.9 (-31.5%)   48.4 (-27.8%)
(54,18)  68.3    47.0 (-31.3%)   49.6 (-27.5%)

         Average Latency [ms]
( c, j)  before  after           after
                 (no populate)   (populate)
-------  --------------------------------------
( 8, 8)  0.259   0.285 (+10.0%)  0.283 (+ 9.3%)
(18,18)  0.293   0.391 (+33.4%)  0.377 (+28.7%)
(36,18)  0.537   0.784 (+46.0%)  0.744 (+38.5%)
(54,18)  0.790   1.149 (+45.4%)  1.090 (+38.0%)


Results (s=1000)
================
         Throghput [10^3 TPS]
( c, j)  before  after           after
                 (no populate)   (populate)
-------  ------------------------------------
( 8, 8)  32.0    29.6 (- 7.6%)   29.1 (- 9.0%)
(18,18)  66.1    49.2 (-25.6%)   50.4 (-23.7%)
(36,18)  76.4    51.0 (-33.3%)   53.4 (-30.1%)
(54,18)  80.1    54.3 (-32.2%)   57.2 (-28.6%)

         Average latency [10^3 TPS]
( c, j)  before  after           after
                 (no populate)   (populate)
-------  --------------------------------------
( 8, 8)  0.250   0.271 (+ 8.4%)  0.275 (+10.0%)
(18,18)  0.272   0.366 (+34.6%)  0.357 (+31.3%)
(36,18)  0.471   0.706 (+49.9%)  0.674 (+43.1%)
(54,18)  0.674   0.995 (+47.6%)  0.944 (+40.1%)


I'd say MAP_POPULATE made performance a little better in large #clients cases, comparing "populate" with "no populate". However,
comparing "after" with "before", I found both throughput and average latency degraded.  VTune told me that "after (populate)" still
spent larger CPU time for memcpy-ing WAL records into mmap-ed segments than "before".

I also made a microbenchmark to see the behavior of mmap and msync.  I found that:

- A major fault occured at mmap with MAP_POPULATE, instead at first access to the mmap-ed space.
- Some minor faults also occured at mmap with MAP_POPULATE, and no additional fault occured when I loaded from the mmap-ed space.
But once I stored to that space, a minor fault occured.
- When I stored to the page that had been msync-ed, a minor fault occurred.

So I think one of the remaining causes of performance degrade is minor faults when mmap-ed pages get dirtied.  And it seems not to
be solved by MAP_POPULATE only, as far as I see.


Conditions
==========
- Use one physical server having 2 NUMA nodes (node 0 and 1)
  - Pin postgres (server processes) to node 0 and pgbench to node 1
  - 18 cores and 192GiB DRAM per node
- Use two NVMe SSDs; one for PGDATA, another for pg_wal
  - Both are installed on the server-side node, that is, node 0
  - Both are formatted with ext4
- Use the attached postgresql.conf


Steps
=====
For each (c,j) pair, I did the following steps three times then I found the median of the three as a final result shown in the
tables above.

(1) Run initdb with proper -D and -X options
(2) Start postgres and create a database for pgbench tables
(3) Run "pgbench -i -s ___" to create tables (s = 50 or 1000)
(4) Stop postgres, remount filesystems, and start postgres again
(5) Execute pg_prewarm extension for all the four pgbench tables
(6) Run pgbench during 30 minutes


pgbench command line
====================
$ pgbench -h /tmp -p 5432 -U username -r -M prepared -T 1800 -c ___ -j ___ dbname

I gave no -b option to use the built-in "TPC-B (sort-of)" query.


Software
========
- Distro: Ubuntu 18.04
- Kernel: Linux 5.4 (vanilla kernel)
- C Compiler: gcc 7.4.0
- PMDK: 1.7
- PostgreSQL: d677550 (master on Mar 3, 2020)


Hardware
========
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6154 (Skylake) x 2sockets
- DRAM: DDR4 2666MHz {32GiB/ch x 6ch}/socket x 2sockets
- NVMe SSD: Intel Optane DC P4800X Series SSDPED1K750GA x2


Best regards,
Takashi


[1] https://www.postgresql.org/message-id/002701d5fd03$6e1d97a0$4a58c6e0$@hco.ntt.co.jp_1

--
Takashi Menjo <[hidden email]>
NTT Software Innovation Center

> -----Original Message-----
> From: Andres Freund <[hidden email]>
> Sent: Thursday, February 20, 2020 2:04 PM
> To: Takashi Menjo <[hidden email]>
> Cc: 'Robert Haas' <[hidden email]>; 'Heikki Linnakangas' <[hidden email]>;
> [hidden email]
> Subject: Re: [PoC] Non-volatile WAL buffer
>
> Hi,
>
> On 2020-02-17 13:12:37 +0900, Takashi Menjo wrote:
> > I applied my patchset that mmap()-s WAL segments as WAL buffers to
> > refs/tags/REL_12_0, and measured and analyzed its performance with
> > pgbench.  Roughly speaking, When I used *SSD and ext4* to store WAL,
> > it was "obviously worse" than the original REL_12_0.  VTune told me
> > that the CPU time of memcpy() called by CopyXLogRecordToWAL() got
> > larger than before.
>
> FWIW, this might largely be because of page faults. In contrast to before we wouldn't reuse the same pages
> (because they've been munmap()/mmap()ed), so the first time they're touched, we'll incur page faults.  Did you
> try mmap()ing with MAP_POPULATE? It's probably also worthwhile to try to use MAP_HUGETLB.
>
> Still doubtful it's the right direction, but I'd rather have good numbers to back me up :)
>
> Greetings,
>
> Andres Freund

v2-0001-Preallocate-more-WAL-segments.patch (3K) Download Attachment
v2-0002-Use-WAL-segments-as-WAL-buffers.patch (44K) Download Attachment
v2-0003-Lazy-unmap-WAL-segments.patch (2K) Download Attachment
v2-0004-Speculative-map-WAL-segments.patch (1K) Download Attachment
v2-0005-Map-WAL-segments-with-MAP_POPULATE-if-non-DAX.patch (948 bytes) Download Attachment
msync-performance-s50.png (43K) Download Attachment
msync-performance-s1000.png (43K) Download Attachment
postgresql.conf (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: [PoC] Non-volatile WAL buffer

Takashi Menjo-2
In reply to this post by Amit Langote
Dear hackers,

I update my non-volatile WAL buffer's patchset to v3.  Now we can use it in streaming replication mode.

Updates from v2:

- walreceiver supports non-volatile WAL buffer
Now walreceiver stores received records directly to non-volatile WAL buffer if applicable.

- pg_basebackup supports non-volatile WAL buffer
Now pg_basebackup copies received WAL segments onto non-volatile WAL buffer if you run it with "nvwal" mode (-Fn).
You should specify a new NVWAL path with --nvwal-path option.  The path will be written to postgresql.auto.conf or recovery.conf.  The size of the new NVWAL is same as the master's one.


Best regards,
Takashi

--
Takashi Menjo <[hidden email]>
NTT Software Innovation Center

> -----Original Message-----
> From: Takashi Menjo <[hidden email]>
> Sent: Wednesday, March 18, 2020 5:59 PM
> To: 'PostgreSQL-development' <[hidden email]>
> Cc: 'Robert Haas' <[hidden email]>; 'Heikki Linnakangas' <[hidden email]>; 'Amit Langote'
> <[hidden email]>
> Subject: RE: [PoC] Non-volatile WAL buffer
>
> Dear hackers,
>
> I rebased my non-volatile WAL buffer's patchset onto master.  A new v2 patchset is attached to this mail.
>
> I also measured performance before and after patchset, varying -c/--client and -j/--jobs options of pgbench, for
> each scaling factor s = 50 or 1000.  The results are presented in the following tables and the attached charts.
> Conditions, steps, and other details will be shown later.
>
>
> Results (s=50)
> ==============
>          Throughput [10^3 TPS]  Average latency [ms]
> ( c, j)  before  after          before  after
> -------  ---------------------  ---------------------
> ( 8, 8)  35.7    37.1 (+3.9%)   0.224   0.216 (-3.6%)
> (18,18)  70.9    74.7 (+5.3%)   0.254   0.241 (-5.1%)
> (36,18)  76.0    80.8 (+6.3%)   0.473   0.446 (-5.7%)
> (54,18)  75.5    81.8 (+8.3%)   0.715   0.660 (-7.7%)
>
>
> Results (s=1000)
> ================
>          Throughput [10^3 TPS]  Average latency [ms]
> ( c, j)  before  after          before  after
> -------  ---------------------  ---------------------
> ( 8, 8)  37.4    40.1 (+7.3%)   0.214   0.199 (-7.0%)
> (18,18)  79.3    86.7 (+9.3%)   0.227   0.208 (-8.4%)
> (36,18)  87.2    95.5 (+9.5%)   0.413   0.377 (-8.7%)
> (54,18)  86.8    94.8 (+9.3%)   0.622   0.569 (-8.5%)
>
>
> Both throughput and average latency are improved for each scaling factor.  Throughput seemed to almost reach
> the upper limit when (c,j)=(36,18).
>
> The percentage in s=1000 case looks larger than in s=50 case.  I think larger scaling factor leads to less
> contentions on the same tables and/or indexes, that is, less lock and unlock operations.  In such a situation,
> write-ahead logging appears to be more significant for performance.
>
>
> Conditions
> ==========
> - Use one physical server having 2 NUMA nodes (node 0 and 1)
>   - Pin postgres (server processes) to node 0 and pgbench to node 1
>   - 18 cores and 192GiB DRAM per node
> - Use an NVMe SSD for PGDATA and an interleaved 6-in-1 NVDIMM-N set for pg_wal
>   - Both are installed on the server-side node, that is, node 0
>   - Both are formatted with ext4
>   - NVDIMM-N is mounted with "-o dax" option to enable Direct Access (DAX)
> - Use the attached postgresql.conf
>   - Two new items nvwal_path and nvwal_size are used only after patch
>
>
> Steps
> =====
> For each (c,j) pair, I did the following steps three times then I found the median of the three as a final result shown
> in the tables above.
>
> (1) Run initdb with proper -D and -X options; and also give --nvwal-path and --nvwal-size options after patch
> (2) Start postgres and create a database for pgbench tables
> (3) Run "pgbench -i -s ___" to create tables (s = 50 or 1000)
> (4) Stop postgres, remount filesystems, and start postgres again
> (5) Execute pg_prewarm extension for all the four pgbench tables
> (6) Run pgbench during 30 minutes
>
>
> pgbench command line
> ====================
> $ pgbench -h /tmp -p 5432 -U username -r -M prepared -T 1800 -c ___ -j ___ dbname
>
> I gave no -b option to use the built-in "TPC-B (sort-of)" query.
>
>
> Software
> ========
> - Distro: Ubuntu 18.04
> - Kernel: Linux 5.4 (vanilla kernel)
> - C Compiler: gcc 7.4.0
> - PMDK: 1.7
> - PostgreSQL: d677550 (master on Mar 3, 2020)
>
>
> Hardware
> ========
> - System: HPE ProLiant DL380 Gen10
> - CPU: Intel Xeon Gold 6154 (Skylake) x 2sockets
> - DRAM: DDR4 2666MHz {32GiB/ch x 6ch}/socket x 2sockets
> - NVDIMM-N: DDR4 2666MHz {16GiB/ch x 6ch}/socket x 2sockets
> - NVMe SSD: Intel Optane DC P4800X Series SSDPED1K750GA
>
>
> Best regards,
> Takashi
>
> --
> Takashi Menjo <[hidden email]> NTT Software Innovation Center
>
> > -----Original Message-----
> > From: Takashi Menjo <[hidden email]>
> > Sent: Thursday, February 20, 2020 6:30 PM
> > To: 'Amit Langote' <[hidden email]>
> > Cc: 'Robert Haas' <[hidden email]>; 'Heikki Linnakangas' <[hidden email]>;
> 'PostgreSQL-development'
> > <[hidden email]>
> > Subject: RE: [PoC] Non-volatile WAL buffer
> >
> > Dear Amit,
> >
> > Thank you for your advice.  Exactly, it's so to speak "do as the hackers do when in pgsql"...
> >
> > I'm rebasing my branch onto master.  I'll submit an updated patchset and performance report later.
> >
> > Best regards,
> > Takashi
> >
> > --
> > Takashi Menjo <[hidden email]> NTT Software
> > Innovation Center
> >
> > > -----Original Message-----
> > > From: Amit Langote <[hidden email]>
> > > Sent: Monday, February 17, 2020 5:21 PM
> > > To: Takashi Menjo <[hidden email]>
> > > Cc: Robert Haas <[hidden email]>; Heikki Linnakangas
> > > <[hidden email]>; PostgreSQL-development
> > > <[hidden email]>
> > > Subject: Re: [PoC] Non-volatile WAL buffer
> > >
> > > Hello,
> > >
> > > On Mon, Feb 17, 2020 at 4:16 PM Takashi Menjo <[hidden email]> wrote:
> > > > Hello Amit,
> > > >
> > > > > I apologize for not having any opinion on the patches
> > > > > themselves, but let me point out that it's better to base these
> > > > > patches on HEAD (master branch) than REL_12_0, because all new
> > > > > code is committed to the master branch, whereas stable branches
> > > > > such as
> > > > > REL_12_0 only receive bug fixes.  Do you have any
> > > specific reason to be working on REL_12_0?
> > > >
> > > > Yes, because I think it's human-friendly to reproduce and discuss
> > > > performance measurement.  Of course I know
> > > all new accepted patches are merged into master's HEAD, not stable
> > > branches and not even release tags, so I'm aware of rebasing my
> > > patchset onto master sooner or later.  However, if someone,
> > > including me, says that s/he applies my patchset to "master" and
> > > measures its performance, we have to pay attention to which commit the "master"
> > > really points to.  Although we have sha1 hashes to specify which
> > > commit, we should check whether the specific commit on master has
> > > patches affecting performance or not
> > because master's HEAD gets new patches day by day.  On the other hand,
> > a release tag clearly points the commit all we probably know.  Also we
> > can check more easily the features and improvements by using release notes and user manuals.
> > >
> > > Thanks for clarifying. I see where you're coming from.
> > >
> > > While I do sometimes see people reporting numbers with the latest
> > > stable release' branch, that's normally just one of the baselines.
> > > The more important baseline for ongoing development is the master
> > > branch's HEAD, which is also what people volunteering to test your
> > > patches would use.  Anyone who reports would have to give at least
> > > two numbers -- performance with a branch's HEAD without patch
> > > applied and that with patch applied -- which can be enough in most
> > > cases to see the difference the patch makes.  Sure, the numbers
> > > might change on each report, but that's fine I'd think.  If you
> > > continue to develop against the stable branch, you might miss to
> > notice impact from any relevant developments in the master branch,
> > even developments which possibly require rethinking the architecture of your own changes, although maybe that
> rarely occurs.
> > >
> > > Thanks,
> > > Amit

v3-0001-Support-GUCs-for-external-WAL-buffer.patch (36K) Download Attachment
v3-0002-Non-volatile-WAL-buffer.patch (63K) Download Attachment
v3-0003-walreceiver-supports-non-volatile-WAL-buffer.patch (6K) Download Attachment
v3-0004-pg_basebackup-supports-non-volatile-WAL-buffer.patch (22K) Download Attachment
v3-0005-README-for-non-volatile-WAL-buffer.patch (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PoC] Non-volatile WAL buffer

Takashi Menjo-3
Rebased.


2020年6月24日(水) 16:44 Takashi Menjo <[hidden email]>:
Dear hackers,

I update my non-volatile WAL buffer's patchset to v3.  Now we can use it in streaming replication mode.

Updates from v2:

- walreceiver supports non-volatile WAL buffer
Now walreceiver stores received records directly to non-volatile WAL buffer if applicable.

- pg_basebackup supports non-volatile WAL buffer
Now pg_basebackup copies received WAL segments onto non-volatile WAL buffer if you run it with "nvwal" mode (-Fn).
You should specify a new NVWAL path with --nvwal-path option.  The path will be written to postgresql.auto.conf or recovery.conf.  The size of the new NVWAL is same as the master's one.


Best regards,
Takashi

--
Takashi Menjo <[hidden email]>
NTT Software Innovation Center

> -----Original Message-----
> From: Takashi Menjo <[hidden email]>
> Sent: Wednesday, March 18, 2020 5:59 PM
> To: 'PostgreSQL-development' <[hidden email]>
> Cc: 'Robert Haas' <[hidden email]>; 'Heikki Linnakangas' <[hidden email]>; 'Amit Langote'
> <[hidden email]>
> Subject: RE: [PoC] Non-volatile WAL buffer
>
> Dear hackers,
>
> I rebased my non-volatile WAL buffer's patchset onto master.  A new v2 patchset is attached to this mail.
>
> I also measured performance before and after patchset, varying -c/--client and -j/--jobs options of pgbench, for
> each scaling factor s = 50 or 1000.  The results are presented in the following tables and the attached charts.
> Conditions, steps, and other details will be shown later.
>
>
> Results (s=50)
> ==============
>          Throughput [10^3 TPS]  Average latency [ms]
> ( c, j)  before  after          before  after
> -------  ---------------------  ---------------------
> ( 8, 8)  35.7    37.1 (+3.9%)   0.224   0.216 (-3.6%)
> (18,18)  70.9    74.7 (+5.3%)   0.254   0.241 (-5.1%)
> (36,18)  76.0    80.8 (+6.3%)   0.473   0.446 (-5.7%)
> (54,18)  75.5    81.8 (+8.3%)   0.715   0.660 (-7.7%)
>
>
> Results (s=1000)
> ================
>          Throughput [10^3 TPS]  Average latency [ms]
> ( c, j)  before  after          before  after
> -------  ---------------------  ---------------------
> ( 8, 8)  37.4    40.1 (+7.3%)   0.214   0.199 (-7.0%)
> (18,18)  79.3    86.7 (+9.3%)   0.227   0.208 (-8.4%)
> (36,18)  87.2    95.5 (+9.5%)   0.413   0.377 (-8.7%)
> (54,18)  86.8    94.8 (+9.3%)   0.622   0.569 (-8.5%)
>
>
> Both throughput and average latency are improved for each scaling factor.  Throughput seemed to almost reach
> the upper limit when (c,j)=(36,18).
>
> The percentage in s=1000 case looks larger than in s=50 case.  I think larger scaling factor leads to less
> contentions on the same tables and/or indexes, that is, less lock and unlock operations.  In such a situation,
> write-ahead logging appears to be more significant for performance.
>
>
> Conditions
> ==========
> - Use one physical server having 2 NUMA nodes (node 0 and 1)
>   - Pin postgres (server processes) to node 0 and pgbench to node 1
>   - 18 cores and 192GiB DRAM per node
> - Use an NVMe SSD for PGDATA and an interleaved 6-in-1 NVDIMM-N set for pg_wal
>   - Both are installed on the server-side node, that is, node 0
>   - Both are formatted with ext4
>   - NVDIMM-N is mounted with "-o dax" option to enable Direct Access (DAX)
> - Use the attached postgresql.conf
>   - Two new items nvwal_path and nvwal_size are used only after patch
>
>
> Steps
> =====
> For each (c,j) pair, I did the following steps three times then I found the median of the three as a final result shown
> in the tables above.
>
> (1) Run initdb with proper -D and -X options; and also give --nvwal-path and --nvwal-size options after patch
> (2) Start postgres and create a database for pgbench tables
> (3) Run "pgbench -i -s ___" to create tables (s = 50 or 1000)
> (4) Stop postgres, remount filesystems, and start postgres again
> (5) Execute pg_prewarm extension for all the four pgbench tables
> (6) Run pgbench during 30 minutes
>
>
> pgbench command line
> ====================
> $ pgbench -h /tmp -p 5432 -U username -r -M prepared -T 1800 -c ___ -j ___ dbname
>
> I gave no -b option to use the built-in "TPC-B (sort-of)" query.
>
>
> Software
> ========
> - Distro: Ubuntu 18.04
> - Kernel: Linux 5.4 (vanilla kernel)
> - C Compiler: gcc 7.4.0
> - PMDK: 1.7
> - PostgreSQL: d677550 (master on Mar 3, 2020)
>
>
> Hardware
> ========
> - System: HPE ProLiant DL380 Gen10
> - CPU: Intel Xeon Gold 6154 (Skylake) x 2sockets
> - DRAM: DDR4 2666MHz {32GiB/ch x 6ch}/socket x 2sockets
> - NVDIMM-N: DDR4 2666MHz {16GiB/ch x 6ch}/socket x 2sockets
> - NVMe SSD: Intel Optane DC P4800X Series SSDPED1K750GA
>
>
> Best regards,
> Takashi
>
> --
> Takashi Menjo <[hidden email]> NTT Software Innovation Center
>
> > -----Original Message-----
> > From: Takashi Menjo <[hidden email]>
> > Sent: Thursday, February 20, 2020 6:30 PM
> > To: 'Amit Langote' <[hidden email]>
> > Cc: 'Robert Haas' <[hidden email]>; 'Heikki Linnakangas' <[hidden email]>;
> 'PostgreSQL-development'
> > <[hidden email]>
> > Subject: RE: [PoC] Non-volatile WAL buffer
> >
> > Dear Amit,
> >
> > Thank you for your advice.  Exactly, it's so to speak "do as the hackers do when in pgsql"...
> >
> > I'm rebasing my branch onto master.  I'll submit an updated patchset and performance report later.
> >
> > Best regards,
> > Takashi
> >
> > --
> > Takashi Menjo <[hidden email]> NTT Software
> > Innovation Center
> >
> > > -----Original Message-----
> > > From: Amit Langote <[hidden email]>
> > > Sent: Monday, February 17, 2020 5:21 PM
> > > To: Takashi Menjo <[hidden email]>
> > > Cc: Robert Haas <[hidden email]>; Heikki Linnakangas
> > > <[hidden email]>; PostgreSQL-development
> > > <[hidden email]>
> > > Subject: Re: [PoC] Non-volatile WAL buffer
> > >
> > > Hello,
> > >
> > > On Mon, Feb 17, 2020 at 4:16 PM Takashi Menjo <[hidden email]> wrote:
> > > > Hello Amit,
> > > >
> > > > > I apologize for not having any opinion on the patches
> > > > > themselves, but let me point out that it's better to base these
> > > > > patches on HEAD (master branch) than REL_12_0, because all new
> > > > > code is committed to the master branch, whereas stable branches
> > > > > such as
> > > > > REL_12_0 only receive bug fixes.  Do you have any
> > > specific reason to be working on REL_12_0?
> > > >
> > > > Yes, because I think it's human-friendly to reproduce and discuss
> > > > performance measurement.  Of course I know
> > > all new accepted patches are merged into master's HEAD, not stable
> > > branches and not even release tags, so I'm aware of rebasing my
> > > patchset onto master sooner or later.  However, if someone,
> > > including me, says that s/he applies my patchset to "master" and
> > > measures its performance, we have to pay attention to which commit the "master"
> > > really points to.  Although we have sha1 hashes to specify which
> > > commit, we should check whether the specific commit on master has
> > > patches affecting performance or not
> > because master's HEAD gets new patches day by day.  On the other hand,
> > a release tag clearly points the commit all we probably know.  Also we
> > can check more easily the features and improvements by using release notes and user manuals.
> > >
> > > Thanks for clarifying. I see where you're coming from.
> > >
> > > While I do sometimes see people reporting numbers with the latest
> > > stable release' branch, that's normally just one of the baselines.
> > > The more important baseline for ongoing development is the master
> > > branch's HEAD, which is also what people volunteering to test your
> > > patches would use.  Anyone who reports would have to give at least
> > > two numbers -- performance with a branch's HEAD without patch
> > > applied and that with patch applied -- which can be enough in most
> > > cases to see the difference the patch makes.  Sure, the numbers
> > > might change on each report, but that's fine I'd think.  If you
> > > continue to develop against the stable branch, you might miss to
> > notice impact from any relevant developments in the master branch,
> > even developments which possibly require rethinking the architecture of your own changes, although maybe that
> rarely occurs.
> > >
> > > Thanks,
> > > Amit


--
Takashi Menjo <[hidden email]>

v4-0001-Support-GUCs-for-external-WAL-buffer.patch (42K) Download Attachment
v4-0002-Non-volatile-WAL-buffer.patch (73K) Download Attachment
v4-0003-walreceiver-supports-non-volatile-WAL-buffer.patch (7K) Download Attachment
v4-0004-pg_basebackup-supports-non-volatile-WAL-buffer.patch (25K) Download Attachment
v4-0005-README-for-non-volatile-WAL-buffer.patch (9K) Download Attachment
1234