Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

Simon Riggs
Comments in ReserveXLogInsertLocation() says
"* This is the performance critical part of XLogInsert that must be serialized
  * across backends. The rest can happen mostly in parallel. Try to keep this
  * section as short as possible, insertpos_lck can be heavily contended on a
  * busy system."

We've worked out a way of reducing contention during
ReserveXLogInsertLocation() by using atomic operations.

The mechanism requires that we remove the xl_prev field in each WAL
record. This also has the effect of reducing WAL volume by a few %.

Currently, we store the start location of the previous WAL record in
the xl_prev field of the WAL record header. Typically, redo recovery
is a forward moving process and hence we don't ever need to consult
xl_prev and read WAL backwards (there is one exception, more on that
later [1]). So in theory, we should be able to remove this field
completely without compromising any functionality or correctness.

But the presence of xl_prev field enables us to guard against torn WAL
pages, when a WAL record starts on a sector boundary. In case of a
torn page, even though the WAL page looks sane, the WAL record could
actually be a stale record retained from the older, recycled WAL file.
The system usually guards against this by comparing xl_prev field
stored in the WAL record header with the WAL location of the previous
record read. Any mismatch is treated as end-of-WAL-stream.

So we can't completely remove xl_prev field, without giving up some
functionality. But we don't really need to store the 8-byte previous
WAL pointer in order to detect torn pages. Something else which can
tell us that the WAL record does not belong to current WAL segno would
be enough as well. I propose that we replace it with a much smaller
2-byte field (let's call it xl_walid). The "xl_walid" (or whatever we
decide to call it) is the low order 16-bits of the WAL segno to which
the WAL record belongs. While reading WAL, we always match that the
"xl_walid" value stored in the WAL record matches with the current WAL
segno's lower order 16-bits and if not, then consider that as the end
of the stream.

For this to work, we must ensure that WAL files are either recycled in
such a way that the "xl_walid" of the previous (to be recycled) WAL
differs from the new WAL or we zero-out the new WAL file. Seems quite
easy to do with the existing infrastructure.

Because of padding and alignment, replacing 8-byte xl_prev with 2-byte
xl_walid effectively reduces the WAL record header by a full 8-bytes
on a 64-bit machine. Obviously, this reduces the amount of WAL
produced and transferred to the standby. On a pgbench test, we see
about 3-5% reduction in WAL traffic, though in some tests higher -
depending on workload.

There is yet another important benefit of removing the xl_prev field.
We no longer need to track PrevBytePos field in XLogCtlInsert. The
insertpos_lck spinlock is now only guarding CurrBytePos. So we can
replace that with an atomic 64-bit integer and completely remove the
spinlock. The comment at the top of ReserveXLogInsertLocation()
clearly mentions the importance of keeping the critical section as
small as possible and this patch achieves that by using atomic
variables.

Pavan ran some micro-benchmarks to measure the effectiveness of the
approach. I (Pavan) wrote a wrapper on top of
ReserveXLogInsertLocation() and exposed that as a SQL-callable
function. I then used pgbench with 1-16 clients where each client
effectively calls ReserveXLogInsertLocation() 1M times. Following are
the results from the master and the patched code, averaged across 5
runs. The tests are done on a i2.2xlarge AWS instance.

HEAD
 1 ... 24.24 tps
 2 ... 18.12 tps
 4 ... 10.95 tps
 8 ...  9.05 tps
16 ... 8.44 tps

As you would notice, the spinlock contention is immediately evident
even when running with just 2 clients and gets worse with 4 or more
clients.

Patched
 1 ... 35.08 tps
 2 ... 31.99 tps
 4 ... 30.48 tps
 8 ... 40.44 tps
16 ... 50.14 tps

The patched code on the other hand scales to higher numbers of clients
much better.

Those are microbenchmarks. You need to run a multi-CPU workload with
heavy WAL inserts to show benefits.

[1] pg_rewind is the only exception which uses xl_prev to find the
previous checkpoint record. But we can always start from the beginning
of the WAL segment and read forward until we find the checkpoint
record. The patch does just the same and passes pg_rewind's tap tests.

Patch credit: Simon Riggs and Pavan Deolasee

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pg_wal_header_reduction_v1.patch (35K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

Alexander Kuzmenkov
Hi Simon, Pavan,

I tried benchmarking your patch. I ran pgbench on a 72-core machine with
a simple append-only script:

BEGIN;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid,
:bid, :aid, :delta, CURRENT_TIMESTAMP);
---- the insert repeated 20 times total -------
END;

I can see an increase in transaction throughput of about 5%. Please see
the attached graph 'append-only.png'
On the default tpcb-like benchmark, there is no meaningful change in
performance, as shown in 'tpcb.png'

Looking at the code, your changes seem reasonable to me. Some nitpicks:

XLogSegNoLowOrderMask -- this macro is not really needed,
XLogSegNoToSegID is enough (pg_resetwal should be changed to use the
latter, too).


The loop in findLastCheckpoint looks complex to me. It would be easier
to read with two loops, the outer one iterating over the segments and
the inner one iterating over the records.


 >+    uint16        xl_walid;        /* lowest 16 bits of the WAL file
to which this
 >+                                   record belongs */

I'd put a high-level description here, the low-level details are hidden
behind macros anyway. Something like this: /* identifier of the WAL
segment to which this record belongs. Used to detect stale WAL records
in a reused segment. */


 >+     * Make sure the above heavily-contended spinlock and byte
positions are
 >+     * on their own cache line. In particular, the RedoRecPtr and
full page

This comment in the definition of XLogCtlInsert mentions the spinlock
that is no more.


The signature of InstallXLogFileSegment looks overloaded now, not sure
how to simplify it. We could have two versions, one that looks for the
free slot and another that reuses the existing file. Also, it seems to
me that the lock doesn't have to be acquired inside this function, it
can be done by the caller.

--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


tpcb.png (17K) Download Attachment
append-only.png (20K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

Robert Haas
In reply to this post by Simon Riggs
On Sat, Dec 30, 2017 at 5:32 AM, Simon Riggs <[hidden email]> wrote:

> So we can't completely remove xl_prev field, without giving up some
> functionality. But we don't really need to store the 8-byte previous
> WAL pointer in order to detect torn pages. Something else which can
> tell us that the WAL record does not belong to current WAL segno would
> be enough as well. I propose that we replace it with a much smaller
> 2-byte field (let's call it xl_walid). The "xl_walid" (or whatever we
> decide to call it) is the low order 16-bits of the WAL segno to which
> the WAL record belongs. While reading WAL, we always match that the
> "xl_walid" value stored in the WAL record matches with the current WAL
> segno's lower order 16-bits and if not, then consider that as the end
> of the stream.
>
> For this to work, we must ensure that WAL files are either recycled in
> such a way that the "xl_walid" of the previous (to be recycled) WAL
> differs from the new WAL or we zero-out the new WAL file. Seems quite
> easy to do with the existing infrastructure.

I have some reservations about whether this makes the mechanism less
reliable.  It seems that an 8-byte field would have almost no chance
of matching by accident even if the location was filled with random
bytes, but with a 2-byte field it's not that unlikely.  Of course, we
also have xl_crc, so I'm not sure whether there's any chance of real
harm...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

Tom Lane-2
Robert Haas <[hidden email]> writes:

> On Sat, Dec 30, 2017 at 5:32 AM, Simon Riggs <[hidden email]> wrote:
>> So we can't completely remove xl_prev field, without giving up some
>> functionality. But we don't really need to store the 8-byte previous
>> WAL pointer in order to detect torn pages. Something else which can
>> tell us that the WAL record does not belong to current WAL segno would
>> be enough as well. I propose that we replace it with a much smaller
>> 2-byte field (let's call it xl_walid). The "xl_walid" (or whatever we
>> decide to call it) is the low order 16-bits of the WAL segno to which
>> the WAL record belongs. While reading WAL, we always match that the
>> "xl_walid" value stored in the WAL record matches with the current WAL
>> segno's lower order 16-bits and if not, then consider that as the end
>> of the stream.
>>
>> For this to work, we must ensure that WAL files are either recycled in
>> such a way that the "xl_walid" of the previous (to be recycled) WAL
>> differs from the new WAL or we zero-out the new WAL file. Seems quite
>> easy to do with the existing infrastructure.

> I have some reservations about whether this makes the mechanism less
> reliable.

Yeah, it scares me too.  The xl_prev field is our only way of detecting
that we're looking at old WAL data when we cross a sector boundary.
I have no faith that we can prevent old WAL data from reappearing in the
file system across an OS crash, so I find Simon's assertion that we can
dodge the problem through file manipulation to be simply unbelievable.

If we could be sure that the WAL page size was no larger than the file
system's write quantum, then checking xlp_pageaddr would be sufficient
to detect stale WAL data.  But I'm afraid 8K is way too big for that;
so we need to be able to recognize page tears within-page.

> Of course, we also have xl_crc, so I'm not sure whether there's any
> chance of real harm...

The CRC only tells you that you have a valid WAL record, it won't clue
you in that it's old data you shouldn't replay.  If the previous WAL
record crossed the torn-page boundary, then you should have gotten a
CRC failure on that record --- but if the previous record ended at a
sector boundary, recognizing that the new record has an old xl_prev is
our ONLY defense against replaying stale data.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

Tom Lane-2
I wrote:
> Robert Haas <[hidden email]> writes:
>> On Sat, Dec 30, 2017 at 5:32 AM, Simon Riggs <[hidden email]> wrote:
>>> For this to work, we must ensure that WAL files are either recycled in
>>> such a way that the "xl_walid" of the previous (to be recycled) WAL
>>> differs from the new WAL or we zero-out the new WAL file. Seems quite
>>> easy to do with the existing infrastructure.

> I have no faith that we can prevent old WAL data from reappearing in the
> file system across an OS crash, so I find Simon's assertion that we can
> dodge the problem through file manipulation to be simply unbelievable.

Forgot to say: if we *could* do it reliably, it would likely add
significant I/O traffic, eg an extra zeroing pass for every WAL segment.
That's a pretty heavy performance price to set against whatever win
we might get from contention reduction.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

Andres Freund
In reply to this post by Tom Lane-2
On 2018-01-12 10:45:54 -0500, Tom Lane wrote:
> Robert Haas <[hidden email]> writes:
> > On Sat, Dec 30, 2017 at 5:32 AM, Simon Riggs <[hidden email]> wrote:
> > I have some reservations about whether this makes the mechanism less
> > reliable.
>
> Yeah, it scares me too.

Same here.


> The xl_prev field is our only way of detecting that we're looking at
> old WAL data when we cross a sector boundary.

Right. I wonder if it be reasonable to move that to a page's header
instead of individual records?  To avoid torn page issues we'd have to
reduce the page size to a sector size, but I'm not sure that's that bad?


> > Of course, we also have xl_crc, so I'm not sure whether there's any
> > chance of real harm...
>
> The CRC only tells you that you have a valid WAL record, it won't clue
> you in that it's old data you shouldn't replay.

Yea, I don't like relying on the CRC alone at all.

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

Tom Lane-2
Andres Freund <[hidden email]> writes:
> On 2018-01-12 10:45:54 -0500, Tom Lane wrote:
>> The xl_prev field is our only way of detecting that we're looking at
>> old WAL data when we cross a sector boundary.

> Right. I wonder if it be reasonable to move that to a page's header
> instead of individual records?  To avoid torn page issues we'd have to
> reduce the page size to a sector size, but I'm not sure that's that bad?

Giving up a dozen or two bytes out of every 512 sounds like quite an
overhead.  Also, this'd mean that a much larger fraction of WAL records
need to be split across page boundaries, which I'd expect to produce a
performance hit in itself --- a page crossing has to complicate figuring
out how much space we need for the record.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

Andres Freund
On 2018-01-12 17:24:54 -0500, Tom Lane wrote:

> Andres Freund <[hidden email]> writes:
> > On 2018-01-12 10:45:54 -0500, Tom Lane wrote:
> >> The xl_prev field is our only way of detecting that we're looking at
> >> old WAL data when we cross a sector boundary.
>
> > Right. I wonder if it be reasonable to move that to a page's header
> > instead of individual records?  To avoid torn page issues we'd have to
> > reduce the page size to a sector size, but I'm not sure that's that bad?
>
> Giving up a dozen or two bytes out of every 512 sounds like quite an
> overhead.

It's not nothing, that's true. But if it avoids 8 bytes in every record,
that'd probably at least as much in most usecases.


> Also, this'd mean that a much larger fraction of WAL records
> need to be split across page boundaries, which I'd expect to produce a
> performance hit in itself --- a page crossing has to complicate figuring
> out how much space we need for the record.

It does increase the computation a bit, see XLogBytePosToRecPtr(). I'd
guess that more of the overhead would come from the xlog buffer
management though.

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

Tom Lane-2
Andres Freund <[hidden email]> writes:
> On 2018-01-12 17:24:54 -0500, Tom Lane wrote:
>> Andres Freund <[hidden email]> writes:
>>> Right. I wonder if it be reasonable to move that to a page's header
>>> instead of individual records?  To avoid torn page issues we'd have to
>>> reduce the page size to a sector size, but I'm not sure that's that bad?

>> Giving up a dozen or two bytes out of every 512 sounds like quite an
>> overhead.

> It's not nothing, that's true. But if it avoids 8 bytes in every record,
> that'd probably at least as much in most usecases.

Fair point.  I don't have a very good handle on what "typical" WAL record
sizes are, but we might be fine with that --- some quick counting on the
fingers says we'd break even with an average record size of ~160 bytes,
and be ahead below that.

We'd need to investigate the page-crossing overhead carefully though.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

Andres Freund
On 2018-01-12 17:43:00 -0500, Tom Lane wrote:

> Andres Freund <[hidden email]> writes:
> > On 2018-01-12 17:24:54 -0500, Tom Lane wrote:
> >> Andres Freund <[hidden email]> writes:
> >>> Right. I wonder if it be reasonable to move that to a page's header
> >>> instead of individual records?  To avoid torn page issues we'd have to
> >>> reduce the page size to a sector size, but I'm not sure that's that bad?
>
> >> Giving up a dozen or two bytes out of every 512 sounds like quite an
> >> overhead.
>
> > It's not nothing, that's true. But if it avoids 8 bytes in every record,
> > that'd probably at least as much in most usecases.
>
> Fair point.  I don't have a very good handle on what "typical" WAL record
> sizes are, but we might be fine with that --- some quick counting on the
> fingers says we'd break even with an average record size of ~160 bytes,
> and be ahead below that.

This is far from a definitive answer, but here's some data:

pgbench -i -s 100 -q:
Type                                           N      (%)          Record size      (%)             FPI size      (%)        Combined size      (%)
----                                           -      ---          -----------      ---             --------      ---        -------------      ---
Total                                     308958                    1077269060 [84.19%]            202269468 [15.81%]           1279538528 [100%]

So here records are really large, which makes sense, given it's
largelyinitialization of data. With wal_compression that'd probably look
different, but still commonly spanning multiple pages.


pgbench -M prepared -c 16 -j 16 -T 100
Type                                           N      (%)          Record size      (%)             FPI size      (%)        Combined size      (%)
----                                           -      ---          -----------      ---             --------      ---        -------------      ---
Total                                   14228881                     947824170 [100.00%]                8192 [0.00%]             947832362 [100%]

Here we're at 66 bytes...


> We'd need to investigate the page-crossing overhead carefully though.

agreed.

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

Claudio Freire
In reply to this post by Simon Riggs


On Sat, Dec 30, 2017 at 7:32 AM, Simon Riggs <[hidden email]> wrote:
So we can't completely remove xl_prev field, without giving up some
functionality. But we don't really need to store the 8-byte previous
WAL pointer in order to detect torn pages. Something else which can
tell us that the WAL record does not belong to current WAL segno would
be enough as well. I propose that we replace it with a much smaller
2-byte field (let's call it xl_walid). The "xl_walid" (or whatever we
decide to call it) is the low order 16-bits of the WAL segno to which
the WAL record belongs. While reading WAL, we always match that the
"xl_walid" value stored in the WAL record matches with the current WAL
segno's lower order 16-bits and if not, then consider that as the end
of the stream.

For this to work, we must ensure that WAL files are either recycled in
such a way that the "xl_walid" of the previous (to be recycled) WAL
differs from the new WAL or we zero-out the new WAL file. Seems quite
easy to do with the existing infrastructure.

Or, you can use the lower 16-bits of the previous record's CRC


Reply | Threaded
Open this post in threaded view
|

Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

Claudio Freire


On Fri, Jan 12, 2018 at 8:43 PM, Claudio Freire <[hidden email]> wrote:


On Sat, Dec 30, 2017 at 7:32 AM, Simon Riggs <[hidden email]> wrote:
So we can't completely remove xl_prev field, without giving up some
functionality. But we don't really need to store the 8-byte previous
WAL pointer in order to detect torn pages. Something else which can
tell us that the WAL record does not belong to current WAL segno would
be enough as well. I propose that we replace it with a much smaller
2-byte field (let's call it xl_walid). The "xl_walid" (or whatever we
decide to call it) is the low order 16-bits of the WAL segno to which
the WAL record belongs. While reading WAL, we always match that the
"xl_walid" value stored in the WAL record matches with the current WAL
segno's lower order 16-bits and if not, then consider that as the end
of the stream.

For this to work, we must ensure that WAL files are either recycled in
such a way that the "xl_walid" of the previous (to be recycled) WAL
differs from the new WAL or we zero-out the new WAL file. Seems quite
easy to do with the existing infrastructure.

Or, you can use the lower 16-bits of the previous record's CRC



Sorry, missed the whole point. Of the *first* record's CRC

Reply | Threaded
Open this post in threaded view
|

Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

Tom Lane-2
In reply to this post by Claudio Freire
Claudio Freire <[hidden email]> writes:
> On Sat, Dec 30, 2017 at 7:32 AM, Simon Riggs <[hidden email]> wrote:
>> So we can't completely remove xl_prev field, without giving up some
>> functionality.

> Or, you can use the lower 16-bits of the previous record's CRC

Hmm ... that is an interesting idea, but I'm not sure it helps much
towards Simon's actual objective.  AIUI the core problem here is the
contention involved in retrieving the previous WAL record's address.
Changing things so that we need the previous record's CRC isn't really
gonna improve that --- if anything, it'll be worse, because the
record's address can (in principle) be known sooner than its CRC.

Still, if we were just looking to shave some bits off of WAL record
headers, it might be possible to do something with this idea.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

Claudio Freire

On Fri, Jan 12, 2018 at 10:49 PM, Tom Lane <[hidden email]> wrote:
Claudio Freire <[hidden email]> writes:
> On Sat, Dec 30, 2017 at 7:32 AM, Simon Riggs <[hidden email]> wrote:
>> So we can't completely remove xl_prev field, without giving up some
>> functionality.

> Or, you can use the lower 16-bits of the previous record's CRC

Hmm ... that is an interesting idea, but I'm not sure it helps much
towards Simon's actual objective.  AIUI the core problem here is the
contention involved in retrieving the previous WAL record's address.
Changing things so that we need the previous record's CRC isn't really
gonna improve that --- if anything, it'll be worse, because the
record's address can (in principle) be known sooner than its CRC.

Still, if we were just looking to shave some bits off of WAL record
headers, it might be possible to do something with this idea.

                        regards, tom lane

I later realized. That's why I corrected myself to the first record, not the previous.

Now, that assumes there's enough entropy in CRC values to actually make good use of those 16 bits... there may not. WAL segments are highly compressible after all.

So, maybe a hash of the LSN of the first record instead? That should be guaranteed to have good entropy (given a good hash).

In any case, there are many rather good alternatives to the segment number that should be reasonably safe from consistent collisions with garbage data.

Reply | Threaded
Open this post in threaded view
|

Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

Simon Riggs
In reply to this post by Tom Lane-2
On 12 January 2018 at 15:45, Tom Lane <[hidden email]> wrote:

>> I have some reservations about whether this makes the mechanism less
>> reliable.
>
> Yeah, it scares me too.  The xl_prev field is our only way of detecting
> that we're looking at old WAL data when we cross a sector boundary.
> I have no faith that we can prevent old WAL data from reappearing in the
> file system across an OS crash, so I find Simon's assertion that we can
> dodge the problem through file manipulation to be simply unbelievable.

Not really sure what you mean by "file manipulation". Maybe the
proposal wasn't clear.

We need a way of detecting that we are looking at old WAL data. More
specifically, we need to know whether we are looking at a current file
or an older file. My main assertion here is that the detection only
needs to happen at file-level, not at record level, so it is OK to
lose some bits of information without changing our ability to protect
data - they were not being used productively.

Let's do the math to see if it is believable, or not.

The new two byte value is protected by CRC. The 2 byte value repeats
every 32768 WAL files. Any bit error in that value that made it appear
to be a current value would need to have a rare set of circumstances.

1. We would need to suffer a bit error that did not get caught by the CRC.

2. An old WAL record would need to occur right on the boundary of the
last WAL record.

3. The bit error would need to occur within the 2 byte value. WAL
records are usually fairly long, but so this has a Probability of
<1/16

4. The bit error would need to change an old value to the current
value of the new 2 byte field. If the current value is N, and the
previous value is M, then a single bit error that takes M -> N can
only happen if N-M is divisible by 2. The maximum probability of an
issue would occur when we reuse WAL every 3 files, so probability of
such a change would be 1/16. If the distance between M and N is not a
power of two then a single bit error cannot change M into N. So what
probability do we assign to the situation that M and N are exactly a
power of two apart?

So the probability of this occurring requires a single undetectable
bit error and would then happen less than 1 in 256 times, but arguably
much less. Notice that this probability is therefore at least 2 orders
of magnitude smaller than the chance that a single bit error occurs
and simply corrupts data, a mere rounding error in risk.

I don't find that unbelievable at all.

If you still do, then I would borrow Andres' idea of using the page
header. If we copy the new 2 byte value into the page header, we can
use that to match against in the case of error. XLogPageHeaderData can
be extended by 2 bytes without increasing its size when using 8 byte
alignment. The new 2 byte value is the same anywhere in the file, so
that works quickly and easily. And it doesn't increase the size of the
header.

So with that change it looks completely viable.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Previous Thread Next Thread