Enable data checksums by default

classic Classic list List threaded Threaded
28 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Enable data checksums by default

Christoph Berg-2
Lately, PostgreSQL has moved many defaults from "bare minimum" more to
the "user friendly by default" side, e.g. hot_standby & replication in
the default configuration, parallelism, and generally higher defaults
for resource knobs like *_mem, autovacuum_* and so on.

I think, the next step in that direction would be to enable data
checksums by default. They make sense in most setups, and people who
plan to run very performance-critical systems where checksums might be
too much need to tune many knobs anyway, and can as well choose to
disable them manually, instead of having everyone else have to enable
them manually. Also, disabling is much easier than enabling.

One argument against checksums used to be that we lack tools to fix
problems with them. But ignore_checksum_failure and the pg_checksums
tool fix that.

The attached patch flips the default in initdb. It also adds a new
option -k --no-data-checksums that wasn't present previously. Docs are
updated to say what the new default is, and the testsuite exercises
the -K option.

Christoph

0001-Enable-data-checksums-by-default.patch (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Enable data checksums by default

Tom Lane-2
Christoph Berg <[hidden email]> writes:
> I think, the next step in that direction would be to enable data
> checksums by default. They make sense in most setups,

Well, that is exactly the point that needs some proof, not just
an unfounded assertion.

IMO, the main value of checksums is that they allow the Postgres
project to deflect blame.  That's nice for us but I'm not sure
that it's a benefit for users.  I've seen little if any data to
suggest that checksums actually catch enough problems to justify
the extra CPU costs and the risk of false positives.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: Enable data checksums by default

Andres Freund
Hi,

On 2019-03-22 12:07:22 -0400, Tom Lane wrote:

> Christoph Berg <[hidden email]> writes:
> > I think, the next step in that direction would be to enable data
> > checksums by default. They make sense in most setups,
>
> Well, that is exactly the point that needs some proof, not just
> an unfounded assertion.
>
> IMO, the main value of checksums is that they allow the Postgres
> project to deflect blame.  That's nice for us but I'm not sure
> that it's a benefit for users.  I've seen little if any data to
> suggest that checksums actually catch enough problems to justify
> the extra CPU costs and the risk of false positives.

IDK, being able to verify in some form that backups aren't corrupted on
an IO level is mighty nice. That often does allow to detect the issue
while one still has older backups around.

My problem is more that I'm not confident the checks are mature
enough. The basebackup checks are atm not able to detect random data,
and neither basebackup nor backend checks detect zeroed out files/file
ranges.

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: Enable data checksums by default

Tomas Vondra-4


On 3/22/19 5:10 PM, Andres Freund wrote:

> Hi,
>
> On 2019-03-22 12:07:22 -0400, Tom Lane wrote:
>> Christoph Berg <[hidden email]> writes:
>>> I think, the next step in that direction would be to enable data
>>> checksums by default. They make sense in most setups,
>>
>> Well, that is exactly the point that needs some proof, not just
>> an unfounded assertion.
>>
>> IMO, the main value of checksums is that they allow the Postgres
>> project to deflect blame.  That's nice for us but I'm not sure
>> that it's a benefit for users.  I've seen little if any data to
>> suggest that checksums actually catch enough problems to justify
>> the extra CPU costs and the risk of false positives.
>

I'm not sure about checksums being an effective tool to deflect blame.
Considering the recent fsync retry issues - due to the assumption that
we can just retry fsync we might have lost some of the writes, resulting
in torn pages and checksum failures. I'm sure we could argue about how
much sense the fsync behavior makes, but I doubt checksum failures are
enough to deflect blame here.

> IDK, being able to verify in some form that backups aren't corrupted on
> an IO level is mighty nice. That often does allow to detect the issue
> while one still has older backups around.
>

Yeah, I agree that's a valuable capability. I think the question is how
effective it actually is considering how much the storage changed over
the past few years (which necessarily affects the type of failures
people have to deal with).

> My problem is more that I'm not confident the checks are mature
> enough. The basebackup checks are atm not able to detect random data,
> and neither basebackup nor backend checks detect zeroed out files/file
> ranges.
>

Yep :-( The pg_basebackup vulnerability to random garbage in a page
header is unfortunate, we better improve that.

It's not clear to me what can checksums do about zeroed pages (and/or
truncated files) though.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Enable data checksums by default

Andres Freund
Hi,

On 2019-03-22 17:32:10 +0100, Tomas Vondra wrote:

> On 3/22/19 5:10 PM, Andres Freund wrote:
> > IDK, being able to verify in some form that backups aren't corrupted on
> > an IO level is mighty nice. That often does allow to detect the issue
> > while one still has older backups around.
> >
>
> Yeah, I agree that's a valuable capability. I think the question is how
> effective it actually is considering how much the storage changed over
> the past few years (which necessarily affects the type of failures
> people have to deal with).

I'm not sure I understand? How do the changes around storage
meaningfully affect the need to have some trust in backups and
benefiting from earlier detection?


> It's not clear to me what can checksums do about zeroed pages (and/or
> truncated files) though.

Well, there's nothing fundamental about needing added pages be
zeroes. We could expand them to be initialized with actual valid
checksums instead of
                /* new buffers are zero-filled */
                MemSet((char *) bufBlock, 0, BLCKSZ);
                /* don't set checksum for all-zero page */
                smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);

the problem is that it's hard to do so safely without adding a lot of
additional WAL logging. A lot of filesystems will journal metadata
changes (like the size of the file), but not contents. So after a crash
the tail end might appear zeroed out, even if we never wrote
zeroes. That's obviously solvable by WAL logging, but that's not cheap.

It might still be a good idea to just write a page with an initialized
header / checksum at that point, as that ought to still detect a number
of problems we can't detect right now.

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: Enable data checksums by default

Tomas Vondra-4
On 3/22/19 5:41 PM, Andres Freund wrote:

> Hi,
>
> On 2019-03-22 17:32:10 +0100, Tomas Vondra wrote:
>> On 3/22/19 5:10 PM, Andres Freund wrote:
>>> IDK, being able to verify in some form that backups aren't corrupted on
>>> an IO level is mighty nice. That often does allow to detect the issue
>>> while one still has older backups around.
>>>
>>
>> Yeah, I agree that's a valuable capability. I think the question is how
>> effective it actually is considering how much the storage changed over
>> the past few years (which necessarily affects the type of failures
>> people have to deal with).
>
> I'm not sure I understand? How do the changes around storage
> meaningfully affect the need to have some trust in backups and
> benefiting from earlier detection?
>

Having trusted in backups is still desirable - nothing changes that,
obviously. The question I was posing was rather "Are checksums still
effective on current storage systems?"

I'm wondering if the storage systems people use nowadays may be failing
in ways that are not reliably detectable by checksums. I don't have any
data to either support or reject that hypothesis, though.

>
>> It's not clear to me what can checksums do about zeroed pages (and/or
>> truncated files) though.
>
> Well, there's nothing fundamental about needing added pages be
> zeroes. We could expand them to be initialized with actual valid
> checksums instead of
> /* new buffers are zero-filled */
> MemSet((char *) bufBlock, 0, BLCKSZ);
> /* don't set checksum for all-zero page */
> smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);
>
> the problem is that it's hard to do so safely without adding a lot of
> additional WAL logging. A lot of filesystems will journal metadata
> changes (like the size of the file), but not contents. So after a crash
> the tail end might appear zeroed out, even if we never wrote
> zeroes. That's obviously solvable by WAL logging, but that's not cheap.
>

Hmmm. I'd say a filesystem that does not guarantee having all the data
after an fsync is outright broken, but maybe that's what checksums are
meant to protect against.

> It might still be a good idea to just write a page with an initialized
> header / checksum at that point, as that ought to still detect a number
> of problems we can't detect right now.
>

Sounds reasonable.

cheers

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Enable data checksums by default

Andres Freund
On 2019-03-22 18:01:32 +0100, Tomas Vondra wrote:

> On 3/22/19 5:41 PM, Andres Freund wrote:
> > Hi,
> >
> > On 2019-03-22 17:32:10 +0100, Tomas Vondra wrote:
> >> On 3/22/19 5:10 PM, Andres Freund wrote:
> >>> IDK, being able to verify in some form that backups aren't corrupted on
> >>> an IO level is mighty nice. That often does allow to detect the issue
> >>> while one still has older backups around.
> >>>
> >>
> >> Yeah, I agree that's a valuable capability. I think the question is how
> >> effective it actually is considering how much the storage changed over
> >> the past few years (which necessarily affects the type of failures
> >> people have to deal with).
> >
> > I'm not sure I understand? How do the changes around storage
> > meaningfully affect the need to have some trust in backups and
> > benefiting from earlier detection?
> >
>
> Having trusted in backups is still desirable - nothing changes that,
> obviously. The question I was posing was rather "Are checksums still
> effective on current storage systems?"
>
> I'm wondering if the storage systems people use nowadays may be failing
> in ways that are not reliably detectable by checksums. I don't have any
> data to either support or reject that hypothesis, though.

I don't think it's useful to paint unsubstantiated doom-and-gloom
pictures.


> >> It's not clear to me what can checksums do about zeroed pages (and/or
> >> truncated files) though.
> >
> > Well, there's nothing fundamental about needing added pages be
> > zeroes. We could expand them to be initialized with actual valid
> > checksums instead of
> > /* new buffers are zero-filled */
> > MemSet((char *) bufBlock, 0, BLCKSZ);
> > /* don't set checksum for all-zero page */
> > smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);
> >
> > the problem is that it's hard to do so safely without adding a lot of
> > additional WAL logging. A lot of filesystems will journal metadata
> > changes (like the size of the file), but not contents. So after a crash
> > the tail end might appear zeroed out, even if we never wrote
> > zeroes. That's obviously solvable by WAL logging, but that's not cheap.
> >
>
> Hmmm. I'd say a filesystem that does not guarantee having all the data
> after an fsync is outright broken, but maybe that's what checksums are
> meant to protect against.

There's no fsync here. smgrextend(with-valid-checksum);crash; - the OS
will probably have journalled the file size change, but not the
contents. After a crash it's thus likely that the data page will appear
zeroed.  Which prevents us from erroring out when encountering a zeroed
page, even though that'd be very good for error detection capabilities,
because storage systems will show corrupted data as zeroes in a number
of cases.

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: Enable data checksums by default

Christoph Berg-2
In reply to this post by Tom Lane-2
Re: Tom Lane 2019-03-22 <[hidden email]>
> Christoph Berg <[hidden email]> writes:
> > I think, the next step in that direction would be to enable data
> > checksums by default. They make sense in most setups,
>
> Well, that is exactly the point that needs some proof, not just
> an unfounded assertion.

I run a benchmark with checksums disabled/enabled. shared_buffers is
512kB to make sure almost any read will fetch the page from the OS
cache; scale factor is 50 (~750MB) to make sure the whole cluster fits
into RAM.

model name: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (8 threads)
alter system set shared_buffers = '512kB';
pgbench -s 50 -i
pgbench -P 5 -M prepared -c 8 -j 8 -T 60 --select-only

without checksums:
tps = 96893.627255 (including connections establishing)
tps = 97570.587793 (including connections establishing)
tps = 97455.484419 (including connections establishing)
tps = 97533.668801 (including connections establishing)
average: 97363

with checksums:
tps = 91942.502487 (including connections establishing)
tps = 92390.556925 (including connections establishing)
tps = 92956.923271 (including connections establishing)
tps = 92914.205047 (including connections establishing)
average: 92551

select 92551.0/97363;
0.9506

So the cost is 5% in this very contrived case. In almost any other
setting, the cost would be lower, I'd think.

Christoph

Reply | Threaded
Open this post in threaded view
|

Re: Enable data checksums by default

Peter Geoghegan-4
In reply to this post by Tom Lane-2
On Fri, Mar 22, 2019 at 9:07 AM Tom Lane <[hidden email]> wrote:
> IMO, the main value of checksums is that they allow the Postgres
> project to deflect blame.  That's nice for us but I'm not sure
> that it's a benefit for users.  I've seen little if any data to
> suggest that checksums actually catch enough problems to justify
> the extra CPU costs and the risk of false positives.

I share your concern.

Some users have a peculiar kind of cognitive dissonance around
corruption, at least in my experience. It's very difficult for them to
make a choice on whether or not to fail hard. Perhaps that needs to be
taken into account, without being indulged.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Enable data checksums by default

Christoph Berg-2
In reply to this post by Christoph Berg-2
Re: To Tom Lane 2019-03-26 <[hidden email]>
> I run a benchmark with checksums disabled/enabled. shared_buffers is
> 512kB to make sure almost any read will fetch the page from the OS
> cache; scale factor is 50 (~750MB) to make sure the whole cluster fits
> into RAM.
[...]
> So the cost is 5% in this very contrived case. In almost any other
> setting, the cost would be lower, I'd think.

(That was on 12devel, btw.)

That was about the most extreme OLTP read-only workload. After
thinking about it some more, I realized that exercising large seqscans
might be an even better way to test it because of less per-query
overhead.

Same setup again, shared_buffers = 16 (128kB), jit = off,
max_parallel_workers_per_gather = 0:

select count(bid) from pgbench_accounts;

no checksums: ~456ms
with checksums: ~489ms

456.0/489 = 0.9325

The cost of checksums is about 6.75% here.

Christoph


Reply | Threaded
Open this post in threaded view
|

Re: Enable data checksums by default

Ants Aasma
On Wed, Mar 27, 2019, 15:57 Christoph Berg <[hidden email]> wrote:
Re: To Tom Lane 2019-03-26 <[hidden email]>
> I run a benchmark with checksums disabled/enabled. shared_buffers is
> 512kB to make sure almost any read will fetch the page from the OS
> cache; scale factor is 50 (~750MB) to make sure the whole cluster fits
> into RAM.
[...]
> So the cost is 5% in this very contrived case. In almost any other
> setting, the cost would be lower, I'd think.

(That was on 12devel, btw.)

That was about the most extreme OLTP read-only workload. After
thinking about it some more, I realized that exercising large seqscans
might be an even better way to test it because of less per-query
overhead.

Same setup again, shared_buffers = 16 (128kB), jit = off,
max_parallel_workers_per_gather = 0:

select count(bid) from pgbench_accounts;

no checksums: ~456ms
with checksums: ~489ms

456.0/489 = 0.9325

The cost of checksums is about 6.75% here.

Can you try with postgres compiled with CFLAGS="-O2 -march=native"? There's a bit of low hanging fruit there to use a runtime CPU check to pick a better optimized checksum function.

Regards,
Ants Aasma
Reply | Threaded
Open this post in threaded view
|

Re: Enable data checksums by default

Christoph Berg-2
Re: Ants Aasma 2019-03-27 <[hidden email]>
> Can you try with postgres compiled with CFLAGS="-O2 -march=native"? There's
> a bit of low hanging fruit there to use a runtime CPU check to pick a
> better optimized checksum function.

Frankly, no. This is with the apt.pg.o packages which are supposed to
be usable by everyone. If there is a better per-CPU checksum function,
PG should pick it at runtime. Special compiler flags are a no-go here.

CPPFLAGS = -Wdate-time -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2 -I/usr/include/mit-krb5
CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -fno-omit-frame-pointer

Christoph


Reply | Threaded
Open this post in threaded view
|

Re: Enable data checksums by default

Peter Eisentraut-6
In reply to this post by Christoph Berg-2
On 2019-03-22 16:16, Christoph Berg wrote:
> I think, the next step in that direction would be to enable data
> checksums by default. They make sense in most setups, and people who
> plan to run very performance-critical systems where checksums might be
> too much need to tune many knobs anyway, and can as well choose to
> disable them manually, instead of having everyone else have to enable
> them manually. Also, disabling is much easier than enabling.

It would also enable pg_rewind to work by default.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Enable data checksums by default

Bernd Helmle
In reply to this post by Christoph Berg-2
Am Dienstag, den 26.03.2019, 16:14 +0100 schrieb Christoph Berg:
> select 92551.0/97363;
> 0.9506
>
> So the cost is 5% in this very contrived case. In almost any other
> setting, the cost would be lower, I'd think.

Well, my machine (Intel(R) Core(TM) i7-6770HQ CPU @ 2.60GHz, 32 GByte
RAM) tells me this:

pgbench -s 50 -i pgbench
pg_ctl -o "--shared-buffers=128kB" restart
pgbench -r -P4 -Mprepared -T60 -c $clients -j $clients -n -S

...prewarm...

Clients checksums
1 20110
2 35338
4 67207
8 96627
16 110091

Clients no checksums
1 21716
2 38543
4 72118
8 117545
16 121415

Clients Impact
1 0,926045312212194
2 0,916846119918014
4 0,931903269641421
8 0,822042621974563
16 0,906733105464728

So between ~7% to 18% impact with checksums in this specific case here.

        Bernd




Reply | Threaded
Open this post in threaded view
|

Re: Enable data checksums by default

Ants Aasma
In reply to this post by Christoph Berg-2
On Thu, Mar 28, 2019 at 10:38 AM Christoph Berg <[hidden email]> wrote:
Re: Ants Aasma 2019-03-27 <[hidden email]>
> Can you try with postgres compiled with CFLAGS="-O2 -march=native"? There's
> a bit of low hanging fruit there to use a runtime CPU check to pick a
> better optimized checksum function.

Frankly, no. This is with the apt.pg.o packages which are supposed to
be usable by everyone. If there is a better per-CPU checksum function,
PG should pick it at runtime. Special compiler flags are a no-go here.

I went ahead and tested it on the count(*) test, same settings as upthread. Median of 5 runs of 20txs on Intel i5-2500k @ 4GHz.

No checksum: 344ms
Checksums: 384ms (+12%)
No checksum march=native: 344ms
Checksums march=native: 369ms (+7%)

The checksum code was written to be easily auto-vectorized by the compiler. So if we just compile the same function with different compiler flags and pick between them at runtime the overhead can be approximately halved. Not saying that this needs to be done before enabling checksums by default, just that when considering overhead, we can foresee it being much lower in future versions.

Regards,
Ants Aasma
Reply | Threaded
Open this post in threaded view
|

Re: Enable data checksums by default

Michael Paquier-2
In reply to this post by Bernd Helmle
On Fri, Mar 29, 2019 at 11:16:11AM +0100, Bernd Helmle wrote:
> So between ~7% to 18% impact with checksums in this specific case here.

I can't really believe that many people set up shared_buffers at 128kB
which would cause such a large number of page evictions, but I can
believe that many users have shared_buffers set to its default value
and that we are going to get complains about "performance drop after
upgrade to v12" if we switch data checksums to on by default.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Enable data checksums by default

Bernd Helmle
Am Freitag, den 29.03.2019, 23:10 +0900 schrieb Michael Paquier:
>
> I can't really believe that many people set up shared_buffers at
> 128kB
> which would cause such a large number of page evictions, but I can
> believe that many users have shared_buffers set to its default value
> and that we are going to get complains about "performance drop after
> upgrade to v12" if we switch data checksums to on by default.

Yeah, i think Christoph's benchmark is based on this thinking. I assume
this very unrealistic scenery should emulate the worst case (many
buffer_reads, high checksum calculation load).

        Bernd




Reply | Threaded
Open this post in threaded view
|

Re: Enable data checksums by default

Christoph Berg-2
Re: Bernd Helmle 2019-03-29 <[hidden email]>

> Am Freitag, den 29.03.2019, 23:10 +0900 schrieb Michael Paquier:
> >
> > I can't really believe that many people set up shared_buffers at
> > 128kB
> > which would cause such a large number of page evictions, but I can
> > believe that many users have shared_buffers set to its default value
> > and that we are going to get complains about "performance drop after
> > upgrade to v12" if we switch data checksums to on by default.
>
> Yeah, i think Christoph's benchmark is based on this thinking. I assume
> this very unrealistic scenery should emulate the worst case (many
> buffer_reads, high checksum calculation load).

It's not unrealistic to have large seqscans that are all buffer
misses, the table just has to be big enough. The idea in my benchmark
was that if I make shared buffers really small, and the table still
fits in to RAM, I should be seeing only buffer misses, but without any
delay for actually reading from disk.

Christoph


Reply | Threaded
Open this post in threaded view
|

Re: Enable data checksums by default

Tomas Vondra-4
On Fri, Mar 29, 2019 at 08:35:26PM +0100, Christoph Berg wrote:

>Re: Bernd Helmle 2019-03-29 <[hidden email]>
>> Am Freitag, den 29.03.2019, 23:10 +0900 schrieb Michael Paquier:
>> >
>> > I can't really believe that many people set up shared_buffers at
>> > 128kB
>> > which would cause such a large number of page evictions, but I can
>> > believe that many users have shared_buffers set to its default value
>> > and that we are going to get complains about "performance drop after
>> > upgrade to v12" if we switch data checksums to on by default.
>>
>> Yeah, i think Christoph's benchmark is based on this thinking. I assume
>> this very unrealistic scenery should emulate the worst case (many
>> buffer_reads, high checksum calculation load).
>
>It's not unrealistic to have large seqscans that are all buffer
>misses, the table just has to be big enough. The idea in my benchmark
>was that if I make shared buffers really small, and the table still
>fits in to RAM, I should be seeing only buffer misses, but without any
>delay for actually reading from disk.
>
>Christoph
>

FWIW I think it's a mistake to focus solely on CPU utilization, which
all the benchmarks performed on this thread do because they look at tps
of in-memory read-only workloads. Checksums have other costs too, not
just the additional CPU time. Most importanly they require wal_log_hints
to be set (which people may or may not want anyway).

I've done a simple benchmark, that does read-only (-S) and read-write
(-N) pgbench runs with different scales, but also measures duration of
the pgbench init and amount of WAL produced during the tests.

On a small machine (i5, 8GB RAM, SSD RAID) the results are these:

         scale    config     |    init           tps        wal
    =========================|==================================
    ro      10    no-hints   |       2        117038        130
                  hints      |       2        116378        146
                  checksums  |       2        115619        147
          -------------------|----------------------------------
           200    no-hints   |      32         88340        2407
                  hints      |      37         86154        2628
                  checksums  |      36         83336        2624
          -------------------|----------------------------------
          2000    no-hints   |     365         38680        1967
                  hints      |     423         38670        2123
                  checksums  |     504         37510        2046
    -------------------------|----------------------------------
    rw      10    no-hints   |       2         19691        437
                  hints      |       2         19712        437
                  checksums  |       2         19654        437
          -------------------|----------------------------------
           200    no-hints   |      32         15839        2745
                  hints      |      37         15735        2783
                  checksums  |      36         15646        2775
          -------------------|----------------------------------
          2000    no-hints   |     365         5371         3721
                  hints      |     423         5270         3671
                  checksums  |     504         5094         3574

The no-hints config is default (wal_log_hints=off, data_checksums=off),
hints sets wal_log_hints=on and checksums enables data checksums. All
the configs were somewhat tuned (1GB shared buffers, max_wal_size high
enough not to hit checkpoints very often, etc.).

I've also done the tests on the a larger machine (2x E5-2620v4, 32GB of
RAM, NVMe SSD), and the general pattern is about the same - while the
tps and amount of WAL (not covering the init) does not change, the time
for initialization increases significantly (by 20-40%).

This effect is even clearer when using slower storage (SATA-based RAID).
The results then look like this:

         scale    config     |    init           tps        wal
    =========================|==================================
    ro     100    no-hints   |      49        229459        122
                  hints      |     101        167983        190
                  checksums  |     103        156307        190
          -------------------|----------------------------------
          1000    no-hints   |     580        152167        109
                  hints      |    1047        122814        142
                  checksums  |    1080        118586        141
          -------------------|----------------------------------
          6000    no-hints   |    4035           508          1
                  hints      |   11193           502          1
                  checksums  |   11376           506          1
    -------------------------|----------------------------------
    rw     100    no-hints   |      49           279        192
                  hints      |     101           275        190
                  checksums  |     103           275        190
          -------------------|----------------------------------
          1000    no-hints   |     580           237        210
                  hints      |    1047           225        201
                  checksums  |    1080           224        200
          -------------------|----------------------------------
          6000    no-hints   |    4035           135        123
                  hints      |   11193           133        122
                  checksums  |   11376           132        121

and when expressed as relative to no-hints:

            scale       config  |     init       tps        wal
    ============================|===============================
    ro        100        hints  |     206%       73%        155%
                     checksums  |     210%       68%        155%
             -------------------|--------------------------------
             1000        hints  |     181%       81%        131%
                     checksums  |     186%       78%        129%
             -------------------|--------------------------------
             6000        hints  |     277%       99%        100%
                     checksums  |     282%       100%       104%
    ----------------------------|--------------------------------
    rw        100        hints  |     206%       99%        99%
                     checksums  |     210%       99%        99%
             -------------------|--------------------------------
             1000        hints  |     181%       95%        96%
                     checksums  |     186%       95%        95%
             -------------------|--------------------------------
             6000        hints  |     277%       99%        99%
                     checksums  |     282%       98%        98%

I have not investigated the exact reasons, but my hypothesis it's about
the amount of WAL generated during the initial CREATE INDEX (because it
probably ends up setting the hint bits), which puts additional pressure
on the storage.

Unfortunately, this additional cost is unlikely to go away :-(

Now, maybe we want to enable checksums by default anyway, but we should
not pretent the only cost related to checksums is CPU usage.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Enable data checksums by default

Andres Freund


On March 30, 2019 3:25:43 PM EDT, Tomas Vondra <[hidden email]> wrote:

>On Fri, Mar 29, 2019 at 08:35:26PM +0100, Christoph Berg wrote:
>>Re: Bernd Helmle 2019-03-29
><[hidden email]>
>>> Am Freitag, den 29.03.2019, 23:10 +0900 schrieb Michael Paquier:
>>> >
>>> > I can't really believe that many people set up shared_buffers at
>>> > 128kB
>>> > which would cause such a large number of page evictions, but I can
>>> > believe that many users have shared_buffers set to its default
>value
>>> > and that we are going to get complains about "performance drop
>after
>>> > upgrade to v12" if we switch data checksums to on by default.
>>>
>>> Yeah, i think Christoph's benchmark is based on this thinking. I
>assume
>>> this very unrealistic scenery should emulate the worst case (many
>>> buffer_reads, high checksum calculation load).
>>
>>It's not unrealistic to have large seqscans that are all buffer
>>misses, the table just has to be big enough. The idea in my benchmark
>>was that if I make shared buffers really small, and the table still
>>fits in to RAM, I should be seeing only buffer misses, but without any
>>delay for actually reading from disk.
>>
>>Christoph
>>
>
>FWIW I think it's a mistake to focus solely on CPU utilization, which
>all the benchmarks performed on this thread do because they look at tps
>of in-memory read-only workloads. Checksums have other costs too, not
>just the additional CPU time. Most importanly they require
>wal_log_hints
>to be set (which people may or may not want anyway).
>
>I've done a simple benchmark, that does read-only (-S) and read-write
>(-N) pgbench runs with different scales, but also measures duration of
>the pgbench init and amount of WAL produced during the tests.
>
>On a small machine (i5, 8GB RAM, SSD RAID) the results are these:
>
>         scale    config     |    init           tps        wal
>    =========================|==================================
>    ro      10    no-hints   |       2        117038        130
>                  hints      |       2        116378        146
>                  checksums  |       2        115619        147
>          -------------------|----------------------------------
>           200    no-hints   |      32         88340        2407
>                  hints      |      37         86154        2628
>                  checksums  |      36         83336        2624
>          -------------------|----------------------------------
>          2000    no-hints   |     365         38680        1967
>                  hints      |     423         38670        2123
>                  checksums  |     504         37510        2046
>    -------------------------|----------------------------------
>    rw      10    no-hints   |       2         19691        437
>                  hints      |       2         19712        437
>                  checksums  |       2         19654        437
>          -------------------|----------------------------------
>           200    no-hints   |      32         15839        2745
>                  hints      |      37         15735        2783
>                  checksums  |      36         15646        2775
>          -------------------|----------------------------------
>          2000    no-hints   |     365         5371         3721
>                  hints      |     423         5270         3671
>                  checksums  |     504         5094         3574
>
>The no-hints config is default (wal_log_hints=off, data_checksums=off),
>hints sets wal_log_hints=on and checksums enables data checksums. All
>the configs were somewhat tuned (1GB shared buffers, max_wal_size high
>enough not to hit checkpoints very often, etc.).
>
>I've also done the tests on the a larger machine (2x E5-2620v4, 32GB of
>RAM, NVMe SSD), and the general pattern is about the same - while the
>tps and amount of WAL (not covering the init) does not change, the time
>for initialization increases significantly (by 20-40%).
>
>This effect is even clearer when using slower storage (SATA-based
>RAID).
>The results then look like this:
>
>         scale    config     |    init           tps        wal
>    =========================|==================================
>    ro     100    no-hints   |      49        229459        122
>                  hints      |     101        167983        190
>                  checksums  |     103        156307        190
>          -------------------|----------------------------------
>          1000    no-hints   |     580        152167        109
>                  hints      |    1047        122814        142
>                  checksums  |    1080        118586        141
>          -------------------|----------------------------------
>          6000    no-hints   |    4035           508          1
>                  hints      |   11193           502          1
>                  checksums  |   11376           506          1
>    -------------------------|----------------------------------
>    rw     100    no-hints   |      49           279        192
>                  hints      |     101           275        190
>                  checksums  |     103           275        190
>          -------------------|----------------------------------
>          1000    no-hints   |     580           237        210
>                  hints      |    1047           225        201
>                  checksums  |    1080           224        200
>          -------------------|----------------------------------
>          6000    no-hints   |    4035           135        123
>                  hints      |   11193           133        122
>                  checksums  |   11376           132        121
>
>and when expressed as relative to no-hints:
>
>            scale       config  |     init       tps        wal
>    ============================|===============================
>    ro        100        hints  |     206%       73%        155%
>                     checksums  |     210%       68%        155%
>             -------------------|--------------------------------
>             1000        hints  |     181%       81%        131%
>                     checksums  |     186%       78%        129%
>             -------------------|--------------------------------
>             6000        hints  |     277%       99%        100%
>                     checksums  |     282%       100%       104%
>    ----------------------------|--------------------------------
>    rw        100        hints  |     206%       99%        99%
>                     checksums  |     210%       99%        99%
>             -------------------|--------------------------------
>             1000        hints  |     181%       95%        96%
>                     checksums  |     186%       95%        95%
>             -------------------|--------------------------------
>             6000        hints  |     277%       99%        99%
>                     checksums  |     282%       98%        98%
>
>I have not investigated the exact reasons, but my hypothesis it's about
>the amount of WAL generated during the initial CREATE INDEX (because it
>probably ends up setting the hint bits), which puts additional pressure
>on the storage.
>
>Unfortunately, this additional cost is unlikely to go away :-(
>
>Now, maybe we want to enable checksums by default anyway, but we should
>not pretent the only cost related to checksums is CPU usage.

Thanks for running these, very helpful.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.


12