Online checksums verification in the backend

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
47 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Re: Online checksums verification in the backend

Julien Rouhaud
On Wed, Sep 9, 2020 at 2:37 PM Michael Paquier <[hidden email]> wrote:

>
> Another thing that was itching me is the introduction of 3 GUCs with
> one new category for the sake of two SQL functions.  For VACUUM we
> have many things relying on the GUC delays, with autovacuum and manual
> vacuum.  Perhaps it would make sense to have these if we have some day
> a background worker doing checksum verifications, still that could
> perfectly be in contrib/, and that would be kind of hard to tune as
> well.  The patch enabling checksums on-the-fly could also be a reason
> good enough.  Another thing we could consider is to pass down those
> parameters as function arguments, at the cost of not being able to
> reload them.

I'm not terribly happy with adding that for now, but it's quite clear
that there'll eventually be a lot of new stuff added that will benefit
from either the category or the GUC.  FTR once we reach an agreement
on how to do this check (I'm wondering if it'll stay an SQL function
or become a plain backend command, in which case GUCs would be
mandatory), I'll also be happy to work on a background worker to help
people running the check regularly.  So in my opinion it's better to
add them now so we won't have to change the sql function definition
later when other facilities will rely on them.


Reply | Threaded
Open this post in threaded view
|

Re: Online checksums verification in the backend

Julien Rouhaud
In reply to this post by Julien Rouhaud
On Wed, Sep 09, 2020 at 11:25:29AM +0200, Julien Rouhaud wrote:
>
> I'll do some becnhmarking and see if I can get some figures, but it'll probably
> take some time.  In the meantime I'm attaching v11 of the patch that should
> address all other comments.

I just realized that I forgot to update one of the Makefile when moving the TAP
test folder.  v12 attached should fix that.

v12-0001-Add-a-pg_check_relation-function.patch (45K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Online checksums verification in the backend

Julien Rouhaud
On Wed, Sep 09, 2020 at 03:41:30PM +0200, Julien Rouhaud wrote:
> On Wed, Sep 09, 2020 at 11:25:29AM +0200, Julien Rouhaud wrote:
> >
> > I'll do some becnhmarking and see if I can get some figures, but it'll probably
> > take some time.  In the meantime I'm attaching v11 of the patch that should
> > address all other comments.
>
> I just realized that I forgot to update one of the Makefile when moving the TAP
> test folder.  v12 attached should fix that.


And the cfbot just reported a new error for Windows build.  Attached v13 should
fix that.

v13-0001-Add-a-pg_check_relation-function.patch (45K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Online checksums verification in the backend

Julien Rouhaud
On Thu, Sep 10, 2020 at 09:47:23AM +0200, Julien Rouhaud wrote:

> On Wed, Sep 09, 2020 at 03:41:30PM +0200, Julien Rouhaud wrote:
> > On Wed, Sep 09, 2020 at 11:25:29AM +0200, Julien Rouhaud wrote:
> > >
> > > I'll do some becnhmarking and see if I can get some figures, but it'll probably
> > > take some time.  In the meantime I'm attaching v11 of the patch that should
> > > address all other comments.
> >
> > I just realized that I forgot to update one of the Makefile when moving the TAP
> > test folder.  v12 attached should fix that.
>
>
> And the cfbot just reported a new error for Windows build.  Attached v13 should
> fix that.


I did some benchmarking using the following environnment:

- 16MB shared buffers
- 490MB table (10M rows)
- synchronized_seqscan to off
- table in OS cache

I don't have a big machine so I went with a very small shared_buffers and a
small table, to make sure that all data is in OS cache but the table more than
an order bigger than the shared_buffers, to simulate some plausible environment.

I used a simple read only query that performs a sequential scan of the table (a
simple SELECT * FROM table), run using 10 concurrent connections, 5 runs of 700
seconds.  I did that without any other activity, with a \watch of the original
pg_check_relation function using \watch .1, and a modified version of that
function without the optimisation, still with a \watch .1

The TPS is obviously overall extremely bad, but I can see that the submitted
version added an overhead of ~3.9% (average of 5 runs), while the version
without the optimisation added an overhead of ~6.57%.

This is supposed to be a relatively fair benchmark as all the data are cached
on the OS side, so IO done while holding the bufmapping lock aren't too long,
but we can see that we already get a non negligible benefit from this
optimisation.  Should I do additional benchmarking, like dropping the OS cache
and/or adding some write activity?  This would probably only make the
unoptimized version perform even worse.


Reply | Threaded
Open this post in threaded view
|

Re: Online checksums verification in the backend

Michael Paquier-2
On Thu, Sep 10, 2020 at 08:06:10PM +0200, Julien Rouhaud wrote:
> The TPS is obviously overall extremely bad, but I can see that the submitted
> version added an overhead of ~3.9% (average of 5 runs), while the version
> without the optimisation added an overhead of ~6.57%.

Okay, so that stands as a difference.  I am planning to run some
benchmarks on my end as well, and see if I can see a clear
difference.

> This is supposed to be a relatively fair benchmark as all the data are cached
> on the OS side, so IO done while holding the bufmapping lock aren't too long,
> but we can see that we already get a non negligible benefit from this
> optimisation.  Should I do additional benchmarking, like dropping the OS cache
> and/or adding some write activity?  This would probably only make the
> unoptimized version perform even worse.

It would be also interesting to see the case where the pages are not
in the OS cache and see how bad it can get.  For the read-write case,
I am not sure as we may have some different overhead that hide the
noise.  Also, did you run your tests with the functions scanning at
full speed, with (ChecksumCostDelay < 0) so as there is no throttling?
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Online checksums verification in the backend

Julien Rouhaud
On Fri, Sep 11, 2020 at 9:34 AM Michael Paquier <[hidden email]> wrote:
>
> On Thu, Sep 10, 2020 at 08:06:10PM +0200, Julien Rouhaud wrote:
> > The TPS is obviously overall extremely bad, but I can see that the submitted
> > version added an overhead of ~3.9% (average of 5 runs), while the version
> > without the optimisation added an overhead of ~6.57%.
>
> Okay, so that stands as a difference.  I am planning to run some
> benchmarks on my end as well, and see if I can see a clear
> difference.

Thanks!

> > This is supposed to be a relatively fair benchmark as all the data are cached
> > on the OS side, so IO done while holding the bufmapping lock aren't too long,
> > but we can see that we already get a non negligible benefit from this
> > optimisation.  Should I do additional benchmarking, like dropping the OS cache
> > and/or adding some write activity?  This would probably only make the
> > unoptimized version perform even worse.
>
> It would be also interesting to see the case where the pages are not
> in the OS cache and see how bad it can get.  For the read-write case,
> I am not sure as we may have some different overhead that hide the
> noise.  Also, did you run your tests with the functions scanning at
> full speed, with (ChecksumCostDelay < 0) so as there is no throttling?

I used all default settings, but by default checksum_cost_delay is 0
so there shouldn't be any throttling.


Reply | Threaded
Open this post in threaded view
|

Re: Online checksums verification in the backend

Michael Paquier-2
On Fri, Sep 11, 2020 at 09:49:16AM +0200, Julien Rouhaud wrote:
> Thanks!

I got some numbers out of my pocket, using the following base
configuration:
wal_level = minimal
fsync = off
shared_buffers = '300MB' # also tested with 30MB and 60MB
checksum_cost_delay = 0 # default in patch

And for the test I have used pgbench initialized at a scale of 250, to
have close to 3.5GB of data, so as it gives us a sort of 90% buffer
eviction, with all the data cached in the OS (we may want to look as
well at the case where the OS cache does not hold all the relation
pages).  I have also run some tests with 30MB and 60MB of shared
buffers, for similar results.

I also applied some prewarming on the cluster:
create extension pg_prewarm
select pg_prewarm(oid) from pg_class where oid > 16000;

Then, I have done five runs using that:
pgbench -S -M prepared -c 64/128/256 -n -T 60
In parallel of that, I got this stuff running in parallel all the
time:
select pg_check_relation('pgbench_accounts');
\watch 0.1

Here are some TPS numbers with the execution time of pg_check_relation.
In the five runs, I removed the highest and lowest ones, then took an
average of the remaining three.  I have also tested two modes: with
and without the optimization, that requires a one-liner in checksum.c
as per your latest patch:
--- a/src/backend/storage/page/checksum.c
+++ b/src/backend/storage/page/checksum.c
@@ -84,7 +84,7 @@ check_one_block(Relation relation, ForkNumber forknum, BlockNumber blkno,
                 uint16 *chk_expected, uint16 *chk_found)
 {
     char        buffer[BLCKSZ];
-    bool        force_lock = false;
+    bool        force_lock = true;
     bool        found_in_sb;

Within parenthesis is the amount of time taken by pg_relation_check()
for a single check.  This is not completely exact and I saw some
variations, just to give an impression:
Conns                      64                 128               256
force_lock=true 60590 (7~8s)  55652 (10.2~10.5s)     46812 (9~10s)
force_lock=false   58637 (5s)        54131 (6~7s)  37091 (1.1~1.2s)

For connections higher than 128, I was kind of surprised to see
pg_relation_check being more aggressive without the optimization, with
much less contention on the buffer mapping LWLock actually, but that's
an impression from looking at pg_stat_activity.

Looking at the wait events for each run, I saw much more hiccups with
the buffer mapping LWLock when forcing the lock rather than not, still
I was able to see some contention when also not forcing the lock.  Not
surprising as this rejects a bunch of pages from shared buffers.

> I used all default settings, but by default checksum_cost_delay is 0
> so there shouldn't be any throttling.

Thanks, so did I.  From what I can see, there could be as well
benefits in not using the optimization by default so as the relation
check applies some natural throttling by making the checks actually
slower (there is a link between the individual runtime of
pg_relation_time and the TPS).  So it also seems to me that the
throttling parameters would be beneficial, but it looks to me that
there is as well a point to not include any throttling in a first
version if the optimization to go full speed is not around.  Using
three new GUCs for those function calls is still too much IMO, so
there is also the argument to move all this stuff into a new contrib/
module, and have a bgworker implementation as part of it as it would
need shared_preload_libraries anyway.

Also, I have been putting some thoughts into the APIs able to fetch a
buffer without going through the shared buffers.  And neither
checksum.c, because it should be a place that those APIs depends on
and include only the most-internal logic for checksum algorithm and
computation, nor checksumfuncs.c, because we need to think about the
case of a background worker as well (that could spawn a set of dynamic
workers connecting to different databases able to do checksum
verifications?).  It would be good to keep the handling of the buffer
mapping lock as well as the calls to smgrread() into a single place.
ReadBuffer_common() is a natural place for that, though it means for
our use case the addition of three new options:
- Being able to pass down directly a buffer pointer to save the page
contents.
- Being able to not verify directly a page, leaving the verification
to the caller upthread.
- Addition of a new mode, that I am calling here RBM_PRIVATE, where we
actually read the page from disk if not yet in shared buffers, except
that we fill in the contents of the page using the pointer given by
the caller.  That's different than the use of local buffers, as we
don't need this much amount of complications like temporary tables of
course for per-page checks.

Another idea would be to actually just let ReadBuffer_common just do
the check by itself, with a different mode like a kind of
RBM_VALIDATE, where we just return a verification state of the page
that can be consumed by callers.

This also comes with some more advantages:
- Tracking of reads from disk with track_io_timing.
- Addition of some private stats dedicated to this private mode, with
new fields in pgBufferUsage, all in a single place
--
Michael

signature.asc (849 bytes) Download Attachment
123