refactoring basebackup.c

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

refactoring basebackup.c

Robert Haas
Hi,

I'd like to propose a fairly major refactoring of the server's
basebackup.c. The current code isn't horrific or anything, but the
base backup mechanism has grown quite a few features over the years
and all of the code knows about all of the features. This is going to
make it progressively more difficult to add additional features, and I
have a few in mind that I'd like to add, as discussed below and also
on several other recent threads.[1][2] The attached patch set shows
what I have in mind. It needs more work, but I believe that there's
enough here for someone to review the overall direction, and even some
of the specifics, and hopefully give me some useful feedback.

This patch set is built around the idea of creating two new
abstractions, a base backup sink -- or bbsink -- and a base backup
archiver -- or bbarchiver.  Each of these works like a foreign data
wrapper or custom scan or TupleTableSlot. That is, there's a table of
function pointers that act like method callbacks. Every implementation
can allocate a struct of sufficient size for its own bookkeeping data,
and the first member of the struct is always the same, and basically
holds the data that all implementations must store, including a
pointer to the table of function pointers. If we were using C++,
bbarchiver and bbsink would be abstract base classes.

They represent closely-related concepts, so much so that I initially
thought we could get by with just one new abstraction layer. I found
on experimentation that this did not work well, so I split it up into
two and that worked a lot better. The distinction is this: a bbsink is
something to which you can send a bunch of archives -- currently, each
would be a tarfile -- and also a backup manifest. A bbarchiver is
something to which you send every file in the data directory
individually, or at least the ones that are getting backed up, plus
any that are being injected into the backup (e.g. the backup_label).
Commonly, a bbsink will do something with the data and then forward it
to a subsequent bbsink, or a bbarchiver will do something with the
data and then forward it to a subsequent bbarchiver or bbsink. For
example, there's a bbarchiver_tar object which, like any bbarchiver,
sees all the files and their contents as input. The output is a
tarfile, which gets send to a bbsink. As things stand in the patch set
now, the tar archives are ultimately sent to the "libpq" bbsink, which
sends them to the client.

In the future, we could have other bbarchivers. For example, we could
add "pax", "zip", or "cpio" bbarchiver which produces archives of that
format, and any given backup could choose which one to use. Or, we
could have a bbarchiver that runs each individual file through a
compression algorithm and then forwards the resulting data to a
subsequent bbarchiver. That would make it easy to produce a tarfile of
individually compressed files, which is one possible way of creating a
seekable achive.[3] Likewise, we could have other bbsinks. For
example, we could have a "localdisk" bbsink that cause the server to
write the backup somewhere in the local filesystem instead of
streaming it out over libpq. Or, we could have an "s3" bbsink that
writes the archives to S3. We could also have bbsinks that compresses
the input archives using some compressor (e.g. lz4, zstd, bzip2, ...)
and forward the resulting compressed archives to the next bbsink in
the chain. I'm not trying to pass judgement on whether any of these
particular things are things we want to do, nor am I saying that this
patch set solves all the problems with doing them. However, I believe
it will make such things a whole lot easier to implement, because all
of the knowledge about whatever new functionality is being added is
centralized in one place, rather than being spread across the entirety
of basebackup.c. As an example of this, look at how 0010 changes
basebackup.c and basebackup_tar.c: afterwards, basebackup.c no longer
knows anything that is tar-specific, whereas right now it knows about
tar-specific things in many places.

Here's an overview of this patch set:

0001-0003 are cleanup patches that I have posted for review on
separate threads.[4][5] They are included here to make it easy to
apply this whole series if someone wishes to do so.

0004 is a minor refactoring that reduces by 1 the number of functions
in basebackup.c that know about the specifics of tarfiles. It is just
a preparatory patch and probably not very interesting.

0005 invents the bbsink abstraction.

0006 creates basebackup_libpq.c and moves all code that knows about
the details of sending archives via libpq there. The functionality is
exposed for use by basebackup.c as a new type of bbsink, bbsink_libpq.

0007 creates basebackup_throttle.c and moves all code that knows about
throttling backups there. The functionality is exposed for use by
basebackup.c as a new type of bbsink, bbsink_throttle. This means that
the throttling logic could be reused to throttle output to any final
destination. Essentially, this is a bbsink that just passes everything
it gets through to the next bbsink, but with a rate limit. If
throttling's not enabled, no bbsink_throttle object is created, so all
of the throttling code is completely out of the execution pipeline.

0008 creates basebackup_progress.c and moves all code that knows about
progress reporting there. The functionality is exposed for use by
basebackup.c as a new type of bbsink, bbsink_progress. Since the
abstraction doesn't fit perfectly in this case, some extra functions
are added to work around the problem. This is not entirely elegant,
but I don't think it's still an improvement over what we have now, and
I don't have a better idea.

0009 invents the bbarchiver abstraction.

0010 invents two new bbarchivers, a tar bbarchiver and a tarsize
bbarchiver, and refactors basebackup.c to make use of them. The tar
bbarchiver puts the files it sees into tar archives and forwards the
resulting archives to a bbsink. The tarsize bbarchiver is used to
support the PROGRESS option to the BASE_BACKUP command. It just
estimates the size of the backup by summing up the file sizes without
reading them. This approach is good for a couple of reasons. First,
without something like this, it's impossible to keep basebackup.c from
knowing something about the tar format, because the PROGRESS option
doesn't just figure out how big the files to be backed up are: it
figures out how big it thinks the archives will be, and that involves
tar-specific considerations. This area needs more work, as the whole
idea of measuring progress by estimating the archive size is going to
break down as soon as server-side compression is in the picture.
Second, this makes the code path that we use for figuring out the
backup size details much more similar to the path we use for
performing the actual backup. For instance, with this patch, we
include the exact same files in the calculation that we will include
in the backup, and in the same order, something that's not true today.
The basebackup_tar.c file added by this patch is sadly lacking in
comments, which I will add in a future version of the patch set. I
think, though, that it will not be too unclear what's going on here.

0011 invents another new kind of bbarchiver. This bbarchiver just
eavesdrops on the stream of files to facilitate backup manifest
construction, and then forwards everything through to a subsequent
bbarchiver. Like bbsink_throttle, it can be entirely omitted if not
used. This patch is a bit clunky at the moment and needs some polish,
but it is another demonstration of how these abstractions can be used
to simplify basebackup.c, so that basebackup.c only has to worry about
determining what should be backed up and not have to worry much about
all the specific things that need to be done as part of that.

Although this patch set adds quite a bit of code on net, it makes
basebackup.c considerably smaller and simpler, removing more than 400
lines of code from that file, about 20% of the current total. There
are some gratifying changes vs. the status quo. For example, in
master, we have this:

sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces,
                bool sendtblspclinks, backup_manifest_info *manifest,
                const char *spcoid)

Notably, the sizeonly flag makes the function not do what the name of
the function suggests that it does. Also, we've got to pass some extra
fields through to enable specific features. With the patch set, the
equivalent function looks like this:

archive_directory(bbarchiver *archiver, const char *path, int basepathlen,
                                  List *tablespaces, bool sendtblspclinks)

The question "what should I do with the directories and files we find
as we recurse?" is now answered by the choice of which bbarchiver to
pass to the function, rather than by the values of sizeonly, manifest,
and spcoid. That's not night and day, but I think it's better,
especially as you imagine adding more features in the future. The
really important part, for me, is that you can make the bbarchiver do
anything you like without needing to make any more changes to this
function. It just arranges to invoke your callbacks. You take it from
there.

One pretty major question that this patch set doesn't address is what
the user interface for any of the hypothetical features mentioned
above ought to look like, or how basebackup.c ought to support them.
The syntax for the BASE_BACKUP command, like the contents of
basebackup.c, has grown organically, and doesn't seem to be very
scalable. Also, the wire protocol - a series of CopyData results which
the client is entirely responsible for knowing how to interpret and
about which the server provides only minimal information - doesn't
much lend itself to extensibility. Some careful design work is likely
needed in both areas, and this patch does not try to do any of it. I
am quite interested in discussing those questions, but I felt that
they weren't the most important problems to solve first.

What do you all think?

Thanks,

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] http://postgr.es/m/CA+TgmoZubLXYR+Pd_gi3MVgyv5hQdLm-GBrVXkun-Lewaw12Kg@...
[2] http://postgr.es/m/CA+TgmoYr7+-0_vyQoHbTP5H3QGZFgfhnrn6ewDteF=kUqkG=Fw@...
[3] http://postgr.es/m/CA+TgmoZQCoCyPv6fGoovtPEZF98AXCwYDnSB0=p5XtxNY68r_A@...
and following
[4] http://postgr.es/m/CA+TgmoYq+59SJ2zBbP891ngWPA9fymOqntqYcweSDYXS2a620A@...
[5] http://postgr.es/m/CA+TgmobWbfReO9-XFk8urR1K4wTNwqoHx_v56t7=T8KaiEoKNw@...


Reply | Threaded
Open this post in threaded view
|

Re: refactoring basebackup.c

Robert Haas
So it might be good if I'd remembered to attach the patches. Let's try
that again.

...Robert

v1-0005-Introduce-bbsink-abstraction.patch (12K) Download Attachment
v1-0002-Minor-code-cleanup-for-perform_base_backup.patch (3K) Download Attachment
v1-0003-Assorted-cleanup-of-tar-related-code.patch (22K) Download Attachment
v1-0004-Recast-_tarWriteDirectory-as-convert_link_to_dire.patch (4K) Download Attachment
v1-0001-Don-t-export-basebackup.c-s-sendTablespace.patch (9K) Download Attachment
v1-0006-Convert-libpq-related-code-to-a-bbsink.patch (43K) Download Attachment
v1-0009-Introduce-bbarchiver-abstraction.patch (15K) Download Attachment
v1-0008-Convert-progress-reporting-code-to-a-bbsink.patch (25K) Download Attachment
v1-0010-Create-and-use-bbarchiver-implementations-for-tar.patch (42K) Download Attachment
v1-0007-Convert-throttling-related-code-to-a-bbsink.patch (18K) Download Attachment
v1-0011-WIP-Convert-backup-manifest-generation-to-a-bbarc.patch (21K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: refactoring basebackup.c

Andres Freund
In reply to this post by Robert Haas
Hi,

On 2020-05-08 16:53:09 -0400, Robert Haas wrote:

> They represent closely-related concepts, so much so that I initially
> thought we could get by with just one new abstraction layer. I found
> on experimentation that this did not work well, so I split it up into
> two and that worked a lot better. The distinction is this: a bbsink is
> something to which you can send a bunch of archives -- currently, each
> would be a tarfile -- and also a backup manifest. A bbarchiver is
> something to which you send every file in the data directory
> individually, or at least the ones that are getting backed up, plus
> any that are being injected into the backup (e.g. the backup_label).
> Commonly, a bbsink will do something with the data and then forward it
> to a subsequent bbsink, or a bbarchiver will do something with the
> data and then forward it to a subsequent bbarchiver or bbsink. For
> example, there's a bbarchiver_tar object which, like any bbarchiver,
> sees all the files and their contents as input. The output is a
> tarfile, which gets send to a bbsink. As things stand in the patch set
> now, the tar archives are ultimately sent to the "libpq" bbsink, which
> sends them to the client.

Hm.

I wonder if there's cases where recursively forwarding like this will
cause noticable performance effects. The only operation that seems
frequent enough to potentially be noticable would be "chunks" of the
file. So perhaps it'd be good to make sure we read in large enough
chunks?

> 0010 invents two new bbarchivers, a tar bbarchiver and a tarsize
> bbarchiver, and refactors basebackup.c to make use of them. The tar
> bbarchiver puts the files it sees into tar archives and forwards the
> resulting archives to a bbsink. The tarsize bbarchiver is used to
> support the PROGRESS option to the BASE_BACKUP command. It just
> estimates the size of the backup by summing up the file sizes without
> reading them. This approach is good for a couple of reasons. First,
> without something like this, it's impossible to keep basebackup.c from
> knowing something about the tar format, because the PROGRESS option
> doesn't just figure out how big the files to be backed up are: it
> figures out how big it thinks the archives will be, and that involves
> tar-specific considerations.

ISTM that it's not actually good to have the progress calculations
include the tar overhead. As you say:

> This area needs more work, as the whole idea of measuring progress by
> estimating the archive size is going to break down as soon as
> server-side compression is in the picture.

This, to me, indicates that we should measure the progress solely based
on how much of the "source" data was processed. The overhead of tar, the
reduction due to compression, shouldn't be included.


> What do you all think?

I've not though enough about the specifics, but I think it looks like
it's going roughly in a better direction.

One thing I wonder about is how stateful the interface is. Archivers
will pretty much always track which file is currently open etc. Somehow
such a repeating state machine seems a bit ugly - but I don't really
have a better answer.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: refactoring basebackup.c

Robert Haas
On Fri, May 8, 2020 at 5:27 PM Andres Freund <[hidden email]> wrote:
> I wonder if there's cases where recursively forwarding like this will
> cause noticable performance effects. The only operation that seems
> frequent enough to potentially be noticable would be "chunks" of the
> file. So perhaps it'd be good to make sure we read in large enough
> chunks?

Yeah, that needs to be tested. Right now the chunk size is 32kB but it
might be a good idea to go larger. Another thing is that right now the
chunk size is tied to the protocol message size, and I'm not sure
whether the size that's optimal for disk reads is also optimal for
protocol messages.

> This, to me, indicates that we should measure the progress solely based
> on how much of the "source" data was processed. The overhead of tar, the
> reduction due to compression, shouldn't be included.

I don't think it's a particularly bad thing that we include a small
amount of progress for sending an empty file, a directory, or a
symlink. That could make the results more meaningful if you have a
database with lots of empty relations in it. However, I agree that the
effect of compression shouldn't be included. To get there, I think we
need to redesign the wire protocol. Right now, the server has no way
of letting the client know how many uncompressed bytes it's sent, and
the client has no way of figuring it out without uncompressing, which
seems like something we want to avoid.

There are some other problems with the current wire protocol, too:

1. The syntax for the BASE_BACKUP command is large and unwieldy. We
really ought to adopt an extensible options syntax, like COPY, VACUUM,
EXPLAIN, etc. do, rather than using a zillion ad-hoc bolt-ons, each
with bespoke lexer and parser support.

2. The client is sent a list of tablespaces and is supposed to use
that to expect an equal number of archives, computing the name for
each one on the client side from the tablespace info. However, I think
we should be able to support modes like "put all the tablespaces in a
single archive" or "send a separate archive for every 256GB" or "ship
it all to the cloud and don't send me any archives". To get there, I
think we should have the server send the archive name to the clients,
and the client should just keep receiving the next archive until it's
told that there are no more. Then if there's one archive or ten
archives or no archives, the client doesn't have to care. It just
receives what the server sends until it hears that there are no more.
It also doesn't know how the server is naming the archives; the server
can, for example, adjust the archive names based on which compression
format is being chosen, without knowledge of the server's naming
conventions needing to exist on the client side.

I think we should keep support for the current BASE_BACKUP command but
either add a new variant using an extensible options, or else invent a
whole new command with a different name (BACKUP, SEND_BACKUP,
whatever) that takes extensible options. This command should send back
all the archives and the backup manifest using a single COPY stream
rather than multiple COPY streams. Within the COPY stream, we'll
invent a sub-protocol, e.g. based on the first letter of the message,
e.g.:

t = Tablespace boundary. No further message payload. Indicates, for
progress reporting purposes, that we are advancing to the next
tablespace.
f = Filename. The remainder of the message payload is the name of the
next file that will be transferred.
d = Data. The next four bytes contain the number of uncompressed bytes
covered by this message, for progress reporting purposes. The rest of
the message is payload, possibly compressed. Could be empty, if the
data is being shipped elsewhere, and these messages are only being
sent to update the client's notion of progress.

> I've not though enough about the specifics, but I think it looks like
> it's going roughly in a better direction.

Good to hear.

> One thing I wonder about is how stateful the interface is. Archivers
> will pretty much always track which file is currently open etc. Somehow
> such a repeating state machine seems a bit ugly - but I don't really
> have a better answer.

I thought about that a bit, too. There might be some way to unify that
by having some common context object that's defined by basebackup.c
and all archivers get it, so that they have some commonly-desired
details without needing bespoke code, but I'm not sure at this point
whether that will actually produce a nicer result. Even if we don't
have it initially, it seems like it wouldn't be very hard to add it
later, so I'm not too stressed about it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: refactoring basebackup.c

Sumanta Mukherjee
Hi Robert,
Please see my comments inline below.

On Tue, May 12, 2020 at 12:33 AM Robert Haas <[hidden email]> wrote:
Yeah, that needs to be tested. Right now the chunk size is 32kB but it
might be a good idea to go larger. Another thing is that right now the
chunk size is tied to the protocol message size, and I'm not sure
whether the size that's optimal for disk reads is also optimal for
protocol messages.

One needs a balance between the number of packets to be sent across the network and so if the size 
of reading from the disk and the network packet size could be unified then it might provide a better optimization.
 

I don't think it's a particularly bad thing that we include a small
amount of progress for sending an empty file, a directory, or a
symlink. That could make the results more meaningful if you have a
database with lots of empty relations in it. However, I agree that the
effect of compression shouldn't be included. To get there, I think we
need to redesign the wire protocol. Right now, the server has no way
of letting the client know how many uncompressed bytes it's sent, and
the client has no way of figuring it out without uncompressing, which
seems like something we want to avoid.


  I agree here too, except that if we have too many tar files one might cringe 
  but sending the xtra amt from these tar files looks okay to me.
 
There are some other problems with the current wire protocol, too:

1. The syntax for the BASE_BACKUP command is large and unwieldy. We
really ought to adopt an extensible options syntax, like COPY, VACUUM,
EXPLAIN, etc. do, rather than using a zillion ad-hoc bolt-ons, each
with bespoke lexer and parser support.

2. The client is sent a list of tablespaces and is supposed to use
that to expect an equal number of archives, computing the name for
each one on the client side from the tablespace info. However, I think
we should be able to support modes like "put all the tablespaces in a
single archive" or "send a separate archive for every 256GB" or "ship
it all to the cloud and don't send me any archives". To get there, I
think we should have the server send the archive name to the clients,
and the client should just keep receiving the next archive until it's
told that there are no more. Then if there's one archive or ten
archives or no archives, the client doesn't have to care. It just
receives what the server sends until it hears that there are no more.
It also doesn't know how the server is naming the archives; the server
can, for example, adjust the archive names based on which compression
format is being chosen, without knowledge of the server's naming
conventions needing to exist on the client side.

  One thing to remember here could be that an optimization would need to be made between the number of options
  we provide and people coming back and saying which combinations do not work
  For example, if a user script has "put all the tablespaces in a single archive" and later on somebody makes some
  script changes to break it down at "256 GB" and there is a conflict then which one takes precedence needs to be chosen.
  When the number of options like this become very large this could lead to some complications.
  
I think we should keep support for the current BASE_BACKUP command but
either add a new variant using an extensible options, or else invent a
whole new command with a different name (BACKUP, SEND_BACKUP,
whatever) that takes extensible options. This command should send back
all the archives and the backup manifest using a single COPY stream
rather than multiple COPY streams. Within the COPY stream, we'll
invent a sub-protocol, e.g. based on the first letter of the message,
e.g.:

t = Tablespace boundary. No further message payload. Indicates, for
progress reporting purposes, that we are advancing to the next
tablespace.
f = Filename. The remainder of the message payload is the name of the
next file that will be transferred.
d = Data. The next four bytes contain the number of uncompressed bytes
covered by this message, for progress reporting purposes. The rest of
the message is payload, possibly compressed. Could be empty, if the
data is being shipped elsewhere, and these messages are only being
sent to update the client's notion of progress.
 
  Here I support this.
 
I thought about that a bit, too. There might be some way to unify that
by having some common context object that's defined by basebackup.c
and all archivers get it, so that they have some commonly-desired
details without needing bespoke code, but I'm not sure at this point
whether that will actually produce a nicer result. Even if we don't
have it initially, it seems like it wouldn't be very hard to add it
later, so I'm not too stressed about it.
 
--Sumanta Mukherjee
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: refactoring basebackup.c

Dilip Kumar-2
In reply to this post by Robert Haas
On Sat, May 9, 2020 at 2:23 AM Robert Haas <[hidden email]> wrote:

>
> Hi,
>
> I'd like to propose a fairly major refactoring of the server's
> basebackup.c. The current code isn't horrific or anything, but the
> base backup mechanism has grown quite a few features over the years
> and all of the code knows about all of the features. This is going to
> make it progressively more difficult to add additional features, and I
> have a few in mind that I'd like to add, as discussed below and also
> on several other recent threads.[1][2] The attached patch set shows
> what I have in mind. It needs more work, but I believe that there's
> enough here for someone to review the overall direction, and even some
> of the specifics, and hopefully give me some useful feedback.
>
> This patch set is built around the idea of creating two new
> abstractions, a base backup sink -- or bbsink -- and a base backup
> archiver -- or bbarchiver.  Each of these works like a foreign data
> wrapper or custom scan or TupleTableSlot. That is, there's a table of
> function pointers that act like method callbacks. Every implementation
> can allocate a struct of sufficient size for its own bookkeeping data,
> and the first member of the struct is always the same, and basically
> holds the data that all implementations must store, including a
> pointer to the table of function pointers. If we were using C++,
> bbarchiver and bbsink would be abstract base classes.
>
> They represent closely-related concepts, so much so that I initially
> thought we could get by with just one new abstraction layer. I found
> on experimentation that this did not work well, so I split it up into
> two and that worked a lot better. The distinction is this: a bbsink is
> something to which you can send a bunch of archives -- currently, each
> would be a tarfile -- and also a backup manifest. A bbarchiver is
> something to which you send every file in the data directory
> individually, or at least the ones that are getting backed up, plus
> any that are being injected into the backup (e.g. the backup_label).
> Commonly, a bbsink will do something with the data and then forward it
> to a subsequent bbsink, or a bbarchiver will do something with the
> data and then forward it to a subsequent bbarchiver or bbsink. For
> example, there's a bbarchiver_tar object which, like any bbarchiver,
> sees all the files and their contents as input. The output is a
> tarfile, which gets send to a bbsink. As things stand in the patch set
> now, the tar archives are ultimately sent to the "libpq" bbsink, which
> sends them to the client.
>
> In the future, we could have other bbarchivers. For example, we could
> add "pax", "zip", or "cpio" bbarchiver which produces archives of that
> format, and any given backup could choose which one to use. Or, we
> could have a bbarchiver that runs each individual file through a
> compression algorithm and then forwards the resulting data to a
> subsequent bbarchiver. That would make it easy to produce a tarfile of
> individually compressed files, which is one possible way of creating a
> seekable achive.[3] Likewise, we could have other bbsinks. For
> example, we could have a "localdisk" bbsink that cause the server to
> write the backup somewhere in the local filesystem instead of
> streaming it out over libpq. Or, we could have an "s3" bbsink that
> writes the archives to S3. We could also have bbsinks that compresses
> the input archives using some compressor (e.g. lz4, zstd, bzip2, ...)
> and forward the resulting compressed archives to the next bbsink in
> the chain. I'm not trying to pass judgement on whether any of these
> particular things are things we want to do, nor am I saying that this
> patch set solves all the problems with doing them. However, I believe
> it will make such things a whole lot easier to implement, because all
> of the knowledge about whatever new functionality is being added is
> centralized in one place, rather than being spread across the entirety
> of basebackup.c. As an example of this, look at how 0010 changes
> basebackup.c and basebackup_tar.c: afterwards, basebackup.c no longer
> knows anything that is tar-specific, whereas right now it knows about
> tar-specific things in many places.
>
> Here's an overview of this patch set:
>
> 0001-0003 are cleanup patches that I have posted for review on
> separate threads.[4][5] They are included here to make it easy to
> apply this whole series if someone wishes to do so.
>
> 0004 is a minor refactoring that reduces by 1 the number of functions
> in basebackup.c that know about the specifics of tarfiles. It is just
> a preparatory patch and probably not very interesting.
>
> 0005 invents the bbsink abstraction.
>
> 0006 creates basebackup_libpq.c and moves all code that knows about
> the details of sending archives via libpq there. The functionality is
> exposed for use by basebackup.c as a new type of bbsink, bbsink_libpq.
>
> 0007 creates basebackup_throttle.c and moves all code that knows about
> throttling backups there. The functionality is exposed for use by
> basebackup.c as a new type of bbsink, bbsink_throttle. This means that
> the throttling logic could be reused to throttle output to any final
> destination. Essentially, this is a bbsink that just passes everything
> it gets through to the next bbsink, but with a rate limit. If
> throttling's not enabled, no bbsink_throttle object is created, so all
> of the throttling code is completely out of the execution pipeline.
>
> 0008 creates basebackup_progress.c and moves all code that knows about
> progress reporting there. The functionality is exposed for use by
> basebackup.c as a new type of bbsink, bbsink_progress. Since the
> abstraction doesn't fit perfectly in this case, some extra functions
> are added to work around the problem. This is not entirely elegant,
> but I don't think it's still an improvement over what we have now, and
> I don't have a better idea.
>
> 0009 invents the bbarchiver abstraction.
>
> 0010 invents two new bbarchivers, a tar bbarchiver and a tarsize
> bbarchiver, and refactors basebackup.c to make use of them. The tar
> bbarchiver puts the files it sees into tar archives and forwards the
> resulting archives to a bbsink. The tarsize bbarchiver is used to
> support the PROGRESS option to the BASE_BACKUP command. It just
> estimates the size of the backup by summing up the file sizes without
> reading them. This approach is good for a couple of reasons. First,
> without something like this, it's impossible to keep basebackup.c from
> knowing something about the tar format, because the PROGRESS option
> doesn't just figure out how big the files to be backed up are: it
> figures out how big it thinks the archives will be, and that involves
> tar-specific considerations. This area needs more work, as the whole
> idea of measuring progress by estimating the archive size is going to
> break down as soon as server-side compression is in the picture.
> Second, this makes the code path that we use for figuring out the
> backup size details much more similar to the path we use for
> performing the actual backup. For instance, with this patch, we
> include the exact same files in the calculation that we will include
> in the backup, and in the same order, something that's not true today.
> The basebackup_tar.c file added by this patch is sadly lacking in
> comments, which I will add in a future version of the patch set. I
> think, though, that it will not be too unclear what's going on here.
>
> 0011 invents another new kind of bbarchiver. This bbarchiver just
> eavesdrops on the stream of files to facilitate backup manifest
> construction, and then forwards everything through to a subsequent
> bbarchiver. Like bbsink_throttle, it can be entirely omitted if not
> used. This patch is a bit clunky at the moment and needs some polish,
> but it is another demonstration of how these abstractions can be used
> to simplify basebackup.c, so that basebackup.c only has to worry about
> determining what should be backed up and not have to worry much about
> all the specific things that need to be done as part of that.
>
> Although this patch set adds quite a bit of code on net, it makes
> basebackup.c considerably smaller and simpler, removing more than 400
> lines of code from that file, about 20% of the current total. There
> are some gratifying changes vs. the status quo. For example, in
> master, we have this:
>
> sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces,
>                 bool sendtblspclinks, backup_manifest_info *manifest,
>                 const char *spcoid)
>
> Notably, the sizeonly flag makes the function not do what the name of
> the function suggests that it does. Also, we've got to pass some extra
> fields through to enable specific features. With the patch set, the
> equivalent function looks like this:
>
> archive_directory(bbarchiver *archiver, const char *path, int basepathlen,
>                                   List *tablespaces, bool sendtblspclinks)
>
> The question "what should I do with the directories and files we find
> as we recurse?" is now answered by the choice of which bbarchiver to
> pass to the function, rather than by the values of sizeonly, manifest,
> and spcoid. That's not night and day, but I think it's better,
> especially as you imagine adding more features in the future. The
> really important part, for me, is that you can make the bbarchiver do
> anything you like without needing to make any more changes to this
> function. It just arranges to invoke your callbacks. You take it from
> there.
>
> One pretty major question that this patch set doesn't address is what
> the user interface for any of the hypothetical features mentioned
> above ought to look like, or how basebackup.c ought to support them.
> The syntax for the BASE_BACKUP command, like the contents of
> basebackup.c, has grown organically, and doesn't seem to be very
> scalable. Also, the wire protocol - a series of CopyData results which
> the client is entirely responsible for knowing how to interpret and
> about which the server provides only minimal information - doesn't
> much lend itself to extensibility. Some careful design work is likely
> needed in both areas, and this patch does not try to do any of it. I
> am quite interested in discussing those questions, but I felt that
> they weren't the most important problems to solve first.
>
> What do you all think?

The overall idea looks quite nice.  I had a look at some of the
patches at least 0005 and 0006.  At first look, I have one comment.

+/*
+ * Each archive is set as a separate stream of COPY data, and thus begins
+ * with a CopyOutResponse message.
+ */
+static void
+bbsink_libpq_begin_archive(bbsink *sink, const char *archive_name)
+{
+ SendCopyOutResponse();
+}

Some of the bbsink_libpq_* functions are directly calling pq_* e.g.
bbsink_libpq_begin_backup whereas others are calling SendCopy*
functions and therein those are calling pq_* functions.  I think
bbsink_libpq_* function can directly call pq_* functions instead of
adding one more level of the function call.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: refactoring basebackup.c

Robert Haas
On Tue, May 12, 2020 at 4:32 AM Dilip Kumar <[hidden email]> wrote:
> Some of the bbsink_libpq_* functions are directly calling pq_* e.g.
> bbsink_libpq_begin_backup whereas others are calling SendCopy*
> functions and therein those are calling pq_* functions.  I think
> bbsink_libpq_* function can directly call pq_* functions instead of
> adding one more level of the function call.

I think all the helper functions have more than one caller, though.
That's why I created them - to avoid duplicating code.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: refactoring basebackup.c

Dilip Kumar-2
On Wed, May 13, 2020 at 1:56 AM Robert Haas <[hidden email]> wrote:

>
> On Tue, May 12, 2020 at 4:32 AM Dilip Kumar <[hidden email]> wrote:
> > Some of the bbsink_libpq_* functions are directly calling pq_* e.g.
> > bbsink_libpq_begin_backup whereas others are calling SendCopy*
> > functions and therein those are calling pq_* functions.  I think
> > bbsink_libpq_* function can directly call pq_* functions instead of
> > adding one more level of the function call.
>
> I think all the helper functions have more than one caller, though.
> That's why I created them - to avoid duplicating code.

You are right, somehow I missed that part.  Sorry for the noise.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: refactoring basebackup.c

Suraj Kharage-2
In reply to this post by Andres Freund
Hi,

Did some performance testing by varying TAR_SEND_SIZE with Robert's refactor patch and without the patch to check the impact.

Below are the details:

Backup type: local backup using pg_basebackup
Data size: Around 200GB (200 tables - each table around 1.05 GB)
different TAR_SEND_SIZE values8kb, 32kb (default value), 128kB, 1MB (1024kB)

Server details:
RAM: 500 GB CPU details: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 128 Filesystem: ext4

8kb 32kb (default value)128kB1024kB
Without refactor patchreal 10m22.718s
user 1m23.629s
sys 8m51.410s
real 8m36.245s
user 1m8.471s
sys 7m21.520s
real 6m54.299s
user 0m55.690s
sys 5m46.502s
real 18m3.511s
user 1m38.197s
sys 9m36.517s
With refactor patch (Robert's patch)real 10m11.350s
user 1m25.038s
sys 8m39.226s
real 8m56.226s
user 1m9.774s
sys 7m41.032s
real 7m26.678s
user 0m54.833s
sys 6m20.057s
real 18m17.230s
user 1m42.749s
sys 9m53.704s

The above numbers are taken from the minimum of two runs of each scenario.

I can see, when we have TAR_SEND_SIZE as 32kb or 128kb, it is giving us a good performance whereas, for 1Mb it is taking 2.5x more time.

Please let me know your thoughts/suggestions on the same.

--
--

Thanks & Regards, 
Suraj kharage, 
EnterpriseDB Corporation, 
The Postgres Database Company.
Reply | Threaded
Open this post in threaded view
|

Re: refactoring basebackup.c

Sumanta Mukherjee
Hi Suraj,

Two points I wanted to mention.

  1. The max rate at which the transfer is happening when the tar size is 128 Kb is at most .48 GB/sec. Is there a possibility to understand what is the buffer size which is being used. That could help us explain some part of the puzzle.
  2. Secondly the idea of taking just the min of two runs is a bit counter to the following. How do we justify the performance numbers and attribute that the differences is not related to noise. It might be better to do a few experiments for each of the kind and then try and fit a basic linear model and report the std deviation. "Order statistics"  where you get the min(X1, X2, ... , Xn) is generally a biased estimator.  A variance calculation of the biased statistics is a bit tricky and so the results could be corrupted by noise. 

With Regards,
Sumanta Mukherjee.


On Wed, May 13, 2020 at 9:31 AM Suraj Kharage <[hidden email]> wrote:
Hi,

Did some performance testing by varying TAR_SEND_SIZE with Robert's refactor patch and without the patch to check the impact.

Below are the details:

Backup type: local backup using pg_basebackup
Data size: Around 200GB (200 tables - each table around 1.05 GB)
different TAR_SEND_SIZE values8kb, 32kb (default value), 128kB, 1MB (1024kB)

Server details:
RAM: 500 GB CPU details: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 128 Filesystem: ext4

8kb 32kb (default value)128kB1024kB
Without refactor patchreal 10m22.718s
user 1m23.629s
sys 8m51.410s
real 8m36.245s
user 1m8.471s
sys 7m21.520s
real 6m54.299s
user 0m55.690s
sys 5m46.502s
real 18m3.511s
user 1m38.197s
sys 9m36.517s
With refactor patch (Robert's patch)real 10m11.350s
user 1m25.038s
sys 8m39.226s
real 8m56.226s
user 1m9.774s
sys 7m41.032s
real 7m26.678s
user 0m54.833s
sys 6m20.057s
real 18m17.230s
user 1m42.749s
sys 9m53.704s

The above numbers are taken from the minimum of two runs of each scenario.

I can see, when we have TAR_SEND_SIZE as 32kb or 128kb, it is giving us a good performance whereas, for 1Mb it is taking 2.5x more time.

Please let me know your thoughts/suggestions on the same.

--
--

Thanks & Regards, 
Suraj kharage, 
EnterpriseDB Corporation, 
The Postgres Database Company.
Reply | Threaded
Open this post in threaded view
|

Re: refactoring basebackup.c

Robert Haas
In reply to this post by Suraj Kharage-2
On Wed, May 13, 2020 at 12:01 AM Suraj Kharage <[hidden email]> wrote:
8kb 32kb (default value)128kB1024kB
Without refactor patchreal 10m22.718s
user 1m23.629s
sys 8m51.410s
real 8m36.245s
user 1m8.471s
sys 7m21.520s
real 6m54.299s
user 0m55.690s
sys 5m46.502s
real 18m3.511s
user 1m38.197s
sys 9m36.517s
With refactor patch (Robert's patch)real 10m11.350s
user 1m25.038s
sys 8m39.226s
real 8m56.226s
user 1m9.774s
sys 7m41.032s
real 7m26.678s
user 0m54.833s
sys 6m20.057s
real 18m17.230s
user 1m42.749s
sys 9m53.704s

The above numbers are taken from the minimum of two runs of each scenario.

I can see, when we have TAR_SEND_SIZE as 32kb or 128kb, it is giving us a good performance whereas, for 1Mb it is taking 2.5x more time.

Please let me know your thoughts/suggestions on the same.

So the patch came out slightly faster at 8kB and slightly slower in the other tests. That's kinda strange. I wonder if it's just noise. How much do the results vary run to run?

I would've expected (and I think Andres thought the same) that a smaller block size would be bad for the patch and a larger block size would be good, but that's not what these numbers show.

I wouldn't worry too much about the regression at 1MB. Probably what's happening there is that we're losing some concurrency - perhaps with smaller block sizes the OS can buffer the entire chunk in the pipe connecting pg_basebackup to the server and start on the next one, but when you go up to 1MB it doesn't fit and ends up spending a lot of time waiting for data to be read from the pipe. Wait event profiling might tell you more. Probably what this suggests is that you want the largest buffer size that doesn't cause you to overrun the network/pipe buffer and no larger. Unfortunately, I have no idea how we'd figure that out dynamically, and I don't see a reason to believe that everyone will have the same size buffers.
 
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Reply | Threaded
Open this post in threaded view
|

Re: refactoring basebackup.c

Suraj Kharage-2
Hi,

On Wed, May 13, 2020 at 7:49 PM Robert Haas <[hidden email]> wrote:

So the patch came out slightly faster at 8kB and slightly slower in the other tests. That's kinda strange. I wonder if it's just noise. How much do the results vary run to run?
It is not varying much except for 8kB run. Please see below details for both runs of each scenario.

8kb 32kb (default value)128kB1024kB
WIthout refactor
patch
1st runreal 10m50.924s
user 1m29.774s
sys 9m13.058s
real 8m36.245s
user 1m8.471s
sys 7m21.520s
real 7m8.690s
user 0m54.840s
sys 6m1.725s
real 18m16.898s
user 1m39.105s
sys 9m42.803s
2nd runreal 10m22.718s
user 1m23.629s
sys 8m51.410s
real 8m44.455s
user 1m7.896s
sys 7m28.909s
real 6m54.299s
user 0m55.690s
sys 5m46.502s
real 18m3.511s
user 1m38.197s
sys 9m36.517s
WIth refactor
patch
1st runreal 10m11.350s
user 1m25.038s
sys 8m39.226s
real 8m56.226s
user 1m9.774s
sys 7m41.032s
real 7m26.678s
user 0m54.833s
sys 6m20.057s
real 19m5.218s
user 1m44.122s
sys 10m17.623s
2nd runreal 11m30.500s
user 1m45.221s
sys 9m37.815s
real 9m4.103s
user 1m6.893s
sys 7m49.393s
real 7m26.713s
user 0m54.868s
sys 6m19.652s
real 18m17.230s
user 1m42.749s
sys 9m53.704s
 

--
--

Thanks & Regards, 
Suraj kharage, 
EnterpriseDB Corporation, 
The Postgres Database Company.
Reply | Threaded
Open this post in threaded view
|

Re: refactoring basebackup.c

Dipesh Pandit
Hi,

I have repeated the experiment with 8K block size and found that the results are not varying much after applying the patch.
Please find the details below.

Backup type: local backup using pg_basebackup
Data size: Around 200GB (200 tables - each table around 1.05 GB)
TAR_SEND_SIZE value8kb

Server details:
RAM: 500 GB CPU details: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 128 Filesystem: ext4

Results:

IterationWIthout refactor
patch
WIth refactor
patch
1st runreal 10m19.001s
user 1m37.895s
sys 8m33.008s
real 9m45.291s
user 1m23.192s
sys 8m14.993s
2nd runreal 9m33.970s
user 1m19.490s
sys 8m6.062s
real 9m30.560s
user 1m22.124s
sys 8m0.979s
3rd runreal 9m19.327s
user 1m21.772s
sys 7m50.613s
real 8m59.241s
user 1m19.001s
sys 7m32.645s
4th runreal 9m56.873s
user 1m22.370s
sys 8m27.054s
real 9m52.290s
user 1m22.175s
sys 8m23.052s
5th runreal 9m45.343s
user 1m23.113s
sys 8m15.418s
real 9m49.633s
user 1m23.122s
sys 8m19.240s

Later I connected with Suraj to validate the experiment details and found that the setup and steps followed are exactly the same in this
experiment when compared with the previous experiment.

Thanks,
Dipesh

On Thu, May 14, 2020 at 7:50 AM Suraj Kharage <[hidden email]> wrote:
Hi,

On Wed, May 13, 2020 at 7:49 PM Robert Haas <[hidden email]> wrote:

So the patch came out slightly faster at 8kB and slightly slower in the other tests. That's kinda strange. I wonder if it's just noise. How much do the results vary run to run?
It is not varying much except for 8kB run. Please see below details for both runs of each scenario.

8kb 32kb (default value)128kB1024kB
WIthout refactor
patch
1st runreal 10m50.924s
user 1m29.774s
sys 9m13.058s
real 8m36.245s
user 1m8.471s
sys 7m21.520s
real 7m8.690s
user 0m54.840s
sys 6m1.725s
real 18m16.898s
user 1m39.105s
sys 9m42.803s
2nd runreal 10m22.718s
user 1m23.629s
sys 8m51.410s
real 8m44.455s
user 1m7.896s
sys 7m28.909s
real 6m54.299s
user 0m55.690s
sys 5m46.502s
real 18m3.511s
user 1m38.197s
sys 9m36.517s
WIth refactor
patch
1st runreal 10m11.350s
user 1m25.038s
sys 8m39.226s
real 8m56.226s
user 1m9.774s
sys 7m41.032s
real 7m26.678s
user 0m54.833s
sys 6m20.057s
real 19m5.218s
user 1m44.122s
sys 10m17.623s
2nd runreal 11m30.500s
user 1m45.221s
sys 9m37.815s
real 9m4.103s
user 1m6.893s
sys 7m49.393s
real 7m26.713s
user 0m54.868s
sys 6m19.652s
real 18m17.230s
user 1m42.749s
sys 9m53.704s
 

--
--

Thanks & Regards, 
Suraj kharage, 
EnterpriseDB Corporation, 
The Postgres Database Company.
Reply | Threaded
Open this post in threaded view
|

Re: refactoring basebackup.c

Suraj Kharage-2

On Tue, Jun 30, 2020 at 10:45 AM Dipesh Pandit <[hidden email]> wrote:
Hi,

I have repeated the experiment with 8K block size and found that the results are not varying much after applying the patch.
Please find the details below.


Later I connected with Suraj to validate the experiment details and found that the setup and steps followed are exactly the same in this
experiment when compared with the previous experiment.


Thanks Dipesh.
It looks like the results are not varying much with your run as you followed the same steps.
One of my run with 8kb which took more time than others might be because of noise at that time.

--
--

Thanks & Regards, 
Suraj kharage,