refactoring basebackup.c

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

refactoring basebackup.c

Robert Haas
Hi,

I'd like to propose a fairly major refactoring of the server's
basebackup.c. The current code isn't horrific or anything, but the
base backup mechanism has grown quite a few features over the years
and all of the code knows about all of the features. This is going to
make it progressively more difficult to add additional features, and I
have a few in mind that I'd like to add, as discussed below and also
on several other recent threads.[1][2] The attached patch set shows
what I have in mind. It needs more work, but I believe that there's
enough here for someone to review the overall direction, and even some
of the specifics, and hopefully give me some useful feedback.

This patch set is built around the idea of creating two new
abstractions, a base backup sink -- or bbsink -- and a base backup
archiver -- or bbarchiver.  Each of these works like a foreign data
wrapper or custom scan or TupleTableSlot. That is, there's a table of
function pointers that act like method callbacks. Every implementation
can allocate a struct of sufficient size for its own bookkeeping data,
and the first member of the struct is always the same, and basically
holds the data that all implementations must store, including a
pointer to the table of function pointers. If we were using C++,
bbarchiver and bbsink would be abstract base classes.

They represent closely-related concepts, so much so that I initially
thought we could get by with just one new abstraction layer. I found
on experimentation that this did not work well, so I split it up into
two and that worked a lot better. The distinction is this: a bbsink is
something to which you can send a bunch of archives -- currently, each
would be a tarfile -- and also a backup manifest. A bbarchiver is
something to which you send every file in the data directory
individually, or at least the ones that are getting backed up, plus
any that are being injected into the backup (e.g. the backup_label).
Commonly, a bbsink will do something with the data and then forward it
to a subsequent bbsink, or a bbarchiver will do something with the
data and then forward it to a subsequent bbarchiver or bbsink. For
example, there's a bbarchiver_tar object which, like any bbarchiver,
sees all the files and their contents as input. The output is a
tarfile, which gets send to a bbsink. As things stand in the patch set
now, the tar archives are ultimately sent to the "libpq" bbsink, which
sends them to the client.

In the future, we could have other bbarchivers. For example, we could
add "pax", "zip", or "cpio" bbarchiver which produces archives of that
format, and any given backup could choose which one to use. Or, we
could have a bbarchiver that runs each individual file through a
compression algorithm and then forwards the resulting data to a
subsequent bbarchiver. That would make it easy to produce a tarfile of
individually compressed files, which is one possible way of creating a
seekable achive.[3] Likewise, we could have other bbsinks. For
example, we could have a "localdisk" bbsink that cause the server to
write the backup somewhere in the local filesystem instead of
streaming it out over libpq. Or, we could have an "s3" bbsink that
writes the archives to S3. We could also have bbsinks that compresses
the input archives using some compressor (e.g. lz4, zstd, bzip2, ...)
and forward the resulting compressed archives to the next bbsink in
the chain. I'm not trying to pass judgement on whether any of these
particular things are things we want to do, nor am I saying that this
patch set solves all the problems with doing them. However, I believe
it will make such things a whole lot easier to implement, because all
of the knowledge about whatever new functionality is being added is
centralized in one place, rather than being spread across the entirety
of basebackup.c. As an example of this, look at how 0010 changes
basebackup.c and basebackup_tar.c: afterwards, basebackup.c no longer
knows anything that is tar-specific, whereas right now it knows about
tar-specific things in many places.

Here's an overview of this patch set:

0001-0003 are cleanup patches that I have posted for review on
separate threads.[4][5] They are included here to make it easy to
apply this whole series if someone wishes to do so.

0004 is a minor refactoring that reduces by 1 the number of functions
in basebackup.c that know about the specifics of tarfiles. It is just
a preparatory patch and probably not very interesting.

0005 invents the bbsink abstraction.

0006 creates basebackup_libpq.c and moves all code that knows about
the details of sending archives via libpq there. The functionality is
exposed for use by basebackup.c as a new type of bbsink, bbsink_libpq.

0007 creates basebackup_throttle.c and moves all code that knows about
throttling backups there. The functionality is exposed for use by
basebackup.c as a new type of bbsink, bbsink_throttle. This means that
the throttling logic could be reused to throttle output to any final
destination. Essentially, this is a bbsink that just passes everything
it gets through to the next bbsink, but with a rate limit. If
throttling's not enabled, no bbsink_throttle object is created, so all
of the throttling code is completely out of the execution pipeline.

0008 creates basebackup_progress.c and moves all code that knows about
progress reporting there. The functionality is exposed for use by
basebackup.c as a new type of bbsink, bbsink_progress. Since the
abstraction doesn't fit perfectly in this case, some extra functions
are added to work around the problem. This is not entirely elegant,
but I don't think it's still an improvement over what we have now, and
I don't have a better idea.

0009 invents the bbarchiver abstraction.

0010 invents two new bbarchivers, a tar bbarchiver and a tarsize
bbarchiver, and refactors basebackup.c to make use of them. The tar
bbarchiver puts the files it sees into tar archives and forwards the
resulting archives to a bbsink. The tarsize bbarchiver is used to
support the PROGRESS option to the BASE_BACKUP command. It just
estimates the size of the backup by summing up the file sizes without
reading them. This approach is good for a couple of reasons. First,
without something like this, it's impossible to keep basebackup.c from
knowing something about the tar format, because the PROGRESS option
doesn't just figure out how big the files to be backed up are: it
figures out how big it thinks the archives will be, and that involves
tar-specific considerations. This area needs more work, as the whole
idea of measuring progress by estimating the archive size is going to
break down as soon as server-side compression is in the picture.
Second, this makes the code path that we use for figuring out the
backup size details much more similar to the path we use for
performing the actual backup. For instance, with this patch, we
include the exact same files in the calculation that we will include
in the backup, and in the same order, something that's not true today.
The basebackup_tar.c file added by this patch is sadly lacking in
comments, which I will add in a future version of the patch set. I
think, though, that it will not be too unclear what's going on here.

0011 invents another new kind of bbarchiver. This bbarchiver just
eavesdrops on the stream of files to facilitate backup manifest
construction, and then forwards everything through to a subsequent
bbarchiver. Like bbsink_throttle, it can be entirely omitted if not
used. This patch is a bit clunky at the moment and needs some polish,
but it is another demonstration of how these abstractions can be used
to simplify basebackup.c, so that basebackup.c only has to worry about
determining what should be backed up and not have to worry much about
all the specific things that need to be done as part of that.

Although this patch set adds quite a bit of code on net, it makes
basebackup.c considerably smaller and simpler, removing more than 400
lines of code from that file, about 20% of the current total. There
are some gratifying changes vs. the status quo. For example, in
master, we have this:

sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces,
                bool sendtblspclinks, backup_manifest_info *manifest,
                const char *spcoid)

Notably, the sizeonly flag makes the function not do what the name of
the function suggests that it does. Also, we've got to pass some extra
fields through to enable specific features. With the patch set, the
equivalent function looks like this:

archive_directory(bbarchiver *archiver, const char *path, int basepathlen,
                                  List *tablespaces, bool sendtblspclinks)

The question "what should I do with the directories and files we find
as we recurse?" is now answered by the choice of which bbarchiver to
pass to the function, rather than by the values of sizeonly, manifest,
and spcoid. That's not night and day, but I think it's better,
especially as you imagine adding more features in the future. The
really important part, for me, is that you can make the bbarchiver do
anything you like without needing to make any more changes to this
function. It just arranges to invoke your callbacks. You take it from
there.

One pretty major question that this patch set doesn't address is what
the user interface for any of the hypothetical features mentioned
above ought to look like, or how basebackup.c ought to support them.
The syntax for the BASE_BACKUP command, like the contents of
basebackup.c, has grown organically, and doesn't seem to be very
scalable. Also, the wire protocol - a series of CopyData results which
the client is entirely responsible for knowing how to interpret and
about which the server provides only minimal information - doesn't
much lend itself to extensibility. Some careful design work is likely
needed in both areas, and this patch does not try to do any of it. I
am quite interested in discussing those questions, but I felt that
they weren't the most important problems to solve first.

What do you all think?

Thanks,

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] http://postgr.es/m/CA+TgmoZubLXYR+Pd_gi3MVgyv5hQdLm-GBrVXkun-Lewaw12Kg@...
[2] http://postgr.es/m/CA+TgmoYr7+-0_vyQoHbTP5H3QGZFgfhnrn6ewDteF=kUqkG=Fw@...
[3] http://postgr.es/m/CA+TgmoZQCoCyPv6fGoovtPEZF98AXCwYDnSB0=p5XtxNY68r_A@...
and following
[4] http://postgr.es/m/CA+TgmoYq+59SJ2zBbP891ngWPA9fymOqntqYcweSDYXS2a620A@...
[5] http://postgr.es/m/CA+TgmobWbfReO9-XFk8urR1K4wTNwqoHx_v56t7=T8KaiEoKNw@...


Reply | Threaded
Open this post in threaded view
|

Re: refactoring basebackup.c

Robert Haas
So it might be good if I'd remembered to attach the patches. Let's try
that again.

...Robert

v1-0005-Introduce-bbsink-abstraction.patch (12K) Download Attachment
v1-0002-Minor-code-cleanup-for-perform_base_backup.patch (3K) Download Attachment
v1-0003-Assorted-cleanup-of-tar-related-code.patch (22K) Download Attachment
v1-0004-Recast-_tarWriteDirectory-as-convert_link_to_dire.patch (4K) Download Attachment
v1-0001-Don-t-export-basebackup.c-s-sendTablespace.patch (9K) Download Attachment
v1-0006-Convert-libpq-related-code-to-a-bbsink.patch (43K) Download Attachment
v1-0009-Introduce-bbarchiver-abstraction.patch (15K) Download Attachment
v1-0008-Convert-progress-reporting-code-to-a-bbsink.patch (25K) Download Attachment
v1-0010-Create-and-use-bbarchiver-implementations-for-tar.patch (42K) Download Attachment
v1-0007-Convert-throttling-related-code-to-a-bbsink.patch (18K) Download Attachment
v1-0011-WIP-Convert-backup-manifest-generation-to-a-bbarc.patch (21K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: refactoring basebackup.c

Andres Freund
In reply to this post by Robert Haas
Hi,

On 2020-05-08 16:53:09 -0400, Robert Haas wrote:

> They represent closely-related concepts, so much so that I initially
> thought we could get by with just one new abstraction layer. I found
> on experimentation that this did not work well, so I split it up into
> two and that worked a lot better. The distinction is this: a bbsink is
> something to which you can send a bunch of archives -- currently, each
> would be a tarfile -- and also a backup manifest. A bbarchiver is
> something to which you send every file in the data directory
> individually, or at least the ones that are getting backed up, plus
> any that are being injected into the backup (e.g. the backup_label).
> Commonly, a bbsink will do something with the data and then forward it
> to a subsequent bbsink, or a bbarchiver will do something with the
> data and then forward it to a subsequent bbarchiver or bbsink. For
> example, there's a bbarchiver_tar object which, like any bbarchiver,
> sees all the files and their contents as input. The output is a
> tarfile, which gets send to a bbsink. As things stand in the patch set
> now, the tar archives are ultimately sent to the "libpq" bbsink, which
> sends them to the client.

Hm.

I wonder if there's cases where recursively forwarding like this will
cause noticable performance effects. The only operation that seems
frequent enough to potentially be noticable would be "chunks" of the
file. So perhaps it'd be good to make sure we read in large enough
chunks?

> 0010 invents two new bbarchivers, a tar bbarchiver and a tarsize
> bbarchiver, and refactors basebackup.c to make use of them. The tar
> bbarchiver puts the files it sees into tar archives and forwards the
> resulting archives to a bbsink. The tarsize bbarchiver is used to
> support the PROGRESS option to the BASE_BACKUP command. It just
> estimates the size of the backup by summing up the file sizes without
> reading them. This approach is good for a couple of reasons. First,
> without something like this, it's impossible to keep basebackup.c from
> knowing something about the tar format, because the PROGRESS option
> doesn't just figure out how big the files to be backed up are: it
> figures out how big it thinks the archives will be, and that involves
> tar-specific considerations.

ISTM that it's not actually good to have the progress calculations
include the tar overhead. As you say:

> This area needs more work, as the whole idea of measuring progress by
> estimating the archive size is going to break down as soon as
> server-side compression is in the picture.

This, to me, indicates that we should measure the progress solely based
on how much of the "source" data was processed. The overhead of tar, the
reduction due to compression, shouldn't be included.


> What do you all think?

I've not though enough about the specifics, but I think it looks like
it's going roughly in a better direction.

One thing I wonder about is how stateful the interface is. Archivers
will pretty much always track which file is currently open etc. Somehow
such a repeating state machine seems a bit ugly - but I don't really
have a better answer.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: refactoring basebackup.c

Robert Haas
On Fri, May 8, 2020 at 5:27 PM Andres Freund <[hidden email]> wrote:
> I wonder if there's cases where recursively forwarding like this will
> cause noticable performance effects. The only operation that seems
> frequent enough to potentially be noticable would be "chunks" of the
> file. So perhaps it'd be good to make sure we read in large enough
> chunks?

Yeah, that needs to be tested. Right now the chunk size is 32kB but it
might be a good idea to go larger. Another thing is that right now the
chunk size is tied to the protocol message size, and I'm not sure
whether the size that's optimal for disk reads is also optimal for
protocol messages.

> This, to me, indicates that we should measure the progress solely based
> on how much of the "source" data was processed. The overhead of tar, the
> reduction due to compression, shouldn't be included.

I don't think it's a particularly bad thing that we include a small
amount of progress for sending an empty file, a directory, or a
symlink. That could make the results more meaningful if you have a
database with lots of empty relations in it. However, I agree that the
effect of compression shouldn't be included. To get there, I think we
need to redesign the wire protocol. Right now, the server has no way
of letting the client know how many uncompressed bytes it's sent, and
the client has no way of figuring it out without uncompressing, which
seems like something we want to avoid.

There are some other problems with the current wire protocol, too:

1. The syntax for the BASE_BACKUP command is large and unwieldy. We
really ought to adopt an extensible options syntax, like COPY, VACUUM,
EXPLAIN, etc. do, rather than using a zillion ad-hoc bolt-ons, each
with bespoke lexer and parser support.

2. The client is sent a list of tablespaces and is supposed to use
that to expect an equal number of archives, computing the name for
each one on the client side from the tablespace info. However, I think
we should be able to support modes like "put all the tablespaces in a
single archive" or "send a separate archive for every 256GB" or "ship
it all to the cloud and don't send me any archives". To get there, I
think we should have the server send the archive name to the clients,
and the client should just keep receiving the next archive until it's
told that there are no more. Then if there's one archive or ten
archives or no archives, the client doesn't have to care. It just
receives what the server sends until it hears that there are no more.
It also doesn't know how the server is naming the archives; the server
can, for example, adjust the archive names based on which compression
format is being chosen, without knowledge of the server's naming
conventions needing to exist on the client side.

I think we should keep support for the current BASE_BACKUP command but
either add a new variant using an extensible options, or else invent a
whole new command with a different name (BACKUP, SEND_BACKUP,
whatever) that takes extensible options. This command should send back
all the archives and the backup manifest using a single COPY stream
rather than multiple COPY streams. Within the COPY stream, we'll
invent a sub-protocol, e.g. based on the first letter of the message,
e.g.:

t = Tablespace boundary. No further message payload. Indicates, for
progress reporting purposes, that we are advancing to the next
tablespace.
f = Filename. The remainder of the message payload is the name of the
next file that will be transferred.
d = Data. The next four bytes contain the number of uncompressed bytes
covered by this message, for progress reporting purposes. The rest of
the message is payload, possibly compressed. Could be empty, if the
data is being shipped elsewhere, and these messages are only being
sent to update the client's notion of progress.

> I've not though enough about the specifics, but I think it looks like
> it's going roughly in a better direction.

Good to hear.

> One thing I wonder about is how stateful the interface is. Archivers
> will pretty much always track which file is currently open etc. Somehow
> such a repeating state machine seems a bit ugly - but I don't really
> have a better answer.

I thought about that a bit, too. There might be some way to unify that
by having some common context object that's defined by basebackup.c
and all archivers get it, so that they have some commonly-desired
details without needing bespoke code, but I'm not sure at this point
whether that will actually produce a nicer result. Even if we don't
have it initially, it seems like it wouldn't be very hard to add it
later, so I'm not too stressed about it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: refactoring basebackup.c

Sumanta Mukherjee
Hi Robert,
Please see my comments inline below.

On Tue, May 12, 2020 at 12:33 AM Robert Haas <[hidden email]> wrote:
Yeah, that needs to be tested. Right now the chunk size is 32kB but it
might be a good idea to go larger. Another thing is that right now the
chunk size is tied to the protocol message size, and I'm not sure
whether the size that's optimal for disk reads is also optimal for
protocol messages.

One needs a balance between the number of packets to be sent across the network and so if the size 
of reading from the disk and the network packet size could be unified then it might provide a better optimization.
 

I don't think it's a particularly bad thing that we include a small
amount of progress for sending an empty file, a directory, or a
symlink. That could make the results more meaningful if you have a
database with lots of empty relations in it. However, I agree that the
effect of compression shouldn't be included. To get there, I think we
need to redesign the wire protocol. Right now, the server has no way
of letting the client know how many uncompressed bytes it's sent, and
the client has no way of figuring it out without uncompressing, which
seems like something we want to avoid.


  I agree here too, except that if we have too many tar files one might cringe 
  but sending the xtra amt from these tar files looks okay to me.
 
There are some other problems with the current wire protocol, too:

1. The syntax for the BASE_BACKUP command is large and unwieldy. We
really ought to adopt an extensible options syntax, like COPY, VACUUM,
EXPLAIN, etc. do, rather than using a zillion ad-hoc bolt-ons, each
with bespoke lexer and parser support.

2. The client is sent a list of tablespaces and is supposed to use
that to expect an equal number of archives, computing the name for
each one on the client side from the tablespace info. However, I think
we should be able to support modes like "put all the tablespaces in a
single archive" or "send a separate archive for every 256GB" or "ship
it all to the cloud and don't send me any archives". To get there, I
think we should have the server send the archive name to the clients,
and the client should just keep receiving the next archive until it's
told that there are no more. Then if there's one archive or ten
archives or no archives, the client doesn't have to care. It just
receives what the server sends until it hears that there are no more.
It also doesn't know how the server is naming the archives; the server
can, for example, adjust the archive names based on which compression
format is being chosen, without knowledge of the server's naming
conventions needing to exist on the client side.

  One thing to remember here could be that an optimization would need to be made between the number of options
  we provide and people coming back and saying which combinations do not work
  For example, if a user script has "put all the tablespaces in a single archive" and later on somebody makes some
  script changes to break it down at "256 GB" and there is a conflict then which one takes precedence needs to be chosen.
  When the number of options like this become very large this could lead to some complications.
  
I think we should keep support for the current BASE_BACKUP command but
either add a new variant using an extensible options, or else invent a
whole new command with a different name (BACKUP, SEND_BACKUP,
whatever) that takes extensible options. This command should send back
all the archives and the backup manifest using a single COPY stream
rather than multiple COPY streams. Within the COPY stream, we'll
invent a sub-protocol, e.g. based on the first letter of the message,
e.g.:

t = Tablespace boundary. No further message payload. Indicates, for
progress reporting purposes, that we are advancing to the next
tablespace.
f = Filename. The remainder of the message payload is the name of the
next file that will be transferred.
d = Data. The next four bytes contain the number of uncompressed bytes
covered by this message, for progress reporting purposes. The rest of
the message is payload, possibly compressed. Could be empty, if the
data is being shipped elsewhere, and these messages are only being
sent to update the client's notion of progress.
 
  Here I support this.
 
I thought about that a bit, too. There might be some way to unify that
by having some common context object that's defined by basebackup.c
and all archivers get it, so that they have some commonly-desired
details without needing bespoke code, but I'm not sure at this point
whether that will actually produce a nicer result. Even if we don't
have it initially, it seems like it wouldn't be very hard to add it
later, so I'm not too stressed about it.
 
--Sumanta Mukherjee
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: refactoring basebackup.c

Dilip Kumar-2
In reply to this post by Robert Haas
On Sat, May 9, 2020 at 2:23 AM Robert Haas <[hidden email]> wrote:

>
> Hi,
>
> I'd like to propose a fairly major refactoring of the server's
> basebackup.c. The current code isn't horrific or anything, but the
> base backup mechanism has grown quite a few features over the years
> and all of the code knows about all of the features. This is going to
> make it progressively more difficult to add additional features, and I
> have a few in mind that I'd like to add, as discussed below and also
> on several other recent threads.[1][2] The attached patch set shows
> what I have in mind. It needs more work, but I believe that there's
> enough here for someone to review the overall direction, and even some
> of the specifics, and hopefully give me some useful feedback.
>
> This patch set is built around the idea of creating two new
> abstractions, a base backup sink -- or bbsink -- and a base backup
> archiver -- or bbarchiver.  Each of these works like a foreign data
> wrapper or custom scan or TupleTableSlot. That is, there's a table of
> function pointers that act like method callbacks. Every implementation
> can allocate a struct of sufficient size for its own bookkeeping data,
> and the first member of the struct is always the same, and basically
> holds the data that all implementations must store, including a
> pointer to the table of function pointers. If we were using C++,
> bbarchiver and bbsink would be abstract base classes.
>
> They represent closely-related concepts, so much so that I initially
> thought we could get by with just one new abstraction layer. I found
> on experimentation that this did not work well, so I split it up into
> two and that worked a lot better. The distinction is this: a bbsink is
> something to which you can send a bunch of archives -- currently, each
> would be a tarfile -- and also a backup manifest. A bbarchiver is
> something to which you send every file in the data directory
> individually, or at least the ones that are getting backed up, plus
> any that are being injected into the backup (e.g. the backup_label).
> Commonly, a bbsink will do something with the data and then forward it
> to a subsequent bbsink, or a bbarchiver will do something with the
> data and then forward it to a subsequent bbarchiver or bbsink. For
> example, there's a bbarchiver_tar object which, like any bbarchiver,
> sees all the files and their contents as input. The output is a
> tarfile, which gets send to a bbsink. As things stand in the patch set
> now, the tar archives are ultimately sent to the "libpq" bbsink, which
> sends them to the client.
>
> In the future, we could have other bbarchivers. For example, we could
> add "pax", "zip", or "cpio" bbarchiver which produces archives of that
> format, and any given backup could choose which one to use. Or, we
> could have a bbarchiver that runs each individual file through a
> compression algorithm and then forwards the resulting data to a
> subsequent bbarchiver. That would make it easy to produce a tarfile of
> individually compressed files, which is one possible way of creating a
> seekable achive.[3] Likewise, we could have other bbsinks. For
> example, we could have a "localdisk" bbsink that cause the server to
> write the backup somewhere in the local filesystem instead of
> streaming it out over libpq. Or, we could have an "s3" bbsink that
> writes the archives to S3. We could also have bbsinks that compresses
> the input archives using some compressor (e.g. lz4, zstd, bzip2, ...)
> and forward the resulting compressed archives to the next bbsink in
> the chain. I'm not trying to pass judgement on whether any of these
> particular things are things we want to do, nor am I saying that this
> patch set solves all the problems with doing them. However, I believe
> it will make such things a whole lot easier to implement, because all
> of the knowledge about whatever new functionality is being added is
> centralized in one place, rather than being spread across the entirety
> of basebackup.c. As an example of this, look at how 0010 changes
> basebackup.c and basebackup_tar.c: afterwards, basebackup.c no longer
> knows anything that is tar-specific, whereas right now it knows about
> tar-specific things in many places.
>
> Here's an overview of this patch set:
>
> 0001-0003 are cleanup patches that I have posted for review on
> separate threads.[4][5] They are included here to make it easy to
> apply this whole series if someone wishes to do so.
>
> 0004 is a minor refactoring that reduces by 1 the number of functions
> in basebackup.c that know about the specifics of tarfiles. It is just
> a preparatory patch and probably not very interesting.
>
> 0005 invents the bbsink abstraction.
>
> 0006 creates basebackup_libpq.c and moves all code that knows about
> the details of sending archives via libpq there. The functionality is
> exposed for use by basebackup.c as a new type of bbsink, bbsink_libpq.
>
> 0007 creates basebackup_throttle.c and moves all code that knows about
> throttling backups there. The functionality is exposed for use by
> basebackup.c as a new type of bbsink, bbsink_throttle. This means that
> the throttling logic could be reused to throttle output to any final
> destination. Essentially, this is a bbsink that just passes everything
> it gets through to the next bbsink, but with a rate limit. If
> throttling's not enabled, no bbsink_throttle object is created, so all
> of the throttling code is completely out of the execution pipeline.
>
> 0008 creates basebackup_progress.c and moves all code that knows about
> progress reporting there. The functionality is exposed for use by
> basebackup.c as a new type of bbsink, bbsink_progress. Since the
> abstraction doesn't fit perfectly in this case, some extra functions
> are added to work around the problem. This is not entirely elegant,
> but I don't think it's still an improvement over what we have now, and
> I don't have a better idea.
>
> 0009 invents the bbarchiver abstraction.
>
> 0010 invents two new bbarchivers, a tar bbarchiver and a tarsize
> bbarchiver, and refactors basebackup.c to make use of them. The tar
> bbarchiver puts the files it sees into tar archives and forwards the
> resulting archives to a bbsink. The tarsize bbarchiver is used to
> support the PROGRESS option to the BASE_BACKUP command. It just
> estimates the size of the backup by summing up the file sizes without
> reading them. This approach is good for a couple of reasons. First,
> without something like this, it's impossible to keep basebackup.c from
> knowing something about the tar format, because the PROGRESS option
> doesn't just figure out how big the files to be backed up are: it
> figures out how big it thinks the archives will be, and that involves
> tar-specific considerations. This area needs more work, as the whole
> idea of measuring progress by estimating the archive size is going to
> break down as soon as server-side compression is in the picture.
> Second, this makes the code path that we use for figuring out the
> backup size details much more similar to the path we use for
> performing the actual backup. For instance, with this patch, we
> include the exact same files in the calculation that we will include
> in the backup, and in the same order, something that's not true today.
> The basebackup_tar.c file added by this patch is sadly lacking in
> comments, which I will add in a future version of the patch set. I
> think, though, that it will not be too unclear what's going on here.
>
> 0011 invents another new kind of bbarchiver. This bbarchiver just
> eavesdrops on the stream of files to facilitate backup manifest
> construction, and then forwards everything through to a subsequent
> bbarchiver. Like bbsink_throttle, it can be entirely omitted if not
> used. This patch is a bit clunky at the moment and needs some polish,
> but it is another demonstration of how these abstractions can be used
> to simplify basebackup.c, so that basebackup.c only has to worry about
> determining what should be backed up and not have to worry much about
> all the specific things that need to be done as part of that.
>
> Although this patch set adds quite a bit of code on net, it makes
> basebackup.c considerably smaller and simpler, removing more than 400
> lines of code from that file, about 20% of the current total. There
> are some gratifying changes vs. the status quo. For example, in
> master, we have this:
>
> sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces,
>                 bool sendtblspclinks, backup_manifest_info *manifest,
>                 const char *spcoid)
>
> Notably, the sizeonly flag makes the function not do what the name of
> the function suggests that it does. Also, we've got to pass some extra
> fields through to enable specific features. With the patch set, the
> equivalent function looks like this:
>
> archive_directory(bbarchiver *archiver, const char *path, int basepathlen,
>                                   List *tablespaces, bool sendtblspclinks)
>
> The question "what should I do with the directories and files we find
> as we recurse?" is now answered by the choice of which bbarchiver to
> pass to the function, rather than by the values of sizeonly, manifest,
> and spcoid. That's not night and day, but I think it's better,
> especially as you imagine adding more features in the future. The
> really important part, for me, is that you can make the bbarchiver do
> anything you like without needing to make any more changes to this
> function. It just arranges to invoke your callbacks. You take it from
> there.
>
> One pretty major question that this patch set doesn't address is what
> the user interface for any of the hypothetical features mentioned
> above ought to look like, or how basebackup.c ought to support them.
> The syntax for the BASE_BACKUP command, like the contents of
> basebackup.c, has grown organically, and doesn't seem to be very
> scalable. Also, the wire protocol - a series of CopyData results which
> the client is entirely responsible for knowing how to interpret and
> about which the server provides only minimal information - doesn't
> much lend itself to extensibility. Some careful design work is likely
> needed in both areas, and this patch does not try to do any of it. I
> am quite interested in discussing those questions, but I felt that
> they weren't the most important problems to solve first.
>
> What do you all think?

The overall idea looks quite nice.  I had a look at some of the
patches at least 0005 and 0006.  At first look, I have one comment.

+/*
+ * Each archive is set as a separate stream of COPY data, and thus begins
+ * with a CopyOutResponse message.
+ */
+static void
+bbsink_libpq_begin_archive(bbsink *sink, const char *archive_name)
+{
+ SendCopyOutResponse();
+}

Some of the bbsink_libpq_* functions are directly calling pq_* e.g.
bbsink_libpq_begin_backup whereas others are calling SendCopy*
functions and therein those are calling pq_* functions.  I think
bbsink_libpq_* function can directly call pq_* functions instead of
adding one more level of the function call.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: refactoring basebackup.c

Robert Haas
On Tue, May 12, 2020 at 4:32 AM Dilip Kumar <[hidden email]> wrote:
> Some of the bbsink_libpq_* functions are directly calling pq_* e.g.
> bbsink_libpq_begin_backup whereas others are calling SendCopy*
> functions and therein those are calling pq_* functions.  I think
> bbsink_libpq_* function can directly call pq_* functions instead of
> adding one more level of the function call.

I think all the helper functions have more than one caller, though.
That's why I created them - to avoid duplicating code.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: refactoring basebackup.c

Dilip Kumar-2
On Wed, May 13, 2020 at 1:56 AM Robert Haas <[hidden email]> wrote:

>
> On Tue, May 12, 2020 at 4:32 AM Dilip Kumar <[hidden email]> wrote:
> > Some of the bbsink_libpq_* functions are directly calling pq_* e.g.
> > bbsink_libpq_begin_backup whereas others are calling SendCopy*
> > functions and therein those are calling pq_* functions.  I think
> > bbsink_libpq_* function can directly call pq_* functions instead of
> > adding one more level of the function call.
>
> I think all the helper functions have more than one caller, though.
> That's why I created them - to avoid duplicating code.

You are right, somehow I missed that part.  Sorry for the noise.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: refactoring basebackup.c

Suraj Kharage-2
In reply to this post by Andres Freund
Hi,

Did some performance testing by varying TAR_SEND_SIZE with Robert's refactor patch and without the patch to check the impact.

Below are the details:

Backup type: local backup using pg_basebackup
Data size: Around 200GB (200 tables - each table around 1.05 GB)
different TAR_SEND_SIZE values8kb, 32kb (default value), 128kB, 1MB (1024kB)

Server details:
RAM: 500 GB CPU details: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 128 Filesystem: ext4

8kb 32kb (default value)128kB1024kB
Without refactor patchreal 10m22.718s
user 1m23.629s
sys 8m51.410s
real 8m36.245s
user 1m8.471s
sys 7m21.520s
real 6m54.299s
user 0m55.690s
sys 5m46.502s
real 18m3.511s
user 1m38.197s
sys 9m36.517s
With refactor patch (Robert's patch)real 10m11.350s
user 1m25.038s
sys 8m39.226s
real 8m56.226s
user 1m9.774s
sys 7m41.032s
real 7m26.678s
user 0m54.833s
sys 6m20.057s
real 18m17.230s
user 1m42.749s
sys 9m53.704s

The above numbers are taken from the minimum of two runs of each scenario.

I can see, when we have TAR_SEND_SIZE as 32kb or 128kb, it is giving us a good performance whereas, for 1Mb it is taking 2.5x more time.

Please let me know your thoughts/suggestions on the same.

--
--

Thanks & Regards, 
Suraj kharage, 
EnterpriseDB Corporation, 
The Postgres Database Company.
Reply | Threaded
Open this post in threaded view
|

Re: refactoring basebackup.c

Sumanta Mukherjee
Hi Suraj,

Two points I wanted to mention.

  1. The max rate at which the transfer is happening when the tar size is 128 Kb is at most .48 GB/sec. Is there a possibility to understand what is the buffer size which is being used. That could help us explain some part of the puzzle.
  2. Secondly the idea of taking just the min of two runs is a bit counter to the following. How do we justify the performance numbers and attribute that the differences is not related to noise. It might be better to do a few experiments for each of the kind and then try and fit a basic linear model and report the std deviation. "Order statistics"  where you get the min(X1, X2, ... , Xn) is generally a biased estimator.  A variance calculation of the biased statistics is a bit tricky and so the results could be corrupted by noise. 

With Regards,
Sumanta Mukherjee.


On Wed, May 13, 2020 at 9:31 AM Suraj Kharage <[hidden email]> wrote:
Hi,

Did some performance testing by varying TAR_SEND_SIZE with Robert's refactor patch and without the patch to check the impact.

Below are the details:

Backup type: local backup using pg_basebackup
Data size: Around 200GB (200 tables - each table around 1.05 GB)
different TAR_SEND_SIZE values8kb, 32kb (default value), 128kB, 1MB (1024kB)

Server details:
RAM: 500 GB CPU details: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 128 Filesystem: ext4

8kb 32kb (default value)128kB1024kB
Without refactor patchreal 10m22.718s
user 1m23.629s
sys 8m51.410s
real 8m36.245s
user 1m8.471s
sys 7m21.520s
real 6m54.299s
user 0m55.690s
sys 5m46.502s
real 18m3.511s
user 1m38.197s
sys 9m36.517s
With refactor patch (Robert's patch)real 10m11.350s
user 1m25.038s
sys 8m39.226s
real 8m56.226s
user 1m9.774s
sys 7m41.032s
real 7m26.678s
user 0m54.833s
sys 6m20.057s
real 18m17.230s
user 1m42.749s
sys 9m53.704s

The above numbers are taken from the minimum of two runs of each scenario.

I can see, when we have TAR_SEND_SIZE as 32kb or 128kb, it is giving us a good performance whereas, for 1Mb it is taking 2.5x more time.

Please let me know your thoughts/suggestions on the same.

--
--

Thanks & Regards, 
Suraj kharage, 
EnterpriseDB Corporation, 
The Postgres Database Company.
Reply | Threaded
Open this post in threaded view
|

Re: refactoring basebackup.c

Robert Haas
In reply to this post by Suraj Kharage-2
On Wed, May 13, 2020 at 12:01 AM Suraj Kharage <[hidden email]> wrote:
8kb 32kb (default value)128kB1024kB
Without refactor patchreal 10m22.718s
user 1m23.629s
sys 8m51.410s
real 8m36.245s
user 1m8.471s
sys 7m21.520s
real 6m54.299s
user 0m55.690s
sys 5m46.502s
real 18m3.511s
user 1m38.197s
sys 9m36.517s
With refactor patch (Robert's patch)real 10m11.350s
user 1m25.038s
sys 8m39.226s
real 8m56.226s
user 1m9.774s
sys 7m41.032s
real 7m26.678s
user 0m54.833s
sys 6m20.057s
real 18m17.230s
user 1m42.749s
sys 9m53.704s

The above numbers are taken from the minimum of two runs of each scenario.

I can see, when we have TAR_SEND_SIZE as 32kb or 128kb, it is giving us a good performance whereas, for 1Mb it is taking 2.5x more time.

Please let me know your thoughts/suggestions on the same.

So the patch came out slightly faster at 8kB and slightly slower in the other tests. That's kinda strange. I wonder if it's just noise. How much do the results vary run to run?

I would've expected (and I think Andres thought the same) that a smaller block size would be bad for the patch and a larger block size would be good, but that's not what these numbers show.

I wouldn't worry too much about the regression at 1MB. Probably what's happening there is that we're losing some concurrency - perhaps with smaller block sizes the OS can buffer the entire chunk in the pipe connecting pg_basebackup to the server and start on the next one, but when you go up to 1MB it doesn't fit and ends up spending a lot of time waiting for data to be read from the pipe. Wait event profiling might tell you more. Probably what this suggests is that you want the largest buffer size that doesn't cause you to overrun the network/pipe buffer and no larger. Unfortunately, I have no idea how we'd figure that out dynamically, and I don't see a reason to believe that everyone will have the same size buffers.
 
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Reply | Threaded
Open this post in threaded view
|

Re: refactoring basebackup.c

Suraj Kharage-2
Hi,

On Wed, May 13, 2020 at 7:49 PM Robert Haas <[hidden email]> wrote:

So the patch came out slightly faster at 8kB and slightly slower in the other tests. That's kinda strange. I wonder if it's just noise. How much do the results vary run to run?
It is not varying much except for 8kB run. Please see below details for both runs of each scenario.

8kb 32kb (default value)128kB1024kB
WIthout refactor
patch
1st runreal 10m50.924s
user 1m29.774s
sys 9m13.058s
real 8m36.245s
user 1m8.471s
sys 7m21.520s
real 7m8.690s
user 0m54.840s
sys 6m1.725s
real 18m16.898s
user 1m39.105s
sys 9m42.803s
2nd runreal 10m22.718s
user 1m23.629s
sys 8m51.410s
real 8m44.455s
user 1m7.896s
sys 7m28.909s
real 6m54.299s
user 0m55.690s
sys 5m46.502s
real 18m3.511s
user 1m38.197s
sys 9m36.517s
WIth refactor
patch
1st runreal 10m11.350s
user 1m25.038s
sys 8m39.226s
real 8m56.226s
user 1m9.774s
sys 7m41.032s
real 7m26.678s
user 0m54.833s
sys 6m20.057s
real 19m5.218s
user 1m44.122s
sys 10m17.623s
2nd runreal 11m30.500s
user 1m45.221s
sys 9m37.815s
real 9m4.103s
user 1m6.893s
sys 7m49.393s
real 7m26.713s
user 0m54.868s
sys 6m19.652s
real 18m17.230s
user 1m42.749s
sys 9m53.704s
 

--
--

Thanks & Regards, 
Suraj kharage, 
EnterpriseDB Corporation, 
The Postgres Database Company.
Reply | Threaded
Open this post in threaded view
|

Re: refactoring basebackup.c

Dipesh Pandit
Hi,

I have repeated the experiment with 8K block size and found that the results are not varying much after applying the patch.
Please find the details below.

Backup type: local backup using pg_basebackup
Data size: Around 200GB (200 tables - each table around 1.05 GB)
TAR_SEND_SIZE value8kb

Server details:
RAM: 500 GB CPU details: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 128 Filesystem: ext4

Results:

IterationWIthout refactor
patch
WIth refactor
patch
1st runreal 10m19.001s
user 1m37.895s
sys 8m33.008s
real 9m45.291s
user 1m23.192s
sys 8m14.993s
2nd runreal 9m33.970s
user 1m19.490s
sys 8m6.062s
real 9m30.560s
user 1m22.124s
sys 8m0.979s
3rd runreal 9m19.327s
user 1m21.772s
sys 7m50.613s
real 8m59.241s
user 1m19.001s
sys 7m32.645s
4th runreal 9m56.873s
user 1m22.370s
sys 8m27.054s
real 9m52.290s
user 1m22.175s
sys 8m23.052s
5th runreal 9m45.343s
user 1m23.113s
sys 8m15.418s
real 9m49.633s
user 1m23.122s
sys 8m19.240s

Later I connected with Suraj to validate the experiment details and found that the setup and steps followed are exactly the same in this
experiment when compared with the previous experiment.

Thanks,
Dipesh

On Thu, May 14, 2020 at 7:50 AM Suraj Kharage <[hidden email]> wrote:
Hi,

On Wed, May 13, 2020 at 7:49 PM Robert Haas <[hidden email]> wrote:

So the patch came out slightly faster at 8kB and slightly slower in the other tests. That's kinda strange. I wonder if it's just noise. How much do the results vary run to run?
It is not varying much except for 8kB run. Please see below details for both runs of each scenario.

8kb 32kb (default value)128kB1024kB
WIthout refactor
patch
1st runreal 10m50.924s
user 1m29.774s
sys 9m13.058s
real 8m36.245s
user 1m8.471s
sys 7m21.520s
real 7m8.690s
user 0m54.840s
sys 6m1.725s
real 18m16.898s
user 1m39.105s
sys 9m42.803s
2nd runreal 10m22.718s
user 1m23.629s
sys 8m51.410s
real 8m44.455s
user 1m7.896s
sys 7m28.909s
real 6m54.299s
user 0m55.690s
sys 5m46.502s
real 18m3.511s
user 1m38.197s
sys 9m36.517s
WIth refactor
patch
1st runreal 10m11.350s
user 1m25.038s
sys 8m39.226s
real 8m56.226s
user 1m9.774s
sys 7m41.032s
real 7m26.678s
user 0m54.833s
sys 6m20.057s
real 19m5.218s
user 1m44.122s
sys 10m17.623s
2nd runreal 11m30.500s
user 1m45.221s
sys 9m37.815s
real 9m4.103s
user 1m6.893s
sys 7m49.393s
real 7m26.713s
user 0m54.868s
sys 6m19.652s
real 18m17.230s
user 1m42.749s
sys 9m53.704s
 

--
--

Thanks & Regards, 
Suraj kharage, 
EnterpriseDB Corporation, 
The Postgres Database Company.
Reply | Threaded
Open this post in threaded view
|

Re: refactoring basebackup.c

Suraj Kharage-2

On Tue, Jun 30, 2020 at 10:45 AM Dipesh Pandit <[hidden email]> wrote:
Hi,

I have repeated the experiment with 8K block size and found that the results are not varying much after applying the patch.
Please find the details below.


Later I connected with Suraj to validate the experiment details and found that the setup and steps followed are exactly the same in this
experiment when compared with the previous experiment.


Thanks Dipesh.
It looks like the results are not varying much with your run as you followed the same steps.
One of my run with 8kb which took more time than others might be because of noise at that time.

--
--

Thanks & Regards, 
Suraj kharage, 

Reply | Threaded
Open this post in threaded view
|

Re: refactoring basebackup.c

Robert Haas
In reply to this post by Robert Haas
On Fri, May 8, 2020 at 4:55 PM Robert Haas <[hidden email]> wrote:
> So it might be good if I'd remembered to attach the patches. Let's try
> that again.

Here's an updated patch set. This is now rebased over master and
includes as 0001 the patch I posted separately at
http://postgr.es/m/CA+TgmobAczXDRO_Gr2euo_TxgzaH1JxbNxvFx=HYvBinefNH8Q@...
but drops some other patches that were committed meanwhile. 0002-0009
of this series are basically the same as 0004-0011 from the previous
series, except for rebasing and fixing a bug I discovered in what's
now 0006. 0012 does a refactoring of pg_basebackup along similar lines
to the server-side refactoring from patches earlier in the series.
0012 is a really terrible, hacky, awful demonstration of how this
infrastructure can support server-side compression. If you apply it
and take a tar-format backup without -R, you will get .tar files that
are actually .tar.gz files. You can rename them, decompress them, and
use pg_verifybackup to check that everything is OK. If you try to do
anything else with 0012 applied, everything will break.

In the process of working on this, I learned a lot about how
pg_basebackup actually works, and found out about a number of things
that, with the benefit of hindsight, seem like they might not have
been the best way to go.

1. pg_basebackup -R injects recovery.conf (on older versions) or
injects standby.signal and appends to postgresql.auto.conf (on newer
versions) by parsing the tar file sent by the server and editing it on
the fly. From the point of view of server-side compression, this is
not ideal, because if you want to make these kinds of changes when
server-side compression is in use, you'd have to decompress the stream
on the client side in order to figure out where in the steam you ought
to inject your changes. But having to do that is a major expense. If
the client instead told the server what to change when generating the
archive, and the server did it, this expense could be avoided. It
would have the additional advantage that the backup manifest could
reflect the effects of those changes; right now it doesn't, and
pg_verifybackup just knows to expect differences in those files.

2. According to the comments, some tar programs require two tar blocks
(i.e. 512-byte blocks) of zero bytes at the end of an archive. The
server does not generate these blocks of zero bytes, so it basically
creates a tar file that works fine with my copy of tar but might break
with somebody else's. Instead, the client appends 1024 zero bytes to
the end of every file it receives from the server. That is an odd way
of fixing this problem, and it makes things rather inflexible. If the
server sends you any kind of a file OTHER THAN a tar file with the
last 1024 zero bytes stripped off, then adding 1024 zero bytes will be
the wrong thing to do. It would be better if the server just generated
fully correct tar files (whatever we think that means) and the client
wrote out exactly what it got from the server. Then, we could have the
server generate cpio archives or zip files or gzip-compressed tar
files or lz4-compressed tar files or anything we like, and the client
wouldn't really need to care as long as it didn't need to extract
those archives. That seems a lot cleaner.

3. The way that progress reporting works relies on the server knowing
exactly how large the archive sent to the client is going to be.
Progress as reckoned by the client is equal to the number of archive
payload bytes the client has received. This works OK with a tar
because we know how big the tar file is going to be based on the size
of the input files we intend to send, but it's unsuitable for any sort
of compressed archive (tar.gz, zip, whatever) because the compression
ratio cannot be predicted in advance. It would be better if the server
sent the payload bytes (possibly compressed) interleaved with progress
indicators, so that the client could correctly indicate that, say, the
backup is 30% complete because 30GB of 100GB has been processed on the
server side, even though the amount of data actually received by the
client might be 25GB or 20GB or 10GB or whatever because it got
compressed before transmission.

4. A related consideration is that we might want to have an option to
do something with the backup other than send it to the client. For
example, it might be useful to have an option for pg_basebackup to
tell the server to write the backup files to some specified server
directory, or to, say, S3. There are security concerns there, and I'm
not proposing to do anything about this immediately, but it seems like
something we might eventually want to have. In such a case, we are not
going to send any payload to the client, but the client probably still
wants progress indicators, so the current system of coupling progress
to the number of bytes received by the client breaks down for that
reason also.

5. As things stand today, the client must know exactly how many
archives it should expect to receive from the server and what each one
is. It can do that, because it knows to expect one archive per
tablespace, and the archive must be an uncompressed tarfile, so there
is no ambiguity. But, if the server could send archives to other
places, or send other kinds of archives to the client, then this would
become more complex. There is no intrinsic reason why the logic on the
client side can't simply be made more complicated in order to cope,
but it doesn't seem like great design, because then every time you
enhance the server, you've also got to enhance the client, and that
limits cross-version compatibility, and also seems more fragile. I
would rather that the server advertise the number of archives and the
names of each archive to the client explicitly, allowing the client to
be dumb unless it needs to post-process (e.g. extract) those archives.

Putting all of the above together, what I propose - but have not yet
tried to implement - is a new COPY sub-protocol for taking base
backups. Instead of sending a COPY stream per archive, the server
would send a single COPY stream where the first byte of each message
is a type indicator, like we do with the replication sub-protocol
today. For example, if the first byte is 'a' that could indicate that
we're beginning a new archive and the rest of the message would
indicate the archive name and perhaps some flags or options. If the
first byte is 'p' that could indicate that we're sending archive
payload, perhaps with the first four bytes of the message being
progress, i.e. the number of newly-processed bytes on the server side
prior to any compression, and the remaining bytes being payload. On
receipt of such a message, the client would increment the progress
indicator by the value indicated in those first four bytes, and then
process the remaining bytes by writing them to a file or whatever
behavior the user selected via -Fp, -Ft, -Z, etc. To be clear, I'm not
saying that this specific thing is the right thing, just something of
this sort. The server would need to continue supporting the current
multi-copy protocol for compatibility with older pg_basebackup
versions, and pg_basebackup would need to continue to support it for
compatibility with older server versions, but we could use the new
approach going forward. (Or, we could break compatibility, but that
would probably be unpopular and seems unnecessary and even risky to me
at this point.)

The ideas in the previous paragraph would address #3-#5 directly, but
they also indirectly address #2 because while we're switching
protocols we could easily move the padding with zero bytes to the
server side, and I think we should. #1 is a bit of a separate
consideration. To tackle #1 along the lines proposed above, the client
needs a way to send the recovery.conf contents to the server so that
the server can inject them into the tar file. It's not exactly clear
to me what the best way of permitting this is, and maybe there's a
totally different approach that would be altogether better. One thing
to consider is that we might well want the client to be able to send
*multiple* chunks of data to the server at the start of a backup. For
instance, suppose we want to support incremental backups. I think the
right approach is for the client to send the backup_manifest file from
the previous full backup to the server. What exactly the server does
with it afterward depends on your preferred approach, but the
necessary information is there. Maybe incremental backup is based on
comparing cryptographic checksums, so the server looks at all the
files and sends to the client those where the checksum (hopefully
SHA-something!) does not match. I wouldn't favor this approach myself,
but I know some people like it. Or maybe it's based on finding blocks
modified since the LSN of the previous backup; the manifest has enough
information for that to work, too. In such an approach, there can be
altogether new files with old LSNs, because files can be flat-copied
without changing block LSNs, so it's important to have the complete
list of files from the previous backup, and that too is in the
manifest. There are even timestamps for the bold among you. Anyway, my
point is to advocate for a design where the client says (1) I want a
backup with these options and then (2) here are 0, 1, or >1 files
(recovery parameters and/or backup manifest and/or other things) in
support of that and then the server hands back a stream of archives
which the client may or may not choose to post-process.

It's tempting to think about solving this problem by appealing to
CopyBoth, but I think that might be the wrong idea. The reason we use
CopyBoth for the replication subprotocol is because there's periodic
messages flowing in both directions that are only loosely coupled to
each other. Apart from reading frequently enough to avoid a deadlock
because both sides have full write buffers, each end of the connection
can kind of do whatever it wants. But for the kinds of use cases I'm
talking about here, that's not so. First the client talks and the
server acknowledges, then the reverse. That doesn't mean we couldn't
use CopyBoth, but maybe a CopyIn followed by a CopyOut would be more
straightforward. I am not sure of the details here and am happy to
hear the ideas of others.

One final thought is that the options framework for pg_basebackup is a
little unfortunate. As of today, what the client receives, always, is
a series of tar files. If you say -Fp, it doesn't change the backup
format; it just extracts the tar files. If you say -Ft, it doesn't. If
you say -Ft but also -Z, it compresses the tar files. Thinking just
about server-side compression and ignoring for the moment more remote
features like alternate archive formats (e.g. zip) or things like
storing the backup to an alternate location rather than returning it
to the client, you probably want the client to be able to specify at
least (1) server-side compression (perhaps with one of several
algorithms) and the client just writes the results, (2) server-side
compression (still with a choice of algorithm) and the client
decompresses but does not extract, (3) server-side compression (still
with a choice of algorithms) and the client decompresses and extracts,
(4) client-side compression (with a choice of algorithms), and (5)
client-side extraction. You might also want (6) server-side
compression (with a choice of algorithms) and client-side decompresses
and then re-compresses with a different algorithm (e.g. lz4 on the
server to save bandwidth at moderate CPU cost into parallel bzip2 on
the client for minimum archival storage). Or, as also discussed
upthread, you might want (7) server-side compression of each file
individually, so that you get a seekable archive of individually
compressed files (e.g. to support fast delta restore). I think that
with these refactoring patches - and the wire protocol redesign
mentioned above - it is very reasonable to offer as many of these
options as we believe users will find useful, but it is not very clear
how to extend the current command-line option framework to support
them. It probably would have been better if pg_basebackup, instead of
having -Fp and -Ft, just had an --extract option that you could either
specify or omit, because that would not have presumed anything about
the archive format, but the existing structure is well-baked at this
point.

Thanks,

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

v2-0001-Flexible-options-for-BASE_BACKUP-and-CREATE_REPLI.patch (27K) Download Attachment
v2-0003-Introduce-bbsink-abstraction.patch (12K) Download Attachment
v2-0002-Recast-_tarWriteDirectory-as-convert_link_to_dire.patch (4K) Download Attachment
v2-0005-Convert-throttling-related-code-to-a-bbsink.patch (17K) Download Attachment
v2-0004-Convert-libpq-related-code-to-a-bbsink.patch (43K) Download Attachment
v2-0006-Convert-progress-reporting-code-to-a-bbsink.patch (25K) Download Attachment
v2-0009-WIP-Convert-backup-manifest-generation-to-a-bbarc.patch (21K) Download Attachment
v2-0008-Create-and-use-bbarchiver-implementations-for-tar.patch (43K) Download Attachment
v2-0007-Introduce-bbarchiver-abstraction.patch (15K) Download Attachment
v2-0010-WIP-Introduce-bbstreamer-abstration-and-adapt-pg_.patch (99K) Download Attachment
v2-0011-POC-Embarrassingly-bad-server-side-compression-pa.patch (11K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: refactoring basebackup.c

Andres Freund
Hi,

On 2020-07-29 11:31:26 -0400, Robert Haas wrote:
> Here's an updated patch set. This is now rebased over master and
> includes as 0001 the patch I posted separately at
> http://postgr.es/m/CA+TgmobAczXDRO_Gr2euo_TxgzaH1JxbNxvFx=HYvBinefNH8Q@...
> but drops some other patches that were committed meanwhile. 0002-0009
> of this series are basically the same as 0004-0011 from the previous
> series, except for rebasing and fixing a bug I discovered in what's
> now 0006. 0012 does a refactoring of pg_basebackup along similar lines
> to the server-side refactoring from patches earlier in the series.

Have you tested whether this still works against older servers? Or do
you think we should not have that as a goal?


> 1. pg_basebackup -R injects recovery.conf (on older versions) or
> injects standby.signal and appends to postgresql.auto.conf (on newer
> versions) by parsing the tar file sent by the server and editing it on
> the fly. From the point of view of server-side compression, this is
> not ideal, because if you want to make these kinds of changes when
> server-side compression is in use, you'd have to decompress the stream
> on the client side in order to figure out where in the steam you ought
> to inject your changes. But having to do that is a major expense. If
> the client instead told the server what to change when generating the
> archive, and the server did it, this expense could be avoided. It
> would have the additional advantage that the backup manifest could
> reflect the effects of those changes; right now it doesn't, and
> pg_verifybackup just knows to expect differences in those files.

Hm. I don't think I terribly like the idea of things like -R having to
be processed server side. That'll be awfully annoying to keep working
across versions, for one. But perhaps the config file should just not be
in the main tar file going forward?

I think we should eventually be able to use one archive for multiple
purposes, e.g. to set up a standby as well as using it for a base
backup. Or multiple standbys with different tablespace remappings.


> 2. According to the comments, some tar programs require two tar blocks
> (i.e. 512-byte blocks) of zero bytes at the end of an archive. The
> server does not generate these blocks of zero bytes, so it basically
> creates a tar file that works fine with my copy of tar but might break
> with somebody else's. Instead, the client appends 1024 zero bytes to
> the end of every file it receives from the server. That is an odd way
> of fixing this problem, and it makes things rather inflexible. If the
> server sends you any kind of a file OTHER THAN a tar file with the
> last 1024 zero bytes stripped off, then adding 1024 zero bytes will be
> the wrong thing to do. It would be better if the server just generated
> fully correct tar files (whatever we think that means) and the client
> wrote out exactly what it got from the server. Then, we could have the
> server generate cpio archives or zip files or gzip-compressed tar
> files or lz4-compressed tar files or anything we like, and the client
> wouldn't really need to care as long as it didn't need to extract
> those archives. That seems a lot cleaner.

Yea.


> 5. As things stand today, the client must know exactly how many
> archives it should expect to receive from the server and what each one
> is. It can do that, because it knows to expect one archive per
> tablespace, and the archive must be an uncompressed tarfile, so there
> is no ambiguity. But, if the server could send archives to other
> places, or send other kinds of archives to the client, then this would
> become more complex. There is no intrinsic reason why the logic on the
> client side can't simply be made more complicated in order to cope,
> but it doesn't seem like great design, because then every time you
> enhance the server, you've also got to enhance the client, and that
> limits cross-version compatibility, and also seems more fragile. I
> would rather that the server advertise the number of archives and the
> names of each archive to the client explicitly, allowing the client to
> be dumb unless it needs to post-process (e.g. extract) those archives.

ISTM that that can help to some degree, but things like tablespace
remapping etc IMO aren't best done server side, so I think the client
will continue to need to know about the contents to a significnat
degree?


> Putting all of the above together, what I propose - but have not yet
> tried to implement - is a new COPY sub-protocol for taking base
> backups. Instead of sending a COPY stream per archive, the server
> would send a single COPY stream where the first byte of each message
> is a type indicator, like we do with the replication sub-protocol
> today. For example, if the first byte is 'a' that could indicate that
> we're beginning a new archive and the rest of the message would
> indicate the archive name and perhaps some flags or options. If the
> first byte is 'p' that could indicate that we're sending archive
> payload, perhaps with the first four bytes of the message being
> progress, i.e. the number of newly-processed bytes on the server side
> prior to any compression, and the remaining bytes being payload. On
> receipt of such a message, the client would increment the progress
> indicator by the value indicated in those first four bytes, and then
> process the remaining bytes by writing them to a file or whatever
> behavior the user selected via -Fp, -Ft, -Z, etc.

Wonder if there's a way to get this to be less stateful. It seems a bit
ugly that the client would know what the last 'a' was for a 'p'? Perhaps
we could actually make 'a' include an identifier for each archive, and
then 'p' would append to a specific archive? Which would then also would
allow for concurrent processing of those archives on the server side.

I'd personally rather have a separate message type for progress and
payload. Seems odd to have to send payload messages with 0 payload just
because we want to update progress (in case of uploading to
e.g. S3). And I think it'd be nice if we could have a more extensible
progress measurement approach than a fixed length prefix. E.g. it might
be nice to allow it to report both the overall progress, as well as a
per archive progress. Or we might want to send progress when uploading
to S3, even when not having pre-calculated the total size of the data
directory.


Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: refactoring basebackup.c

Robert Haas
On Fri, Jul 31, 2020 at 12:49 PM Andres Freund <[hidden email]> wrote:
> Have you tested whether this still works against older servers? Or do
> you think we should not have that as a goal?

I haven't tested that recently but I intended to keep it working. I'll
make sure to nail that down before I get to the point of committing
anything, but I don't expect big problems. It's kind of annoying to
have so much backward compatibility stuff here but I think ripping any
of that out should wait for another time.

> Hm. I don't think I terribly like the idea of things like -R having to
> be processed server side. That'll be awfully annoying to keep working
> across versions, for one. But perhaps the config file should just not be
> in the main tar file going forward?

That'd be a user-visible change, though, whereas what I'm proposing
isn't. Instead of directly injecting stuff, the client can just send
it to the server and have the server inject it, provided the server is
new enough. Cross-version issues don't seem to be any worse than now.
That being said, I don't love it, either. We could just suggest to
people that using -R together with server compression is

> I think we should eventually be able to use one archive for multiple
> purposes, e.g. to set up a standby as well as using it for a base
> backup. Or multiple standbys with different tablespace remappings.

I don't think I understand your point here.

> ISTM that that can help to some degree, but things like tablespace
> remapping etc IMO aren't best done server side, so I think the client
> will continue to need to know about the contents to a significnat
> degree?

If I'm not mistaken, those mappings are only applied with -Fp i.e. if
we're extracting. And it's no problem to jigger things in that case;
we can only do this if we understand the archive in the first place.
The problem is when you have to decompress and recompress to jigger
things.

> Wonder if there's a way to get this to be less stateful. It seems a bit
> ugly that the client would know what the last 'a' was for a 'p'? Perhaps
> we could actually make 'a' include an identifier for each archive, and
> then 'p' would append to a specific archive? Which would then also would
> allow for concurrent processing of those archives on the server side.

...says the guy working on asynchronous I/O. I don't know, it's not a
bad idea, but I think we'd have to change a LOT of code to make it
actually do something useful. I feel like this could be added as a
later extension of the protocol, rather than being something that we
necessarily need to do now.

> I'd personally rather have a separate message type for progress and
> payload. Seems odd to have to send payload messages with 0 payload just
> because we want to update progress (in case of uploading to
> e.g. S3). And I think it'd be nice if we could have a more extensible
> progress measurement approach than a fixed length prefix. E.g. it might
> be nice to allow it to report both the overall progress, as well as a
> per archive progress. Or we might want to send progress when uploading
> to S3, even when not having pre-calculated the total size of the data
> directory.

I don't mind a separate message type here, but if you want merging of
short messages with adjacent longer messages to generate a minimal
number of system calls, that might have some implications for the
other thread where we're talking about how to avoid extra memory
copies when generating protocol messages. If you don't mind them going
out as separate network packets, then it doesn't matter.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company