WIP/PoC for parallel backup

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
116 messages Options
1234 ... 6
Reply | Threaded
Open this post in threaded view
|

WIP/PoC for parallel backup

Asif Rehman
Hi Hackers,

I have been looking into adding parallel backup feature in pg_basebackup. Currently pg_basebackup sends BASE_BACKUP command for taking full backup, server scans the PGDATA and sends the files to pg_basebackup. In general, server takes the following steps on BASE_BACKUP command:

- do pg_start_backup
- scans PGDATA, creates and send header containing information of tablespaces.
- sends each tablespace to pg_basebackup.
- and then do pg_stop_backup

All these steps are executed sequentially by a single process. The idea I am working on is to separate these steps into multiple commands in replication grammer. Add worker processes to the pg_basebackup where they can copy the contents of PGDATA in parallel.

The command line interface syntax would be like:
pg_basebackup --jobs=WORKERS


Replication commands:

- BASE_BACKUP [PARALLEL] - returns a list of files in PGDATA
If the parallel option is there, then it will only do pg_start_backup, scans PGDATA and sends a list of file names.

- SEND_FILES_CONTENTS (file1, file2,...) - returns the files in given list.
pg_basebackup will then send back a list of filenames in this command. This commands will be send by each worker and that worker will be getting the said files.

- STOP_BACKUP
when all workers finish then, pg_basebackup will send STOP_BACKUP command.

The pg_basebackup can start by sending "BASE_BACKUP PARALLEL" command and getting a list of filenames from the server in response. It should then divide this list as per --jobs parameter. (This division can be based on file sizes). Each of the worker process will issue a SEND_FILES_CONTENTS (file1, file2,...) command. In response, the server will send the files mentioned in the list back to the requesting worker process.

Once all the files are copied, then pg_basebackup will send the STOP_BACKUP command. Similar idea has been been discussed by Robert, on the incremental backup thread a while ago. This is similar to that but instead of START_BACKUP and SEND_FILE_LIST, I have combined them into BASE_BACKUP PARALLEL.

I have done a basic proof of concenpt (POC), which is also attached. I would appreciate some input on this. So far, I am simply dividing the list equally and assigning them to worker processes. I intend to fine tune this by taking into consideration file sizes. Further to add tar format support, I am considering that each worker process, processes all files belonging to a tablespace in its list (i.e. creates and copies tar file), before it processes the next tablespace. As a result, this will create tar files that are disjointed with respect tablespace data. For example:
Say, tablespace t1 has 20 files and we have 5 worker processes and tablespace t2 has 10. Ignoring all other factors for the sake of this example, each worker process will get a group of 4 files of t1 and 2 files of t2. Each process will create 2 tar files, one for t1 containing 4 files and another for t2 containing 2 files.

Regards,
Asif

0001-Initial-POC-on-parallel-backup.patch (52K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: WIP/PoC for parallel backup

Asim R P
Hi Asif

Interesting proposal.  Bulk of the work in a backup is transferring files from source data directory to destination.  Your patch is breaking this task down in multiple sets of files and transferring each set in parallel.  This seems correct, however, your patch is also creating a new process to handle each set.  Is that necessary?  I think we should try to achieve this using multiple asynchronous libpq connections from a single basebackup process.  That is to use PQconnectStartParams() interface instead of PQconnectdbParams(), wich is currently used by basebackup.  On the server side, it may still result in multiple backend processes per connection, and an attempt should be made to avoid that as well, but it seems complicated.

What do you think?

Asim
Reply | Threaded
Open this post in threaded view
|

Re: WIP/PoC for parallel backup

Ibrar Ahmed-4


On Fri, Aug 23, 2019 at 3:18 PM Asim R P <[hidden email]> wrote:
Hi Asif

Interesting proposal.  Bulk of the work in a backup is transferring files from source data directory to destination.  Your patch is breaking this task down in multiple sets of files and transferring each set in parallel.  This seems correct, however, your patch is also creating a new process to handle each set.  Is that necessary?  I think we should try to achieve this using multiple asynchronous libpq connections from a single basebackup process.  That is to use PQconnectStartParams() interface instead of PQconnectdbParams(), wich is currently used by basebackup.  On the server side, it may still result in multiple backend processes per connection, and an attempt should be made to avoid that as well, but it seems complicated.

What do you think?

The main question is what we really want to solve here. What is the
bottleneck? and which HW want to saturate?. Why I am saying that because
there are multiple H/W involve while taking the backup (Network/CPU/Disk). If we
already saturated the disk then there is no need to add parallelism because
we will be blocked on disk I/O anyway.  I implemented the parallel backup in a sperate
application and has wonderful results. I just skim through the code and have
some reservation that creating a separate process only for copying data is overkill.
There are two options, one is non-blocking calls or you can have some worker threads.
But before doing that need to see the pg_basebackup bottleneck, after that, we
can see what is the best way to solve that. Some numbers may help to understand the
actual benefit.


--
Ibrar Ahmed
Reply | Threaded
Open this post in threaded view
|

Re: WIP/PoC for parallel backup

Asif Rehman
In reply to this post by Asim R P

On Fri, Aug 23, 2019 at 3:18 PM Asim R P <[hidden email]> wrote:
Hi Asif

Interesting proposal.  Bulk of the work in a backup is transferring files from source data directory to destination.  Your patch is breaking this task down in multiple sets of files and transferring each set in parallel.  This seems correct, however, your patch is also creating a new process to handle each set.  Is that necessary?  I think we should try to achieve this using multiple asynchronous libpq connections from a single basebackup process.  That is to use PQconnectStartParams() interface instead of PQconnectdbParams(), wich is currently used by basebackup.  On the server side, it may still result in multiple backend processes per connection, and an attempt should be made to avoid that as well, but it seems complicated.

What do you think?

Asim

Thanks Asim for the feedback. This is a good suggestion. The main idea I wanted to discuss is the design where we can open multiple backend connections to get the data instead of a single connection.
On the client side we can have multiple approaches, One is to use asynchronous APIs ( as suggested by you) and other could be to decide between multi-process and multi-thread. The main point was we can extract lot of performance benefit by using the multiple connections and I built this POC to float the idea of how the parallel backup can work, since the core logic of getting the files using multiple connections will remain the same, wether we use asynchronous, multi-process or multi-threaded.

I am going to address the division of files to be distributed evenly among multiple workers based on file sizes, that would allow to get some concrete numbers as well as it will also us to gauge some benefits between async and multiprocess/thread approach on client side.

Regards,
Asif
 
Reply | Threaded
Open this post in threaded view
|

Re: WIP/PoC for parallel backup

Stephen Frost
Greetings,

* Asif Rehman ([hidden email]) wrote:

> On Fri, Aug 23, 2019 at 3:18 PM Asim R P <[hidden email]> wrote:
> > Interesting proposal.  Bulk of the work in a backup is transferring files
> > from source data directory to destination.  Your patch is breaking this
> > task down in multiple sets of files and transferring each set in parallel.
> > This seems correct, however, your patch is also creating a new process to
> > handle each set.  Is that necessary?  I think we should try to achieve this
> > using multiple asynchronous libpq connections from a single basebackup
> > process.  That is to use PQconnectStartParams() interface instead of
> > PQconnectdbParams(), wich is currently used by basebackup.  On the server
> > side, it may still result in multiple backend processes per connection, and
> > an attempt should be made to avoid that as well, but it seems complicated.
>
> Thanks Asim for the feedback. This is a good suggestion. The main idea I
> wanted to discuss is the design where we can open multiple backend
> connections to get the data instead of a single connection.
> On the client side we can have multiple approaches, One is to use
> asynchronous APIs ( as suggested by you) and other could be to decide
> between multi-process and multi-thread. The main point was we can extract
> lot of performance benefit by using the multiple connections and I built
> this POC to float the idea of how the parallel backup can work, since the
> core logic of getting the files using multiple connections will remain the
> same, wether we use asynchronous, multi-process or multi-threaded.
>
> I am going to address the division of files to be distributed evenly among
> multiple workers based on file sizes, that would allow to get some concrete
> numbers as well as it will also us to gauge some benefits between async and
> multiprocess/thread approach on client side.
I would expect you to quickly want to support compression on the server
side, before the data is sent across the network, and possibly
encryption, and so it'd likely make sense to just have independent
processes and connections through which to do that.

Thanks,

Stephen

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: WIP/PoC for parallel backup

Ibrar Ahmed-4


On Fri, Aug 23, 2019 at 10:26 PM Stephen Frost <[hidden email]> wrote:
Greetings,

* Asif Rehman ([hidden email]) wrote:
> On Fri, Aug 23, 2019 at 3:18 PM Asim R P <[hidden email]> wrote:
> > Interesting proposal.  Bulk of the work in a backup is transferring files
> > from source data directory to destination.  Your patch is breaking this
> > task down in multiple sets of files and transferring each set in parallel.
> > This seems correct, however, your patch is also creating a new process to
> > handle each set.  Is that necessary?  I think we should try to achieve this
> > using multiple asynchronous libpq connections from a single basebackup
> > process.  That is to use PQconnectStartParams() interface instead of
> > PQconnectdbParams(), wich is currently used by basebackup.  On the server
> > side, it may still result in multiple backend processes per connection, and
> > an attempt should be made to avoid that as well, but it seems complicated.
>
> Thanks Asim for the feedback. This is a good suggestion. The main idea I
> wanted to discuss is the design where we can open multiple backend
> connections to get the data instead of a single connection.
> On the client side we can have multiple approaches, One is to use
> asynchronous APIs ( as suggested by you) and other could be to decide
> between multi-process and multi-thread. The main point was we can extract
> lot of performance benefit by using the multiple connections and I built
> this POC to float the idea of how the parallel backup can work, since the
> core logic of getting the files using multiple connections will remain the
> same, wether we use asynchronous, multi-process or multi-threaded.
>
> I am going to address the division of files to be distributed evenly among
> multiple workers based on file sizes, that would allow to get some concrete
> numbers as well as it will also us to gauge some benefits between async and
> multiprocess/thread approach on client side.

I would expect you to quickly want to support compression on the server
side, before the data is sent across the network, and possibly
encryption, and so it'd likely make sense to just have independent
processes and connections through which to do that.

+1 for compression and encryption, but I think parallelism will give us 
the benefit with and without the compression.

Thanks,

Stephen


--
Ibrar Ahmed
Reply | Threaded
Open this post in threaded view
|

Re: WIP/PoC for parallel backup

Ahsan Hadi-2
In reply to this post by Stephen Frost


On Fri, 23 Aug 2019 at 10:26 PM, Stephen Frost <[hidden email]> wrote:
Greetings,

* Asif Rehman ([hidden email]) wrote:
> On Fri, Aug 23, 2019 at 3:18 PM Asim R P <[hidden email]> wrote:
> > Interesting proposal.  Bulk of the work in a backup is transferring files
> > from source data directory to destination.  Your patch is breaking this
> > task down in multiple sets of files and transferring each set in parallel.
> > This seems correct, however, your patch is also creating a new process to
> > handle each set.  Is that necessary?  I think we should try to achieve this
> > using multiple asynchronous libpq connections from a single basebackup
> > process.  That is to use PQconnectStartParams() interface instead of
> > PQconnectdbParams(), wich is currently used by basebackup.  On the server
> > side, it may still result in multiple backend processes per connection, and
> > an attempt should be made to avoid that as well, but it seems complicated.
>
> Thanks Asim for the feedback. This is a good suggestion. The main idea I
> wanted to discuss is the design where we can open multiple backend
> connections to get the data instead of a single connection.
> On the client side we can have multiple approaches, One is to use
> asynchronous APIs ( as suggested by you) and other could be to decide
> between multi-process and multi-thread. The main point was we can extract
> lot of performance benefit by using the multiple connections and I built
> this POC to float the idea of how the parallel backup can work, since the
> core logic of getting the files using multiple connections will remain the
> same, wether we use asynchronous, multi-process or multi-threaded.
>
> I am going to address the division of files to be distributed evenly among
> multiple workers based on file sizes, that would allow to get some concrete
> numbers as well as it will also us to gauge some benefits between async and
> multiprocess/thread approach on client side.

I would expect you to quickly want to support compression on the server
side, before the data is sent across the network, and possibly
encryption, and so it'd likely make sense to just have independent
processes and connections through which to do that.

It would be interesting to see the benefits of compression (before the data is transferred over the network) on top of parallelism. Since there is also some overhead associated with performing the compression. I agree with your suggestion of trying to add parallelism first and then try compression before the data is sent across the network.



Thanks,

Stephen
Reply | Threaded
Open this post in threaded view
|

Re: WIP/PoC for parallel backup

Stephen Frost
Greetings,

* Ahsan Hadi ([hidden email]) wrote:

> On Fri, 23 Aug 2019 at 10:26 PM, Stephen Frost <[hidden email]> wrote:
> > I would expect you to quickly want to support compression on the server
> > side, before the data is sent across the network, and possibly
> > encryption, and so it'd likely make sense to just have independent
> > processes and connections through which to do that.
>
> It would be interesting to see the benefits of compression (before the data
> is transferred over the network) on top of parallelism. Since there is also
> some overhead associated with performing the compression. I agree with your
> suggestion of trying to add parallelism first and then try compression
> before the data is sent across the network.
You're welcome to take a look at pgbackrest for insight and to play with
regarding compression-before-transfer, how best to split up the files
and order them, encryption, et al.  We've put quite a bit of effort into
figuring all of that out.

Thanks!

Stephen

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: WIP/PoC for parallel backup

Robert Haas
In reply to this post by Asif Rehman
On Wed, Aug 21, 2019 at 9:53 AM Asif Rehman <[hidden email]> wrote:
> - BASE_BACKUP [PARALLEL] - returns a list of files in PGDATA
> If the parallel option is there, then it will only do pg_start_backup, scans PGDATA and sends a list of file names.

So IIUC, this would mean that BASE_BACKUP without PARALLEL returns
tarfiles, and BASE_BACKUP with PARALLEL returns a result set with a
list of file names. I don't think that's a good approach. It's too
confusing to have one replication command that returns totally
different things depending on whether some option is given.

> - SEND_FILES_CONTENTS (file1, file2,...) - returns the files in given list.
> pg_basebackup will then send back a list of filenames in this command. This commands will be send by each worker and that worker will be getting the said files.

Seems reasonable, but I think you should just pass one file name and
use the command multiple times, once per file.

> - STOP_BACKUP
> when all workers finish then, pg_basebackup will send STOP_BACKUP command.

This also seems reasonable, but surely the matching command should
then be called START_BACKUP, not BASEBACKUP PARALLEL.

> I have done a basic proof of concenpt (POC), which is also attached. I would appreciate some input on this. So far, I am simply dividing the list equally and assigning them to worker processes. I intend to fine tune this by taking into consideration file sizes. Further to add tar format support, I am considering that each worker process, processes all files belonging to a tablespace in its list (i.e. creates and copies tar file), before it processes the next tablespace. As a result, this will create tar files that are disjointed with respect tablespace data. For example:

Instead of doing this, I suggest that you should just maintain a list
of all the files that need to be fetched and have each worker pull a
file from the head of the list and fetch it when it finishes receiving
the previous file.  That way, if some connections go faster or slower
than others, the distribution of work ends up fairly even.  If you
instead pre-distribute the work, you're guessing what's going to
happen in the future instead of just waiting to see what actually does
happen. Guessing isn't intrinsically bad, but guessing when you could
be sure of doing the right thing *is* bad.

If you want to be really fancy, you could start by sorting the files
in descending order of size, so that big files are fetched before
small ones.  Since the largest possible file is 1GB and any database
where this feature is important is probably hundreds or thousands of
GB, this may not be very important. I suggest not worrying about it
for v1.

> Say, tablespace t1 has 20 files and we have 5 worker processes and tablespace t2 has 10. Ignoring all other factors for the sake of this example, each worker process will get a group of 4 files of t1 and 2 files of t2. Each process will create 2 tar files, one for t1 containing 4 files and another for t2 containing 2 files.

This is one of several possible approaches. If we're doing a
plain-format backup in parallel, we can just write each file where it
needs to go and call it good. But, with a tar-format backup, what
should we do? I can see three options:

1. Error! Tar format parallel backups are not supported.

2. Write multiple tar files. The user might reasonably expect that
they're going to end up with the same files at the end of the backup
regardless of whether they do it in parallel. A user with this
expectation will be disappointed.

3. Write one tar file. In this design, the workers have to take turns
writing to the tar file, so you need some synchronization around that.
Perhaps you'd have N threads that read and buffer a file, and N+1
buffers.  Then you have one additional thread that reads the complete
files from the buffers and writes them to the tar file. There's
obviously some possibility that the writer won't be able to keep up
and writing the backup will therefore be slower than it would be with
approach (2).

There's probably also a possibility that approach (2) would thrash the
disk head back and forth between multiple files that are all being
written at the same time, and approach (3) will therefore win by not
thrashing the disk head. But, since spinning media are becoming less
and less popular and are likely to have multiple disk heads under the
hood when they are used, this is probably not too likely.

I think your choice to go with approach (2) is probably reasonable,
but I'm not sure whether everyone will agree.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: WIP/PoC for parallel backup

Asif Rehman
Hi Robert,

Thanks for the feedback. Please see the comments below:

On Tue, Sep 24, 2019 at 10:53 PM Robert Haas <[hidden email]> wrote:
On Wed, Aug 21, 2019 at 9:53 AM Asif Rehman <[hidden email]> wrote:
> - BASE_BACKUP [PARALLEL] - returns a list of files in PGDATA
> If the parallel option is there, then it will only do pg_start_backup, scans PGDATA and sends a list of file names.

So IIUC, this would mean that BASE_BACKUP without PARALLEL returns
tarfiles, and BASE_BACKUP with PARALLEL returns a result set with a
list of file names. I don't think that's a good approach. It's too
confusing to have one replication command that returns totally
different things depending on whether some option is given.

Sure. I will add a separate command (START_BACKUP)  for parallel.


> - SEND_FILES_CONTENTS (file1, file2,...) - returns the files in given list.
> pg_basebackup will then send back a list of filenames in this command. This commands will be send by each worker and that worker will be getting the said files.

Seems reasonable, but I think you should just pass one file name and
use the command multiple times, once per file.

I considered this approach initially,  however, I adopted the current strategy to avoid multiple round trips between the server and clients and save on query processing time by issuing a single command rather than multiple ones. Further fetching multiple files at once will also aid in supporting the tar format by utilising the existing ReceiveTarFile() function and will be able to create a tarball for per tablespace per worker.
  

> - STOP_BACKUP
> when all workers finish then, pg_basebackup will send STOP_BACKUP command.

This also seems reasonable, but surely the matching command should
then be called START_BACKUP, not BASEBACKUP PARALLEL.

> I have done a basic proof of concenpt (POC), which is also attached. I would appreciate some input on this. So far, I am simply dividing the list equally and assigning them to worker processes. I intend to fine tune this by taking into consideration file sizes. Further to add tar format support, I am considering that each worker process, processes all files belonging to a tablespace in its list (i.e. creates and copies tar file), before it processes the next tablespace. As a result, this will create tar files that are disjointed with respect tablespace data. For example:

Instead of doing this, I suggest that you should just maintain a list
of all the files that need to be fetched and have each worker pull a
file from the head of the list and fetch it when it finishes receiving
the previous file.  That way, if some connections go faster or slower
than others, the distribution of work ends up fairly even.  If you
instead pre-distribute the work, you're guessing what's going to
happen in the future instead of just waiting to see what actually does
happen. Guessing isn't intrinsically bad, but guessing when you could
be sure of doing the right thing *is* bad.

If you want to be really fancy, you could start by sorting the files
in descending order of size, so that big files are fetched before
small ones.  Since the largest possible file is 1GB and any database
where this feature is important is probably hundreds or thousands of
GB, this may not be very important. I suggest not worrying about it
for v1.

Ideally, I would like to support the tar format as well, which would be much easier to implement when fetching multiple files at once since that would enable using the existent functionality to be used without much change.

Your idea of sorting the files in descending order of size seems very appealing. I think we can do this and have the file divided among the workers one by one i.e. the first file in the list goes to worker 1, the second to process 2, and so on and so forth.
 

> Say, tablespace t1 has 20 files and we have 5 worker processes and tablespace t2 has 10. Ignoring all other factors for the sake of this example, each worker process will get a group of 4 files of t1 and 2 files of t2. Each process will create 2 tar files, one for t1 containing 4 files and another for t2 containing 2 files.

This is one of several possible approaches. If we're doing a
plain-format backup in parallel, we can just write each file where it
needs to go and call it good. But, with a tar-format backup, what
should we do? I can see three options:

1. Error! Tar format parallel backups are not supported.

2. Write multiple tar files. The user might reasonably expect that
they're going to end up with the same files at the end of the backup
regardless of whether they do it in parallel. A user with this
expectation will be disappointed.

3. Write one tar file. In this design, the workers have to take turns
writing to the tar file, so you need some synchronization around that.
Perhaps you'd have N threads that read and buffer a file, and N+1
buffers.  Then you have one additional thread that reads the complete
files from the buffers and writes them to the tar file. There's
obviously some possibility that the writer won't be able to keep up
and writing the backup will therefore be slower than it would be with
approach (2).

There's probably also a possibility that approach (2) would thrash the
disk head back and forth between multiple files that are all being
written at the same time, and approach (3) will therefore win by not
thrashing the disk head. But, since spinning media are becoming less
and less popular and are likely to have multiple disk heads under the
hood when they are used, this is probably not too likely.

I think your choice to go with approach (2) is probably reasonable,
but I'm not sure whether everyone will agree.

Yes for the tar format support, approach (2) is what I had in mind. Currently I'm working on the implementation and will share the patch in a couple of days.


--
Asif Rehman
Highgo Software (Canada/China/Pakistan) 
URL : www.highgo.ca 
Reply | Threaded
Open this post in threaded view
|

Re: WIP/PoC for parallel backup

Jeevan Chalke
Hi  Asif,

I was looking at the patch and tried comipling it. However, got few errors and warnings.

Fixed those in the attached patch.

On Fri, Sep 27, 2019 at 9:30 PM Asif Rehman <[hidden email]> wrote:
Hi Robert,

Thanks for the feedback. Please see the comments below:

On Tue, Sep 24, 2019 at 10:53 PM Robert Haas <[hidden email]> wrote:
On Wed, Aug 21, 2019 at 9:53 AM Asif Rehman <[hidden email]> wrote:
> - BASE_BACKUP [PARALLEL] - returns a list of files in PGDATA
> If the parallel option is there, then it will only do pg_start_backup, scans PGDATA and sends a list of file names.

So IIUC, this would mean that BASE_BACKUP without PARALLEL returns
tarfiles, and BASE_BACKUP with PARALLEL returns a result set with a
list of file names. I don't think that's a good approach. It's too
confusing to have one replication command that returns totally
different things depending on whether some option is given.

Sure. I will add a separate command (START_BACKUP)  for parallel.


> - SEND_FILES_CONTENTS (file1, file2,...) - returns the files in given list.
> pg_basebackup will then send back a list of filenames in this command. This commands will be send by each worker and that worker will be getting the said files.

Seems reasonable, but I think you should just pass one file name and
use the command multiple times, once per file.

I considered this approach initially,  however, I adopted the current strategy to avoid multiple round trips between the server and clients and save on query processing time by issuing a single command rather than multiple ones. Further fetching multiple files at once will also aid in supporting the tar format by utilising the existing ReceiveTarFile() function and will be able to create a tarball for per tablespace per worker.
  

> - STOP_BACKUP
> when all workers finish then, pg_basebackup will send STOP_BACKUP command.

This also seems reasonable, but surely the matching command should
then be called START_BACKUP, not BASEBACKUP PARALLEL.

> I have done a basic proof of concenpt (POC), which is also attached. I would appreciate some input on this. So far, I am simply dividing the list equally and assigning them to worker processes. I intend to fine tune this by taking into consideration file sizes. Further to add tar format support, I am considering that each worker process, processes all files belonging to a tablespace in its list (i.e. creates and copies tar file), before it processes the next tablespace. As a result, this will create tar files that are disjointed with respect tablespace data. For example:

Instead of doing this, I suggest that you should just maintain a list
of all the files that need to be fetched and have each worker pull a
file from the head of the list and fetch it when it finishes receiving
the previous file.  That way, if some connections go faster or slower
than others, the distribution of work ends up fairly even.  If you
instead pre-distribute the work, you're guessing what's going to
happen in the future instead of just waiting to see what actually does
happen. Guessing isn't intrinsically bad, but guessing when you could
be sure of doing the right thing *is* bad.

If you want to be really fancy, you could start by sorting the files
in descending order of size, so that big files are fetched before
small ones.  Since the largest possible file is 1GB and any database
where this feature is important is probably hundreds or thousands of
GB, this may not be very important. I suggest not worrying about it
for v1.

Ideally, I would like to support the tar format as well, which would be much easier to implement when fetching multiple files at once since that would enable using the existent functionality to be used without much change.

Your idea of sorting the files in descending order of size seems very appealing. I think we can do this and have the file divided among the workers one by one i.e. the first file in the list goes to worker 1, the second to process 2, and so on and so forth.
 

> Say, tablespace t1 has 20 files and we have 5 worker processes and tablespace t2 has 10. Ignoring all other factors for the sake of this example, each worker process will get a group of 4 files of t1 and 2 files of t2. Each process will create 2 tar files, one for t1 containing 4 files and another for t2 containing 2 files.

This is one of several possible approaches. If we're doing a
plain-format backup in parallel, we can just write each file where it
needs to go and call it good. But, with a tar-format backup, what
should we do? I can see three options:

1. Error! Tar format parallel backups are not supported.

2. Write multiple tar files. The user might reasonably expect that
they're going to end up with the same files at the end of the backup
regardless of whether they do it in parallel. A user with this
expectation will be disappointed.

3. Write one tar file. In this design, the workers have to take turns
writing to the tar file, so you need some synchronization around that.
Perhaps you'd have N threads that read and buffer a file, and N+1
buffers.  Then you have one additional thread that reads the complete
files from the buffers and writes them to the tar file. There's
obviously some possibility that the writer won't be able to keep up
and writing the backup will therefore be slower than it would be with
approach (2).

There's probably also a possibility that approach (2) would thrash the
disk head back and forth between multiple files that are all being
written at the same time, and approach (3) will therefore win by not
thrashing the disk head. But, since spinning media are becoming less
and less popular and are likely to have multiple disk heads under the
hood when they are used, this is probably not too likely.

I think your choice to go with approach (2) is probably reasonable,
but I'm not sure whether everyone will agree.

Yes for the tar format support, approach (2) is what I had in mind. Currently I'm working on the implementation and will share the patch in a couple of days.


--
Asif Rehman
Highgo Software (Canada/China/Pakistan) 
URL : www.highgo.ca 


--
Jeevan Chalke
Associate Database Architect & Team Lead, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company


0001-Initial-POC-on-parallel-backup_fix_errors_warnings_delta.patch (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: WIP/PoC for parallel backup

Robert Haas
In reply to this post by Asif Rehman
On Fri, Sep 27, 2019 at 12:00 PM Asif Rehman <[hidden email]> wrote:
>> > - SEND_FILES_CONTENTS (file1, file2,...) - returns the files in given list.
>> > pg_basebackup will then send back a list of filenames in this command. This commands will be send by each worker and that worker will be getting the said files.
>>
>> Seems reasonable, but I think you should just pass one file name and
>> use the command multiple times, once per file.
>
> I considered this approach initially,  however, I adopted the current strategy to avoid multiple round trips between the server and clients and save on query processing time by issuing a single command rather than multiple ones. Further fetching multiple files at once will also aid in supporting the tar format by utilising the existing ReceiveTarFile() function and will be able to create a tarball for per tablespace per worker.

I think that sending multiple filenames on a line could save some time
when there are lots of very small files, because then the round-trip
overhead could be significant.

However, if you've got mostly big files, I think this is going to be a
loser. It'll be fine if you're able to divide the work exactly evenly,
but that's pretty hard to do, because some workers may succeed in
copying the data faster than others for a variety of reasons: some
data is in memory, some data has to be read from disk, different data
may need to be read from different disks that run at different speeds,
not all the network connections may run at the same speed. Remember
that the backup's not done until the last worker finishes, and so
there may well be a significant advantage in terms of overall speed in
putting some energy into making sure that they finish as close to each
other in time as possible.

To put that another way, the first time all the workers except one get
done while the last one still has 10GB of data to copy, somebody's
going to be unhappy.

> Ideally, I would like to support the tar format as well, which would be much easier to implement when fetching multiple files at once since that would enable using the existent functionality to be used without much change.

I think we should just have the client generate the tarfile. It'll
require duplicating some code, but it's not actually that much code or
that complicated from what I can see.


--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: WIP/PoC for parallel backup

Asif Rehman


On Thu, Oct 3, 2019 at 6:40 PM Robert Haas <[hidden email]> wrote:
On Fri, Sep 27, 2019 at 12:00 PM Asif Rehman <[hidden email]> wrote:
>> > - SEND_FILES_CONTENTS (file1, file2,...) - returns the files in given list.
>> > pg_basebackup will then send back a list of filenames in this command. This commands will be send by each worker and that worker will be getting the said files.
>>
>> Seems reasonable, but I think you should just pass one file name and
>> use the command multiple times, once per file.
>
> I considered this approach initially,  however, I adopted the current strategy to avoid multiple round trips between the server and clients and save on query processing time by issuing a single command rather than multiple ones. Further fetching multiple files at once will also aid in supporting the tar format by utilising the existing ReceiveTarFile() function and will be able to create a tarball for per tablespace per worker.

I think that sending multiple filenames on a line could save some time
when there are lots of very small files, because then the round-trip
overhead could be significant.

However, if you've got mostly big files, I think this is going to be a
loser. It'll be fine if you're able to divide the work exactly evenly,
but that's pretty hard to do, because some workers may succeed in
copying the data faster than others for a variety of reasons: some
data is in memory, some data has to be read from disk, different data
may need to be read from different disks that run at different speeds,
not all the network connections may run at the same speed. Remember
that the backup's not done until the last worker finishes, and so
there may well be a significant advantage in terms of overall speed in
putting some energy into making sure that they finish as close to each
other in time as possible.

To put that another way, the first time all the workers except one get
done while the last one still has 10GB of data to copy, somebody's
going to be unhappy.

I have updated the patch (see the attached patch) to include tablespace support, tar format support and all other backup base backup options to work in parallel mode as well. As previously suggested, I have removed BASE_BACKUP [PARALLEL] and have added START_BACKUP instead to start the backup. The tar format will write multiple tar files depending upon the number of workers specified. Also made all commands (START_BACKUP/SEND_FILES_CONTENT/STOP_BACKUP) to accept the base_backup_opt_list. This way the command-line options can also be provided to these commands. Since the command-line options don't change once the backup initiates, I went this way instead of storing them in shared state.

The START_BACKUP command will now return a sorted list of files in descending order based on file sizes. This way, the larger files will be on top of the list. hence these files will be assigned to workers one by one, making it so that the larger files will be copied before other files.

Based on my understanding your main concern is that the files won't be distributed fairly i.e one worker might get a big file and take more time while others get done early with smaller files? In this approach I have created a list of files in descending order based on there sizes so all the big size files will come at the top. The maximum file size in PG is 1GB so if we have four workers who are picking up file from the list one by one, the worst case scenario is that one worker gets a file of 1GB to process while others get files of smaller size. However with this approach of descending files based on size and handing it out to workers one by one, there is a very high likelihood of workers getting work evenly. does this address your concerns?

Furthermore the patch also includes the regression test. As t/010_pg_basebackup.pl test-case is testing base backup comprehensively, so I have duplicated it to "t/040_pg_basebackup_parallel.pl" and added parallel option in all of its tests, to make sure parallel mode works expectantly. The one thing that differs from base backup is the file checksum reporting. In parallel mode, the total number of checksum failures are not reported correctly however it will abort the backup whenever a checksum failure occurs. This is because processes are not maintaining any shared state. I assume that it's not much important to report total number of failures vs noticing the failure and aborting.


--
Asif Rehman
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca


0001-parallel-backup.patch (102K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: WIP/PoC for parallel backup

Robert Haas
On Fri, Oct 4, 2019 at 7:02 AM Asif Rehman <[hidden email]> wrote:
> Based on my understanding your main concern is that the files won't be distributed fairly i.e one worker might get a big file and take more time while others get done early with smaller files? In this approach I have created a list of files in descending order based on there sizes so all the big size files will come at the top. The maximum file size in PG is 1GB so if we have four workers who are picking up file from the list one by one, the worst case scenario is that one worker gets a file of 1GB to process while others get files of smaller size. However with this approach of descending files based on size and handing it out to workers one by one, there is a very high likelihood of workers getting work evenly. does this address your concerns?

Somewhat, but I'm not sure it's good enough. There are lots of reasons
why two processes that are started at the same time with the same
amount of work might not finish at the same time.

I'm also not particularly excited about having the server do the
sorting based on file size.  Seems like that ought to be the client's
job, if the client needs the sorting.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: WIP/PoC for parallel backup

Rushabh Lathia
In reply to this post by Asif Rehman
Thanks Asif for the patch.  I am opting this for a review.  Patch is
bit big, so here are very initial comments to make the review process
easier.

1) Patch seems doing lot of code shuffling, I think it would be easy
to review if you can break the clean up patch separately.

Example:
a: setup_throttle
b: include_wal_files

2) As I can see this patch basically have three major phase.

a) Introducing new commands like START_BACKUP, SEND_FILES_CONTENT and
STOP_BACKUP.
b) Implementation of actual parallel backup.
c) Testcase

I would suggest, if you can break out in three as a separate patch that
would be nice.  It will benefit in reviewing the patch.

3) In your patch you are preparing the backup manifest (file which
giving the information about the data files). Robert Haas, submitted
the backup manifests patch on another thread [1], and I think we
should use that patch to get the backup manifests for parallel backup.

Further, I will continue to review patch but meanwhile if you can
break the patches - so that review process be easier.


Thanks,

On Fri, Oct 4, 2019 at 4:32 PM Asif Rehman <[hidden email]> wrote:


On Thu, Oct 3, 2019 at 6:40 PM Robert Haas <[hidden email]> wrote:
On Fri, Sep 27, 2019 at 12:00 PM Asif Rehman <[hidden email]> wrote:
>> > - SEND_FILES_CONTENTS (file1, file2,...) - returns the files in given list.
>> > pg_basebackup will then send back a list of filenames in this command. This commands will be send by each worker and that worker will be getting the said files.
>>
>> Seems reasonable, but I think you should just pass one file name and
>> use the command multiple times, once per file.
>
> I considered this approach initially,  however, I adopted the current strategy to avoid multiple round trips between the server and clients and save on query processing time by issuing a single command rather than multiple ones. Further fetching multiple files at once will also aid in supporting the tar format by utilising the existing ReceiveTarFile() function and will be able to create a tarball for per tablespace per worker.

I think that sending multiple filenames on a line could save some time
when there are lots of very small files, because then the round-trip
overhead could be significant.

However, if you've got mostly big files, I think this is going to be a
loser. It'll be fine if you're able to divide the work exactly evenly,
but that's pretty hard to do, because some workers may succeed in
copying the data faster than others for a variety of reasons: some
data is in memory, some data has to be read from disk, different data
may need to be read from different disks that run at different speeds,
not all the network connections may run at the same speed. Remember
that the backup's not done until the last worker finishes, and so
there may well be a significant advantage in terms of overall speed in
putting some energy into making sure that they finish as close to each
other in time as possible.

To put that another way, the first time all the workers except one get
done while the last one still has 10GB of data to copy, somebody's
going to be unhappy.

I have updated the patch (see the attached patch) to include tablespace support, tar format support and all other backup base backup options to work in parallel mode as well. As previously suggested, I have removed BASE_BACKUP [PARALLEL] and have added START_BACKUP instead to start the backup. The tar format will write multiple tar files depending upon the number of workers specified. Also made all commands (START_BACKUP/SEND_FILES_CONTENT/STOP_BACKUP) to accept the base_backup_opt_list. This way the command-line options can also be provided to these commands. Since the command-line options don't change once the backup initiates, I went this way instead of storing them in shared state.

The START_BACKUP command will now return a sorted list of files in descending order based on file sizes. This way, the larger files will be on top of the list. hence these files will be assigned to workers one by one, making it so that the larger files will be copied before other files.

Based on my understanding your main concern is that the files won't be distributed fairly i.e one worker might get a big file and take more time while others get done early with smaller files? In this approach I have created a list of files in descending order based on there sizes so all the big size files will come at the top. The maximum file size in PG is 1GB so if we have four workers who are picking up file from the list one by one, the worst case scenario is that one worker gets a file of 1GB to process while others get files of smaller size. However with this approach of descending files based on size and handing it out to workers one by one, there is a very high likelihood of workers getting work evenly. does this address your concerns?

Furthermore the patch also includes the regression test. As t/010_pg_basebackup.pl test-case is testing base backup comprehensively, so I have duplicated it to "t/040_pg_basebackup_parallel.pl" and added parallel option in all of its tests, to make sure parallel mode works expectantly. The one thing that differs from base backup is the file checksum reporting. In parallel mode, the total number of checksum failures are not reported correctly however it will abort the backup whenever a checksum failure occurs. This is because processes are not maintaining any shared state. I assume that it's not much important to report total number of failures vs noticing the failure and aborting.


--
Asif Rehman
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca



--
Rushabh Lathia
Reply | Threaded
Open this post in threaded view
|

Re: WIP/PoC for parallel backup

Asif Rehman


On Mon, Oct 7, 2019 at 1:52 PM Rushabh Lathia <[hidden email]> wrote:
Thanks Asif for the patch.  I am opting this for a review.  Patch is
bit big, so here are very initial comments to make the review process
easier.

Thanks Rushabh for reviewing the patch.


1) Patch seems doing lot of code shuffling, I think it would be easy
to review if you can break the clean up patch separately.

Example:
a: setup_throttle
b: include_wal_files

2) As I can see this patch basically have three major phase.

a) Introducing new commands like START_BACKUP, SEND_FILES_CONTENT and
STOP_BACKUP.
b) Implementation of actual parallel backup.
c) Testcase

I would suggest, if you can break out in three as a separate patch that
would be nice.  It will benefit in reviewing the patch.

Sure, why not. I will break them into multiple patches.
 

3) In your patch you are preparing the backup manifest (file which
giving the information about the data files). Robert Haas, submitted
the backup manifests patch on another thread [1], and I think we
should use that patch to get the backup manifests for parallel backup.

Sure. Though the backup manifest patch calculates and includes the checksum of backup files and is done
while the file is being transferred to the frontend-end. The manifest file itself is copied at the
very end of the backup. In parallel backup, I need the list of filenames before file contents are transferred, in
order to divide them into multiple workers. For that, the manifest file has to be available when START_BACKUP
 is called. 

That means, backup manifest should support its creation while excluding the checksum during START_BACKUP().
I also need the directory information as well for two reasons:

- In plain format, base path has to exist before we can write the file. we can extract the base path from the file
but doing that for all files does not seem a good idea.
- base backup does not include the content of some directories but those directories although empty, are still
expected in PGDATA.

I can make these changes part of parallel backup (which would be on top of backup manifest patch) or
these changes can be done as part of manifest patch and then parallel can use them. 

Robert what do you suggest?


-- 
Asif Rehman
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca

Reply | Threaded
Open this post in threaded view
|

Re: WIP/PoC for parallel backup

Robert Haas
On Mon, Oct 7, 2019 at 8:48 AM Asif Rehman <[hidden email]> wrote:

> Sure. Though the backup manifest patch calculates and includes the checksum of backup files and is done
> while the file is being transferred to the frontend-end. The manifest file itself is copied at the
> very end of the backup. In parallel backup, I need the list of filenames before file contents are transferred, in
> order to divide them into multiple workers. For that, the manifest file has to be available when START_BACKUP
>  is called.
>
> That means, backup manifest should support its creation while excluding the checksum during START_BACKUP().
> I also need the directory information as well for two reasons:
>
> - In plain format, base path has to exist before we can write the file. we can extract the base path from the file
> but doing that for all files does not seem a good idea.
> - base backup does not include the content of some directories but those directories although empty, are still
> expected in PGDATA.
>
> I can make these changes part of parallel backup (which would be on top of backup manifest patch) or
> these changes can be done as part of manifest patch and then parallel can use them.
>
> Robert what do you suggest?

I think we should probably not use backup manifests here, actually. I
initially thought that would be a good idea, but after further thought
it seems like it just complicates the code to no real benefit.  I
suggest that the START_BACKUP command just return a result set, like a
query, with perhaps four columns: file name, file type ('d' for
directory or 'f' for file), file size, file mtime. pg_basebackup will
ignore the mtime, but some other tools might find that useful
information.

I wonder if we should also split START_BACKUP (which should enter
non-exclusive backup mode) from GET_FILE_LIST, in case some other
client program wants to use one of those but not the other.  I think
that's probably a good idea, but not sure.

I still think that the files should be requested one at a time, not a
huge long list in a single command.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: WIP/PoC for parallel backup

Asif Rehman


On Mon, Oct 7, 2019 at 6:05 PM Robert Haas <[hidden email]> wrote:
On Mon, Oct 7, 2019 at 8:48 AM Asif Rehman <[hidden email]> wrote:
> Sure. Though the backup manifest patch calculates and includes the checksum of backup files and is done
> while the file is being transferred to the frontend-end. The manifest file itself is copied at the
> very end of the backup. In parallel backup, I need the list of filenames before file contents are transferred, in
> order to divide them into multiple workers. For that, the manifest file has to be available when START_BACKUP
>  is called.
>
> That means, backup manifest should support its creation while excluding the checksum during START_BACKUP().
> I also need the directory information as well for two reasons:
>
> - In plain format, base path has to exist before we can write the file. we can extract the base path from the file
> but doing that for all files does not seem a good idea.
> - base backup does not include the content of some directories but those directories although empty, are still
> expected in PGDATA.
>
> I can make these changes part of parallel backup (which would be on top of backup manifest patch) or
> these changes can be done as part of manifest patch and then parallel can use them.
>
> Robert what do you suggest?

I think we should probably not use backup manifests here, actually. I
initially thought that would be a good idea, but after further thought
it seems like it just complicates the code to no real benefit.

Okay.
 
  I
suggest that the START_BACKUP command just return a result set, like a
query, with perhaps four columns: file name, file type ('d' for
directory or 'f' for file), file size, file mtime. pg_basebackup will
ignore the mtime, but some other tools might find that useful
information.
yes current patch already returns the result set. will add the additional information.


I wonder if we should also split START_BACKUP (which should enter
non-exclusive backup mode) from GET_FILE_LIST, in case some other
client program wants to use one of those but not the other.  I think
that's probably a good idea, but not sure.

Currently pg_basebackup does not enter in exclusive backup mode and other tools have to
use pg_start_backup() and pg_stop_backup() functions to achieve that. Since we are breaking
backup into multiple command, I believe it would be a good idea to have this option. I will include
it in next revision of this patch.
 

I still think that the files should be requested one at a time, not a
huge long list in a single command.
sure, will make the change.


--
Asif Rehman
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca

Reply | Threaded
Open this post in threaded view
|

Re: WIP/PoC for parallel backup

Ibrar Ahmed-4
In reply to this post by Robert Haas


On Mon, Oct 7, 2019 at 6:06 PM Robert Haas <[hidden email]> wrote:
On Mon, Oct 7, 2019 at 8:48 AM Asif Rehman <[hidden email]> wrote:
> Sure. Though the backup manifest patch calculates and includes the checksum of backup files and is done
> while the file is being transferred to the frontend-end. The manifest file itself is copied at the
> very end of the backup. In parallel backup, I need the list of filenames before file contents are transferred, in
> order to divide them into multiple workers. For that, the manifest file has to be available when START_BACKUP
>  is called.
>
> That means, backup manifest should support its creation while excluding the checksum during START_BACKUP().
> I also need the directory information as well for two reasons:
>
> - In plain format, base path has to exist before we can write the file. we can extract the base path from the file
> but doing that for all files does not seem a good idea.
> - base backup does not include the content of some directories but those directories although empty, are still
> expected in PGDATA.
>
> I can make these changes part of parallel backup (which would be on top of backup manifest patch) or
> these changes can be done as part of manifest patch and then parallel can use them.
>
> Robert what do you suggest?

I think we should probably not use backup manifests here, actually. I
initially thought that would be a good idea, but after further thought
it seems like it just complicates the code to no real benefit.  I
suggest that the START_BACKUP command just return a result set, like a
query, with perhaps four columns: file name, file type ('d' for
directory or 'f' for file), file size, file mtime. pg_basebackup will
ignore the mtime, but some other tools might find that useful
information.

I wonder if we should also split START_BACKUP (which should enter
non-exclusive backup mode) from GET_FILE_LIST, in case some other
client program wants to use one of those but not the other.  I think
that's probably a good idea, but not sure.

I still think that the files should be requested one at a time, not a
huge long list in a single command.

What about have an API to get the single file or list of files? We will use a single file in
our application and other tools can get the benefit of list of files.  

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




--
Ibrar Ahmed
Reply | Threaded
Open this post in threaded view
|

Re: WIP/PoC for parallel backup

Robert Haas
On Mon, Oct 7, 2019 at 9:43 AM Ibrar Ahmed <[hidden email]> wrote:
> What about have an API to get the single file or list of files? We will use a single file in
> our application and other tools can get the benefit of list of files.

That sounds a bit speculative to me. Who is to say that anyone will
find that useful? I mean, I think it's fine and good to build the
functionality that we need in a way that maximizes the likelihood that
other tools can reuse that functionality, and I think we should do
that. But I don't think it's smart to build functionality that we
don't really need in the hope that somebody else will find it useful
unless we're pretty sure that they actually will. I don't see that as
being the case here; YMMV.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


1234 ... 6