block-level incremental backup

classic Classic list List threaded Threaded
53 messages Options
123
Reply | Threaded
Open this post in threaded view
|

block-level incremental backup

Robert Haas
Hi,

Several companies, including EnterpriseDB, NTT, and Postgres Pro, have
developed technology that permits a block-level incremental backup to
be taken from a PostgreSQL server.  I believe the idea in all of those
cases is that non-relation files should be backed up in their
entirety, but for relation files, only those blocks that have been
changed need to be backed up.  I would like to propose that we should
have a solution for this problem in core, rather than leaving it to
each individual PostgreSQL company to develop and maintain their own
solution. Generally my idea is:

1. There should be a way to tell pg_basebackup to request from the
server only those blocks where LSN >= threshold_value.  There are
several possible ways for the server to implement this, the simplest
of which is to just scan all the blocks and send only the ones that
satisfy that criterion.  That might sound dumb, but it does still save
network bandwidth, and it works even without any prior setup. It will
probably be more efficient in many cases to instead scan all the WAL
generated since that LSN and extract block references from it, but
that is only possible if the server has all of that WAL available or
can somehow get it from the archive.  We could also, as several people
have proposed previously, have some kind of additional relation for
that stores either a single is-modified bit -- which only helps if the
reference LSN for the is-modified bit is older than the requested LSN
but not too much older -- or the highest LSN for each range of K
blocks, or something like that.  I am at the moment not too concerned
with the exact strategy we use here. I believe we may want to
eventually support more than one, since they have different
trade-offs.

2. When you use pg_basebackup in this way, each relation file that is
not sent in its entirety is replaced by a file with a different name.
For example, instead of base/16384/16417, you might get
base/16384/partial.16417 or however we decide to name them.  Each such
file will store near the beginning of the file a list of all the
blocks contained in that file, and the blocks themselves will follow
at offsets that can be predicted from the metadata at the beginning of
the file.  The idea is that you shouldn't have to read the whole file
to figure out which blocks it contains, and if you know specifically
what blocks you want, you should be able to reasonably efficiently
read just those blocks.  A backup taken in this manner should also
probably create some kind of metadata file in the root directory that
stops the server from starting and lists other salient details of the
backup.  In particular, you need the threshold LSN for the backup
(i.e. contains blocks newer than this) and the start LSN for the
backup (i.e. the LSN that would have been returned from
pg_start_backup).

3. There should be a new tool that knows how to merge a full backup
with any number of incremental backups and produce a complete data
directory with no remaining partial files.  The tool should check that
the threshold LSN for each incremental backup is less than or equal to
the start LSN of the previous backup; if not, there may be changes
that happened in between which would be lost, so combining the backups
is unsafe.  Running this tool can be thought of either as restoring
the backup or as producing a new synthetic backup from any number of
incremental backups.  This would allow for a strategy of unending
incremental backups.  For instance, on day 1, you take a full backup.
On every subsequent day, you take an incremental backup.  On day 9,
you run pg_combinebackup day1 day2 -o full; rm -rf day1 day2; mv full
day2.  On each subsequent day you do something similar.  Now you can
always roll back to any of the last seven days by combining the oldest
backup you have (which is always a synthetic full backup) with as many
newer incrementals as you want, up to the point where you want to
stop.

Other random points:
- If the server has multiple ways of finding blocks with an LSN
greater than or equal to the threshold LSN, it could make a cost-based
decision between those methods, or it could allow the client to
specify the method to be used.
- I imagine that the server would offer this functionality through a
new replication command or a syntax extension to an existing command,
so it could also be used by tools other than pg_basebackup if they
wished.
- Combining backups could also be done destructively rather than, as
proposed above, non-destructively, but you have to be careful about
what happens in case of a failure.
- The pg_combinebackup tool (or whatever we call it) should probably
have an option to exploit hard links to save disk space; this could in
particular make construction of a new synthetic full backup much
cheaper.  However you'd better be careful not to use this option when
actually trying to restore, because if you start the server and run
recovery, you don't want to change the copies of those same files that
are in your backup directory.  I guess the server could be taught to
complain about st_nlink > 1 but I'm not sure we want to go there.
- It would also be possible to collapse multiple incremental backups
into a single incremental backup, without combining with a full
backup.  In the worst case, size(i1+i2) = size(i1) + size(i2), but if
the same data is modified repeatedly collapsing backups would save
lots of space.  This doesn't seem like a must-have for v1, though.
- If you have a SAN and are taking backups using filesystem snapshots,
then you don't need this, because your SAN probably already uses
copy-on-write magic for those snapshots, and so you are already
getting all of the same benefits in terms of saving storage space that
you would get from something like this.  But not everybody has a SAN.
- I know that there have been several previous efforts in this area,
but none of them have gotten to the point of being committed.  I
intend no disrespect to those efforts.  I believe I'm taking a
slightly different view of the problem here than what has been done
previously, trying to focus on the user experience rather than, e.g.,
the technology that is used to decide which blocks need to be sent.
However it's possible I've missed a promising patch that takes an
approach very similar to what I'm outlining here, and if so, I don't
mind a bit having that pointed out to me.
- This is just a design proposal at this point; there is no code.  If
this proposal, or some modified version of it, seems likely to be
acceptable, I and/or my colleagues might try to implement it.
- It would also be nice to support *parallel* backup, both for full
backups as we can do them today and for incremental backups.  But that
sound like a separate effort.  pg_combinebackup could potentially
support parallel operation as well, although that might be too
ambitious for v1.
- It would also be nice if pg_basebackup could write backups to places
other than the local disk, like an object store, a tape drive, etc.
But that also sounds like a separate effort.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: block-level incremental backup

a.zakirov
Hello,

On 09.04.2019 18:48, Robert Haas wrote:
> - It would also be nice if pg_basebackup could write backups to places
> other than the local disk, like an object store, a tape drive, etc.
> But that also sounds like a separate effort.
>
> Thoughts?

(Just thinking out loud) Also it might be useful to have remote restore
facility (i.e. if pg_combinebackup could write to non-local storage), so
you don't need to restore the instance into a locale place and copy/move
to the remote machine. But it seems to me that it is the most nontrivial
feature and requires much more effort than other points.

In pg_probackup we have remote restore via SSH in the beta state. But
SSH isn't an option for in-core approach I think.

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Reply | Threaded
Open this post in threaded view
|

Re: block-level incremental backup

Andres Freund
In reply to this post by Robert Haas
Hi,

On 2019-04-09 11:48:38 -0400, Robert Haas wrote:
> 2. When you use pg_basebackup in this way, each relation file that is
> not sent in its entirety is replaced by a file with a different name.
> For example, instead of base/16384/16417, you might get
> base/16384/partial.16417 or however we decide to name them.

Hm. But that means that files that are shipped nearly in their entirety,
need to be fully rewritten. Wonder if it's better to ship them as files
with holes, and have the metadata in a separate file. That'd then allow
to just fill in the holes with data from the older version.  I'd assume
that there's a lot of workloads where some significantly sized relations
will get updated in nearly their entirety between backups.


> Each such file will store near the beginning of the file a list of all the
> blocks contained in that file, and the blocks themselves will follow
> at offsets that can be predicted from the metadata at the beginning of
> the file.  The idea is that you shouldn't have to read the whole file
> to figure out which blocks it contains, and if you know specifically
> what blocks you want, you should be able to reasonably efficiently
> read just those blocks.  A backup taken in this manner should also
> probably create some kind of metadata file in the root directory that
> stops the server from starting and lists other salient details of the
> backup.  In particular, you need the threshold LSN for the backup
> (i.e. contains blocks newer than this) and the start LSN for the
> backup (i.e. the LSN that would have been returned from
> pg_start_backup).

I wonder if we shouldn't just integrate that into pg_control or such. So
that:

> 3. There should be a new tool that knows how to merge a full backup
> with any number of incremental backups and produce a complete data
> directory with no remaining partial files.

Could just be part of server startup?


> - I imagine that the server would offer this functionality through a
> new replication command or a syntax extension to an existing command,
> so it could also be used by tools other than pg_basebackup if they
> wished.

Would this logic somehow be usable from tools that don't want to copy
the data directory via pg_basebackup (e.g. for parallelism, to directly
send to some backup service / SAN / whatnot)?


> - It would also be nice if pg_basebackup could write backups to places
> other than the local disk, like an object store, a tape drive, etc.
> But that also sounds like a separate effort.

Indeed seems separate. But worthwhile.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: block-level incremental backup

Gary M
Having worked in the data storage industry since the '80s, I think backup is an important capability. Having said that, the ideas should be expanded to an overall data management strategy combining local and remote storage including cloud.

From my experience, record and transaction consistency is critical to any replication action, including backup.  The approach commonly includes a starting baseline, snapshot if you prefer, and a set of incremental changes to the snapshot.  I always used the transaction logs for both backup and remote replication to other DBMS. In standard ECMA-208 @94, you will note a file object with a transaction property. Although the language specifies files, a file may be any set of records.

SAN based snapshots usually occur on the SAN storage device, meaning if cached data (unwritten to disk) will not be snapshotted or inconsistently reference and likely result in a corrupted database on restore. 

Snapshots are point in time states of storage objects. Between snapshot periods, any number of changes many occur.  If a record of "all changes" are required, snapshot methods must be augmented with a historical record.. the transaction log.     

 Delta block methods for backups have been in practice for many years. ZFS had adopted the practice for block management. The ability of incremental backups, whether block, transactions or other methods, is dependent on prior data. Like primary storage, backup media can fail, become lost and be inadvertently corrupted. The result of incremental data backup loss is the restored data after the point of loss is likely corrupted.

cheers, 
garym

On Tue, Apr 9, 2019 at 10:35 AM Andres Freund <[hidden email]> wrote:
Hi,

On 2019-04-09 11:48:38 -0400, Robert Haas wrote:
> 2. When you use pg_basebackup in this way, each relation file that is
> not sent in its entirety is replaced by a file with a different name.
> For example, instead of base/16384/16417, you might get
> base/16384/partial.16417 or however we decide to name them.

Hm. But that means that files that are shipped nearly in their entirety,
need to be fully rewritten. Wonder if it's better to ship them as files
with holes, and have the metadata in a separate file. That'd then allow
to just fill in the holes with data from the older version.  I'd assume
that there's a lot of workloads where some significantly sized relations
will get updated in nearly their entirety between backups.


> Each such file will store near the beginning of the file a list of all the
> blocks contained in that file, and the blocks themselves will follow
> at offsets that can be predicted from the metadata at the beginning of
> the file.  The idea is that you shouldn't have to read the whole file
> to figure out which blocks it contains, and if you know specifically
> what blocks you want, you should be able to reasonably efficiently
> read just those blocks.  A backup taken in this manner should also
> probably create some kind of metadata file in the root directory that
> stops the server from starting and lists other salient details of the
> backup.  In particular, you need the threshold LSN for the backup
> (i.e. contains blocks newer than this) and the start LSN for the
> backup (i.e. the LSN that would have been returned from
> pg_start_backup).

I wonder if we shouldn't just integrate that into pg_control or such. So
that:

> 3. There should be a new tool that knows how to merge a full backup
> with any number of incremental backups and produce a complete data
> directory with no remaining partial files.

Could just be part of server startup?


> - I imagine that the server would offer this functionality through a
> new replication command or a syntax extension to an existing command,
> so it could also be used by tools other than pg_basebackup if they
> wished.

Would this logic somehow be usable from tools that don't want to copy
the data directory via pg_basebackup (e.g. for parallelism, to directly
send to some backup service / SAN / whatnot)?


> - It would also be nice if pg_basebackup could write backups to places
> other than the local disk, like an object store, a tape drive, etc.
> But that also sounds like a separate effort.

Indeed seems separate. But worthwhile.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: block-level incremental backup

Robert Haas
In reply to this post by Andres Freund
On Tue, Apr 9, 2019 at 12:35 PM Andres Freund <[hidden email]> wrote:
> Hm. But that means that files that are shipped nearly in their entirety,
> need to be fully rewritten. Wonder if it's better to ship them as files
> with holes, and have the metadata in a separate file. That'd then allow
> to just fill in the holes with data from the older version.  I'd assume
> that there's a lot of workloads where some significantly sized relations
> will get updated in nearly their entirety between backups.

I don't want to rely on holes at the FS level.  I don't want to have
to worry about what Windows does and what every Linux filesystem does
and what NetBSD and FreeBSD and Dragonfly BSD and MacOS do.  And I
don't want to have to write documentation for the fine manual
explaining to people that they need to use a hole-preserving tool when
they copy an incremental backup around.  And I don't want to have to
listen to complaints from $USER that their backup tool, $THING, is not
hole-aware.  Just - no.

But what we could do is have some threshold (as git does), beyond
which you just send the whole file.  For example if >90% of the blocks
have changed, or >80% or whatever, then you just send everything.
That way, if you have a database where you have lots and lots of 1GB
segments with low churn (so that you can't just use full backups) and
lots and lots of 1GB segments with high churn (to create the problem
you're describing) you'll still be OK.

> > 3. There should be a new tool that knows how to merge a full backup
> > with any number of incremental backups and produce a complete data
> > directory with no remaining partial files.
>
> Could just be part of server startup?

Yes, but I think that sucks.  You might not want to start the server
but rather just create a new synthetic backup.  And realistically,
it's hard to imagine the server doing anything but synthesizing the
backup first and then proceeding as normal.  In theory there's no
reason why it couldn't be smart enough to construct the files it needs
"on demand" in the background, but that sounds really hard and I don't
think there's enough value to justify that level of effort.  YMMV, of
course.

> > - I imagine that the server would offer this functionality through a
> > new replication command or a syntax extension to an existing command,
> > so it could also be used by tools other than pg_basebackup if they
> > wished.
>
> Would this logic somehow be usable from tools that don't want to copy
> the data directory via pg_basebackup (e.g. for parallelism, to directly
> send to some backup service / SAN / whatnot)?

Well, I'm imagining it as a piece of server-side functionality that
can figure out what has changed using one of several possible methods,
and then send that stuff to you.  So I think if you don't have a
server connection you are out of luck.  If you have a server
connection but just want to be told what has changed rather than
actually being given that data, that might be something that could be
worked into the design.  I'm not sure whether that's a real need,
though, or just extra work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: block-level incremental backup

Robert Haas
In reply to this post by a.zakirov
On Tue, Apr 9, 2019 at 12:32 PM Arthur Zakirov <[hidden email]> wrote:
> In pg_probackup we have remote restore via SSH in the beta state. But
> SSH isn't an option for in-core approach I think.

That's a little off-topic for this thread, but I think we should have
some kind of extensible mechanism for pg_basebackup and maybe other
tools, so that you can teach it to send backups to AWS or your
teletype or etch them on stone tablets or whatever without having to
modify core code.  But let's not design that mechanism on this thread,
'cuz that will distract from what I want to talk about here.  Feel
free to start a new thread for it, though, and I'll jump in.  :-)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: block-level incremental backup

Peter Eisentraut-6
In reply to this post by Robert Haas
On 2019-04-09 17:48, Robert Haas wrote:
> It will
> probably be more efficient in many cases to instead scan all the WAL
> generated since that LSN and extract block references from it, but
> that is only possible if the server has all of that WAL available or
> can somehow get it from the archive.

This could be a variant of a replication slot that preserves WAL between
incremental backup runs.

> 3. There should be a new tool that knows how to merge a full backup
> with any number of incremental backups and produce a complete data
> directory with no remaining partial files.

Are there by any chance standard file formats and tools that describe a
binary difference between directories?  That would be really useful here.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: block-level incremental backup

Alvaro Herrera-9
On 2019-Apr-09, Peter Eisentraut wrote:

> On 2019-04-09 17:48, Robert Haas wrote:

> > 3. There should be a new tool that knows how to merge a full backup
> > with any number of incremental backups and produce a complete data
> > directory with no remaining partial files.
>
> Are there by any chance standard file formats and tools that describe a
> binary difference between directories?  That would be really useful here.

VCDIFF? https://tools.ietf.org/html/rfc3284

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: block-level incremental backup

Andrey Borodin-2
In reply to this post by Robert Haas
Hi!

> 9 апр. 2019 г., в 20:48, Robert Haas <[hidden email]> написал(а):
>
> Thoughts?
Thanks for this long and thoughtful post!

At Yandex, we are using incremental backups for some years now. Initially, we used patched pgbarman, then we implemented this functionality in WAL-G. And there are many things to be done yet. We have more than 1Pb of clusters backuped with this technology.
Most of the time we use this technology as a part of HA setup in managed PostgreSQL service. So, for us main goals are to operate backups cheaply and restore new node quickly. Here's what I see from our perspective.

1. Yes, this feature is important.

2. This importance comes not from reduced disk storage, magnetic disks and object storages are very cheap.

3. Incremental backups save a lot of network bandwidth. It is non-trivial for the storage system to ingest hundreds of Tb daily.

4. Incremental backups are a redundancy of WAL, intended for parallel application. Incremental backup applied sequentially is not very useful, it will not be much faster than simple WAL replay in many cases.

5. As long as increments duplicate WAL functionality - it is not worth pursuing tradeoffs of storage utilization reduction. We scan WAL during archivation, extract numbers of changed blocks and store changemap for a group of WALs in the archive.

6. This changemaps can be used for the increment of the visibility map (if I recall correctly). But you cannot compare LSNs on a page of visibility map: some operations do not bump them.

7. We use changemaps during backups and during WAL replay - we know blocks that will change far in advance and prefetch them to page cache like pg_prefaulter does.

8. There is similar functionality in RMAN for one well-known database. They used to store 8 sets of change maps. That database also has cool functionality "increment for catchup".

9. We call incremental backup a "delta backup". This wording describes purpose more precisely: it is not "next version of DB", it is "difference between two DB states". But wording choice does not matter much.


Here are slides from my talk at PgConf.APAC[0]. I've proposed a talk on this matter to PgCon, but it was not accepted. I will try next year :)

> 9 апр. 2019 г., в 20:48, Robert Haas <[hidden email]> написал(а):
> - This is just a design proposal at this point; there is no code.  If
> this proposal, or some modified version of it, seems likely to be
> acceptable, I and/or my colleagues might try to implement it.

I'll be happy to help with code, discussion and patch review.

Best regards, Andrey Borodin.

[0] https://yadi.sk/i/Y_S1iqNN5WxS6A

Reply | Threaded
Open this post in threaded view
|

Re: block-level incremental backup

konstantin knizhnik
In reply to this post by Robert Haas


On 09.04.2019 18:48, Robert Haas wrote:
> 1. There should be a way to tell pg_basebackup to request from the
> server only those blocks where LSN >= threshold_value.

Some times ago I have implemented alternative version of ptrack utility
(not one used in pg_probackup)
which detects updated block at file level. It is very simple and may be
it can be sometimes integrated in master.
I attached patch to vanilla to this mail.
Right now it contains just two GUCs:

ptrack_map_size: Size of ptrack map (number of elements) used for
incremental backup: 0 disabled.
ptrack_block_log: Logarithm of ptrack block size (amount of pages)

and one function:

pg_ptrack_get_changeset(startlsn pg_lsn) returns
{relid,relfilenode,reltablespace,forknum,blocknum,segsize,updlsn,path}

Idea is very simple: it creates hash map of fixed size (ptrack_map_size)
and stores LSN of written pages in this map.
As far as postgres default page size seems to be too small  for ptrack
block (requiring too large hash map or increasing number of conflicts,
as well as
increasing number of random reads) it is possible to configure ptrack
block to consists of multiple pages (power of 2).

This patch is using memory mapping mechanism. Unfortunately there is no
portable wrapper for it in Postgres, so I have to provide own
implementations for Unix/Windows. Certainly it is not good and should be
rewritten.

How to use?

1. Define ptrack_map_size in postgres.conf, for example (use simple
number for more uniform hashing):

ptrack_map_size = 1000003

2.  Remember current lsn.

psql postgres -c "select pg_current_wal_lsn()"
  pg_current_wal_lsn
--------------------
  0/224A268
(1 row)

3. Do some updates.

$ pgbench -T 10 postgres

4. Select changed blocks.

  select * from pg_ptrack_get_changeset('0/224A268');
  relid | relfilenode | reltablespace | forknum | blocknum | segsize | 
updlsn   |         path
-------+-------------+---------------+---------+----------+---------+-----------+----------------------
  16390 |       16396 |          1663 |       0 |     1640 |       1 |
0/224FD88 | base/12710/16396
  16390 |       16396 |          1663 |       0 |     1641 |       1 |
0/2258680 | base/12710/16396
  16390 |       16396 |          1663 |       0 |     1642 |       1 |
0/22615A0 | base/12710/16396
...

Certainly ptrack should be used as part of some backup tool (as
pg_basebackup or pg_probackup).


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


ptrack-1.patch (15K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: block-level incremental backup

Jehan-Guillaume de Rorthais
In reply to this post by Robert Haas
Hi,

On Tue, 9 Apr 2019 11:48:38 -0400
Robert Haas <[hidden email]> wrote:

> Several companies, including EnterpriseDB, NTT, and Postgres Pro, have
> developed technology that permits a block-level incremental backup to
> be taken from a PostgreSQL server.  I believe the idea in all of those
> cases is that non-relation files should be backed up in their
> entirety, but for relation files, only those blocks that have been
> changed need to be backed up.  I would like to propose that we should
> have a solution for this problem in core, rather than leaving it to
> each individual PostgreSQL company to develop and maintain their own
> solution. Generally my idea is:
>
> 1. There should be a way to tell pg_basebackup to request from the
> server only those blocks where LSN >= threshold_value.  There are
> several possible ways for the server to implement this, the simplest
> of which is to just scan all the blocks and send only the ones that
> satisfy that criterion.  That might sound dumb, but it does still save
> network bandwidth, and it works even without any prior setup.

+1 this is a simple design and probably a first easy step bringing a lot of
benefices already.

> It will probably be more efficient in many cases to instead scan all the WAL
> generated since that LSN and extract block references from it, but
> that is only possible if the server has all of that WAL available or
> can somehow get it from the archive.

I seize the opportunity to discuss about this on the fly.

I've been playing with the idea of producing incremental backups from
archives since many years. But I've only started PoC'ing on it this year.

My idea would be create a new tool working on archived WAL. No burden
server side. Basic concept is:

* parse archives
* record latest relevant FPW for the incr backup
* write new WALs with recorded FPW and removing/rewriting duplicated walrecords.

It's just a PoC and I hadn't finished the WAL writing part...not even talking
about the replay part. I'm not even sure this project is a good idea, but it is
a good educational exercice to me in the meantime.

Anyway, using real life OLTP production archives, my stats were:

  # WAL   xlogrec kept     Size WAL kept
    127            39%               50%
    383            22%               38%
    639            20%               29%

Based on this stats, I expect this would save a lot of time during recovery in
a first step. If it get mature, it might even save a lot of archives space or
extend the retention period with degraded granularity. It would even help
taking full backups with a lower frequency.

Any thoughts about this design would be much appreciated. I suppose this should
be offlist or in a new thread to avoid polluting this thread as this is a
slightly different subject.

Regards,


PS: I was surprised to still find some existing piece of code related to
pglesslog in core. This project has been discontinued and WAL format changed in
the meantime.


Reply | Threaded
Open this post in threaded view
|

Re: block-level incremental backup

Robert Haas
In reply to this post by Alvaro Herrera-9
On Tue, Apr 9, 2019 at 5:28 PM Alvaro Herrera <[hidden email]> wrote:

> On 2019-Apr-09, Peter Eisentraut wrote:
> > On 2019-04-09 17:48, Robert Haas wrote:
> > > 3. There should be a new tool that knows how to merge a full backup
> > > with any number of incremental backups and produce a complete data
> > > directory with no remaining partial files.
> >
> > Are there by any chance standard file formats and tools that describe a
> > binary difference between directories?  That would be really useful here.
>
> VCDIFF? https://tools.ietf.org/html/rfc3284

I don't understand VCDIFF very well, but I see some potential problems
with going in this direction.

First, suppose we take a full backup on Monday.  Then, on Tuesday, we
want to take an incremental backup.  In my proposal, the backup server
only needs to provide the database with one piece of information: the
start-LSN of the previous backup.  The server determines which blocks
are recently modified and sends them to the client, which stores them.
The end.  On the other hand, storing a maximally compact VCDIFF seems
to require that, for each block modified in the Tuesday backup, we go
read the corresponding block as it existed on Monday.  Assuming that
the server is using some efficient method of locating modified blocks,
this will approximately double the amount of read I/O required to
complete the backup: either the server or the client must now read not
only the current version of the block but the previous versions.  If
the previous backup is an incremental backup that does not contain
full block images but only VCDIFF content, whoever is performing the
VCDIFF calculation will need to walk the entire backup chain and
reconstruct the previous contents of the previous block so that it can
compute the newest VCDIFF.  A customer who does an incremental backup
every day and maintains a synthetic full backup from 1 week prior will
see a roughly eightfold increase in read I/O compared to the design I
proposed.

The same problem exists at restore time.  In my design, the total read
I/O required is equal to the size of the database, plus however much
metadata needs to be read from older delta files -- and that should be
fairly small compared to the actual data being read, at least in
normal, non-extreme cases.  But if we are going to proceed by applying
a series of delta files, we're going to need to read every older
backup in its entirety.  If the turnover percentage is significant,
say 20%/day, and if the backup chain is say 7 backups long to get back
to a full backup, this is a huge difference.  Instead of having to
read ~100% of the database size, as in my proposal, we'll need to read
100% + (6 * 20%) = 220% of the database size.

Since VCDIFF uses an add-copy-run language to described differences,
we could try to work around the problem that I just described by
describing each changed data block as an 8192-byte add, and unchanged
blocks as an 8192-byte copy.  If we did that, then I think that the
problem at backup time goes away: we can write out a VCDIFF-format
file for the changed blocks based just on knowing that those are the
blocks that have changed, without needing to access the older file. Of
course, if we do it this way, the file will be larger than it would be
if we actually compared the old and new block contents and wrote out a
minimal VCDIFF, but it does make taking a backup a lot simpler.  Even
with this proposal, though, I think we still have trouble with restore
time.  I proposed putting the metadata about which blocks are included
in a delta file at the beginning of the file, which allows a restore
of a new incremental backup to relatively efficiently flip through
older backups to find just the blocks that it needs, without having to
read the whole file.  But I think (although I am not quite sure) that
in the VCDIFF format, the payload for an ADD instruction is stored
near the payload.  The result would be that you'd have to basically
read the whole file at restore time to figure out which blocks were
available from that file and which ones needed to be retrieved from an
older backup.  So while this approach would fix the backup-time
problem, I believe that it would still require significantly more read
I/O at restore time than my proposal.

Furthermore, if, at backup time, we have to do anything that requires
access to the old data, either the client or the server needs to have
access to that data.  Nonwithstanding the costs of reading it, that
doesn't seem very desirable.  The server is quite unlikely to have
access to the backups, because most users want to back up to a
different server in order to guard against a hardware failure.  The
client is more likely to be running on a machine where it has access
to the data, because many users back up to the same machine every day,
so the machine that is taking the current backup probably has the
older one.  However, accessing that old backup might not be cheap.  It
could be located in an object store in the cloud someplace, or it
could have been written out to a tape drive and the tape removed from
the drive.  In the design I'm proposing, that stuff doesn't matter,
but if you want to run diffs, then it does.  Even if the client has
efficient access to the data and even if it has so much read I/O
bandwidth that the costs of reading that old data to run diffs doesn't
matter, it's still pretty awkward for a tar-format backup.  The client
would have to take the tar archive sent by the server apart and form a
new one.

Another advantage of storing whole blocks in the incremental backup is
that there's no tight coupling between the full backup and the
incremental backup.  Suppose you take a full backup A on S1, and then
another full backup B, and then an incremental backup C based on A,
and then an incremental backup D based on B.  If backup B is destroyed
beyond retrieval, you can restore the chain A-C-D and get back to the
same place that restoring B-D would have gotten you.  Backup D doesn't
really know or care that it happens to be based on B.  It just knows
that it can only give you those blocks that have LSN >= LSN_B.  You
can get those blocks from anywhere that you like.  If D instead stored
deltas between the blocks as they exist in backup B, then those deltas
would have to be applied specifically to backup B, not some
possibly-later version.

I think the way to think about this problem, or at least the way I
think about this problem, is that we need to decide whether want
file-level incremental backup, block-level incremental backup, or
byte-level incremental backup.  pgbackrest implements file-level
incremental backup: if the file has changed, copy the whole thing.
That has an appealing simplicity but risks copying 1GB of data for a
1-byte change. What I'm proposing here is block-level incremental
backup, which is more complicated and still risks copying 8kB of data
for a 1-byte change.  Using VCDIFF would, I think, give us byte-level
incremental backup.  That would probably do an excellent job of making
incremental backups as small as they can possibly be, because we would
not need to include in the backup image even a single byte of
unmodified data.  It also seems like it does some other compression
tricks which could shrink incremental backups further.  However, my
intuition is that we won't gain enough in terms of backup size to make
up for the downsides listed above.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: block-level incremental backup

Robert Haas
In reply to this post by Jehan-Guillaume de Rorthais
On Wed, Apr 10, 2019 at 10:57 AM Jehan-Guillaume de Rorthais
<[hidden email]> wrote:

> My idea would be create a new tool working on archived WAL. No burden
> server side. Basic concept is:
>
> * parse archives
> * record latest relevant FPW for the incr backup
> * write new WALs with recorded FPW and removing/rewriting duplicated walrecords.
>
> It's just a PoC and I hadn't finished the WAL writing part...not even talking
> about the replay part. I'm not even sure this project is a good idea, but it is
> a good educational exercice to me in the meantime.
>
> Anyway, using real life OLTP production archives, my stats were:
>
>   # WAL   xlogrec kept     Size WAL kept
>     127            39%               50%
>     383            22%               38%
>     639            20%               29%
>
> Based on this stats, I expect this would save a lot of time during recovery in
> a first step. If it get mature, it might even save a lot of archives space or
> extend the retention period with degraded granularity. It would even help
> taking full backups with a lower frequency.
>
> Any thoughts about this design would be much appreciated. I suppose this should
> be offlist or in a new thread to avoid polluting this thread as this is a
> slightly different subject.

Interesting idea, but I don't see how it can work if you only deal
with the FPWs and not the other records.  For instance, suppose that
you take a full backup at time T0, and then at time T1 there are two
modifications to a certain block in quick succession.  That block is
then never touched again.  Since no checkpoint intervenes between the
modifications, the first one emits an FPI and the second does not.
Capturing the FPI is fine as far as it goes, but unless you also do
something with the non-FPI change, you lose that second modification.
You could fix that by having your tool replicate the effects of WAL
apply outside the server, but that sounds like a ton of work and a ton
of possible bugs.

I have a related idea, though.  Suppose that, as Peter says upthread,
you have a replication slot that prevents old WAL from being removed.
You also have a background worker that is connected to that slot.  It
decodes WAL and produces summary files containing all block-references
extracted from those WAL records and the associated LSN (or maybe some
approximation of the LSN instead of the exact value, to allow for
compression and combining of nearby references).  Then you hold onto
those summary files after the actual WAL is removed.  Now, when
somebody asks the server for all blocks changed since a certain LSN,
it can use those summary files to figure out which blocks to send
without having to read all the pages in the database.  Although I
believe that a simple system that finds modified blocks by reading
them all is good enough for a first version of this feature and useful
in its own right, a more efficient system will be a lot more useful,
and something like this seems to me to be probably the best way to
implement it.

The reason why I think this is likely to be superior to other possible
approaches, such as the ptrack approach Konstantin suggests elsewhere
on this thread, is because it pushes the work of figuring out which
blocks have been modified into the background.  With a ptrack-type
approach, the server has to do some non-zero amount of extra work in
the foreground every time it modifies a block.  With an approach based
on WAL-scanning, the work is done in the background and nobody has to
wait for it.  It's possible that there are other considerations which
aren't occurring to me right now, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: block-level incremental backup

Robert Haas
In reply to this post by konstantin knizhnik
On Wed, Apr 10, 2019 at 10:22 AM Konstantin Knizhnik
<[hidden email]> wrote:
> Some times ago I have implemented alternative version of ptrack utility
> (not one used in pg_probackup)
> which detects updated block at file level. It is very simple and may be
> it can be sometimes integrated in master.

I don't think this is completely crash-safe.  It looks like it
arranges to msync() the ptrack file at appropriate times (although I
haven't exhaustively verified the logic), but it uses MS_ASYNC, so
it's possible that the ptrack file could get updated on disk either
before or after the relation file itself.  I think before is probably
OK -- it just risks having some blocks look modified when they aren't
really -- but after seems like it is very much not OK.  And changing
this to use MS_SYNC would probably be really expensive.  Likely a
better approach would be to hook into the new fsync queue machinery
that Thomas Munro added to PostgreSQL 12.

It looks like your system maps all the blocks in the system into a
fixed-size map using hashing.  If the number of modified blocks
between the full backup and the incremental backup is large compared
to the size of the ptrack map, you'll start to get a lot of
false-positives.  It will look as if much of the database needs to be
backed up.  For example, in your sample configuration, you have
ptrack_map_size = 1000003. If you've got a 100GB database with 20%
daily turnover, that's about 2.6 million blocks.  If you set bump a
random entry ~2.6 million times in a map with 1000003 entries, on the
average ~92% of the entries end up getting bumped, so you will get
very little benefit from incremental backup.  This problem drops off
pretty fast if you raise the size of the map, but it's pretty critical
that your map is large enough for the database you've got, or you may
as well not bother.

It also appears that your system can't really handle resizing of the
map in any friendly way.  So if your data size grows, you may be faced
with either letting the map become progressively less effective, or
throwing it out and losing all the data you have.

None of that is to say that what you're presenting here has no value,
but I think it's possible to do better (and I think we should try).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: block-level incremental backup

Ashwin Agrawal
In reply to this post by Robert Haas

On Wed, Apr 10, 2019 at 9:21 AM Robert Haas <[hidden email]> wrote:
I have a related idea, though.  Suppose that, as Peter says upthread,
you have a replication slot that prevents old WAL from being removed.
You also have a background worker that is connected to that slot.  It
decodes WAL and produces summary files containing all block-references
extracted from those WAL records and the associated LSN (or maybe some
approximation of the LSN instead of the exact value, to allow for
compression and combining of nearby references).  Then you hold onto
those summary files after the actual WAL is removed.  Now, when
somebody asks the server for all blocks changed since a certain LSN,
it can use those summary files to figure out which blocks to send
without having to read all the pages in the database.  Although I
believe that a simple system that finds modified blocks by reading
them all is good enough for a first version of this feature and useful
in its own right, a more efficient system will be a lot more useful,
and something like this seems to me to be probably the best way to
implement it.

Not to fork the conversation from incremental backups, but similar approach is what we have been thinking for pg_rewind. Currently, pg_rewind requires all the WAL logs to be present on source side from point of divergence to rewind. Instead just parse the wal and keep the changed blocks around on sourece. Then don't need to retain the WAL but can still rewind using the changed block map. So, rewind becomes much similar to incremental backup proposed here after performing rewind activity using target side WAL only.
Reply | Threaded
Open this post in threaded view
|

Re: block-level incremental backup

Robert Haas
In reply to this post by Andrey Borodin-2
On Wed, Apr 10, 2019 at 7:51 AM Andrey Borodin <[hidden email]> wrote:
> > 9 апр. 2019 г., в 20:48, Robert Haas <[hidden email]> написал(а):
> > - This is just a design proposal at this point; there is no code.  If
> > this proposal, or some modified version of it, seems likely to be
> > acceptable, I and/or my colleagues might try to implement it.
>
> I'll be happy to help with code, discussion and patch review.

That would be great!

We should probably give this discussion some more time before we
plunge into the implementation phase, but I'd love to have some help
with that, whether it's with coding or review or whatever.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: block-level incremental backup

Robert Haas
In reply to this post by Ashwin Agrawal
On Wed, Apr 10, 2019 at 12:56 PM Ashwin Agrawal <[hidden email]> wrote:
> Not to fork the conversation from incremental backups, but similar approach is what we have been thinking for pg_rewind. Currently, pg_rewind requires all the WAL logs to be present on source side from point of divergence to rewind. Instead just parse the wal and keep the changed blocks around on sourece. Then don't need to retain the WAL but can still rewind using the changed block map. So, rewind becomes much similar to incremental backup proposed here after performing rewind activity using target side WAL only.

Interesting.  So if we build a system like this for incremental
backup, or for pg_rewind, the other one can use the same
infrastructure.  That sound excellent.  I'll start a new thread to
talk about that, and hopefully you and Heikki and others will chime in
with thoughts.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: block-level incremental backup

Jehan-Guillaume de Rorthais
In reply to this post by Robert Haas
Hi,

First thank you for your answer!

On Wed, 10 Apr 2019 12:21:03 -0400
Robert Haas <[hidden email]> wrote:

> On Wed, Apr 10, 2019 at 10:57 AM Jehan-Guillaume de Rorthais
> <[hidden email]> wrote:
> > My idea would be create a new tool working on archived WAL. No burden
> > server side. Basic concept is:
> >
> > * parse archives
> > * record latest relevant FPW for the incr backup
> > * write new WALs with recorded FPW and removing/rewriting duplicated
> > walrecords.
> >
> > It's just a PoC and I hadn't finished the WAL writing part...not even
> > talking about the replay part. I'm not even sure this project is a good
> > idea, but it is a good educational exercice to me in the meantime.
> >
> > Anyway, using real life OLTP production archives, my stats were:
> >
> >   # WAL   xlogrec kept     Size WAL kept
> >     127            39%               50%
> >     383            22%               38%
> >     639            20%               29%
> >
> > Based on this stats, I expect this would save a lot of time during recovery
> > in a first step. If it get mature, it might even save a lot of archives
> > space or extend the retention period with degraded granularity. It would
> > even help taking full backups with a lower frequency.
> >
> > Any thoughts about this design would be much appreciated. I suppose this
> > should be offlist or in a new thread to avoid polluting this thread as this
> > is a slightly different subject.  
>
> Interesting idea, but I don't see how it can work if you only deal
> with the FPWs and not the other records.  For instance, suppose that
> you take a full backup at time T0, and then at time T1 there are two
> modifications to a certain block in quick succession.  That block is
> then never touched again.  Since no checkpoint intervenes between the
> modifications, the first one emits an FPI and the second does not.
> Capturing the FPI is fine as far as it goes, but unless you also do
> something with the non-FPI change, you lose that second modification.
> You could fix that by having your tool replicate the effects of WAL
> apply outside the server, but that sounds like a ton of work and a ton
> of possible bugs.

In my current design, the scan is done backward from end to start and I keep all
the records appearing after the last occurrence of their respective FPI.

The next challenge I have to achieve is to deal with multiple blocks records
where some need to be removed and other are FPI to keep (eg. UPDATE).

> I have a related idea, though.  Suppose that, as Peter says upthread,
> you have a replication slot that prevents old WAL from being removed.
> You also have a background worker that is connected to that slot.  It
> decodes WAL and produces summary files containing all block-references
> extracted from those WAL records and the associated LSN (or maybe some
> approximation of the LSN instead of the exact value, to allow for
> compression and combining of nearby references).  Then you hold onto
> those summary files after the actual WAL is removed.  Now, when
> somebody asks the server for all blocks changed since a certain LSN,
> it can use those summary files to figure out which blocks to send
> without having to read all the pages in the database.  Although I
> believe that a simple system that finds modified blocks by reading
> them all is good enough for a first version of this feature and useful
> in its own right, a more efficient system will be a lot more useful,
> and something like this seems to me to be probably the best way to
> implement it.

Summary files looks like what Andrey Borodin described as delta-files and
change maps.

> With an approach based
> on WAL-scanning, the work is done in the background and nobody has to
> wait for it.

Agree with this.


Reply | Threaded
Open this post in threaded view
|

Re: block-level incremental backup

Robert Haas
On Wed, Apr 10, 2019 at 2:21 PM Jehan-Guillaume de Rorthais
<[hidden email]> wrote:
> In my current design, the scan is done backward from end to start and I keep all
> the records appearing after the last occurrence of their respective FPI.

Oh, interesting.  That seems like it would require pretty major
surgery on the WAL stream.

> Summary files looks like what Andrey Borodin described as delta-files and
> change maps.

Yep.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: block-level incremental backup

Andres Freund
Hi,

On 2019-04-10 14:38:43 -0400, Robert Haas wrote:
> On Wed, Apr 10, 2019 at 2:21 PM Jehan-Guillaume de Rorthais
> <[hidden email]> wrote:
> > In my current design, the scan is done backward from end to start and I keep all
> > the records appearing after the last occurrence of their respective FPI.
>
> Oh, interesting.  That seems like it would require pretty major
> surgery on the WAL stream.

Can't you just read each segment forward, and then reverse? That's not
that much memory? And sure, there's some inefficient cases where records
span many segments, but that's rare enough that reading a few segments
several times doesn't strike me as particularly bad?

Greetings,

Andres Freund


123