Concurrency issue in pg_rewind

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Concurrency issue in pg_rewind

Alexander Kukushkin
Hello,

Today I bumped into an issue with pg_rewind which is not 100% clear
where should be better fixed.

The first call of pg_rewind failed with the following message:
servers diverged at WAL location A76/39E55338 on timeline 132
could not open file
"/home/postgres/pgdata/pgroot/data/pg_wal/0000008400000A760000001E":
No such file or directory

could not find previous WAL record at A76/1EFFE620
Failure, exiting

In order to avoid rebuilding the replica from scratch, we restored the
missing file by calling restore_command (wal-g) and repeated the call
of pg_rewind.

The second time pg_rewind also failed, but the error looked differently:
servers diverged at WAL location A76/39E55338 on timeline 132
rewinding from last common checkpoint at A76/1EF254B8 on timeline 132

could not remove file
"/home/postgres/pgdata/pgroot/data/pg_wal/.wal-g/prefetch/running/0000008400000A7600000024":
No such file or directory
Failure, exiting

The second call left PGDATA in an inconsistent state (empty pg_control).

A few words about where the pg_wal/.wal-g/prefetch/running/ is coming
from: wal-g by default when fetching the WAL file is also trying to do
a prefetch of a few next WAL files. For that it forks and the child
process doing prefetch while the parent process exits. In order to
avoid multiple parallel prefetches of the same file, wal-g keeps its
state in the pg_wal/.wal-g directory. It also keeps prefetched files
there.

What in fact happened: pg_rewind is building a list of files in the
target directory which don't match with the source directory and
therefore must be changed (copied/removed/truncated/etc). When the
list was built, the wal-g prefetch was still running, but when
pg_rewind tried to remove files that should not be there because they
don't exist in the source directory it failed with the fatal error.

The issue occurred on 10.14, but I believe very similar might happen
with postgres 13 when pg_rewind is called with --restore-target-wal
option.

One might argue that the issue should be fixed in wal-g (it should not
mess up with pg_wal), and I personally 99% agree with that, but so far
this behavior was safe, because postgres itself never looks inside
unexpected directories in pg_wal. Also from the usability perspective
it is very convenient to keep prefetched files in the pg_wal/.wal-g,
because it guarantees 100% that they will be located on the same
filesystem as pg_wal and therefore the next time when the
restore_command is called it is enough just to rename the file.

That made me think about how it could be improved in the pg_rewind.
The thing is, that we want to have a specific file to be removed, and
it is already not there. Should it be a fatal error?
traverse_datadir()/recurse_dir() already ignoring all failed lstat()
calls with errno == ENOENT.

Basically I have to options:
1. Remove modify remove_target_file(const char *path, bool missing_ok)
function and remove the missing_ok option, that would be consistent
with recurse_dir()
2. Change the logic of remove_target_file(), so it doesn't exit with
the fatal error if the file is missing, but shows only a warning.

In addition to the aforementioned options the remove_target_dir() also
should be improved, i.e. it should check errno and behave similarly to
the remove_target_file() if the errno == ENOENT

What do you think?

Regards,
--
Alexander Kukushkin


Reply | Threaded
Open this post in threaded view
|

Re: Concurrency issue in pg_rewind

Oleksandr Shulgin
On Wed, Sep 16, 2020 at 2:55 PM Alexander Kukushkin <[hidden email]> wrote:

The second time pg_rewind also failed, but the error looked differently:
servers diverged at WAL location A76/39E55338 on timeline 132
rewinding from last common checkpoint at A76/1EF254B8 on timeline 132

could not remove file
"/home/postgres/pgdata/pgroot/data/pg_wal/.wal-g/prefetch/running/0000008400000A7600000024":

Ouch.  I think pg_rewind shouldn't try to remove any random files in pg_wal that it doesn't know about.
What if the administrator made a backup of some WAL segments there?

--
Alex

Reply | Threaded
Open this post in threaded view
|

Re: Concurrency issue in pg_rewind

Alexey Kondratov-2
In reply to this post by Alexander Kukushkin
On 2020-09-16 15:55, Alexander Kukushkin wrote:

> Hello,
>
> Today I bumped into an issue with pg_rewind which is not 100% clear
> where should be better fixed.
>
> The first call of pg_rewind failed with the following message:
> servers diverged at WAL location A76/39E55338 on timeline 132
> could not open file
> "/home/postgres/pgdata/pgroot/data/pg_wal/0000008400000A760000001E":
> No such file or directory
>
> could not find previous WAL record at A76/1EFFE620
> Failure, exiting
>
> In order to avoid rebuilding the replica from scratch, we restored the
> missing file by calling restore_command (wal-g) and repeated the call
> of pg_rewind.
>
> The second time pg_rewind also failed, but the error looked
> differently:
> servers diverged at WAL location A76/39E55338 on timeline 132
> rewinding from last common checkpoint at A76/1EF254B8 on timeline 132
>
> could not remove file
> "/home/postgres/pgdata/pgroot/data/pg_wal/.wal-g/prefetch/running/0000008400000A7600000024":
> No such file or directory
> Failure, exiting
>
> The second call left PGDATA in an inconsistent state (empty
> pg_control).
>
> A few words about where the pg_wal/.wal-g/prefetch/running/ is coming
> from: wal-g by default when fetching the WAL file is also trying to do
> a prefetch of a few next WAL files. For that it forks and the child
> process doing prefetch while the parent process exits. In order to
> avoid multiple parallel prefetches of the same file, wal-g keeps its
> state in the pg_wal/.wal-g directory. It also keeps prefetched files
> there.
>

Hm, I cannot understand why wal-g (or any other tool) is trying to run
pg_rewind, while WAL copying (and prefetch) is still in progress? Why do
not just wait until it is finished?

It is also not clear for me why it does not put prefetched WAL files
directly into the pg_wal?

>
> The issue occurred on 10.14, but I believe very similar might happen
> with postgres 13 when pg_rewind is called with --restore-target-wal
> option.
>

With --restore-target-wal pg_rewind is trying to call restore_command on
its own and it can happen at two stages:

1) When pg_rewind is trying to find the last checkpoint preceding a
divergence point. In that case file map is not even yet initialized.
Thus, all fetched WAL segments at this stage will be present in the file
map created later.

2) When it creates a data pages map. It should traverse WAL from the
last common checkpoint till the final shutdown point in order to find
all modified pages on the target. At this stage pg_rewind only updates
info about data segments in the file map. That way, I see a minor
problem that WAL segments fetched at this stage would not be deleted,
since they are absent in the file map.

Anyway, pg_rewind does not delete neither WAL segments, not any other
files in the middle of the file map creation, so I cannot imagine, how
it can get into the same trouble on its own.

>
> That made me think about how it could be improved in the pg_rewind.
> The thing is, that we want to have a specific file to be removed, and
> it is already not there. Should it be a fatal error?
> traverse_datadir()/recurse_dir() already ignoring all failed lstat()
> calls with errno == ENOENT.
>

recurse_dir() is also called on the source, which should be running when
--source-server is used, so it can remove files. There is actually a
TODO there:

/*
  * File doesn't exist anymore. This is ok, if the new master
  * is running and the file was just removed. If it was a data
  * file, there should be a WAL record of the removal. If it
  * was something else, it couldn't have been anyway.
  *
  * TODO: But complain if we're processing the target dir!
  */

>
> Basically I have to options:
> 1. Remove modify remove_target_file(const char *path, bool missing_ok)
> function and remove the missing_ok option, that would be consistent
> with recurse_dir()
> 2. Change the logic of remove_target_file(), so it doesn't exit with
> the fatal error if the file is missing, but shows only a warning.
>
> In addition to the aforementioned options the remove_target_dir() also
> should be improved, i.e. it should check errno and behave similarly to
> the remove_target_file() if the errno == ENOENT
>
> What do you think?
>

Although keeping arbitrary files inside PGDATA does not look like a good
idea for me, I do not see anything criminal in skipping non-existing
file, when executing a file map by pg_rewind.


Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company


Reply | Threaded
Open this post in threaded view
|

Re: Concurrency issue in pg_rewind

Alexander Kukushkin
On Thu, 17 Sep 2020 at 14:04, Alexey Kondratov
<[hidden email]> wrote:
>
> Hm, I cannot understand why wal-g (or any other tool) is trying to run
> pg_rewind, while WAL copying (and prefetch) is still in progress? Why do
> not just wait until it is finished?

wal-g doesn't try to call pg_rewind.
First, we called wal-g, it fetched the file we requested and exited.
But, before exiting, wal-g forks, and the child process does prefetch
of a few next WALs.
We don't really know when the child process exits and can't wait for it.

>
> It is also not clear for me why it does not put prefetched WAL files
> directly into the pg_wal?

Because this is how postgres works. It doesn't matter whether the
specific WAL segment is there, postgres will call the restore_command
anyway.
The restore command also doesn't know if the file in pg_wal is OK,
therefore keeping the prefetched file in some other place and moving
it seems to be a good approach.

> With --restore-target-wal pg_rewind is trying to call restore_command on
> its own and it can happen at two stages:
>
> 1) When pg_rewind is trying to find the last checkpoint preceding a
> divergence point. In that case file map is not even yet initialized.
> Thus, all fetched WAL segments at this stage will be present in the file
> map created later.

Nope, it will fetch files you requested, and in addition to that it
will leave a child process running in the background which is doing
the prefetch (manipulating with pg_wal/.wal-g/...)

>
> 2) When it creates a data pages map. It should traverse WAL from the
> last common checkpoint till the final shutdown point in order to find
> all modified pages on the target. At this stage pg_rewind only updates
> info about data segments in the file map. That way, I see a minor
> problem that WAL segments fetched at this stage would not be deleted,
> since they are absent in the file map.
>
> Anyway, pg_rewind does not delete neither WAL segments, not any other
> files in the middle of the file map creation, so I cannot imagine, how
> it can get into the same trouble on its own.

When pg_rewind was creating the map, some temporary files where there,
because the forked child process of wal-g was still running.
When the wal-g child process exits, it removes some of these files.
Specifically, it was trying to prefetch 0000008400000A7600000024 into
the pg_wal/.wal-g/prefetch/running/0000008400000A7600000024, but
apparently the file wasn't available on S3 and prefetch failed,
therefore the empty file was removed.


> Although keeping arbitrary files inside PGDATA does not look like a good
> idea for me, I do not see anything criminal in skipping non-existing
> file, when executing a file map by pg_rewind.

Good, I will prepare a patch then.

Regards,
--
Alexander Kukushkin


Reply | Threaded
Open this post in threaded view
|

Re: Concurrency issue in pg_rewind

Alexey Kondratov-2
On 2020-09-17 15:27, Alexander Kukushkin wrote:

> On Thu, 17 Sep 2020 at 14:04, Alexey Kondratov
> <[hidden email]> wrote:
>
>> With --restore-target-wal pg_rewind is trying to call restore_command
>> on
>> its own and it can happen at two stages:
>>
>> 1) When pg_rewind is trying to find the last checkpoint preceding a
>> divergence point. In that case file map is not even yet initialized.
>> Thus, all fetched WAL segments at this stage will be present in the
>> file
>> map created later.
>
> Nope, it will fetch files you requested, and in addition to that it
> will leave a child process running in the background which is doing
> the prefetch (manipulating with pg_wal/.wal-g/...)
>
>>
>> 2) When it creates a data pages map. It should traverse WAL from the
>> last common checkpoint till the final shutdown point in order to find
>> all modified pages on the target. At this stage pg_rewind only updates
>> info about data segments in the file map. That way, I see a minor
>> problem that WAL segments fetched at this stage would not be deleted,
>> since they are absent in the file map.
>>
>> Anyway, pg_rewind does not delete neither WAL segments, not any other
>> files in the middle of the file map creation, so I cannot imagine, how
>> it can get into the same trouble on its own.
>
> When pg_rewind was creating the map, some temporary files where there,
> because the forked child process of wal-g was still running.
> When the wal-g child process exits, it removes some of these files.
> Specifically, it was trying to prefetch 0000008400000A7600000024 into
> the pg_wal/.wal-g/prefetch/running/0000008400000A7600000024, but
> apparently the file wasn't available on S3 and prefetch failed,
> therefore the empty file was removed.
>

I do understand how you got into this problem with wal-g. This part of
my answer was about bare postgres and pg_rewind. And my point was that
from my perspective pg_rewind with --restore-target-wal cannot get into
the same trouble on its own, without 'help' of some side tools like
wal-g.


Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company


Reply | Threaded
Open this post in threaded view
|

Re: Concurrency issue in pg_rewind

Michael Paquier-2
In reply to this post by Oleksandr Shulgin
On Thu, Sep 17, 2020 at 10:20:16AM +0200, Oleksandr Shulgin wrote:
> Ouch.  I think pg_rewind shouldn't try to remove any random files in pg_wal
> that it doesn't know about.
> What if the administrator made a backup of some WAL segments there?

IMO, this would be a rather bad strategy anyway, so just don't do
that, because that could also mean that this is on the same partition
as pg_wal/ which would crash the server if the partition has the idea
to get full even if max_wal_size is set correctly.  If you think about
that, this is rather similar to putting tablespaces in your root data
folder: it may look fancy, but you just make the task of the backend
more complicated.  Note that this practice is not wise when Postgres
decides to apply actions that loop across the entire tree, just to
name two of them: the full fsync of PGDATA at the beginning of crash
recovery or just a base backup.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Concurrency issue in pg_rewind

Andrey M. Borodin


> 18 сент. 2020 г., в 11:10, Michael Paquier <[hidden email]> написал(а):
>
> On Thu, Sep 17, 2020 at 10:20:16AM +0200, Oleksandr Shulgin wrote:
>> Ouch.  I think pg_rewind shouldn't try to remove any random files in pg_wal
>> that it doesn't know about.
>> What if the administrator made a backup of some WAL segments there?
>
> IMO, this would be a rather bad strategy anyway, so just don't do
> that, because that could also mean that this is on the same partition
> as pg_wal/

This is whole point of having prefetch. restore_command just links file from the same partition.
In WAL-G you strictly control number of cached WALs, so if you configured max_wal_size - you can configure WALG_DOWNLOAD_CONCURRENCY too.

Best regards, Andrey Borodin.

Reply | Threaded
Open this post in threaded view
|

Re: Concurrency issue in pg_rewind

Michael Paquier-2
On Fri, Sep 18, 2020 at 11:31:26AM +0500, Andrey M. Borodin wrote:
> This is whole point of having prefetch. restore_command just links
> file from the same partition.

If this stuff is willing to do so, you may have your reasons, but even
if you wish to locate both pg_wal/ and the prefetch path in the same
partition, I don't get why it is necessary to have the prefetch path
included directly in pg_wal?  You could just use different paths for
both.  Say, with a base partition at /my/path/, you can just have
/my/path/pg_wal/ that the Postgres backend links to, and
/my/path/wal-g/prefetch/ for the secondary path.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Concurrency issue in pg_rewind

Alexander Kukushkin
Hi,

On Fri, 18 Sep 2020 at 08:59, Michael Paquier <[hidden email]> wrote:

> If this stuff is willing to do so, you may have your reasons, but even
> if you wish to locate both pg_wal/ and the prefetch path in the same
> partition, I don't get why it is necessary to have the prefetch path
> included directly in pg_wal?  You could just use different paths for
> both.  Say, with a base partition at /my/path/, you can just have
> /my/path/pg_wal/ that the Postgres backend links to, and
> /my/path/wal-g/prefetch/ for the secondary path.

Well, I agree and disagree at the same time.
Yes, it would be nice not to write anything unexpected to PGDATA and
pg_wal, but this is also a usability issue.
Creating a separate directory and configuring wal-e/wal-g to use it
(now it is not even possible to configure it), requires some effort
from the administrator, while using something inside pg_wal just
works.

At the same time, pg_rewind due to such "fatal" error leaves PGDATA in
an inconsistent state with empty pg_control file, this is totally bad
and easily fixable. We want the specific file to be absent and it is
already absent, why should it be a fatal error and not warning?

Regards,
--
Alexander Kukushkin


Reply | Threaded
Open this post in threaded view
|

Re: Concurrency issue in pg_rewind

Oleksandr Shulgin
In reply to this post by Michael Paquier-2
On Fri, Sep 18, 2020 at 8:10 AM Michael Paquier <[hidden email]> wrote:
On Thu, Sep 17, 2020 at 10:20:16AM +0200, Oleksandr Shulgin wrote:
> Ouch.  I think pg_rewind shouldn't try to remove any random files in pg_wal
> that it doesn't know about.
> What if the administrator made a backup of some WAL segments there?

IMO, this would be a rather bad strategy anyway, so just don't do
that, because that could also mean that this is on the same partition
as pg_wal/ which would crash the server if the partition has the idea
to get full even if max_wal_size is set correctly.

To clarify my point, I don't mean to backup WAL segments in the background when the server is running, but precisely when the server is down and you need to intervene, such as running pg_rewind.  You might want to "stash" some of the latest segments in case you need to start over (name it pg_wal/0000008400000A760000001E.backup, or pg_wal/backup/0000008400000A760000001E).  It is surprising that pg_rewind might want to decide to remove those.

--
Alex

Reply | Threaded
Open this post in threaded view
|

Re: Concurrency issue in pg_rewind

Andrey M. Borodin
In reply to this post by Michael Paquier-2


> 18 сент. 2020 г., в 11:59, Michael Paquier <[hidden email]> написал(а):
>
> On Fri, Sep 18, 2020 at 11:31:26AM +0500, Andrey M. Borodin wrote:
>> This is whole point of having prefetch. restore_command just links
>> file from the same partition.
>
> If this stuff is willing to do so, you may have your reasons, but even
> if you wish to locate both pg_wal/ and the prefetch path in the same
> partition, I don't get why it is necessary to have the prefetch path
> included directly in pg_wal?  You could just use different paths for
> both.  Say, with a base partition at /my/path/, you can just have
> /my/path/pg_wal/ that the Postgres backend links to, and
> /my/path/wal-g/prefetch/ for the secondary path.

This complexity doesn't seem necessary to me. What we gain? Prefetched WAL is WAL per se. Makes sense to keep it in pg_wal tree by default.

I will implement possibility to move cache out of pg_wal (similar functionality is implemented in pgBackRest). But it seems useless to me: user can configure WAL prefetch to be less performant, without any benefits.

Best regards, Andrey Borodin.