Duplicate history file?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
36 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Duplicate history file?

Kyotaro Horiguchi-4
So, this is the new new thread.

This thread should have been started here:

https://www.postgresql.org/message-id/20210531.165825.921389284096975508.horikyota.ntt%40gmail.com

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: Duplicate history file?

Stephen Frost
Greetings,

* Kyotaro Horiguchi ([hidden email]) wrote:
> So, this is the new new thread.

This is definitely not the way I would recommend starting up a new
thread as you didn't include the actual text of the prior discussion for
people to be able to read and respond to, instead making them go hunt
for the prior discussion on the old thread and negating the point of
starting a new thread..

Still, I went and found the other email-

* Kyotaro Horiguchi ([hidden email]) wrote:

> At Mon, 31 May 2021 11:52:05 +0900, Tatsuro Yamada <[hidden email]> wrote in
> > Since the above behavior is different from the behavior of the
> > test command in the following example in postgresql.conf, I think
> > we should write a note about this example.
> >
> > # e.g. 'test ! -f /mnt/server/archivedir/%f && cp %p
> > # /mnt/server/archivedir/%f'
> >
> > Let me describe the problem we faced.
> > - When archive_mode=always, archive_command is (sometimes) executed
> >   in a situation where the history file already exists on the standby
> >   side.
> >
> > - In this case, if "test ! -f" is written in the archive_command of
> >   postgresql.conf on the standby side, the command will keep failing.
> >
> >   Note that this problem does not occur when archive_mode=on.
> >
> > So, what should we do for the user? I think we should put some notes
> > in postgresql.conf or in the documentation. For example, something
> > like this:
First off, we should tell them to not use test or cp in their actual
archive command because they don't do things like make sure that the WAL
that's been archived has actually been fsync'd.  Multiple people have
tried to make improvements in this area but the long and short of it is
that trying to provide a simple archive command in the documentation
that actually *works* isn't enough- you need a real tool.  Maybe someone
will write one some day that's part of core, but it's not happened yet
and instead there's external solutions which actually do the correct
things.

The existing documentation should be taken as purely "this is how the
variables which are passed in get expanded" not as "this is what you
should do", because it's very much not the latter in any form.

Thanks,

Stephen

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate history file?

Kyotaro Horiguchi-4
(Mmm. thunderbird or gmail connects this thread to the previous one..)

At Mon, 7 Jun 2021 14:20:38 -0400, Stephen Frost <[hidden email]> wrote in

> Greetings,
>
> * Kyotaro Horiguchi ([hidden email]) wrote:
> > So, this is the new new thread.
>
> This is definitely not the way I would recommend starting up a new
> thread as you didn't include the actual text of the prior discussion for
> people to be able to read and respond to, instead making them go hunt
> for the prior discussion on the old thread and negating the point of
> starting a new thread..

Sorry for that. I'll do that next time.

> Still, I went and found the other email-

Thanks!

> * Kyotaro Horiguchi ([hidden email]) wrote:
> > At Mon, 31 May 2021 11:52:05 +0900, Tatsuro Yamada <[hidden email]> wrote in
> > > Since the above behavior is different from the behavior of the
> > > test command in the following example in postgresql.conf, I think
> > > we should write a note about this example.
> > >
> > > # e.g. 'test ! -f /mnt/server/archivedir/%f && cp %p
> > > # /mnt/server/archivedir/%f'
> > >
> > > Let me describe the problem we faced.
> > > - When archive_mode=always, archive_command is (sometimes) executed
> > >   in a situation where the history file already exists on the standby
> > >   side.
> > >
> > > - In this case, if "test ! -f" is written in the archive_command of
> > >   postgresql.conf on the standby side, the command will keep failing.
> > >
> > >   Note that this problem does not occur when archive_mode=on.
> > >
> > > So, what should we do for the user? I think we should put some notes
> > > in postgresql.conf or in the documentation. For example, something
> > > like this:
>
> First off, we should tell them to not use test or cp in their actual
> archive command because they don't do things like make sure that the WAL
> that's been archived has actually been fsync'd.  Multiple people have
> tried to make improvements in this area but the long and short of it is
> that trying to provide a simple archive command in the documentation
> that actually *works* isn't enough- you need a real tool.  Maybe someone
> will write one some day that's part of core, but it's not happened yet
> and instead there's external solutions which actually do the correct
> things.

Ideally I agree that it is definitely right. But the documentation
doesn't say a bit of "don't use the simple copy command in any case
(or at least the cases where more than a certain level of durability
and integrity guarantee is required).".

Actually many people are satisfied with just "cp/copy" and I think
they know that the command doesn't guarantee on the integrity of
archived files on , say, some disastrous situation like a sudden power
cut.

However, the use of "test !  -f..." is in a bit different kind of
suggestion.

https://www.postgresql.org/docs/13/continuous-archiving.html
| The archive command should generally be designed to refuse to
| overwrite any pre-existing archive file. This is an important safety
| feature to preserve the integrity of your archive in case of
| administrator error (such as sending the output of two different
| servers to the same archive directory)

This implies that no WAL segment are archived more than once at least
under any valid operation. Some people are following this suggestion
to prevent archive from breaking by some *wrong* operations.

> The existing documentation should be taken as purely "this is how the
> variables which are passed in get expanded" not as "this is what you
> should do", because it's very much not the latter in any form.

It describes "how archive_command should be like" and showing examples
among the description implies that the example conforms the
should-be's.

Nevertheless, the issue here is that there's a case where archiving
stalls when following the suggestion above under a certain condition.
Even if it is written premising "set .. archive_mode to on", I don't
believe that people can surmise that the same archive_command might
fail when setting archive_mode to always, because the description
implies


So I think we need to revise the documentation, or need to *fix* the
revealed problem that is breaking the assumption of the documentation.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: Duplicate history file?

Kyotaro Horiguchi-4
Yeah, it's hot these days...

At Tue, 08 Jun 2021 12:04:43 +0900 (JST), Kyotaro Horiguchi <[hidden email]> wrote in

> (Mmm. thunderbird or gmail connects this thread to the previous one..)
>
> At Mon, 7 Jun 2021 14:20:38 -0400, Stephen Frost <[hidden email]> wrote in
> > Greetings,
> >
> > * Kyotaro Horiguchi ([hidden email]) wrote:
> > > So, this is the new new thread.
> >
> > This is definitely not the way I would recommend starting up a new
> > thread as you didn't include the actual text of the prior discussion for
> > people to be able to read and respond to, instead making them go hunt
> > for the prior discussion on the old thread and negating the point of
> > starting a new thread..
>
> Sorry for that. I'll do that next time.
>
> > Still, I went and found the other email-
>
> Thanks!
>
> > * Kyotaro Horiguchi ([hidden email]) wrote:
> > > At Mon, 31 May 2021 11:52:05 +0900, Tatsuro Yamada <[hidden email]> wrote in
> > > > Since the above behavior is different from the behavior of the
> > > > test command in the following example in postgresql.conf, I think
> > > > we should write a note about this example.
> > > >
> > > > # e.g. 'test ! -f /mnt/server/archivedir/%f && cp %p
> > > > # /mnt/server/archivedir/%f'
> > > >
> > > > Let me describe the problem we faced.
> > > > - When archive_mode=always, archive_command is (sometimes) executed
> > > >   in a situation where the history file already exists on the standby
> > > >   side.
> > > >
> > > > - In this case, if "test ! -f" is written in the archive_command of
> > > >   postgresql.conf on the standby side, the command will keep failing.
> > > >
> > > >   Note that this problem does not occur when archive_mode=on.
> > > >
> > > > So, what should we do for the user? I think we should put some notes
> > > > in postgresql.conf or in the documentation. For example, something
> > > > like this:
> >
> > First off, we should tell them to not use test or cp in their actual
> > archive command because they don't do things like make sure that the WAL
> > that's been archived has actually been fsync'd.  Multiple people have
> > tried to make improvements in this area but the long and short of it is
> > that trying to provide a simple archive command in the documentation
> > that actually *works* isn't enough- you need a real tool.  Maybe someone
> > will write one some day that's part of core, but it's not happened yet
> > and instead there's external solutions which actually do the correct
> > things.
>
> Ideally I agree that it is definitely right. But the documentation
> doesn't say a bit of "don't use the simple copy command in any case
> (or at least the cases where more than a certain level of durability
> and integrity guarantee is required).".
>
> Actually many people are satisfied with just "cp/copy" and I think
> they know that the command doesn't guarantee on the integrity of
> archived files on , say, some disastrous situation like a sudden power
> cut.
>
> However, the use of "test !  -f..." is in a bit different kind of
> suggestion.
>
> https://www.postgresql.org/docs/13/continuous-archiving.html
> | The archive command should generally be designed to refuse to
> | overwrite any pre-existing archive file. This is an important safety
> | feature to preserve the integrity of your archive in case of
> | administrator error (such as sending the output of two different
> | servers to the same archive directory)
>
> This implies that no WAL segment are archived more than once at least
> under any valid operation. Some people are following this suggestion
> to prevent archive from breaking by some *wrong* operations.
>
> > The existing documentation should be taken as purely "this is how the
> > variables which are passed in get expanded" not as "this is what you
> > should do", because it's very much not the latter in any form.
>

- It describes "how archive_command should be like" and showing examples
+ It describes "what archive_command should be like" and showing examples

> among the description implies that the example conforms the
> should-be's.
>
> Nevertheless, the issue here is that there's a case where archiving
> stalls when following the suggestion above under a certain condition.
> Even if it is written premising "set .. archive_mode to on", I don't
> believe that people can surmise that the same archive_command might
- fail when setting archive_mode to always, because the description
- implies
+ fail when setting archive_mode to always.

>
> So I think we need to revise the documentation, or need to *fix* the
> revealed problem that is breaking the assumption of the documentation.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: Duplicate history file?

Tatsuro Yamada-2
In reply to this post by Kyotaro Horiguchi-4
Hi Horiguchi-san,

> This thread should have been started here:
>
> https://www.postgresql.org/message-id/20210531.165825.921389284096975508.horikyota.ntt%40gmail.com
>>
>> (To recap: In a replication set using archive, startup tries to
>> restore WAL files from archive before checking pg_wal directory for
>> the desired file.  The behavior itself is intentionally designed and
>> reasonable. However, the restore code notifies of a restored file
>> regardless of whether it has been already archived or not.  If
>> archive_command is written so as to return error for overwriting as we
>> suggest in the documentation, that behavior causes archive failure.)
>>
>> After playing with this, I see the problem just by restarting a
>> standby even in a simple archive-replication set after making
>> not-special prerequisites.  So I think this is worth fixing.
>>
>> With this patch, KeepFileRestoredFromArchive compares the contents of
>> just-restored file and the existing file for the same segment only
>> when:
>>
>>       - archive_mode = always
>>   and - the file to restore already exists in pgwal
>>   and - it has a .done and/or .ready status file.
>>
>> which doesn't happen usually.  Then the function skips archive
>> notification if the contents are identical.  The included TAP test is
>> working both on Linux and Windows.
>
>
> Thank you for the analysis and the patch.
> I'll try the patch tomorrow.
>
> I just noticed that this thread is still tied to another thread
> (it's not an independent thread). To fix that, it may be better to
> create a new thread again.


I've tried your patch. Unfortunately, it didn't seem to have any good
effect on the script I sent to reproduce the problem.

I understand that, as Stefan says, the test and cp commands have
problems and should not be used for archive commands. Maybe this is not
a big problem for the community.
Nevertheless, even if we do not improve the feature, I think it is a
good idea to explicitly state in the documentation that archiving may
fail under certain conditions for new users.

I'd like to hear the opinions of experts on the archive command.

P.S.
My customer's problem has already been solved, so it's ok. I've
emailed -hackers with the aim of preventing users from encountering
the same problem.


Regards,
Tatsuro Yamada




Reply | Threaded
Open this post in threaded view
|

Re: Duplicate history file?

Kyotaro Horiguchi-4

On 2021/06/08 18:19, Tatsuro Yamada wrote:
> I've tried your patch. Unfortunately, it didn't seem to have any good
> effect on the script I sent to reproduce the problem.

Oops! The patch forgot about history files.

I checked the attached with your repro script and it works fine.

> I understand that, as Stefan says, the test and cp commands have
> problems and should not be used for archive commands. Maybe this is not
> a big problem for the community.
> Nevertheless, even if we do not improve the feature, I think it is a
> good idea to explicitly state in the documentation that archiving may
> fail under certain conditions for new users.
>
> I'd like to hear the opinions of experts on the archive command.
>
> P.S.
> My customer's problem has already been solved, so it's ok. I've
> emailed -hackers with the aim of preventing users from encountering
> the same problem.
>
I understand that.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


avoid_duplicate_archiving_PoC3.patch.txt (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate history file?

Tatsuro Yamada-2
Hi Horiguchi-san,

On 2021/06/09 11:47, Kyotaro Horiguchi wrote:
> On 2021/06/08 18:19, Tatsuro Yamada wrote:
>> I've tried your patch. Unfortunately, it didn't seem to have any good
>> effect on the script I sent to reproduce the problem.
>
> Oops! The patch forgot about history files.
>
> I checked the attached with your repro script and it works fine.


Thank you for fixing the patch.
The new patch works well in my environment. :-D


Regards,
Tatsuro Yamada



Reply | Threaded
Open this post in threaded view
|

Re: Duplicate history file?

Tatsuro Yamada-2
Hi

> Thank you for fixing the patch.
> The new patch works well in my environment. :-D

This may not be important at this time since it is a
PoC patch, but I would like to inform you that there
was a line that contained multiple spaces instead of tabs.

$ git diff --check
src/backend/access/transam/xlogarchive.c:465: trailing whitespace.
+

Regards,
Tatsuro Yamada



Reply | Threaded
Open this post in threaded view
|

Re: Duplicate history file?

Fujii Masao-4


On 2021/06/09 15:58, Tatsuro Yamada wrote:

> Hi
>
>> Thank you for fixing the patch.
>> The new patch works well in my environment. :-D
>
> This may not be important at this time since it is a
> PoC patch, but I would like to inform you that there
> was a line that contained multiple spaces instead of tabs.
>
> $ git diff --check
> src/backend/access/transam/xlogarchive.c:465: trailing whitespace.
> +

Even with the patch, if "test ! -f ..." is used in archive_command,
you may still *easily* get the trouble that WAL archiving keeps failing?

Instead, we should consider and document "better" command for
archive_command, or implement something like pg_archivecopy command
into the core (as far as I remember, there was the discussion about
this feature before...)?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION


Reply | Threaded
Open this post in threaded view
|

Re: Duplicate history file?

Tatsuro Yamada-2
Hi,

On 2021/06/09 16:23, Fujii Masao wrote:

> On 2021/06/09 15:58, Tatsuro Yamada wrote:
>> This may not be important at this time since it is a
>> PoC patch, but I would like to inform you that there
>> was a line that contained multiple spaces instead of tabs.
>>
>> $ git diff --check
>> src/backend/access/transam/xlogarchive.c:465: trailing whitespace.
>> +
>
> Even with the patch, if "test ! -f ..." is used in archive_command,
> you may still *easily* get the trouble that WAL archiving keeps failing?

Thanks for your comment.

Yes, it may solve the error when using the test command, but it is
dangerous to continue using the cp command, which is listed as an
example of an archive command.

 
> Instead, we should consider and document "better" command for
> archive_command, or implement something like pg_archivecopy command
> into the core (as far as I remember, there was the discussion about
> this feature before...)?


I agree with that idea.
Since archiving is important for all users, I think there should be
either a better and safer command in the documentation, or an archive
command (pg_archivecopy?) that we provide as a community, as you said.
I am curious about the conclusions of past discussions. :)


Regards,
Tatsuro Yamada




Reply | Threaded
Open this post in threaded view
|

Re: Duplicate history file?

Kyotaro Horiguchi-4
At Wed, 09 Jun 2021 16:56:14 +0900, Tatsuro Yamada <[hidden email]> wrote in

> Hi,
>
> On 2021/06/09 16:23, Fujii Masao wrote:
> > On 2021/06/09 15:58, Tatsuro Yamada wrote:
> >> This may not be important at this time since it is a
> >> PoC patch, but I would like to inform you that there
> >> was a line that contained multiple spaces instead of tabs.
> >>
> >> $ git diff --check
> >> src/backend/access/transam/xlogarchive.c:465: trailing whitespace.
> >> +
> > Even with the patch, if "test ! -f ..." is used in archive_command,
> > you may still *easily* get the trouble that WAL archiving keeps
> > failing?

I'm not sure, but in regard to the the cause that the patch treats, if
an already-archived file is recycled or deleted then the same file is
restored from archive, that could happen. But the WAL segment that
contains the latest checkpoint won't be deleted. The same can be said
on history files.

> Thanks for your comment.
>
> Yes, it may solve the error when using the test command, but it is
> dangerous to continue using the cp command, which is listed as an
> example of an archive command.

"test" command?

At first I thought that the archive command needs to compare the whole
file content *always*, but that happens with the same frequency with
the patch runs a whole-file comparison.

> > Instead, we should consider and document "better" command for
> > archive_command, or implement something like pg_archivecopy command
> > into the core (as far as I remember, there was the discussion about
> > this feature before...)?
>
>
> I agree with that idea.
> Since archiving is important for all users, I think there should be
> either a better and safer command in the documentation, or an archive
> command (pg_archivecopy?) that we provide as a community, as you said.
> I am curious about the conclusions of past discussions. :)

How perfect the officially-provided script or command need to be?  The
reason that the script in the documentation is so simple is, I guess,
we don't/can't offer a steps sufficiently solid for all-directions.

Since we didn't noticed that the "test ! -f" harms so it has been
there but finally we need to remove it.  Instead, we need to write
doen the known significant requirements by words. I'm afraid that the
concrete script would be a bit complex for the documentation..

So what we can do that is:

 - Remove the "test ! -f" from the sample command (for *nixen).

 - Rewrite at least the following portion in the documentation. [1]

   > The archive command should generally be designed to refuse to
   > overwrite any pre-existing archive file. This is an important
   > safety feature to preserve the integrity of your archive in case
   > of administrator error (such as sending the output of two
   > different servers to the same archive directory).
   >
   > It is advisable to test your proposed archive command to ensure
   > that it indeed does not overwrite an existing file, and that it
   > returns nonzero status in this case. The example command above
   > for Unix ensures this by including a separate test step. On some
   > Unix platforms, cp has switches such as -i that can be used to do
   > the same thing less verbosely, but you should not rely on these
   > without verifying that the right exit status is returned. (In
   > particular, GNU cp will return status zero when -i is used and
   > the target file already exists, which is not the desired
   > behavior.)

The replacement would be something like:

"There is a case where WAL file and timeline history files is archived
more than once.  The archive command should generally be designed to
refuse to replace any pre-existing archive file with a file with
different content but to return zero if the file to be archived is
identical with the preexisting file."

But I'm not sure how it looks like.. (even ignoring the broken
phrasing..)
 

1: https://www.postgresql.org/docs/11/continuous-archiving.html

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: Duplicate history file?

Stephen Frost
Greetings,

* Kyotaro Horiguchi ([hidden email]) wrote:

> At Wed, 09 Jun 2021 16:56:14 +0900, Tatsuro Yamada <[hidden email]> wrote in
> > On 2021/06/09 16:23, Fujii Masao wrote:
> > > Instead, we should consider and document "better" command for
> > > archive_command, or implement something like pg_archivecopy command
> > > into the core (as far as I remember, there was the discussion about
> > > this feature before...)?
> >
> > I agree with that idea.
> > Since archiving is important for all users, I think there should be
> > either a better and safer command in the documentation, or an archive
> > command (pg_archivecopy?) that we provide as a community, as you said.
> > I am curious about the conclusions of past discussions. :)
>
> How perfect the officially-provided script or command need to be?  The
> reason that the script in the documentation is so simple is, I guess,
> we don't/can't offer a steps sufficiently solid for all-directions.
>
> Since we didn't noticed that the "test ! -f" harms so it has been
> there but finally we need to remove it.  Instead, we need to write
> doen the known significant requirements by words. I'm afraid that the
> concrete script would be a bit complex for the documentation..
We don't have any 'officially-provided' tool for archive command.

> So what we can do that is:
>
>  - Remove the "test ! -f" from the sample command (for *nixen).

... or just remove the example entirely.  It really doesn't do anything
good for us, in my view.

>  - Rewrite at least the following portion in the documentation. [1]
>
>    > The archive command should generally be designed to refuse to
>    > overwrite any pre-existing archive file. This is an important
>    > safety feature to preserve the integrity of your archive in case
>    > of administrator error (such as sending the output of two
>    > different servers to the same archive directory).
>    >
>    > It is advisable to test your proposed archive command to ensure
>    > that it indeed does not overwrite an existing file, and that it
>    > returns nonzero status in this case. The example command above
>    > for Unix ensures this by including a separate test step. On some
>    > Unix platforms, cp has switches such as -i that can be used to do
>    > the same thing less verbosely, but you should not rely on these
>    > without verifying that the right exit status is returned. (In
>    > particular, GNU cp will return status zero when -i is used and
>    > the target file already exists, which is not the desired
>    > behavior.)
>
> The replacement would be something like:
>
> "There is a case where WAL file and timeline history files is archived
> more than once.  The archive command should generally be designed to
> refuse to replace any pre-existing archive file with a file with
> different content but to return zero if the file to be archived is
> identical with the preexisting file."
>
> But I'm not sure how it looks like.. (even ignoring the broken
> phrasing..)
There is so much more that we should be including here, like "you should
make sure your archive command will reliably sync the WAL file to disk
before returning success to PG, since PG will feel free to immediately
remove the WAL file once archive command has returned successfully", and
"the archive command should check that there exists a .history file for
any timeline after timeline 1 in the repo for the WAL file that's being
archived" and "the archive command should allow the exist, binary
identical, WAL file to be archived multiple times without error, but
should error if a new WAL file is archived which would overwrite a
binary distinct WAL file in the repo", and "the archive command should
check the WAL header to make sure that the WAL file matches the cluster
in the corresponding backup repo", and "whatever is expiring the WAL
files after they've been archived should make sure to not expire out any
WAL that is needed for any of the backups that remain", and "oh, by the
way, depending on the exit code of the command, PG may consider the
failure to be something which can be retried, or not", and other things
that I can't think of off the top of my head right now.

I have to say that it gets to a point where it feels like we're trying
to document everything about writing a C extension to PG using the
hooks which we make available.  We've generally agreed that folks should
be looking at the source code if they're writing a serious C extension
and it's certainly the case that, in writing a serious archive command
and backup tool, getting into the PG source code has been routinely
necessary.

Thanks,

Stephen

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate history file?

Kyotaro Horiguchi-4
At Thu, 10 Jun 2021 10:00:21 -0400, Stephen Frost <[hidden email]> wrote in

> Greetings,
>
> * Kyotaro Horiguchi ([hidden email]) wrote:
> > At Wed, 09 Jun 2021 16:56:14 +0900, Tatsuro Yamada <[hidden email]> wrote in
> > > On 2021/06/09 16:23, Fujii Masao wrote:
> > > > Instead, we should consider and document "better" command for
> > > > archive_command, or implement something like pg_archivecopy command
> > > > into the core (as far as I remember, there was the discussion about
> > > > this feature before...)?
> > >
> > > I agree with that idea.
> > > Since archiving is important for all users, I think there should be
> > > either a better and safer command in the documentation, or an archive
> > > command (pg_archivecopy?) that we provide as a community, as you said.
> > > I am curious about the conclusions of past discussions. :)
> >
> > How perfect the officially-provided script or command need to be?  The
> > reason that the script in the documentation is so simple is, I guess,
> > we don't/can't offer a steps sufficiently solid for all-directions.
> >
> > Since we didn't noticed that the "test ! -f" harms so it has been
> > there but finally we need to remove it.  Instead, we need to write
> > doen the known significant requirements by words. I'm afraid that the
> > concrete script would be a bit complex for the documentation..
>
> We don't have any 'officially-provided' tool for archive command.

I meant the "test ! -f .." by the "officially-provided script".  The
fact we show it in the documentation (without a caveart) means that
the script at least doesn't break the server behavior that is running
normally including promotion.

> > So what we can do that is:
> >
> >  - Remove the "test ! -f" from the sample command (for *nixen).
>
> ... or just remove the example entirely.  It really doesn't do anything
> good for us, in my view.

Yeah. I feel like so. But that also means the usage instruction of the
replacements disappears from our documentation. The least problematic
example in the regards above is just "cp .." without "test" as the
instruction.

> > The replacement would be something like:
> >
> > "There is a case where WAL file and timeline history files is archived
> > more than once.  The archive command should generally be designed to
> > refuse to replace any pre-existing archive file with a file with
> > different content but to return zero if the file to be archived is
> > identical with the preexisting file."
> >
> > But I'm not sure how it looks like.. (even ignoring the broken
> > phrasing..)
>
> There is so much more that we should be including here, like "you should

Mmm. Yeah, I can understand your sentiment maybe completely.

> make sure your archive command will reliably sync the WAL file to disk
> before returning success to PG, since PG will feel free to immediately
> remove the WAL file once archive command has returned successfully", and
> "the archive command should check that there exists a .history file for
> any timeline after timeline 1 in the repo for the WAL file that's being
> archived" and "the archive command should allow the exist, binary
> identical, WAL file to be archived multiple times without error, but
> should error if a new WAL file is archived which would overwrite a
> binary distinct WAL file in the repo", and "the archive command should
> check the WAL header to make sure that the WAL file matches the cluster
> in the corresponding backup repo", and "whatever is expiring the WAL
> files after they've been archived should make sure to not expire out any
> WAL that is needed for any of the backups that remain", and "oh, by the
> way, depending on the exit code of the command, PG may consider the
> failure to be something which can be retried, or not", and other things
> that I can't think of off the top of my head right now.
> I have to say that it gets to a point where it feels like we're trying
> to document everything about writing a C extension to PG using the
> hooks which we make available.  We've generally agreed that folks should
> be looking at the source code if they're writing a serious C extension
> and it's certainly the case that, in writing a serious archive command
> and backup tool, getting into the PG source code has been routinely
> necessary.

Nevertheless I agree to it, still don't we need a minimum workable
setup as the first step? Something like below.

===
The following is an example of the minimal archive_command.

Example: cp %p /blah/%f

However, it is far from perfect. The following is the discussion about
what is needed for archive_command to be more reliable.

<the long list of the requirements>
====

Anyway it doesn't seem to be the time to do that, but as now that we
know that there's a case where the current example doesn't prevent PG
from working correctly, we cannot use the "test ! -f" example and
cannot suggest "do not overwrite existing archived files" without a
caveat. At least don't we need to *fix* that parts for now?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: Duplicate history file?

Kyotaro Horiguchi-4
Recently my brain is always twisted..

At Fri, 11 Jun 2021 11:25:51 +0900 (JST), Kyotaro Horiguchi <[hidden email]> wrote in
> Anyway it doesn't seem to be the time to do that, but as now that we
- know that there's a case where the current example doesn't prevent PG
+ know that there's a case where the current example prevents PG
> from working correctly, we cannot use the "test ! -f" example and
> cannot suggest "do not overwrite existing archived files" without a
> caveat. At least don't we need to *fix* that parts for now?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: Duplicate history file?

Julien Rouhaud
In reply to this post by Kyotaro Horiguchi-4
On Fri, Jun 11, 2021 at 11:25:51AM +0900, Kyotaro Horiguchi wrote:

>
> Nevertheless I agree to it, still don't we need a minimum workable
> setup as the first step? Something like below.
>
> ===
> The following is an example of the minimal archive_command.
>
> Example: cp %p /blah/%f
>
> However, it is far from perfect. The following is the discussion about
> what is needed for archive_command to be more reliable.
>
> <the long list of the requirements>
> ====

"far from perfect" is a strong understatement for "appears to work but will
randomly and silently breaks everything without a simple way to detect it".

We should document a minimum workable setup, but cp isn't an example of that,
and I don't think that there will be a simple example unless we provide a
dedicated utility.

It could however be something along those lines:

Example: archive_program %p /path/to/%d

archive_program being a script ensuring that all those requirements are met:
<the long list of the requirements>


Reply | Threaded
Open this post in threaded view
|

Re: Duplicate history file?

Kyotaro Horiguchi-4
At Fri, 11 Jun 2021 10:48:32 +0800, Julien Rouhaud <[hidden email]> wrote in

> On Fri, Jun 11, 2021 at 11:25:51AM +0900, Kyotaro Horiguchi wrote:
> >
> > Nevertheless I agree to it, still don't we need a minimum workable
> > setup as the first step? Something like below.
> >
> > ===
> > The following is an example of the minimal archive_command.
> >
> > Example: cp %p /blah/%f
> >
> > However, it is far from perfect. The following is the discussion about
> > what is needed for archive_command to be more reliable.
> >
> > <the long list of the requirements>
> > ====
>
> "far from perfect" is a strong understatement for "appears to work but will
> randomly and silently breaks everything without a simple way to detect it".

I think it's overstating. It sounds like a story of a mission critical
world.  How perfect archive_command should be depends on the
requirements of every system.  Simple cp is actualy sufficient in
certain log? range of usages, maybe.

> We should document a minimum workable setup, but cp isn't an example of that,
> and I don't think that there will be a simple example unless we provide a
> dedicated utility.

It looks somewhat strange like "Well, you need a special track to
drive your car, however, we don't have one. It's your responsibility
to construct a track that protects it from accidents perfectly.".

"Yeah, I'm not going to push it so hard and don't care it gets some
small scratches, couldn't I drive it on a public road?"

(Sorry for the bad analogy).

I think cp can be an example as far as we explain the limitations. (On
the other hand "test !-f" cannot since it actually prevents server
from working correctly.)

> It could however be something along those lines:
>
> Example: archive_program %p /path/to/%d
>
> archive_program being a script ensuring that all those requirements are met:
> <the long list of the requirements>

Isn't it almost saying that anything less than pgBackRest isn't
qualified as archive_program?


regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: Duplicate history file?

Michael Paquier-2
On Fri, Jun 11, 2021 at 03:32:28PM +0900, Kyotaro Horiguchi wrote:
> At Fri, 11 Jun 2021 10:48:32 +0800, Julien Rouhaud <[hidden email]> wrote in
>> "far from perfect" is a strong understatement for "appears to work but will
>> randomly and silently breaks everything without a simple way to detect it".

Yeah.  Users like unplugging their hosts, because that's *fast* and
easy to do.

> I think it's overstating. It sounds like a story of a mission critical
> world.  How perfect archive_command should be depends on the
> requirements of every system.  Simple cp is actually sufficient in
> certain log? range of usages, maybe.
>
>> We should document a minimum workable setup, but cp isn't an example of that,
>> and I don't think that there will be a simple example unless we provide a
>> dedicated utility.
>
> I think cp can be an example as far as we explain the limitations. (On
> the other hand "test !-f" cannot since it actually prevents server
> from working correctly.)
Disagreed.  I think that we should not try to change this area until
we can document a reliable solution, and a simple "cp" is not that.
Hmm.  A simple command that could be used as reference is for example
"dd" that flushes the file by itself, or we could just revisit the
discussions about having a pg_copy command, or we could document a
small utility in perl that does the job.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate history file?

Julien Rouhaud
In reply to this post by Kyotaro Horiguchi-4
On Fri, Jun 11, 2021 at 03:32:28PM +0900, Kyotaro Horiguchi wrote:
> At Fri, 11 Jun 2021 10:48:32 +0800, Julien Rouhaud <[hidden email]> wrote in
> >
> > "far from perfect" is a strong understatement for "appears to work but will
> > randomly and silently breaks everything without a simple way to detect it".
>
> I think it's overstating. It sounds like a story of a mission critical
> world.  How perfect archive_command should be depends on the
> requirements of every system.  Simple cp is actualy sufficient in
> certain log? range of usages, maybe.

I disagree, cp is probably the worst command that can be used for this purpose.
On top on the previous problems already mentioned, you also have the fact that
the copy isn't atomic.  It means that any concurrent restore_command (or
anything that would consume the archived files) will happily process a half
copied WAL file, and in case of any error during the copy you end up with a
file for which you don't know if it contains valid data or not.  I don't see
any case where you would actually want to use that, unless maybe if you want to
benchmark how long it takes before you lose some data.

> > We should document a minimum workable setup, but cp isn't an example of that,
> > and I don't think that there will be a simple example unless we provide a
> > dedicated utility.
>
> It looks somewhat strange like "Well, you need a special track to
> drive your car, however, we don't have one. It's your responsibility
> to construct a track that protects it from accidents perfectly.".
>
> "Yeah, I'm not going to push it so hard and don't care it gets some
> small scratches, couldn't I drive it on a public road?"
>
> (Sorry for the bad analogy).

I think that a better analogy would be "I don't need working brakes on my car
since I only drive on highway and there aren't any red light there".

> Isn't it almost saying that anything less than pgBackRest isn't
> qualified as archive_program?

I don't know, I'm assuming that barman also provides one, such as wal-e and
wal-g (assuming that the distant providers do their part of the job correctly).
Maybe there are other tools too.  But as long as we don't document what exactly
are the requirements, it's not really a surprise that most people don't
implement them.


Reply | Threaded
Open this post in threaded view
|

Re: Duplicate history file?

Kyotaro Horiguchi-4
At Fri, 11 Jun 2021 15:18:03 +0800, Julien Rouhaud <[hidden email]> wrote in
> On Fri, Jun 11, 2021 at 03:32:28PM +0900, Kyotaro Horiguchi wrote:
> I disagree, cp is probably the worst command that can be used for this purpose.
> On top on the previous problems already mentioned, you also have the fact that
> the copy isn't atomic.  It means that any concurrent restore_command (or
> anything that would consume the archived files) will happily process a half
> copied WAL file, and in case of any error during the copy you end up with a
> file for which you don't know if it contains valid data or not.  I don't see
> any case where you would actually want to use that, unless maybe if you want to
> benchmark how long it takes before you lose some data.

Actually there's large room for losing data with cp. Yes, we would
need additional redundancy of storage and periodical integrity
inspection of the storage and archives on maybe need copies at the
other sites on the other side of the Earth. But they are too-much for
some kind of users.  They have the right and responsibility to decide
how durable/reliable their archive needs to be.  (Putting aside some
hardware/geological requirements :p) If we mandate some
characteristics on the archive_command, we should take them into core.
I remember I saw some discussions on archive command on this line but
I think it had ended at the point something like that "we cannot
design one-fits-all interface comforming the requirements" or
something (sorry, I don't remember in its detail..)

> I don't know, I'm assuming that barman also provides one, such as wal-e and
> wal-g (assuming that the distant providers do their part of the job correctly).

Well. rman used rsync/ssh in its documentation in the past and now
looks like providing barman-wal-archive so it seems that you're right
in that point.  So, do we recommend them in our documentation? (I'm
not sure they are actually comform the requirement, though..)

> Maybe there are other tools too.  But as long as we don't document what exactly
> are the requirements, it's not really a surprise that most people don't
> implement them.

I strongly agree to describe the requirements.

My point is that if all of them are really mandatory, it is mandatory
for us to officially provide or at least recommend the minimal
implement(s) that covers all of them.  If we recommended some external
tools, that would mean that we ensure that the tools qualify the
requirements.

If we write an example with a pseudo tool name, requiring some
characteristics on the tool, then not telling about the minimal tools,
I think that it is equivalent to that we are inhibiting certain users
from using archive_command even if they really don't want such level
of durability.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: Duplicate history file?

Kyotaro Horiguchi-4
In reply to this post by Michael Paquier-2
At Fri, 11 Jun 2021 16:08:33 +0900, Michael Paquier <[hidden email]> wrote in
> On Fri, Jun 11, 2021 at 03:32:28PM +0900, Kyotaro Horiguchi wrote:
> > I think cp can be an example as far as we explain the limitations. (On
> > the other hand "test !-f" cannot since it actually prevents server
> > from working correctly.)
>
> Disagreed.  I think that we should not try to change this area until
> we can document a reliable solution, and a simple "cp" is not that.

Isn't removing cp from the documentation a change in this area? I
basically agree to not to change anything but the current example
"test ! -f <fn> && cp .." and relevant description has been known to
be problematic in a certain situation.

- Do we leave it alone igonring the possible problem?

- Just add a description about "the problem"?

- Just remove "test ! -f" and the relevant description?

- Remove "test ! -f" and rewrite the relevant description?

(- or not remove "test ! -f" and rewrite the relevant description?)

- Write the full (known) requirements and use a pseudo tool-name in
 the example?

 - provide a minimal implement of the command?

 - recommend some external tools (that we can guarantee that they
   comform the requriements)?

 - not recommend any tools?

> Hmm.  A simple command that could be used as reference is for example
> "dd" that flushes the file by itself, or we could just revisit the
> discussions about having a pg_copy command, or we could document a
> small utility in perl that does the job.

I think we should do that if pg_copy comforms the mandatory
requirements but maybe it's in the future. Showing the minimal
implement in perl looks good.

However, my main point here is "what should we do for now?".  Not
about an ideal solution.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


12