BUG #16032: pg_basebackup when running on Windows doesn't clean up on failure correctly

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

BUG #16032: pg_basebackup when running on Windows doesn't clean up on failure correctly

PG Bug reporting form
The following bug has been logged on the website:

Bug reference:      16032
Logged by:          Rob Emery
Email address:      [hidden email]
PostgreSQL version: 11.5
Operating system:   Windows
Description:        

Hello,

When running a pg_basebackup in tar mode on windows against a PG 9.5 box we
are finding that the backup doesn't get deleted successfully when it
fails.
Based on
https://github.com/postgres/postgres/commit/9083353b15c3cf8e7bbac104a81ad42281178cdf#diff-473d306e41ee616e92fb58ac128a9dcc
we would expect it to?

To reproduce start running pg_basebackup like so:

PS C:\Users\developer> & "$PgBinPath\pg_basebackup.exe" --host="PGSERVER"
--pgdata="D:\Backups\rofl" --format=tar --wal-method=fetch
--username="backup_user"

from a psql on the box run:
`SELECT pg_terminate_backend(pid) FROM pg_stat_replication WHERE state =
'backup';`
this kills the backup.

If you run pg_basebackup on a linux box it nicely deletes the base.tar and
the directory, whereas on windows we get the following error:
```
pg_basebackup: could not read COPY data: SSL SYSCALL error: EOF detected
pg_basebackup: removing data directory "D:\Backups\rofl"
could not remove file or directory "D:\Backups\rofl/base.tar": Permission
denied
could not remove file or directory "D:\Backups\rofl": Directory not empty
pg_basebackup: failed to remove data directory
```
Once pg_basebackup has exited, the file lock seems to be gone and we can
delete it ourselves,
it looks like pg_basebackup is contending with itself but only on Windows.

We've reproduced this on 2 different Windows machines so we don't think it's
something like AntiVirus getting in the way or similiar.

Thanks,
Rob

Reply | Threaded
Open this post in threaded view
|

[PATCH] Re: BUG #16032: pg_basebackup when running on Windows doesn't clean up on failure correctly

Rob-3
Hello,

I've already posted this to pg_hackers, but I'm not sure if it's more
normal to
post proposed fixes against the bug itself; so see attached if needed.

Thanks,
Rob



On 2019-10-01 17:06, PG Bug reporting form wrote:

> The following bug has been logged on the website:
>
> Bug reference:      16032
> Logged by:          Rob Emery
> Email address:      [hidden email]
> PostgreSQL version: 11.5
> Operating system:   Windows
> Description:
>
> Hello,
>
> When running a pg_basebackup in tar mode on windows against a PG 9.5
> box we
> are finding that the backup doesn't get deleted successfully when it
> fails.
> Based on
> https://github.com/postgres/postgres/commit/9083353b15c3cf8e7bbac104a81ad42281178cdf#diff-473d306e41ee616e92fb58ac128a9dcc
> we would expect it to?
>
> To reproduce start running pg_basebackup like so:
>
> PS C:\Users\developer> & "$PgBinPath\pg_basebackup.exe"
> --host="PGSERVER"
> --pgdata="D:\Backups\rofl" --format=tar --wal-method=fetch
> --username="backup_user"
>
> from a psql on the box run:
> `SELECT pg_terminate_backend(pid) FROM pg_stat_replication WHERE state
> =
> 'backup';`
> this kills the backup.
>
> If you run pg_basebackup on a linux box it nicely deletes the base.tar
> and
> the directory, whereas on windows we get the following error:
> ```
> pg_basebackup: could not read COPY data: SSL SYSCALL error: EOF
> detected
> pg_basebackup: removing data directory "D:\Backups\rofl"
> could not remove file or directory "D:\Backups\rofl/base.tar":
> Permission
> denied
> could not remove file or directory "D:\Backups\rofl": Directory not
> empty
> pg_basebackup: failed to remove data directory
> ```
> Once pg_basebackup has exited, the file lock seems to be gone and we
> can
> delete it ourselves,
> it looks like pg_basebackup is contending with itself but only on
> Windows.
>
> We've reproduced this on 2 different Windows machines so we don't think
> it's
> something like AntiVirus getting in the way or similiar.
>
> Thanks,
> Rob

0001-Fix-bug16032-under-windows-when-the-backup-is-aborte.patch (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Re: BUG #16032: pg_basebackup when running on Windows doesn't clean up on failure correctly

Michael Paquier-2
On Sun, Oct 06, 2019 at 08:15:10PM +0100, Rob Emery wrote:
> Hello,
>
> I've already posted this to pg_hackers, but I'm not sure if it's
> more normal to post proposed fixes against the bug itself; so see
> attached if needed.

Discussing patches on -bugs threads is perfectly fine :)

Regarding your patch, anything living in the middle of processing
(basically calling disconnect_and_exit() in ~11 and exit() for 12~)
would fail into this category.  Wouldn't it make sense here to use
atexit() to ensure that the cleanup always happens?  I am not sure
that it is a good idea to hope that anything processing the base
backup COPY chunks will remember to clean up those handles in the
event of an error.  I am ready to bet that any future code will forget
to add that so we would keep falling into the same trap.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Re: BUG #16032: pg_basebackup when running on Windows doesn't clean up on failure correctly

Rob-3
Hiya Michael,

> Regarding your patch, anything living in the middle of processing
> (basically calling disconnect_and_exit() in ~11 and exit() for 12~)
> would fail into this category.  Wouldn't it make sense here to use
> atexit() to ensure that the cleanup always happens?  I am not sure
> that it is a good idea to hope that anything processing the base
> backup COPY chunks will remember to clean up those handles in the
> event of an error.  I am ready to bet that any future code will forget
> to add that so we would keep falling into the same trap.

Yes and no, the problem is that under Windows the file cleanup
occurs before the process itself actually ends, as the streaming of
the backup from the COPY is performed in a thread, not a fork.

So, if we attempt to use atexit to cleanup the filehandles,
pg_basebackup will still be attempting to delete the partial backup
before the file handles are cleaned up by the atexit callback.

I hope that makes sense; I agree with your sentiment that there must
be a better way of structuring this.

-Rob