PostgreSQL 10 not archiving some WAL files

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

PostgreSQL 10 not archiving some WAL files

Norberto Dellê

Hi

I have a strange case in some PostgreSQL instances that I monitor. First the software versions and setup of a specific one:

  • Windows Server 2012 Foundation
  • PostgreSQL 10.10 x64
  • archive_command = 'copy% p "D: \\ backup \\ wal \\% f" / y'

This setup worked very well for a long time, but since we upgraded PostgreSQL to newer versions (10.x), sometimes it just skips archiving some wal files. This seems to happen mainly when the server is shut down. There's no messages about errors in the logs, and also no corresponding .done file in archive_status, it just goes to the next wal file.

I would like to know if any of you had or has a problem like this, and if you had, how you solved it.

Reply | Threaded
Open this post in threaded view
|

Re: PostgreSQL 10 not archiving some WAL files

Michael Paquier-2
On Wed, Mar 18, 2020 at 10:57:22AM -0300, Norberto Dellê wrote:
> This setup worked very well for a long time, but since we upgraded
> PostgreSQL to newer versions (10.x), sometimes it just skips archiving some
> wal files. This seems to happen mainly when the server is shut down.
> There's no messages about errors in the logs, and also no corresponding
> .done file in archive_status, it just goes to the next wal file.
>
> I would like to know if any of you had or has a problem like this, and if
> you had, how you solved it.

Hmm.  I don't recall seeing any of that.  This would mean that we are
either missing the generation of some .ready file, or that some .done
file gets generated when they should not in archive_status/.  What
kind of server shutdown are you doing?  Immediate so as recovery
happens at the follow-up startup.  Or is that a clean service
shutdown?
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PostgreSQL 10 not archiving some WAL files

Norberto Dellê
>On Wed, Mar 18, 2020 at 10:57:22AM -0300, Norberto Dellê wrote:

>> This setup worked very well for a long time, but since we upgraded
>> PostgreSQL to newer versions (10.x), sometimes it just skips archiving some
>> wal files. This seems to happen mainly when the server is shut down.
>> There's no messages about errors in the logs, and also no corresponding
>> .done file in archive_status, it just goes to the next wal file.
>>
>> I would like to know if any of you had or has a problem like this, and if
>> you had, how you solved it.
>
>Hmm.  I don't recall seeing any of that.  This would mean that we are
>either missing the generation of some .ready file, or that some .done
>file gets generated when they should not in archive_status/.  What
>kind of server shutdown are you doing?  Immediate so as recovery
>happens at the follow-up startup.  Or is that a clean service
>shutdown?
>--
>Michael
>...

Hi Michael

Almost daily the server is shutdown, because my client wants to save energy.
In the logs, it appears that the service was not shut down properly. Here's the last lines of one log file:

2020-03-16 18:10:03.130 -04 [1964] LOG:  received fast shutdown request
2020-03-16 18:10:03.132 -04 [2948] ERROR:  canceling statement due to user request
2020-03-16 18:10:03.192 -04 [1964] LOG:  aborting any active transactions
2020-03-16 18:10:03.194 -04 [1964] LOG:  worker process: logical replication launcher (PID 2948) exited with exit code 1
2020-03-16 18:10:03.210 -04 [2188] FATAL:  terminating connection due to administrator command
2020-03-16 18:10:03.210 -04 [6688] FATAL:  terminating connection due to administrator command
2020-03-16 18:10:03.210 -04 [3348] FATAL:  terminating connection due to administrator command
2020-03-16 18:10:03.210 -04 [6356] FATAL:  terminating connection due to administrator command
2020-03-16 18:10:03.210 -04 [5736] FATAL:  terminating connection due to administrator command
2020-03-16 18:10:03.210 -04 [4028] FATAL:  terminating connection due to administrator command
2020-03-16 18:10:03.211 -04 [2552] FATAL:  terminating connection due to administrator command
2020-03-16 18:10:03.212 -04 [4280] FATAL:  terminating connection due to administrator command
2020-03-16 18:10:03.210 -04 [2336] FATAL:  terminating connection due to administrator command
2020-03-16 18:10:03.227 -04 [6564] FATAL:  terminating connection due to administrator command
2020-03-16 18:10:03.255 -04 [2884] LOG:  shutting down

When the service is started the next day, this is what Postgres logs:

2020-03-17 07:47:26.721 -04 [2272] LOG:  database system was interrupted; last known up at 2020-03-16 17:56:20 -04
2020-03-17 07:47:34.974 -04 [2884] FATAL:  the database system is starting up
2020-03-17 07:47:39.378 -04 [2272] LOG:  database system was not properly shut down; automatic recovery in progress
2020-03-17 07:47:39.631 -04 [2272] LOG:  redo starts at 38/F30F5108
2020-03-17 07:47:40.038 -04 [3052] FATAL:  the database system is starting up
2020-03-17 07:47:40.126 -04 [2272] LOG:  redo done at 38/F4001738
2020-03-17 07:47:40.207 -04 [2272] LOG:  last completed transaction was at log time 2020-03-16 18:06:53.269798-04
2020-03-17 07:47:42.590 -04 [2080] LOG:  database system is ready to accept connections

In this shutdown/startup process, this PostgreSQL instance skipped archiving a wal file.
Another thing that I perceived today is that in another instance, days after (about a week) skipping a wal file, the file was finally archived.
Both of these instances that I talked about also have streaming replication configured. But I don't think it is related because there other instances skipping archiving wal files that don't have streaming replication configured.

Norberto
Reply | Threaded
Open this post in threaded view
|

Re: PostgreSQL 10 not archiving some WAL files

Norberto Dellê
In reply to this post by Michael Paquier-2
>>On Wed, Mar 18, 2020 at 10:57:22AM -0300, Norberto Dellê wrote:

>> This setup worked very well for a long time, but since we upgraded
>> PostgreSQL to newer versions (10.x), sometimes it just skips archiving some
>> wal files. This seems to happen mainly when the server is shut down.
>> There's no messages about errors in the logs, and also no corresponding
>> .done file in archive_status, it just goes to the next wal file.
>>
>> I would like to know if any of you had or has a problem like this, and if
>> you had, how you solved it.
>
>Hmm.  I don't recall seeing any of that.  This would mean that we are
>either missing the generation of some .ready file, or that some .done
>file gets generated when they should not in archive_status/.  What
>kind of server shutdown are you doing?  Immediate so as recovery
>happens at the follow-up startup.  Or is that a clean service
>shutdown?
>--
>Michael
>... 
 
Hi Michael

Almost daily the server is shutdown, because my client wants to save energy.
In the logs, it appears that the service was not shut down properly. Here's the last lines of one log file:

2020-03-16 18:10:03.130 -04 [1964] LOG:  received fast shutdown request
2020-03-16 18:10:03.132 -04 [2948] ERROR:  canceling statement due to user request
2020-03-16 18:10:03.192 -04 [1964] LOG:  aborting any active transactions
2020-03-16 18:10:03.194 -04 [1964] LOG:  worker process: logical replication launcher (PID 2948) exited with exit code 1
2020-03-16 18:10:03.210 -04 [2188] FATAL:  terminating connection due to administrator command
2020-03-16 18:10:03.210 -04 [6688] FATAL:  terminating connection due to administrator command
2020-03-16 18:10:03.210 -04 [3348] FATAL:  terminating connection due to administrator command
2020-03-16 18:10:03.210 -04 [6356] FATAL:  terminating connection due to administrator command
2020-03-16 18:10:03.210 -04 [5736] FATAL:  terminating connection due to administrator command
2020-03-16 18:10:03.210 -04 [4028] FATAL:  terminating connection due to administrator command
2020-03-16 18:10:03.211 -04 [2552] FATAL:  terminating connection due to administrator command
2020-03-16 18:10:03.212 -04 [4280] FATAL:  terminating connection due to administrator command
2020-03-16 18:10:03.210 -04 [2336] FATAL:  terminating connection due to administrator command
2020-03-16 18:10:03.227 -04 [6564] FATAL:  terminating connection due to administrator command
2020-03-16 18:10:03.255 -04 [2884] LOG:  shutting down

When the service is started the next day, this is what Postgres logs:

2020-03-17 07:47:26.721 -04 [2272] LOG:  database system was interrupted; last known up at 2020-03-16 17:56:20 -04
2020-03-17 07:47:34.974 -04 [2884] FATAL:  the database system is starting up
2020-03-17 07:47:39.378 -04 [2272] LOG:  database system was not properly shut down; automatic recovery in progress
2020-03-17 07:47:39.631 -04 [2272] LOG:  redo starts at 38/F30F5108
2020-03-17 07:47:40.038 -04 [3052] FATAL:  the database system is starting up
2020-03-17 07:47:40.126 -04 [2272] LOG:  redo done at 38/F4001738
2020-03-17 07:47:40.207 -04 [2272] LOG:  last completed transaction was at log time 2020-03-16 18:06:53.269798-04
2020-03-17 07:47:42.590 -04 [2080] LOG:  database system is ready to accept connections

In this shutdown/startup process, this PostgreSQL instance skipped archiving a wal file.
Another thing that I perceived today is that in another instance, days after (about a week) skipping a wal file, the file was finally archived.
Both of these instances that I talked about also have streaming replication configured. But I don't think it is related because there other instances skipping archiving wal files that don't have streaming replication configured.

Norberto