Standby recovery conflicts: add information when the cancellation occurs

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Standby recovery conflicts: add information when the cancellation occurs

Drouvot, Bertrand
Hi hackers,

As suggested by Masao, I am starting a new thread to follow up about
standby recovery conflicts.

The initial patch proposed in [1] has been split in 3 parts:

- Add block information in error context of WAL REDO apply: committed
(9d0bd95fa90a7243047a74e29f265296a9fc556d)
- Add information when the startup process is waiting for recovery
conflicts: committed (0650ff23038bc3eb8d8fd851744db837d921e285)
- Add information when the cancellation occurs:  subject of this new thread

As you can see, the initial idea was also to dump information about the
blocking backends (should they reach the cancellation stage).

Main idea is to provide information like:

2020-06-15 06:48:54.778 UTC [7037] LOG: about to interrupt pid: 7037,
backend_type: client backend, state: active, wait_event_type: Timeout,
wait_event: PgSleep, query_start: 2020-06-15 06:48:13.008427+00

Some examples, on how this could be useful:

     - For example the query being canceled usually runs in 1 second,
seeing that it started 1 minute ago (when canceled) could indicate plan
change.
     - For example a lot of queries have been canceled and all of them
were waiting on “DataFileRead”: that could indicate bad IO response time
at that moment.
     - Seeing the state as “idle in transaction” could potentially
indicate an unexpected application behavior (say the application is
using Begin; SET TRANSACTION ISOLATION LEVEL REPEATABLE READ; then
select and then stay in an idle in transaction state that could lead to
recovery conflict)

Main purpose is to dump information just before the cancellation occurs
to get some clue on what was going on and get some data to work on (to
avoid future conflict and cancellation).

If you think this information can be useful then I can submit a patch in
this area.

Bertrand

[1]:
https://www.postgresql.org/message-id/9a60178c-a853-1440-2cdc-c3af916cff59%40amazon.com