replication_slots usability issue

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

replication_slots usability issue

Joshua D. Drake

-Hackers,


Working on 9.6 today (unsure if fixed in newer versions). Had an issue where the wal was 280G despite max_wal_size being 8G. Found out there were stale replication slots from a recent base backup. I went to drop the replication slots and found that since the wal_level was set to minimal vs replica or higher, I couldn't drop the replication slot. Clearly that makes sense for creating a replication slot but it seems like an artificial limitation for dropping them.


JD

-- 
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
***  A fault and talent of mine is to tell it exactly how it is.  ***
PostgreSQL centered full stack support, consulting and development. 
Advocate: @amplifypostgres || Learn: https://postgresconf.org
*****     Unless otherwise stated, opinions are my own.   *****
Reply | Threaded
Open this post in threaded view
|

Re: replication_slots usability issue

Andres Freund


On October 29, 2018 1:31:56 PM EDT, "Joshua D. Drake" <[hidden email]> wrote:

>-Hackers,
>
>
>Working on 9.6 today (unsure if fixed in newer versions). Had an issue
>where the wal was 280G despite max_wal_size being 8G. Found out there
>were stale replication slots from a recent base backup. I went to drop
>the replication slots and found that since the wal_level was set to
>minimal vs replica or higher, I couldn't drop the replication slot.
>Clearly that makes sense for creating a replication slot but it seems
>like an artificial limitation for dropping them.

Uh, huh? How did you manage to start a server with existing slots with that configuration? It should have errored out at start...

Andres

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Reply | Threaded
Open this post in threaded view
|

Re: replication_slots usability issue

Alvaro Herrera-9
In reply to this post by Joshua D. Drake
On 2018-Oct-29, Joshua D. Drake wrote:

> -Hackers,
>
>
> Working on 9.6 today (unsure if fixed in newer versions). Had an issue where
> the wal was 280G despite max_wal_size being 8G. Found out there were stale
> replication slots from a recent base backup. I went to drop the replication
> slots and found that since the wal_level was set to minimal vs replica or
> higher, I couldn't drop the replication slot. Clearly that makes sense for
> creating a replication slot but it seems like an artificial limitation for
> dropping them.

This sounds closely related to
https://www.postgresql.org/message-id/20180508143725.mn3ivlyvgpul6ovr%40alvherre.pgsql
(commit a1f680d962ff) wherein we made it possible to drop a slot in
single-user mode.

Seems worth fixing.  Send a patch?

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: replication_slots usability issue

Andres Freund
On 2018-10-29 16:02:18 -0300, Alvaro Herrera wrote:

> On 2018-Oct-29, Joshua D. Drake wrote:
>
> > -Hackers,
> >
> >
> > Working on 9.6 today (unsure if fixed in newer versions). Had an issue where
> > the wal was 280G despite max_wal_size being 8G. Found out there were stale
> > replication slots from a recent base backup. I went to drop the replication
> > slots and found that since the wal_level was set to minimal vs replica or
> > higher, I couldn't drop the replication slot. Clearly that makes sense for
> > creating a replication slot but it seems like an artificial limitation for
> > dropping them.
>
> This sounds closely related to
> https://www.postgresql.org/message-id/20180508143725.mn3ivlyvgpul6ovr%40alvherre.pgsql
> (commit a1f680d962ff) wherein we made it possible to drop a slot in
> single-user mode.
>
> Seems worth fixing.  Send a patch?

I don't think this quite is the problem. ISTM the issue is rather that
StartupReplicationSlots() *needs* to check whether wal_level > minimal,
and doesn't. So you can create a slot, shutdown, change wal_level,
startup. A slot exists but won't work correctly.

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: replication_slots usability issue

Joshua D. Drake
In reply to this post by Andres Freund
On 10/29/18 11:26 AM, Andres Freund wrote:

>
> On October 29, 2018 1:31:56 PM EDT, "Joshua D. Drake" <[hidden email]> wrote:
>> -Hackers,
>>
>>
>> Working on 9.6 today (unsure if fixed in newer versions). Had an issue
>> where the wal was 280G despite max_wal_size being 8G. Found out there
>> were stale replication slots from a recent base backup. I went to drop
>> the replication slots and found that since the wal_level was set to
>> minimal vs replica or higher, I couldn't drop the replication slot.
>> Clearly that makes sense for creating a replication slot but it seems
>> like an artificial limitation for dropping them.
> Uh, huh? How did you manage to start a server with existing slots with that configuration? It should have errored out at start...

Well, this is the recovery.conf:

standby_mode = 'on'
recovery_target = 'immediate'
primary_slot_name = 'testing_db01'
primary_conninfo = 'user=replication
passfile=/var/lib/postgresql/.pgpass host=db01 port=5432 sslmode=prefer
sslcompression=1 krbsrvname=postgres target_session_attrs=any'

recovery_target_action = 'promote'

The machine came up clean and the only reason I noticed the problem is
that it ran out of disk space. I cleared enough disk space to get it to
come up again and noticed that there were replication slots that were
identical to the primary.


JD


--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
***  A fault and talent of mine is to tell it exactly how it is.  ***
PostgreSQL centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://postgresconf.org
*****     Unless otherwise stated, opinions are my own.   *****


Reply | Threaded
Open this post in threaded view
|

Re: replication_slots usability issue

Michael Paquier-2
In reply to this post by Andres Freund
On Mon, Oct 29, 2018 at 12:13:04PM -0700, Andres Freund wrote:
> I don't think this quite is the problem. ISTM the issue is rather that
> StartupReplicationSlots() *needs* to check whether wal_level > minimal,
> and doesn't. So you can create a slot, shutdown, change wal_level,
> startup. A slot exists but won't work correctly.

It seems to me that what we are looking for is just to complain at
startup if we find any slot data and if trying to start up with
wal_level = minimal.

Er...  At the same time, shouldn't RestoreSlotFromDisk() *not* use PANIC
if more slots are found in pg_replslot than max_replication_slots can
handle.  A FATAL is fine at startup, PANIC blows up a core file, which
is clearly overdoing it if the goal is to give a recommendation at the
end.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: replication_slots usability issue

Andres Freund
On 2018-10-30 11:51:09 +0900, Michael Paquier wrote:
> On Mon, Oct 29, 2018 at 12:13:04PM -0700, Andres Freund wrote:
> > I don't think this quite is the problem. ISTM the issue is rather that
> > StartupReplicationSlots() *needs* to check whether wal_level > minimal,
> > and doesn't. So you can create a slot, shutdown, change wal_level,
> > startup. A slot exists but won't work correctly.
>
> It seems to me that what we are looking for is just to complain at
> startup if we find any slot data and if trying to start up with
> wal_level = minimal.

Right, we really should just call CheckSlotRequirements() before doing
so. I'll make it so, once I'm actually awake and had some coffee.


> Er...  At the same time, shouldn't RestoreSlotFromDisk() *not* use PANIC
> if more slots are found in pg_replslot than max_replication_slots can
> handle.  A FATAL is fine at startup, PANIC blows up a core file, which
> is clearly overdoing it if the goal is to give a recommendation at the
> end.

I can't get particularly excited about this. I guess we can change it,
but I'd only do so in master.


Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: replication_slots usability issue

Joshua D. Drake
On 10/30/18 10:52 AM, Andres Freund wrote:

> On 2018-10-30 11:51:09 +0900, Michael Paquier wrote:
>> On Mon, Oct 29, 2018 at 12:13:04PM -0700, Andres Freund wrote:
>>> I don't think this quite is the problem. ISTM the issue is rather that
>>> StartupReplicationSlots() *needs* to check whether wal_level > minimal,
>>> and doesn't. So you can create a slot, shutdown, change wal_level,
>>> startup. A slot exists but won't work correctly.
>> It seems to me that what we are looking for is just to complain at
>> startup if we find any slot data and if trying to start up with
>> wal_level = minimal.
> Right, we really should just call CheckSlotRequirements() before doing
> so. I'll make it so, once I'm actually awake and had some coffee.

Why not just disable the slot and report an INFO: line?


JD


>
>
>> Er...  At the same time, shouldn't RestoreSlotFromDisk() *not* use PANIC
>> if more slots are found in pg_replslot than max_replication_slots can
>> handle.  A FATAL is fine at startup, PANIC blows up a core file, which
>> is clearly overdoing it if the goal is to give a recommendation at the
>> end.
> I can't get particularly excited about this. I guess we can change it,
> but I'd only do so in master.
>
>
> Greetings,
>
> Andres Freund
>

--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
***  A fault and talent of mine is to tell it exactly how it is.  ***
PostgreSQL centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://postgresconf.org
*****     Unless otherwise stated, opinions are my own.   *****


Reply | Threaded
Open this post in threaded view
|

Re: replication_slots usability issue

Andres Freund
On 2018-10-30 11:02:04 -0700, Joshua D. Drake wrote:

> On 10/30/18 10:52 AM, Andres Freund wrote:
> > On 2018-10-30 11:51:09 +0900, Michael Paquier wrote:
> > > On Mon, Oct 29, 2018 at 12:13:04PM -0700, Andres Freund wrote:
> > > > I don't think this quite is the problem. ISTM the issue is rather that
> > > > StartupReplicationSlots() *needs* to check whether wal_level > minimal,
> > > > and doesn't. So you can create a slot, shutdown, change wal_level,
> > > > startup. A slot exists but won't work correctly.
> > > It seems to me that what we are looking for is just to complain at
> > > startup if we find any slot data and if trying to start up with
> > > wal_level = minimal.
> > Right, we really should just call CheckSlotRequirements() before doing
> > so. I'll make it so, once I'm actually awake and had some coffee.
>
> Why not just disable the slot and report an INFO: line?

Because afterwards there'd be a slot with corrupted contents. Especially
bad for logical slots. Which might look ok, but crash in weird ways.

- Andres

Reply | Threaded
Open this post in threaded view
|

Re: replication_slots usability issue

Michael Paquier-2
In reply to this post by Andres Freund
On Tue, Oct 30, 2018 at 10:52:54AM -0700, Andres Freund wrote:
> On 2018-10-30 11:51:09 +0900, Michael Paquier wrote:
>> Er...  At the same time, shouldn't RestoreSlotFromDisk() *not* use PANIC
>> if more slots are found in pg_replslot than max_replication_slots can
>> handle.  A FATAL is fine at startup, PANIC blows up a core file, which
>> is clearly overdoing it if the goal is to give a recommendation at the
>> end.
>
> I can't get particularly excited about this.

This gets only used by the startup process to restore slot information,
so for example if you have an instance with two slots, shut it down,
reduce max_replication_slots to 1, and restart then you generate a core
dump, which is a bit overdoing it to simply let the user know that he
just needs to bump up one parameter and to give him a recommendation to
do so :)

For example:
2018-10-31 11:04:08 JST [9014]: [11-1] db=,user=,app=,client= DEBUG:
starting up replication slots
2018-10-31 11:04:08 JST [9014]: [12-1] db=,user=,app=,client= DEBUG:
restoring replication slot from "pg_replslot/popo3/state"
2018-10-31 11:04:08 JST [9014]: [13-1] db=,user=,app=,client= DEBUG:
restoring replication slot from "pg_replslot/popo2/state"
2018-10-31 11:04:08 JST [9014]: [14-1] db=,user=,app=,client= PANIC:
too many replication slots active before shutdown
2018-10-31 11:04:08 JST [9014]: [15-1] db=,user=,app=,client= HINT:
Increase max_replication_slots and try again.

> I guess we can change it, but I'd only do so in master.

I won't fight hard for a back-patch, now I think that we should
back-patch a fix.  Generating a core file if a state is unexpected is
fine because you expect the system to react so, being able to trigger
one based on a parameter switch just to give a recommendation is not in
my opinion.

For example with the attached the system reacts as follows:
2018-10-31 11:07:23 JST [10063]: [14-1] db=,user=,app=,client= FATAL:
too many replication slots active before shutdown
2018-10-31 11:07:23 JST [10063]: [15-1] db=,user=,app=,client= HINT:
Increase max_replication_slots and try again.
2018-10-31 11:07:23 JST [10061]: [6-1] db=,user=,app=,client= LOG:
startup process (PID 10063) exited with exit code 1
2018-10-31 11:07:23 JST [10061]: [7-1] db=,user=,app=,client= LOG:
aborting startup due to startup process failure
--
Michael

slot-restore-fatal.patch (457 bytes) Download Attachment
signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: replication_slots usability issue

Andres Freund
In reply to this post by Andres Freund
On 2018-10-30 10:52:54 -0700, Andres Freund wrote:

> On 2018-10-30 11:51:09 +0900, Michael Paquier wrote:
> > On Mon, Oct 29, 2018 at 12:13:04PM -0700, Andres Freund wrote:
> > > I don't think this quite is the problem. ISTM the issue is rather that
> > > StartupReplicationSlots() *needs* to check whether wal_level > minimal,
> > > and doesn't. So you can create a slot, shutdown, change wal_level,
> > > startup. A slot exists but won't work correctly.
> >
> > It seems to me that what we are looking for is just to complain at
> > startup if we find any slot data and if trying to start up with
> > wal_level = minimal.
>
> Right, we really should just call CheckSlotRequirements() before doing
> so. I'll make it so, once I'm actually awake and had some coffee.

And done.  Thanks for the report JD.

- Andres

Reply | Threaded
Open this post in threaded view
|

Re: replication_slots usability issue

Michael Paquier-2
HI Andres,

On Wed, Oct 31, 2018 at 03:48:02PM -0700, Andres Freund wrote:
> And done.  Thanks for the report JD.

Shouldn't we also switch the PANIC to a FATAL in RestoreSlotFromDisk()?
I don't mind doing so myself if you agree with the change, only on
HEAD as you seemed to disagree about changing that on back-branches.

Also, from 691d79a which you just committed:
+       ereport(FATAL,
+               (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                errmsg("logical replication slots \"%s\" exists, but wal_level < logical",
+                       NameStr(cp.slotdata.name)),
I can see one grammar mistake here, as you refer to only one slot here.
The error messages should read:
"logical replication slot \"%s\" exists, but wal_level < logical"
and:
"physical replication slot \"%s\" exists, but wal_level < replica"

Thanks,
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: replication_slots usability issue

Andres Freund
On 2018-11-01 09:34:05 +0900, Michael Paquier wrote:
> HI Andres,
>
> On Wed, Oct 31, 2018 at 03:48:02PM -0700, Andres Freund wrote:
> > And done.  Thanks for the report JD.
>
> Shouldn't we also switch the PANIC to a FATAL in
> RestoreSlotFromDisk()?

That has absolutely nothing to do with the issue at hand though, so I
don't think it'd have made much sense to do it at the same time. Nor do
I think it's particularly important.


> I don't mind doing so myself if you agree with the change, only on
> HEAD as you seemed to disagree about changing that on back-branches.

Cool. And yes, I don't think a cosmetic log level adjustment that could
affect people's scripts should be backpatched without need. Even if not
particularly likely to break something.


> Also, from 691d79a which you just committed:
> +       ereport(FATAL,
> +               (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +                errmsg("logical replication slots \"%s\" exists, but wal_level < logical",
> +                       NameStr(cp.slotdata.name)),
> I can see one grammar mistake here, as you refer to only one slot here.
> The error messages should read:
> "logical replication slot \"%s\" exists, but wal_level < logical"
> and:
> "physical replication slot \"%s\" exists, but wal_level < replica"

Darnit. Fixed. Thanks.

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: replication_slots usability issue

Michael Paquier-2
On Thu, Nov 01, 2018 at 10:54:23AM -0700, Andres Freund wrote:
> On 2018-11-01 09:34:05 +0900, Michael Paquier wrote:
> That has absolutely nothing to do with the issue at hand though, so I
> don't think it'd have made much sense to do it at the same time. Nor do
> I think it's particularly important.

Thanks for the confirmation.

>> I don't mind doing so myself if you agree with the change, only on
>> HEAD as you seemed to disagree about changing that on back-branches.
>
> Cool. And yes, I don't think a cosmetic log level adjustment that could
> affect people's scripts should be backpatched without need. Even if not
> particularly likely to break something.

No issues with your arguments.  I did the change this way.

>> Also, from 691d79a which you just committed:
>> +       ereport(FATAL,
>> +               (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>> +                errmsg("logical replication slots \"%s\" exists, but wal_level < logical",
>> +                       NameStr(cp.slotdata.name)),
>> I can see one grammar mistake here, as you refer to only one slot here.
>> The error messages should read:
>> "logical replication slot \"%s\" exists, but wal_level < logical"
>> and:
>> "physical replication slot \"%s\" exists, but wal_level < replica"
>
> Darnit. Fixed. Thanks.
And thanks for fixing that.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: replication_slots usability issue

Petr Jelinek-4
In reply to this post by Andres Freund
On 01/11/2018 18:54, Andres Freund wrote:>

>> Also, from 691d79a which you just committed:
>> +       ereport(FATAL,
>> +               (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>> +                errmsg("logical replication slots \"%s\" exists, but wal_level < logical",
>> +                       NameStr(cp.slotdata.name)),
>> I can see one grammar mistake here, as you refer to only one slot here.
>> The error messages should read:
>> "logical replication slot \"%s\" exists, but wal_level < logical"
>> and:
>> "physical replication slot \"%s\" exists, but wal_level < replica"
>
> Darnit. Fixed. Thanks.
>

Since we are fixing this message, shouldn't the hint for logical slot
say "Change wal_level to be logical or higher" rather than "replica or
higher" :)

--
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: replication_slots usability issue

Andres Freund
On 2018-11-02 15:51:34 +0100, Petr Jelinek wrote:

> On 01/11/2018 18:54, Andres Freund wrote:>
> >> Also, from 691d79a which you just committed:
> >> +       ereport(FATAL,
> >> +               (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> >> +                errmsg("logical replication slots \"%s\" exists, but wal_level < logical",
> >> +                       NameStr(cp.slotdata.name)),
> >> I can see one grammar mistake here, as you refer to only one slot here.
> >> The error messages should read:
> >> "logical replication slot \"%s\" exists, but wal_level < logical"
> >> and:
> >> "physical replication slot \"%s\" exists, but wal_level < replica"
> >
> > Darnit. Fixed. Thanks.
> >
>
> Since we are fixing this message, shouldn't the hint for logical slot
> say "Change wal_level to be logical or higher" rather than "replica or
> higher" :)

Gah. I really wasn't firing on all cylinders here. Darned jetlag (let's
just assume that's what it was).

Greetings,

Andres Freund