Get stuck when dropping a subscription during synchronizing table

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
73 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Get stuck when dropping a subscription during synchronizing table

Masahiko Sawada
Hi,

I encountered a situation where DROP SUBSCRIPTION got stuck when
initial table sync is in progress. In my environment, I created
several tables with some data on publisher. I created subscription on
subscriber and drop subscription immediately after that. It doesn't
always happen but I often encountered it on my environment.

ps -x command shows the following.

 96796 ?        Ss     0:00 postgres: masahiko postgres [local] DROP
SUBSCRIPTION
 96801 ?        Ts     0:00 postgres: bgworker: logical replication
worker for subscription 40993    waiting
 96805 ?        Ss     0:07 postgres: bgworker: logical replication
worker for subscription 40993 sync 16418
 96806 ?        Ss     0:01 postgres: wal sender process masahiko [local] idle
 96807 ?        Ss     0:00 postgres: bgworker: logical replication
worker for subscription 40993 sync 16421
 96808 ?        Ss     0:00 postgres: wal sender process masahiko [local] idle

The DROP SUBSCRIPTION process (pid 96796) is waiting for the apply
worker process (pid 96801) to stop while holding a lock on
pg_subscription_rel. On the other hand the apply worker is waiting for
acquiring a tuple lock on pg_subscription_rel needed for heap_update.
Also table sync workers (pid 96805 and 96807) are waiting for the
apply worker process to change their status.

Also, even when DROP SUBSCRIPTION is done successfully, the table sync
worker can be orphaned because I guess that the apply worker can exit
before change status of table sync worker.

I'm using 1f30295.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Get stuck when dropping a subscription during synchronizing table

Erik Rijkers
On 2017-05-08 11:27, Masahiko Sawada wrote:

> Hi,
>
> I encountered a situation where DROP SUBSCRIPTION got stuck when
> initial table sync is in progress. In my environment, I created
> several tables with some data on publisher. I created subscription on
> subscriber and drop subscription immediately after that. It doesn't
> always happen but I often encountered it on my environment.
>
> ps -x command shows the following.
>
>  96796 ?        Ss     0:00 postgres: masahiko postgres [local] DROP
> SUBSCRIPTION
>  96801 ?        Ts     0:00 postgres: bgworker: logical replication
> worker for subscription 40993    waiting
>  96805 ?        Ss     0:07 postgres: bgworker: logical replication
> worker for subscription 40993 sync 16418
>  96806 ?        Ss     0:01 postgres: wal sender process masahiko
> [local] idle
>  96807 ?        Ss     0:00 postgres: bgworker: logical replication
> worker for subscription 40993 sync 16421
>  96808 ?        Ss     0:00 postgres: wal sender process masahiko
> [local] idle
>

FWIW, running

> 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+
> 0002-WIP-Possibly-more-robust-snapbuild-approach.patch     +
> fix-statistics-reporting-in-logical-replication-work.patch
     (on top of 44c528810)

I have encountered the same condition as well in the last few days, a
few times (I think 2 or 3 times).

Erik Rijkers


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Get stuck when dropping a subscription during synchronizing table

Masahiko Sawada
On Mon, May 8, 2017 at 7:14 PM, Erik Rijkers <[hidden email]> wrote:

> On 2017-05-08 11:27, Masahiko Sawada wrote:
>>
>> Hi,
>>
>> I encountered a situation where DROP SUBSCRIPTION got stuck when
>> initial table sync is in progress. In my environment, I created
>> several tables with some data on publisher. I created subscription on
>> subscriber and drop subscription immediately after that. It doesn't
>> always happen but I often encountered it on my environment.
>>
>> ps -x command shows the following.
>>
>>  96796 ?        Ss     0:00 postgres: masahiko postgres [local] DROP
>> SUBSCRIPTION
>>  96801 ?        Ts     0:00 postgres: bgworker: logical replication
>> worker for subscription 40993    waiting
>>  96805 ?        Ss     0:07 postgres: bgworker: logical replication
>> worker for subscription 40993 sync 16418
>>  96806 ?        Ss     0:01 postgres: wal sender process masahiko [local]
>> idle
>>  96807 ?        Ss     0:00 postgres: bgworker: logical replication
>> worker for subscription 40993 sync 16421
>>  96808 ?        Ss     0:00 postgres: wal sender process masahiko [local]
>> idle
>>
>
> FWIW, running
>
>> 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+
>> 0002-WIP-Possibly-more-robust-snapbuild-approach.patch     +
>> fix-statistics-reporting-in-logical-replication-work.patch
>
>     (on top of 44c528810)

Thanks, which thread are these patches attached on?

>
> I have encountered the same condition as well in the last few days, a few
> times (I think 2 or 3 times).

IIUC there are two issues; one is that the deadlock can happen between
the DROP SUBSCRIPTION and the apply worker process, another one is the
table sync worker can be orphaned if the apply worker exits before
changing status. The latter might relate to another issue reported by
Jeff[1].

[1] https://www.postgresql.org/message-id/CAMkU%3D1xUJKs%3D2etq2K7bmbY51Q7g853HLxJ7qEB2Snog9oRvDw%40mail.gmail.com

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Get stuck when dropping a subscription during synchronizing table

Petr Jelinek-4
In reply to this post by Masahiko Sawada
On 08/05/17 11:27, Masahiko Sawada wrote:

> Hi,
>
> I encountered a situation where DROP SUBSCRIPTION got stuck when
> initial table sync is in progress. In my environment, I created
> several tables with some data on publisher. I created subscription on
> subscriber and drop subscription immediately after that. It doesn't
> always happen but I often encountered it on my environment.
>
> ps -x command shows the following.
>
>  96796 ?        Ss     0:00 postgres: masahiko postgres [local] DROP
> SUBSCRIPTION
>  96801 ?        Ts     0:00 postgres: bgworker: logical replication
> worker for subscription 40993    waiting
>  96805 ?        Ss     0:07 postgres: bgworker: logical replication
> worker for subscription 40993 sync 16418
>  96806 ?        Ss     0:01 postgres: wal sender process masahiko [local] idle
>  96807 ?        Ss     0:00 postgres: bgworker: logical replication
> worker for subscription 40993 sync 16421
>  96808 ?        Ss     0:00 postgres: wal sender process masahiko [local] idle
>
> The DROP SUBSCRIPTION process (pid 96796) is waiting for the apply
> worker process (pid 96801) to stop while holding a lock on
> pg_subscription_rel. On the other hand the apply worker is waiting for
> acquiring a tuple lock on pg_subscription_rel needed for heap_update.
> Also table sync workers (pid 96805 and 96807) are waiting for the
> apply worker process to change their status.
>

Looks like we should kill apply before dropping dependencies.

> Also, even when DROP SUBSCRIPTION is done successfully, the table sync
> worker can be orphaned because I guess that the apply worker can exit
> before change status of table sync worker.

Well the tablesync worker should stop itself if the subscription got
removed, but of course again the dependencies are an issue, so we should
probably kill those explicitly as well.

--
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Get stuck when dropping a subscription during synchronizing table

Erik Rijkers
In reply to this post by Masahiko Sawada
On 2017-05-08 13:13, Masahiko Sawada wrote:
> On Mon, May 8, 2017 at 7:14 PM, Erik Rijkers <[hidden email]> wrote:
>> On 2017-05-08 11:27, Masahiko Sawada wrote:
>>>

>>
>> FWIW, running
>>
>>> 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+
>>> 0002-WIP-Possibly-more-robust-snapbuild-approach.patch     +
>>> fix-statistics-reporting-in-logical-replication-work.patch
>>
>>     (on top of 44c528810)
>
> Thanks, which thread are these patches attached on?
>

The first two patches are here:
https://www.postgresql.org/message-id/20170505004237.edtahvrwb3uwd5rs%40alap3.anarazel.de

and last one:
https://www.postgresql.org/message-id/22cc402c-88eb-fa35-217f-0060db2c72f0%402ndquadrant.com

( I have to include that last one or my tests fail within minutes. )


Erik Rijkers


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Get stuck when dropping a subscription during synchronizing table

Masahiko Sawada
On Mon, May 8, 2017 at 8:53 PM, Erik Rijkers <[hidden email]> wrote:

> On 2017-05-08 13:13, Masahiko Sawada wrote:
>>
>> On Mon, May 8, 2017 at 7:14 PM, Erik Rijkers <[hidden email]> wrote:
>>>
>>> On 2017-05-08 11:27, Masahiko Sawada wrote:
>>>>
>>>>
>
>>>
>>> FWIW, running
>>>
>>>> 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+
>>>> 0002-WIP-Possibly-more-robust-snapbuild-approach.patch     +
>>>> fix-statistics-reporting-in-logical-replication-work.patch
>>>
>>>
>>>     (on top of 44c528810)
>>
>>
>> Thanks, which thread are these patches attached on?
>>
>
> The first two patches are here:
> https://www.postgresql.org/message-id/20170505004237.edtahvrwb3uwd5rs%40alap3.anarazel.de
>
> and last one:
> https://www.postgresql.org/message-id/22cc402c-88eb-fa35-217f-0060db2c72f0%402ndquadrant.com
>
> ( I have to include that last one or my tests fail within minutes. )

Thank you! I will look at these patches.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Get stuck when dropping a subscription during synchronizing table

Masahiko Sawada
In reply to this post by Petr Jelinek-4
On Mon, May 8, 2017 at 8:42 PM, Petr Jelinek
<[hidden email]> wrote:

> On 08/05/17 11:27, Masahiko Sawada wrote:
>> Hi,
>>
>> I encountered a situation where DROP SUBSCRIPTION got stuck when
>> initial table sync is in progress. In my environment, I created
>> several tables with some data on publisher. I created subscription on
>> subscriber and drop subscription immediately after that. It doesn't
>> always happen but I often encountered it on my environment.
>>
>> ps -x command shows the following.
>>
>>  96796 ?        Ss     0:00 postgres: masahiko postgres [local] DROP
>> SUBSCRIPTION
>>  96801 ?        Ts     0:00 postgres: bgworker: logical replication
>> worker for subscription 40993    waiting
>>  96805 ?        Ss     0:07 postgres: bgworker: logical replication
>> worker for subscription 40993 sync 16418
>>  96806 ?        Ss     0:01 postgres: wal sender process masahiko [local] idle
>>  96807 ?        Ss     0:00 postgres: bgworker: logical replication
>> worker for subscription 40993 sync 16421
>>  96808 ?        Ss     0:00 postgres: wal sender process masahiko [local] idle
>>
>> The DROP SUBSCRIPTION process (pid 96796) is waiting for the apply
>> worker process (pid 96801) to stop while holding a lock on
>> pg_subscription_rel. On the other hand the apply worker is waiting for
>> acquiring a tuple lock on pg_subscription_rel needed for heap_update.
>> Also table sync workers (pid 96805 and 96807) are waiting for the
>> apply worker process to change their status.
>>
>
> Looks like we should kill apply before dropping dependencies.

Sorry, after investigated I found out that DROP SUBSCRIPTION process
is holding AccessExclusiveLock on pg_subscription (, not
pg_subscription_rel) and apply worker is waiting for acquiring a lock
on it. So I guess that the dropping dependencies are not relevant with
this.  It seems to me that the main cause is that DROP SUBSCRIPTION
waits for apply worker to finish while keeping to hold
AccessExclusiveLock on pg_subscription. Perhaps we need to contrive
ways to reduce lock level somehow.

>
>> Also, even when DROP SUBSCRIPTION is done successfully, the table sync
>> worker can be orphaned because I guess that the apply worker can exit
>> before change status of table sync worker.
>
> Well the tablesync worker should stop itself if the subscription got
> removed, but of course again the dependencies are an issue, so we should
> probably kill those explicitly as well.

Yeah, I think that we should ensure that the apply worker exits after
killed all involved table sync workers.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Get stuck when dropping a subscription during synchronizing table

Masahiko Sawada
On Wed, May 10, 2017 at 2:46 AM, Masahiko Sawada <[hidden email]> wrote:

> On Mon, May 8, 2017 at 8:42 PM, Petr Jelinek
> <[hidden email]> wrote:
>> On 08/05/17 11:27, Masahiko Sawada wrote:
>>> Hi,
>>>
>>> I encountered a situation where DROP SUBSCRIPTION got stuck when
>>> initial table sync is in progress. In my environment, I created
>>> several tables with some data on publisher. I created subscription on
>>> subscriber and drop subscription immediately after that. It doesn't
>>> always happen but I often encountered it on my environment.
>>>
>>> ps -x command shows the following.
>>>
>>>  96796 ?        Ss     0:00 postgres: masahiko postgres [local] DROP
>>> SUBSCRIPTION
>>>  96801 ?        Ts     0:00 postgres: bgworker: logical replication
>>> worker for subscription 40993    waiting
>>>  96805 ?        Ss     0:07 postgres: bgworker: logical replication
>>> worker for subscription 40993 sync 16418
>>>  96806 ?        Ss     0:01 postgres: wal sender process masahiko [local] idle
>>>  96807 ?        Ss     0:00 postgres: bgworker: logical replication
>>> worker for subscription 40993 sync 16421
>>>  96808 ?        Ss     0:00 postgres: wal sender process masahiko [local] idle
>>>
>>> The DROP SUBSCRIPTION process (pid 96796) is waiting for the apply
>>> worker process (pid 96801) to stop while holding a lock on
>>> pg_subscription_rel. On the other hand the apply worker is waiting for
>>> acquiring a tuple lock on pg_subscription_rel needed for heap_update.
>>> Also table sync workers (pid 96805 and 96807) are waiting for the
>>> apply worker process to change their status.
>>>
>>
>> Looks like we should kill apply before dropping dependencies.
>
> Sorry, after investigated I found out that DROP SUBSCRIPTION process
> is holding AccessExclusiveLock on pg_subscription (, not
> pg_subscription_rel) and apply worker is waiting for acquiring a lock
> on it.

Hmm it seems there are two cases. One is that the apply worker waits
to acquire AccessShareLock on pg_subscription but DropSubscription
already acquired AcessExclusiveLock on it and waits for the apply
worker to finish. Another case is that the apply worker waits to
acquire a tuple lock on pg_subscrption_rel but DropSubscription (maybe
droppoing dependencies) already acquired it.

> So I guess that the dropping dependencies are not relevant with
> this.  It seems to me that the main cause is that DROP SUBSCRIPTION
> waits for apply worker to finish while keeping to hold
> AccessExclusiveLock on pg_subscription. Perhaps we need to contrive
> ways to reduce lock level somehow.
>
>>
>>> Also, even when DROP SUBSCRIPTION is done successfully, the table sync
>>> worker can be orphaned because I guess that the apply worker can exit
>>> before change status of table sync worker.
>>
>> Well the tablesync worker should stop itself if the subscription got
>> removed, but of course again the dependencies are an issue, so we should
>> probably kill those explicitly as well.
>
> Yeah, I think that we should ensure that the apply worker exits after
> killed all involved table sync workers.
>

Barring any objections, I'll add these two issues to open item.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Get stuck when dropping a subscription during synchronizing table

Michael Paquier
On Wed, May 10, 2017 at 11:57 AM, Masahiko Sawada <[hidden email]> wrote:
> Barring any objections, I'll add these two issues to open item.

It seems to me that those open items have not been added yet to the
list. If I am following correctly, they could be defined as follows:
- Dropping subscription may stuck if done during tablesync.
-- Analyze deadlock issues with DROP SUBSCRIPTION and apply worker process.
-- Avoid orphaned tablesync worker if apply worker exits before
changing its status.

I am playing with the code to look at both of them... But feel free to
update this thread if I don't show up. There are no test cases, but
some well-placed pg_usleep calls should make both issues easily
reproducible. I have the gut feeling that other things are hidden
behind though.
--
Michael


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Get stuck when dropping a subscription during synchronizing table

Masahiko Sawada
On Thu, May 11, 2017 at 4:06 PM, Michael Paquier
<[hidden email]> wrote:
> On Wed, May 10, 2017 at 11:57 AM, Masahiko Sawada <[hidden email]> wrote:
>> Barring any objections, I'll add these two issues to open item.
>
> It seems to me that those open items have not been added yet to the
> list. If I am following correctly, they could be defined as follows:
> - Dropping subscription may stuck if done during tablesync.
> -- Analyze deadlock issues with DROP SUBSCRIPTION and apply worker process.
> -- Avoid orphaned tablesync worker if apply worker exits before
> changing its status.

Thanks, I think correct. Added it to open item.

>
> I am playing with the code to look at both of them... But feel free to
> update this thread if I don't show up. There are no test cases, but
> some well-placed pg_usleep calls should make both issues easily
> reproducible. I have the gut feeling that other things are hidden
> behind though.

I'm also working on this, so will update it if there is.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Get stuck when dropping a subscription during synchronizing table

Petr Jelinek-4
On 11/05/17 10:10, Masahiko Sawada wrote:
> On Thu, May 11, 2017 at 4:06 PM, Michael Paquier
> <[hidden email]> wrote:
>> On Wed, May 10, 2017 at 11:57 AM, Masahiko Sawada <[hidden email]> wrote:
>>> Barring any objections, I'll add these two issues to open item.
>>
>> It seems to me that those open items have not been added yet to the
>> list. If I am following correctly, they could be defined as follows:
>> - Dropping subscription may stuck if done during tablesync.
>> -- Analyze deadlock issues with DROP SUBSCRIPTION and apply worker process.

I think the solution to this is to reintroduce the LWLock that was
removed and replaced with the exclusive lock on catalog [1]. I am afraid
that correct way of handling this is to do both LWLock and catalog lock
(first LWLock under which we kill the workers and then catalog lock so
that something that prevents launcher from restarting them is held till
the end of transaction).

>> -- Avoid orphaned tablesync worker if apply worker exits before
>> changing its status.
>

The behavior question I have about this is if sync workers should die
when apply worker dies (ie they are tied to apply worker) or if they
should be tied to the subscription.

I guess taking down all the sync workers when apply worker has exited is
easier to solve. Of course it means that if apply worker restarts in
middle of table synchronization, the table synchronization will have to
start from scratch. That being said, in normal operation apply worker
should only exit/restart if subscription has changed or has been
dropped/disabled and I think sync workers want to exit/restart in that
situation as well.

So for example having shmem detach hook for an apply worker (or reusing
the existing one) that searches for all the other workers for same
subscription and shuts them down as well sounds like solution to this.

[1]
https://www.postgresql.org/message-id/CAHGQGwHPi8ky-yANFfe0sgmhKtsYcQLTnKx07bW9S7-Rn1746w@...

--
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Get stuck when dropping a subscription during synchronizing table

Masahiko Sawada
On Thu, May 11, 2017 at 6:16 PM, Petr Jelinek
<[hidden email]> wrote:

> On 11/05/17 10:10, Masahiko Sawada wrote:
>> On Thu, May 11, 2017 at 4:06 PM, Michael Paquier
>> <[hidden email]> wrote:
>>> On Wed, May 10, 2017 at 11:57 AM, Masahiko Sawada <[hidden email]> wrote:
>>>> Barring any objections, I'll add these two issues to open item.
>>>
>>> It seems to me that those open items have not been added yet to the
>>> list. If I am following correctly, they could be defined as follows:
>>> - Dropping subscription may stuck if done during tablesync.
>>> -- Analyze deadlock issues with DROP SUBSCRIPTION and apply worker process.
>
> I think the solution to this is to reintroduce the LWLock that was
> removed and replaced with the exclusive lock on catalog [1]. I am afraid
> that correct way of handling this is to do both LWLock and catalog lock
> (first LWLock under which we kill the workers and then catalog lock so
> that something that prevents launcher from restarting them is held till
> the end of transaction).

I agree to reintroduce LWLock and to stop logical rep worker first and
then modify catalog. That way we can reduce catalog lock level (maybe
to RowExclusiveLock) so that apply worker can see it. Also I think
that we need to do more things like in order to prevent that we keep
to hold LWLock until end of transaction, because holding LWLock until
end of transaction is not good idea and could be cause of deadlock. So
for example we can commit the transaction in DropSubscription after
cleaned pg_subscription record and all its dependencies and then start
new transaction for the remaining work. Of course we also need to
disallow DROP SUBSCRIPTION being executed in a user transaction
though.

>
>>> -- Avoid orphaned tablesync worker if apply worker exits before
>>> changing its status.
>>
>
> The behavior question I have about this is if sync workers should die
> when apply worker dies (ie they are tied to apply worker) or if they
> should be tied to the subscription.
>
> I guess taking down all the sync workers when apply worker has exited is
> easier to solve. Of course it means that if apply worker restarts in
> middle of table synchronization, the table synchronization will have to
> start from scratch. That being said, in normal operation apply worker
> should only exit/restart if subscription has changed or has been
> dropped/disabled and I think sync workers want to exit/restart in that
> situation as well.

I agree that sync workers are tied to the apply worker.

>
> So for example having shmem detach hook for an apply worker (or reusing
> the existing one) that searches for all the other workers for same
> subscription and shuts them down as well sounds like solution to this.

Seems reasonable solution.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Get stuck when dropping a subscription during synchronizing table

Masahiko Sawada
On Fri, May 12, 2017 at 11:24 AM, Masahiko Sawada <[hidden email]> wrote:

> On Thu, May 11, 2017 at 6:16 PM, Petr Jelinek
> <[hidden email]> wrote:
>> On 11/05/17 10:10, Masahiko Sawada wrote:
>>> On Thu, May 11, 2017 at 4:06 PM, Michael Paquier
>>> <[hidden email]> wrote:
>>>> On Wed, May 10, 2017 at 11:57 AM, Masahiko Sawada <[hidden email]> wrote:
>>>>> Barring any objections, I'll add these two issues to open item.
>>>>
>>>> It seems to me that those open items have not been added yet to the
>>>> list. If I am following correctly, they could be defined as follows:
>>>> - Dropping subscription may stuck if done during tablesync.
>>>> -- Analyze deadlock issues with DROP SUBSCRIPTION and apply worker process.
>>
>> I think the solution to this is to reintroduce the LWLock that was
>> removed and replaced with the exclusive lock on catalog [1]. I am afraid
>> that correct way of handling this is to do both LWLock and catalog lock
>> (first LWLock under which we kill the workers and then catalog lock so
>> that something that prevents launcher from restarting them is held till
>> the end of transaction).
>
> I agree to reintroduce LWLock and to stop logical rep worker first and
> then modify catalog. That way we can reduce catalog lock level (maybe
> to RowExclusiveLock) so that apply worker can see it. Also I think
> that we need to do more things like in order to prevent that we keep
> to hold LWLock until end of transaction, because holding LWLock until
> end of transaction is not good idea and could be cause of deadlock. So
> for example we can commit the transaction in DropSubscription after
> cleaned pg_subscription record and all its dependencies and then start
> new transaction for the remaining work. Of course we also need to
> disallow DROP SUBSCRIPTION being executed in a user transaction
> though.
Attached two draft patches to solve these issues.

Attached 0001 patch reintroduces LogicalRepLauncherLock and makes DROP
SUBSCRIPTION keep holding it until commit. To prevent from deadlock
possibility, I disallowed DROP SUBSCRIPTION being called in a
transaction block. But there might be more sensible solution for this.
please give me feedback.

>
>>
>>>> -- Avoid orphaned tablesync worker if apply worker exits before
>>>> changing its status.
>>>
>>
>> The behavior question I have about this is if sync workers should die
>> when apply worker dies (ie they are tied to apply worker) or if they
>> should be tied to the subscription.
>>
>> I guess taking down all the sync workers when apply worker has exited is
>> easier to solve. Of course it means that if apply worker restarts in
>> middle of table synchronization, the table synchronization will have to
>> start from scratch. That being said, in normal operation apply worker
>> should only exit/restart if subscription has changed or has been
>> dropped/disabled and I think sync workers want to exit/restart in that
>> situation as well.
>
> I agree that sync workers are tied to the apply worker.
>
>>
>> So for example having shmem detach hook for an apply worker (or reusing
>> the existing one) that searches for all the other workers for same
>> subscription and shuts them down as well sounds like solution to this.
>
> Seems reasonable solution.
>
Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

0001-Fix-a-deadlock-bug-between-DROP-SUBSCRIPTION-and-app.patch (8K) Download Attachment
0002-Wait-for-table-sync-worker-to-finish-when-apply-work.patch (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Get stuck when dropping a subscription during synchronizing table

Noah Misch-2
In reply to this post by Masahiko Sawada
On Mon, May 08, 2017 at 06:27:30PM +0900, Masahiko Sawada wrote:

> I encountered a situation where DROP SUBSCRIPTION got stuck when
> initial table sync is in progress. In my environment, I created
> several tables with some data on publisher. I created subscription on
> subscriber and drop subscription immediately after that. It doesn't
> always happen but I often encountered it on my environment.
>
> ps -x command shows the following.
>
>  96796 ?        Ss     0:00 postgres: masahiko postgres [local] DROP
> SUBSCRIPTION
>  96801 ?        Ts     0:00 postgres: bgworker: logical replication
> worker for subscription 40993    waiting
>  96805 ?        Ss     0:07 postgres: bgworker: logical replication
> worker for subscription 40993 sync 16418
>  96806 ?        Ss     0:01 postgres: wal sender process masahiko [local] idle
>  96807 ?        Ss     0:00 postgres: bgworker: logical replication
> worker for subscription 40993 sync 16421
>  96808 ?        Ss     0:00 postgres: wal sender process masahiko [local] idle
>
> The DROP SUBSCRIPTION process (pid 96796) is waiting for the apply
> worker process (pid 96801) to stop while holding a lock on
> pg_subscription_rel. On the other hand the apply worker is waiting for
> acquiring a tuple lock on pg_subscription_rel needed for heap_update.
> Also table sync workers (pid 96805 and 96807) are waiting for the
> apply worker process to change their status.
>
> Also, even when DROP SUBSCRIPTION is done successfully, the table sync
> worker can be orphaned because I guess that the apply worker can exit
> before change status of table sync worker.
>
> I'm using 1f30295.

[Action required within three days.  This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item.  Peter,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10.  Consequently, I will appreciate your efforts
toward speedy resolution.  Thanks.

[1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Get stuck when dropping a subscription during synchronizing table

Kyotaro HORIGUCHI-2
In reply to this post by Masahiko Sawada
Hello,

At Fri, 12 May 2017 17:24:07 +0900, Masahiko Sawada <[hidden email]> wrote in <[hidden email]>

> On Fri, May 12, 2017 at 11:24 AM, Masahiko Sawada <[hidden email]> wrote:
> > On Thu, May 11, 2017 at 6:16 PM, Petr Jelinek
> > <[hidden email]> wrote:
> >> On 11/05/17 10:10, Masahiko Sawada wrote:
> >>> On Thu, May 11, 2017 at 4:06 PM, Michael Paquier
> >>> <[hidden email]> wrote:
> >>>> On Wed, May 10, 2017 at 11:57 AM, Masahiko Sawada <[hidden email]> wrote:
> >>>>> Barring any objections, I'll add these two issues to open item.
> >>>>
> >>>> It seems to me that those open items have not been added yet to the
> >>>> list. If I am following correctly, they could be defined as follows:
> >>>> - Dropping subscription may stuck if done during tablesync.
> >>>> -- Analyze deadlock issues with DROP SUBSCRIPTION and apply worker process.
> >>
> >> I think the solution to this is to reintroduce the LWLock that was
> >> removed and replaced with the exclusive lock on catalog [1]. I am afraid
> >> that correct way of handling this is to do both LWLock and catalog lock
> >> (first LWLock under which we kill the workers and then catalog lock so
> >> that something that prevents launcher from restarting them is held till
> >> the end of transaction).
> >
> > I agree to reintroduce LWLock and to stop logical rep worker first and
> > then modify catalog. That way we can reduce catalog lock level (maybe
> > to RowExclusiveLock) so that apply worker can see it. Also I think
> > that we need to do more things like in order to prevent that we keep
> > to hold LWLock until end of transaction, because holding LWLock until
> > end of transaction is not good idea and could be cause of deadlock. So
> > for example we can commit the transaction in DropSubscription after
> > cleaned pg_subscription record and all its dependencies and then start
> > new transaction for the remaining work. Of course we also need to
> > disallow DROP SUBSCRIPTION being executed in a user transaction
> > though.
>
> Attached two draft patches to solve these issues.
>
> Attached 0001 patch reintroduces LogicalRepLauncherLock and makes DROP
> SUBSCRIPTION keep holding it until commit. To prevent from deadlock
> possibility, I disallowed DROP SUBSCRIPTION being called in a
> transaction block. But there might be more sensible solution for this.
> please give me feedback.
+ * Protect against launcher restarting the worker. This lock will
+ * be released at commit.

This is wrong. COMMIT doesn't release left-over LWLocks, only
ABORT does (precisely, it seems intended to fire on ERRORs). So
with this patch, the second DROP SUBSCRIPTION is stuck on the
LWLock acquired at the first time. And as Petr said, LWLock with
such a duration seems bad.

The cause seems to be that workers ignore sigterm on certain
conditions. One of the choke points is GetSubscription, the other
is get_subscription_list. I think we can treat the both cases
without LWLocks.

The attached patch does that.

- heap_close + UnlockRelationOid in get_subscription_list() is
  equivalent to one heap_close or relation_close but I took seeming
  symmetricity.

- 0.5 seconds for the sleep in ApplyWorkerMain is quite
  arbitrary. NAPTIME_PER_CYCLE * 1000 could be used instead.

- NULL MySubscription without SIGTERM might not need to be an
  ERROR.

Any more thoughts?


FYI, I reproduced the situation by the following steps. This
effectively reproduced the situation without delay insertion for
me.

# Creating 5 tables with 100000 rows on the publisher
create table t1 (a int);
...
create table t5 (a int);
insert into t1 (select * from generate_series(0, 99999) a);
...
insert into t5 (select * from generate_series(0, 99999) a);
create publication p1 for table t1, t2, t3, t4, t5;


# Subscribe them, wait 1sec, then unsbscribe.
create table t1 (a int);
...
create table t5 (a int);
truncate t1, t2, t3, t4, t5; create subscription s1 CONNECTION 'host=/tmp port=5432 dbname=postgres' publication p1; select pg_sleep(1); drop subscription s1;

Repeated test can be performed by repeatedly enter the last line.

> >>>> -- Avoid orphaned tablesync worker if apply worker exits before
> >>>> changing its status.
> >>>
> >>
> >> The behavior question I have about this is if sync workers should die
> >> when apply worker dies (ie they are tied to apply worker) or if they
> >> should be tied to the subscription.
> >>
> >> I guess taking down all the sync workers when apply worker has exited is
> >> easier to solve. Of course it means that if apply worker restarts in
> >> middle of table synchronization, the table synchronization will have to
> >> start from scratch. That being said, in normal operation apply worker
> >> should only exit/restart if subscription has changed or has been
> >> dropped/disabled and I think sync workers want to exit/restart in that
> >> situation as well.
> >
> > I agree that sync workers are tied to the apply worker.
> >
> >>
> >> So for example having shmem detach hook for an apply worker (or reusing
> >> the existing one) that searches for all the other workers for same
> >> subscription and shuts them down as well sounds like solution to this.
> >
> > Seems reasonable solution.
regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

*** a/src/backend/replication/logical/launcher.c
--- b/src/backend/replication/logical/launcher.c
***************
*** 42,47 ****
--- 42,48 ----
  #include "replication/worker_internal.h"
 
  #include "storage/ipc.h"
+ #include "storage/lmgr.h"
  #include "storage/proc.h"
  #include "storage/procarray.h"
  #include "storage/procsignal.h"
***************
*** 116,122 **** get_subscription_list(void)
  StartTransactionCommand();
  (void) GetTransactionSnapshot();
 
! rel = heap_open(SubscriptionRelationId, AccessShareLock);
  scan = heap_beginscan_catalog(rel, 0, NULL);
 
  while (HeapTupleIsValid(tup = heap_getnext(scan, ForwardScanDirection)))
--- 117,131 ----
  StartTransactionCommand();
  (void) GetTransactionSnapshot();
 
! /*
! * This lock cannot be aquired while subsciption commands are updating the
! * relation. We can safely skip over for the case.
! */
! if (!ConditionalLockRelationOid(SubscriptionRelationId, AccessShareLock))
! return NIL;
!
! rel = heap_open(SubscriptionRelationId, NoLock);
!
  scan = heap_beginscan_catalog(rel, 0, NULL);
 
  while (HeapTupleIsValid(tup = heap_getnext(scan, ForwardScanDirection)))
***************
*** 146,152 **** get_subscription_list(void)
  }
 
  heap_endscan(scan);
! heap_close(rel, AccessShareLock);
 
  CommitTransactionCommand();
 
--- 155,162 ----
  }
 
  heap_endscan(scan);
! heap_close(rel, NoLock);
! UnlockRelationOid(SubscriptionRelationId, AccessShareLock);
 
  CommitTransactionCommand();
 
***************
*** 403,410 **** retry:
  }
 
  /*
   * Stop the logical replication worker and wait until it detaches from the
!  * slot.
   */
  void
  logicalrep_worker_stop(Oid subid, Oid relid)
--- 413,465 ----
  }
 
  /*
+  * Stop all table sync workers associated with given subid.
+  *
+  * This function is called by apply worker. Since table sync
+  * worker associated with same subscription is launched by
+  * only the apply worker. We don't need to acquire
+  * LogicalRepLauncherLock here.
+  */
+ void
+ logicalrep_sync_workers_stop(Oid subid)
+ {
+ List *relid_list = NIL;
+ ListCell *cell;
+ int i;
+
+ LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
+
+ /*
+ * Walks the workers array and get relid list that matches
+ * given subscription id.
+ */
+ for (i = 0; i < max_logical_replication_workers; i++)
+ {
+ LogicalRepWorker *w = &LogicalRepCtx->workers[i];
+
+ if (w->in_use && w->subid == subid &&
+ OidIsValid(w->relid))
+ relid_list = lappend_oid(relid_list, w->relid);
+ }
+
+ LWLockRelease(LogicalRepWorkerLock);
+
+ /* Return if there is no table sync worker associated with myself */
+ if (relid_list == NIL)
+ return;
+
+ foreach (cell, relid_list)
+ {
+ Oid relid = lfirst_oid(cell);
+
+ logicalrep_worker_stop(subid, relid);
+ }
+ }
+
+ /*
   * Stop the logical replication worker and wait until it detaches from the
!  * slot. This function can be called by both logical replication launcher
!  * and apply worker to stop apply worker and table sync worker.
   */
  void
  logicalrep_worker_stop(Oid subid, Oid relid)
***************
*** 570,575 **** logicalrep_worker_attach(int slot)
--- 625,634 ----
  static void
  logicalrep_worker_detach(void)
  {
+ /* Stop all sync workers associated if apply worker */
+ if (!am_tablesync_worker())
+ logicalrep_sync_workers_stop(MyLogicalRepWorker->subid);
+
  /* Block concurrent access. */
  LWLockAcquire(LogicalRepWorkerLock, LW_EXCLUSIVE);
 
*** a/src/backend/replication/logical/worker.c
--- b/src/backend/replication/logical/worker.c
***************
*** 1455,1468 **** ApplyWorkerMain(Datum main_arg)
  char   *myslotname;
  WalRcvStreamOptions options;
 
- /* Attach to slot */
- logicalrep_worker_attach(worker_slot);
-
  /* Setup signal handling */
  pqsignal(SIGHUP, logicalrep_worker_sighup);
  pqsignal(SIGTERM, logicalrep_worker_sigterm);
  BackgroundWorkerUnblockSignals();
 
  /* Initialise stats to a sanish value */
  MyLogicalRepWorker->last_send_time = MyLogicalRepWorker->last_recv_time =
  MyLogicalRepWorker->reply_time = GetCurrentTimestamp();
--- 1455,1471 ----
  char   *myslotname;
  WalRcvStreamOptions options;
 
  /* Setup signal handling */
  pqsignal(SIGHUP, logicalrep_worker_sighup);
  pqsignal(SIGTERM, logicalrep_worker_sigterm);
  BackgroundWorkerUnblockSignals();
 
+ /*
+ * Attach to slot. This should be after signal handling setup since
+ * signals may come as soon as attached.
+ */
+ logicalrep_worker_attach(worker_slot);
+
  /* Initialise stats to a sanish value */
  MyLogicalRepWorker->last_send_time = MyLogicalRepWorker->last_recv_time =
  MyLogicalRepWorker->reply_time = GetCurrentTimestamp();
***************
*** 1492,1498 **** ApplyWorkerMain(Datum main_arg)
   ALLOCSET_DEFAULT_SIZES);
  StartTransactionCommand();
  oldctx = MemoryContextSwitchTo(ApplyContext);
! MySubscription = GetSubscription(MyLogicalRepWorker->subid, false);
  MySubscriptionValid = true;
  MemoryContextSwitchTo(oldctx);
 
--- 1495,1532 ----
   ALLOCSET_DEFAULT_SIZES);
  StartTransactionCommand();
  oldctx = MemoryContextSwitchTo(ApplyContext);
!
! /*
! * Wait for the catalog is available. The subscription for this worker
! * might be already dropped.  We should receive SIGTERM in the case so
! * obey it.
! */
! while (!ConditionalLockRelationOid(SubscriptionRelationId, AccessShareLock))
! {
! pg_usleep(500 * 1000L); /* 0.5s */
!
! /* We are apparently killed, exit silently. */
! if (got_SIGTERM)
! proc_exit(0);
! }
!
! MySubscription = GetSubscription(MyLogicalRepWorker->subid, true);
! UnlockRelationOid(SubscriptionRelationId, AccessShareLock);
!
! /* There's a race codition here. Check if MySubscription is valid. */
! if (MySubscription == NULL)
! {
! /* If we got SIGTERM, we are explicitly killed */
! if (got_SIGTERM)
! proc_exit(0);
!
! /* Otherwise something uncertain happned */
! ereport(ERROR,
! (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
! errmsg("subscription for this worker not found: %s",
! MySubscription->name)));
! }
!
  MySubscriptionValid = true;
  MemoryContextSwitchTo(oldctx);
 
*** a/src/include/replication/worker_internal.h
--- b/src/include/replication/worker_internal.h
***************
*** 78,83 **** extern void logicalrep_worker_launch(Oid dbid, Oid subid, const char *subname,
--- 78,84 ----
  extern void logicalrep_worker_stop(Oid subid, Oid relid);
  extern void logicalrep_worker_wakeup(Oid subid, Oid relid);
  extern void logicalrep_worker_wakeup_ptr(LogicalRepWorker *worker);
+ extern void logicalrep_sync_workers_stop(Oid subid);
 
  extern int logicalrep_sync_worker_count(Oid subid);
 


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Get stuck when dropping a subscription during synchronizing table

Masahiko Sawada
On Mon, May 15, 2017 at 8:02 PM, Kyotaro HORIGUCHI
<[hidden email]> wrote:

> Hello,
>
> At Fri, 12 May 2017 17:24:07 +0900, Masahiko Sawada <[hidden email]> wrote in <[hidden email]>
>> On Fri, May 12, 2017 at 11:24 AM, Masahiko Sawada <[hidden email]> wrote:
>> > On Thu, May 11, 2017 at 6:16 PM, Petr Jelinek
>> > <[hidden email]> wrote:
>> >> On 11/05/17 10:10, Masahiko Sawada wrote:
>> >>> On Thu, May 11, 2017 at 4:06 PM, Michael Paquier
>> >>> <[hidden email]> wrote:
>> >>>> On Wed, May 10, 2017 at 11:57 AM, Masahiko Sawada <[hidden email]> wrote:
>> >>>>> Barring any objections, I'll add these two issues to open item.
>> >>>>
>> >>>> It seems to me that those open items have not been added yet to the
>> >>>> list. If I am following correctly, they could be defined as follows:
>> >>>> - Dropping subscription may stuck if done during tablesync.
>> >>>> -- Analyze deadlock issues with DROP SUBSCRIPTION and apply worker process.
>> >>
>> >> I think the solution to this is to reintroduce the LWLock that was
>> >> removed and replaced with the exclusive lock on catalog [1]. I am afraid
>> >> that correct way of handling this is to do both LWLock and catalog lock
>> >> (first LWLock under which we kill the workers and then catalog lock so
>> >> that something that prevents launcher from restarting them is held till
>> >> the end of transaction).
>> >
>> > I agree to reintroduce LWLock and to stop logical rep worker first and
>> > then modify catalog. That way we can reduce catalog lock level (maybe
>> > to RowExclusiveLock) so that apply worker can see it. Also I think
>> > that we need to do more things like in order to prevent that we keep
>> > to hold LWLock until end of transaction, because holding LWLock until
>> > end of transaction is not good idea and could be cause of deadlock. So
>> > for example we can commit the transaction in DropSubscription after
>> > cleaned pg_subscription record and all its dependencies and then start
>> > new transaction for the remaining work. Of course we also need to
>> > disallow DROP SUBSCRIPTION being executed in a user transaction
>> > though.
>>
>> Attached two draft patches to solve these issues.
>>
>> Attached 0001 patch reintroduces LogicalRepLauncherLock and makes DROP
>> SUBSCRIPTION keep holding it until commit. To prevent from deadlock
>> possibility, I disallowed DROP SUBSCRIPTION being called in a
>> transaction block. But there might be more sensible solution for this.
>> please give me feedback.
>
> +        * Protect against launcher restarting the worker. This lock will
> +        * be released at commit.
>
> This is wrong. COMMIT doesn't release left-over LWLocks, only
> ABORT does (precisely, it seems intended to fire on ERRORs). So
> with this patch, the second DROP SUBSCRIPTION is stuck on the
> LWLock acquired at the first time. And as Petr said, LWLock with
> such a duration seems bad.

Oh I understood. Thank you for pointing out.

>
> The cause seems to be that workers ignore sigterm on certain
> conditions. One of the choke points is GetSubscription, the other
> is get_subscription_list. I think we can treat the both cases
> without LWLocks.
>
> The attached patch does that.
>
> - heap_close + UnlockRelationOid in get_subscription_list() is
>   equivalent to one heap_close or relation_close but I took seeming
>   symmetricity.
>
> - 0.5 seconds for the sleep in ApplyWorkerMain is quite
>   arbitrary. NAPTIME_PER_CYCLE * 1000 could be used instead.
>
> - NULL MySubscription without SIGTERM might not need to be an
>   ERROR.
>
> Any more thoughts?

I think the above changes can solve this issue but It seems to me that
holding AccessExclusiveLock on pg_subscription by DROP SUBSCRIPTION
until commit could lead another deadlock problem in the future. So I'd
to contrive ways to reduce lock level somehow if possible. For
example, if we change the apply launcher so that it gets the
subscription list only when pg_subscription gets invalid, apply
launcher cannot try to launch the apply worker being stopped. We
invalidate pg_subscription at commit of DROP SUBSCRIPTION and the
apply launcher can get new subscription list which doesn't include the
entry we removed. That way we can reduce lock level to
ShareUpdateExclusiveLock and solve this issue.
Also in your patch, we need to change DROP SUBSCRIPTION as well to
resolve another case I encountered, where DROP SUBSCRIPTION waits for
apply worker while holding a tuple lock on pg_subscription_rel and the
apply worker waits for same tuple on pg_subscription_rel in
SetSubscriptionRelState().

>
>
> FYI, I reproduced the situation by the following steps. This
> effectively reproduced the situation without delay insertion for
> me.
>
> # Creating 5 tables with 100000 rows on the publisher
> create table t1 (a int);
> ...
> create table t5 (a int);
> insert into t1 (select * from generate_series(0, 99999) a);
> ...
> insert into t5 (select * from generate_series(0, 99999) a);
> create publication p1 for table t1, t2, t3, t4, t5;
>
>
> # Subscribe them, wait 1sec, then unsbscribe.
> create table t1 (a int);
> ...
> create table t5 (a int);
> truncate t1, t2, t3, t4, t5; create subscription s1 CONNECTION 'host=/tmp port=5432 dbname=postgres' publication p1; select pg_sleep(1); drop subscription s1;
>
> Repeated test can be performed by repeatedly enter the last line.
>
>> >>>> -- Avoid orphaned tablesync worker if apply worker exits before
>> >>>> changing its status.
>> >>>
>> >>
>> >> The behavior question I have about this is if sync workers should die
>> >> when apply worker dies (ie they are tied to apply worker) or if they
>> >> should be tied to the subscription.
>> >>
>> >> I guess taking down all the sync workers when apply worker has exited is
>> >> easier to solve. Of course it means that if apply worker restarts in
>> >> middle of table synchronization, the table synchronization will have to
>> >> start from scratch. That being said, in normal operation apply worker
>> >> should only exit/restart if subscription has changed or has been
>> >> dropped/disabled and I think sync workers want to exit/restart in that
>> >> situation as well.
>> >
>> > I agree that sync workers are tied to the apply worker.
>> >
>> >>
>> >> So for example having shmem detach hook for an apply worker (or reusing
>> >> the existing one) that searches for all the other workers for same
>> >> subscription and shuts them down as well sounds like solution to this.
>> >
>> > Seems reasonable solution.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Get stuck when dropping a subscription during synchronizing table

Robert Haas
On Wed, May 17, 2017 at 6:58 AM, Masahiko Sawada <[hidden email]> wrote:

> I think the above changes can solve this issue but It seems to me that
> holding AccessExclusiveLock on pg_subscription by DROP SUBSCRIPTION
> until commit could lead another deadlock problem in the future. So I'd
> to contrive ways to reduce lock level somehow if possible. For
> example, if we change the apply launcher so that it gets the
> subscription list only when pg_subscription gets invalid, apply
> launcher cannot try to launch the apply worker being stopped. We
> invalidate pg_subscription at commit of DROP SUBSCRIPTION and the
> apply launcher can get new subscription list which doesn't include the
> entry we removed. That way we can reduce lock level to
> ShareUpdateExclusiveLock and solve this issue.
> Also in your patch, we need to change DROP SUBSCRIPTION as well to
> resolve another case I encountered, where DROP SUBSCRIPTION waits for
> apply worker while holding a tuple lock on pg_subscription_rel and the
> apply worker waits for same tuple on pg_subscription_rel in
> SetSubscriptionRelState().

I don't really understand the issue being discussed here in any
detail, but as a general point I'd say that it might be more
productive to make the locks finer-grained rather than struggling to
reduce the lock level.  For example, instead of locking all of
pg_subscription, use LockSharedObject() to lock the individual
subscription, still with AccessExclusiveLock.  That means that other
accesses to that subscription also need to take a lock so that you
actually get a conflict when there should be one, but that should be
doable.  I expect that trying to manage locking conflicts using only
catalog-wide locks is a doomed strategy.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Get stuck when dropping a subscription during synchronizing table

Noah Misch-2
In reply to this post by Noah Misch-2
On Mon, May 15, 2017 at 03:28:14AM +0000, Noah Misch wrote:

> On Mon, May 08, 2017 at 06:27:30PM +0900, Masahiko Sawada wrote:
> > I encountered a situation where DROP SUBSCRIPTION got stuck when
> > initial table sync is in progress. In my environment, I created
> > several tables with some data on publisher. I created subscription on
> > subscriber and drop subscription immediately after that. It doesn't
> > always happen but I often encountered it on my environment.
> >
> > ps -x command shows the following.
> >
> >  96796 ?        Ss     0:00 postgres: masahiko postgres [local] DROP
> > SUBSCRIPTION
> >  96801 ?        Ts     0:00 postgres: bgworker: logical replication
> > worker for subscription 40993    waiting
> >  96805 ?        Ss     0:07 postgres: bgworker: logical replication
> > worker for subscription 40993 sync 16418
> >  96806 ?        Ss     0:01 postgres: wal sender process masahiko [local] idle
> >  96807 ?        Ss     0:00 postgres: bgworker: logical replication
> > worker for subscription 40993 sync 16421
> >  96808 ?        Ss     0:00 postgres: wal sender process masahiko [local] idle
> >
> > The DROP SUBSCRIPTION process (pid 96796) is waiting for the apply
> > worker process (pid 96801) to stop while holding a lock on
> > pg_subscription_rel. On the other hand the apply worker is waiting for
> > acquiring a tuple lock on pg_subscription_rel needed for heap_update.
> > Also table sync workers (pid 96805 and 96807) are waiting for the
> > apply worker process to change their status.
> >
> > Also, even when DROP SUBSCRIPTION is done successfully, the table sync
> > worker can be orphaned because I guess that the apply worker can exit
> > before change status of table sync worker.
> >
> > I'm using 1f30295.
>
> [Action required within three days.  This is a generic notification.]
>
> The above-described topic is currently a PostgreSQL 10 open item.  Peter,
> since you committed the patch believed to have created it, you own this open
> item.  If some other commit is more relevant or if this does not belong as a
> v10 open item, please let us know.  Otherwise, please observe the policy on
> open item ownership[1] and send a status update within three calendar days of
> this message.  Include a date for your subsequent status update.  Testers may
> discover new open items at any time, and I want to plan to get them all fixed
> well in advance of shipping v10.  Consequently, I will appreciate your efforts
> toward speedy resolution.  Thanks.
>
> [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

IMMEDIATE ATTENTION REQUIRED.  This PostgreSQL 10 open item is past due for
your status update.  Please reacquaint yourself with the policy on open item
ownership[1] and then reply immediately.  If I do not hear from you by
2017-05-19 16:00 UTC, I will transfer this item to release management team
ownership without further notice.

[1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Get stuck when dropping a subscription during synchronizing table

Peter Eisentraut-6
On 5/18/17 11:11, Noah Misch wrote:
> IMMEDIATE ATTENTION REQUIRED.  This PostgreSQL 10 open item is past due for
> your status update.  Please reacquaint yourself with the policy on open item
> ownership[1] and then reply immediately.  If I do not hear from you by
> 2017-05-19 16:00 UTC, I will transfer this item to release management team
> ownership without further notice.

There is no progress on this issue at the moment.  I will report again
next Wednesday.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Get stuck when dropping a subscription during synchronizing table

Kyotaro HORIGUCHI-2
In reply to this post by Robert Haas
Hello,

At Thu, 18 May 2017 10:16:35 -0400, Robert Haas <[hidden email]> wrote in <[hidden email]>

> On Wed, May 17, 2017 at 6:58 AM, Masahiko Sawada <[hidden email]> wrote:
> > I think the above changes can solve this issue but It seems to me that
> > holding AccessExclusiveLock on pg_subscription by DROP SUBSCRIPTION
> > until commit could lead another deadlock problem in the future. So I'd
> > to contrive ways to reduce lock level somehow if possible. For
> > example, if we change the apply launcher so that it gets the
> > subscription list only when pg_subscription gets invalid, apply
> > launcher cannot try to launch the apply worker being stopped. We
> > invalidate pg_subscription at commit of DROP SUBSCRIPTION and the
> > apply launcher can get new subscription list which doesn't include the
> > entry we removed. That way we can reduce lock level to
> > ShareUpdateExclusiveLock and solve this issue.
> > Also in your patch, we need to change DROP SUBSCRIPTION as well to
> > resolve another case I encountered, where DROP SUBSCRIPTION waits for
> > apply worker while holding a tuple lock on pg_subscription_rel and the
> > apply worker waits for same tuple on pg_subscription_rel in
> > SetSubscriptionRelState().

Sorry, I don't have enough time to consider this
profoundly. Perhaps will return later.

> I don't really understand the issue being discussed here in any
> detail, but as a general point I'd say that it might be more
> productive to make the locks finer-grained rather than struggling to
> reduce the lock level.  For example, instead of locking all of
> pg_subscription, use LockSharedObject() to lock the individual
> subscription, still with AccessExclusiveLock.  That means that other
> accesses to that subscription also need to take a lock so that you
> actually get a conflict when there should be one, but that should be
> doable.  I expect that trying to manage locking conflicts using only
> catalog-wide locks is a doomed strategy.

Thank you for the suggestion. I think it is a bit differnt from
that. The problem here is that a replication worker may try
reading exactly the tuple for the subscription being deleted just
before responding to a received termination request. So the
finer-graind lock doesn't help.

The focus of resolving this is preventing blocking of workers
caused by DROP SUBSCRIPTION. So Sadasan's patch immediately
released the lock on pg_subscrption and uses another lock for
exclusion. My patch just give up to read the catalog when not
available.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
1234
Previous Thread Next Thread