Quantcast

tablesync patch broke the assumption that logical rep depends on?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

tablesync patch broke the assumption that logical rep depends on?

Fujii Masao-2
Hi,

         src/backend/replication/logical/launcher.c
         * Worker started and attached to our shmem. This check is safe
         * because only launcher ever starts the workers, so nobody can steal
         * the worker slot.

The tablesync patch enabled even worker to start another worker.
So the above assumption is not valid for now.

This issue seems to cause the corner case where the launcher picks up
the same worker slot that previously-started worker has already picked
up to start another worker.

Regards,

--
Fujii Masao


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tablesync patch broke the assumption that logical rep depends on?

Noah Misch-2
On Tue, Apr 11, 2017 at 02:28:44AM +0900, Fujii Masao wrote:

>          src/backend/replication/logical/launcher.c
>          * Worker started and attached to our shmem. This check is safe
>          * because only launcher ever starts the workers, so nobody can steal
>          * the worker slot.
>
> The tablesync patch enabled even worker to start another worker.
> So the above assumption is not valid for now.
>
> This issue seems to cause the corner case where the launcher picks up
> the same worker slot that previously-started worker has already picked
> up to start another worker.

[Action required within three days.  This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item.  Peter,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10.  Consequently, I will appreciate your efforts
toward speedy resolution.  Thanks.

[1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tablesync patch broke the assumption that logical rep depends on?

Peter Eisentraut-6
In reply to this post by Fujii Masao-2
On 4/10/17 13:28, Fujii Masao wrote:

>          src/backend/replication/logical/launcher.c
>          * Worker started and attached to our shmem. This check is safe
>          * because only launcher ever starts the workers, so nobody can steal
>          * the worker slot.
>
> The tablesync patch enabled even worker to start another worker.
> So the above assumption is not valid for now.
>
> This issue seems to cause the corner case where the launcher picks up
> the same worker slot that previously-started worker has already picked
> up to start another worker.

I think what the comment should rather say is that workers are always
started through logicalrep_worker_launch() and worker slots are always
handed out while holding LogicalRepWorkerLock exclusively, so nobody can
steal the worker slot.

Does that make sense?

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tablesync patch broke the assumption that logical rep depends on?

Fujii Masao-2
On Fri, Apr 14, 2017 at 1:28 AM, Peter Eisentraut
<[hidden email]> wrote:

> On 4/10/17 13:28, Fujii Masao wrote:
>>          src/backend/replication/logical/launcher.c
>>          * Worker started and attached to our shmem. This check is safe
>>          * because only launcher ever starts the workers, so nobody can steal
>>          * the worker slot.
>>
>> The tablesync patch enabled even worker to start another worker.
>> So the above assumption is not valid for now.
>>
>> This issue seems to cause the corner case where the launcher picks up
>> the same worker slot that previously-started worker has already picked
>> up to start another worker.
>
> I think what the comment should rather say is that workers are always
> started through logicalrep_worker_launch() and worker slots are always
> handed out while holding LogicalRepWorkerLock exclusively, so nobody can
> steal the worker slot.
>
> Does that make sense?

No unless I'm missing something.

logicalrep_worker_launch() picks up unused worker slot (slot's proc == NULL)
while holding LogicalRepWorkerLock. But it releases the lock before the slot
is marked as used (i.e., slot is set to non-NULL). Then newly-launched worker
calls logicalrep_worker_attach() and marks the slot as used.

So if another logicalrep_worker_launch() starts after LogicalRepWorkerLock
is released before the slot is marked as used, it can pick up the same slot
because that slot looks unused.

Regards,

--
Fujii Masao


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tablesync patch broke the assumption that logical rep depends on?

Petr Jelinek-4
On 13/04/17 19:31, Fujii Masao wrote:

> On Fri, Apr 14, 2017 at 1:28 AM, Peter Eisentraut
> <[hidden email]> wrote:
>> On 4/10/17 13:28, Fujii Masao wrote:
>>>          src/backend/replication/logical/launcher.c
>>>          * Worker started and attached to our shmem. This check is safe
>>>          * because only launcher ever starts the workers, so nobody can steal
>>>          * the worker slot.
>>>
>>> The tablesync patch enabled even worker to start another worker.
>>> So the above assumption is not valid for now.
>>>
>>> This issue seems to cause the corner case where the launcher picks up
>>> the same worker slot that previously-started worker has already picked
>>> up to start another worker.
>>
>> I think what the comment should rather say is that workers are always
>> started through logicalrep_worker_launch() and worker slots are always
>> handed out while holding LogicalRepWorkerLock exclusively, so nobody can
>> steal the worker slot.
>>
>> Does that make sense?
>
> No unless I'm missing something.
>
> logicalrep_worker_launch() picks up unused worker slot (slot's proc == NULL)
> while holding LogicalRepWorkerLock. But it releases the lock before the slot
> is marked as used (i.e., slot is set to non-NULL). Then newly-launched worker
> calls logicalrep_worker_attach() and marks the slot as used.
>
> So if another logicalrep_worker_launch() starts after LogicalRepWorkerLock
> is released before the slot is marked as used, it can pick up the same slot
> because that slot looks unused.
>

Yeah I think it's less of a problem of that comment than the fact that
logicalrep_worker_launch isn't concurrency safe. We need in_use marker
for the workers and update it as needed instead of relying on pgproc.
I'll write up something over the weekend.

--
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tablesync patch broke the assumption that logical rep depends on?

Steve Singer-2
In reply to this post by Fujii Masao-2
On 04/10/2017 01:28 PM, Fujii Masao wrote:

> Hi,
>
>           src/backend/replication/logical/launcher.c
>           * Worker started and attached to our shmem. This check is safe
>           * because only launcher ever starts the workers, so nobody can steal
>           * the worker slot.
>
> The tablesync patch enabled even worker to start another worker.
> So the above assumption is not valid for now.
>
> This issue seems to cause the corner case where the launcher picks up
> the same worker slot that previously-started worker has already picked
> up to start another worker.
>
> Regards,
>

I'm not sure if what I am seeing is related to this or another issue.

If I create a subscription for a publication that doesn't exist (yet) I
see 'publication mypub does not exist" in the subscribers log

If I then create the publication on the publisher the error doesn't go
away, the worker process keeps spinning with the same error.

If I then drop the subscription and recreate it it works.




--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tablesync patch broke the assumption that logical rep depends on?

Petr Jelinek-4
On 16/04/17 21:27, Steve Singer wrote:

> On 04/10/2017 01:28 PM, Fujii Masao wrote:
>> Hi,
>>
>>           src/backend/replication/logical/launcher.c
>>           * Worker started and attached to our shmem. This check is safe
>>           * because only launcher ever starts the workers, so nobody
>> can steal
>>           * the worker slot.
>>
>> The tablesync patch enabled even worker to start another worker.
>> So the above assumption is not valid for now.
>>
>> This issue seems to cause the corner case where the launcher picks up
>> the same worker slot that previously-started worker has already picked
>> up to start another worker.
>>
>> Regards,
>>
>
> I'm not sure if what I am seeing is related to this or another issue.
>
> If I create a subscription for a publication that doesn't exist (yet) I
> see 'publication mypub does not exist" in the subscribers log
>
> If I then create the publication on the publisher the error doesn't go
> away, the worker process keeps spinning with the same error.
>
> If I then drop the subscription and recreate it it works.
>

No that's a result of how logical decoding/slots work. We see catalogs
as what they looked like at the specific point in time. If you create
slot (by creating subscription) it will not see publication that was
created after until decoding on that slot reaches point in time when it
was actually created.

--
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tablesync patch broke the assumption that logical rep depends on?

Noah Misch-2
In reply to this post by Noah Misch-2
On Thu, Apr 13, 2017 at 04:56:05AM +0000, Noah Misch wrote:

> On Tue, Apr 11, 2017 at 02:28:44AM +0900, Fujii Masao wrote:
> >          src/backend/replication/logical/launcher.c
> >          * Worker started and attached to our shmem. This check is safe
> >          * because only launcher ever starts the workers, so nobody can steal
> >          * the worker slot.
> >
> > The tablesync patch enabled even worker to start another worker.
> > So the above assumption is not valid for now.
> >
> > This issue seems to cause the corner case where the launcher picks up
> > the same worker slot that previously-started worker has already picked
> > up to start another worker.
>
> [Action required within three days.  This is a generic notification.]
>
> The above-described topic is currently a PostgreSQL 10 open item.  Peter,
> since you committed the patch believed to have created it, you own this open
> item.  If some other commit is more relevant or if this does not belong as a
> v10 open item, please let us know.  Otherwise, please observe the policy on
> open item ownership[1] and send a status update within three calendar days of
> this message.  Include a date for your subsequent status update.  Testers may
> discover new open items at any time, and I want to plan to get them all fixed
> well in advance of shipping v10.  Consequently, I will appreciate your efforts
> toward speedy resolution.  Thanks.
>
> [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

This PostgreSQL 10 open item is past due for your status update.  Kindly send
a status update within 24 hours, and include a date for your subsequent status
update.  Refer to the policy on open item ownership:
https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tablesync patch broke the assumption that logical rep depends on?

Peter Eisentraut-6
On 4/16/17 22:01, Noah Misch wrote:
> This PostgreSQL 10 open item is past due for your status update.  Kindly send
> a status update within 24 hours, and include a date for your subsequent status
> update.  Refer to the policy on open item ownership:
> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

I think we're not really sure yet what to do about this.  Discussion is
ongoing.  I'll report back on Wednesday.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tablesync patch broke the assumption that logical rep depends on?

Petr Jelinek-4
On 17/04/17 20:09, Peter Eisentraut wrote:
> On 4/16/17 22:01, Noah Misch wrote:
>> This PostgreSQL 10 open item is past due for your status update.  Kindly send
>> a status update within 24 hours, and include a date for your subsequent status
>> update.  Refer to the policy on open item ownership:
>> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
>
> I think we're not really sure yet what to do about this.  Discussion is
> ongoing.  I'll report back on Wednesday.
>

So my idea was to add some kind of inuse flag. This turned out to be bit
more complicated in terms of how to clean it than I would have hoped.
This is due to the fact that there is no way to reliably tell if worker
has failed to start if the parent worker crashed while waiting.

My solution to that is to use similar logic to autovacuum where we use
timeout for worker to attach to shmem. We do this only if there is no
free slot found when launch of replication worker was requested.

While working on this patch I also noticed other subtle concurrency
issues and fixed them as well - stopping worker that hasn't finished
starting yet wasn't completely concurrency safe and limiting sync
workers per subscription theoretically wasn't either (although I don't
think it could happen in practice).

I do wonder now though if it's such a good idea to have the
BackgroundWorkerHandle private to the bgworker.c given that this is the
3rd time (twice before was outside of postgres core) I had to write
similar generation mechanism that it uses for unique bgworker
authentication inside the process which started it. It would have been
much easier if I could just save the BackgroundWorkerHandle itself to
shmem so it could be used across processes instead of having to reinvent
it every time.

--
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

0001-Fix-various-concurrency-issues-in-logical-replicatio.patch (11K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tablesync patch broke the assumption that logical rep depends on?

Peter Eisentraut-6
On 4/18/17 22:13, Petr Jelinek wrote:
> So my idea was to add some kind of inuse flag. This turned out to be bit
> more complicated in terms of how to clean it than I would have hoped.
> This is due to the fact that there is no way to reliably tell if worker
> has failed to start if the parent worker crashed while waiting.
>
> My solution to that is to use similar logic to autovacuum where we use
> timeout for worker to attach to shmem. We do this only if there is no
> free slot found when launch of replication worker was requested.

It looks like launch_time is never set the current time in your patch.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tablesync patch broke the assumption that logical rep depends on?

Petr Jelinek-4
On 20/04/17 18:58, Peter Eisentraut wrote:

> On 4/18/17 22:13, Petr Jelinek wrote:
>> So my idea was to add some kind of inuse flag. This turned out to be bit
>> more complicated in terms of how to clean it than I would have hoped.
>> This is due to the fact that there is no way to reliably tell if worker
>> has failed to start if the parent worker crashed while waiting.
>>
>> My solution to that is to use similar logic to autovacuum where we use
>> timeout for worker to attach to shmem. We do this only if there is no
>> free slot found when launch of replication worker was requested.
>
> It looks like launch_time is never set the current time in your patch.
>
Oops, fixed.

--
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

0001-Fix-various-concurrency-issues-in-logical-replicatiov2.patch (14K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tablesync patch broke the assumption that logical rep depends on?

Noah Misch-2
In reply to this post by Peter Eisentraut-6
On Mon, Apr 17, 2017 at 02:09:54PM -0400, Peter Eisentraut wrote:
> I think we're not really sure yet what to do about this.  Discussion is
> ongoing.  I'll report back on Wednesday.

This PostgreSQL 10 open item is past due for your status update, and this is
overall the seventh time you have you allowed one of your open items to go out
of compliance.  Kindly start complying with the policy on open item ownership:
https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

At a bare minimum, send a status update within 24 hours, and include a date
for your subsequent status update.


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tablesync patch broke the assumption that logical rep depends on?

Peter Eisentraut-6
In reply to this post by Petr Jelinek-4
On 4/20/17 14:29, Petr Jelinek wrote:
> + /* Find unused worker slot. */
> + if (!w->in_use)
>   {
> - worker = &LogicalRepCtx->workers[slot];
> - break;
> + worker = w;
> + slot = i;
> + }

Doesn't this still need a break?  Otherwise it always picks the last slot.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tablesync patch broke the assumption that logical rep depends on?

Peter Eisentraut-6
In reply to this post by Noah Misch-2
On 4/20/17 22:24, Noah Misch wrote:
> On Mon, Apr 17, 2017 at 02:09:54PM -0400, Peter Eisentraut wrote:
>> I think we're not really sure yet what to do about this.  Discussion is
>> ongoing.  I'll report back on Wednesday.
>
> This PostgreSQL 10 open item is past due for your status update, and this is
> overall the seventh time you have you allowed one of your open items to go out
> of compliance.  Kindly start complying with the policy on open item ownership:
> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

We are actively working on it.  It should be resolved within a few days.
 Next check-in Monday.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tablesync patch broke the assumption that logical rep depends on?

Petr Jelinek-4
In reply to this post by Peter Eisentraut-6
On 21/04/17 16:09, Peter Eisentraut wrote:

> On 4/20/17 14:29, Petr Jelinek wrote:
>> + /* Find unused worker slot. */
>> + if (!w->in_use)
>>   {
>> - worker = &LogicalRepCtx->workers[slot];
>> - break;
>> + worker = w;
>> + slot = i;
>> + }
>
> Doesn't this still need a break?  Otherwise it always picks the last slot.
>

Yes it will pick the last slot, does that matter though, is the first
one better somehow?

We can't break because we also need to continue the counter (I think the
issue that the counter solves is probably just theoretical, but still).

Hmm actually, maybe the if (!w->in_use) should be if (worker == NULL &&
!w->in_use)?

--
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tablesync patch broke the assumption that logical rep depends on?

Peter Eisentraut-6
On 4/21/17 10:11, Petr Jelinek wrote:

> On 21/04/17 16:09, Peter Eisentraut wrote:
>> On 4/20/17 14:29, Petr Jelinek wrote:
>>> + /* Find unused worker slot. */
>>> + if (!w->in_use)
>>>   {
>>> - worker = &LogicalRepCtx->workers[slot];
>>> - break;
>>> + worker = w;
>>> + slot = i;
>>> + }
>>
>> Doesn't this still need a break?  Otherwise it always picks the last slot.
>>
>
> Yes it will pick the last slot, does that matter though, is the first
> one better somehow?
>
> We can't break because we also need to continue the counter (I think the
> issue that the counter solves is probably just theoretical, but still).

I see.  I think the code would be less confusing if we break the loop
like before and call logicalrep_sync_worker_count() separately.

> Hmm actually, maybe the if (!w->in_use) should be if (worker == NULL &&
> !w->in_use)?

That would also do it.  But it's getting a bit fiddly.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tablesync patch broke the assumption that logical rep depends on?

Petr Jelinek-4
On 21/04/17 16:23, Peter Eisentraut wrote:

> On 4/21/17 10:11, Petr Jelinek wrote:
>> On 21/04/17 16:09, Peter Eisentraut wrote:
>>> On 4/20/17 14:29, Petr Jelinek wrote:
>>>> + /* Find unused worker slot. */
>>>> + if (!w->in_use)
>>>>   {
>>>> - worker = &LogicalRepCtx->workers[slot];
>>>> - break;
>>>> + worker = w;
>>>> + slot = i;
>>>> + }
>>>
>>> Doesn't this still need a break?  Otherwise it always picks the last slot.
>>>
>>
>> Yes it will pick the last slot, does that matter though, is the first
>> one better somehow?
>>
>> We can't break because we also need to continue the counter (I think the
>> issue that the counter solves is probably just theoretical, but still).
>
> I see.  I think the code would be less confusing if we break the loop
> like before and call logicalrep_sync_worker_count() separately.
>
>> Hmm actually, maybe the if (!w->in_use) should be if (worker == NULL &&
>> !w->in_use)?
>
> That would also do it.  But it's getting a bit fiddly.
>

I just wanted to avoid looping twice, especially since the garbage
collecting code has to also do the loop. I guess I'll go with my
original coding for this then which was to put retry label above the
loop first, then try finding worker slot, if found call the
logicalrep_sync_worker_count and if not found do the garbage collection
and if we cleaned up something then goto retry.

--
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tablesync patch broke the assumption that logical rep depends on?

Petr Jelinek-4
On 21/04/17 16:31, Petr Jelinek wrote:

> On 21/04/17 16:23, Peter Eisentraut wrote:
>> On 4/21/17 10:11, Petr Jelinek wrote:
>>> On 21/04/17 16:09, Peter Eisentraut wrote:
>>>> On 4/20/17 14:29, Petr Jelinek wrote:
>>>>> + /* Find unused worker slot. */
>>>>> + if (!w->in_use)
>>>>>   {
>>>>> - worker = &LogicalRepCtx->workers[slot];
>>>>> - break;
>>>>> + worker = w;
>>>>> + slot = i;
>>>>> + }
>>>>
>>>> Doesn't this still need a break?  Otherwise it always picks the last slot.
>>>>
>>>
>>> Yes it will pick the last slot, does that matter though, is the first
>>> one better somehow?
>>>
>>> We can't break because we also need to continue the counter (I think the
>>> issue that the counter solves is probably just theoretical, but still).
>>
>> I see.  I think the code would be less confusing if we break the loop
>> like before and call logicalrep_sync_worker_count() separately.
>>
>>> Hmm actually, maybe the if (!w->in_use) should be if (worker == NULL &&
>>> !w->in_use)?
>>
>> That would also do it.  But it's getting a bit fiddly.
>>
>
> I just wanted to avoid looping twice, especially since the garbage
> collecting code has to also do the loop. I guess I'll go with my
> original coding for this then which was to put retry label above the
> loop first, then try finding worker slot, if found call the
> logicalrep_sync_worker_count and if not found do the garbage collection
> and if we cleaned up something then goto retry.
>
Here is the patch doing just that.

--
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Fix-various-concurrency-issues-in-logical-replicatiov3.patch (14K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tablesync patch broke the assumption that logical rep depends on?

Petr Jelinek-4
On 22/04/17 22:09, Petr Jelinek wrote:

> On 21/04/17 16:31, Petr Jelinek wrote:
>> On 21/04/17 16:23, Peter Eisentraut wrote:
>>> On 4/21/17 10:11, Petr Jelinek wrote:
>>>> On 21/04/17 16:09, Peter Eisentraut wrote:
>>>>> On 4/20/17 14:29, Petr Jelinek wrote:
>>>>>> + /* Find unused worker slot. */
>>>>>> + if (!w->in_use)
>>>>>>   {
>>>>>> - worker = &LogicalRepCtx->workers[slot];
>>>>>> - break;
>>>>>> + worker = w;
>>>>>> + slot = i;
>>>>>> + }
>>>>>
>>>>> Doesn't this still need a break?  Otherwise it always picks the last slot.
>>>>>
>>>>
>>>> Yes it will pick the last slot, does that matter though, is the first
>>>> one better somehow?
>>>>
>>>> We can't break because we also need to continue the counter (I think the
>>>> issue that the counter solves is probably just theoretical, but still).
>>>
>>> I see.  I think the code would be less confusing if we break the loop
>>> like before and call logicalrep_sync_worker_count() separately.
>>>
>>>> Hmm actually, maybe the if (!w->in_use) should be if (worker == NULL &&
>>>> !w->in_use)?
>>>
>>> That would also do it.  But it's getting a bit fiddly.
>>>
>>
>> I just wanted to avoid looping twice, especially since the garbage
>> collecting code has to also do the loop. I guess I'll go with my
>> original coding for this then which was to put retry label above the
>> loop first, then try finding worker slot, if found call the
>> logicalrep_sync_worker_count and if not found do the garbage collection
>> and if we cleaned up something then goto retry.
>>
>
> Here is the patch doing just that.
>
And one more revision which also checks in_use when attaching shared
memory. This is mainly to improve the user visible behavior in
theoretical corner case when the worker is supposed to be cleaned up but
actually still manages to start (it would still fail even without this
change but the error message would be more obscure).

--
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Fix-various-concurrency-issues-in-logical-replicatiov4.patch (11K) Download Attachment
12
Previous Thread Next Thread
Loading...