Timing-sensitive case in src/test/recovery TAP tests

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Timing-sensitive case in src/test/recovery TAP tests

Tom Lane-2
I've been experimenting with a change to pg_ctl, which I'll post
separately, to reduce its reaction time so that it reports success
more quickly after a wait for postmaster start/stop.  I found one
case in "make check-world" that got a failure when I reduced the
reaction time to ~1ms.  That's the very last test in 001_stream_rep.pl,
"cascaded slot xmin reset after startup with hs feedback reset", and
the cause appears to be that it's not allowing any delay time for a
replication slot's state to update after a postmaster restart.

This seems worth fixing independently of any possible code changes,
because it shows that this test could fail on a slow or overloaded
machine.  I couldn't find any instances of such a failure in the
buildfarm archives, but that may have a lot to do with the fact that
owners of slow buildfarm animals are (mostly?) not running this test.

Some experimentation says that the minimum delay needed to make it
work reliably on my workstation is about 100ms, so a simple patch
along the lines of the attached might be good enough.  I find this
approach conceptually dissatisfying, though, since it's still
potentially vulnerable to the failure under sufficient load.
I wonder if there is an easy way to improve that ... maybe convert
to something involving poll_query_until?

                        regards, tom lane


diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 266d27c..8d6edd2 100644
*** a/src/test/recovery/t/001_stream_rep.pl
--- b/src/test/recovery/t/001_stream_rep.pl
*************** isnt($xmin, '', 'cascaded slot xmin non-
*** 265,270 ****
--- 265,272 ----
  # Xmin from a previous run should be cleared on startup.
  $node_standby_2->start;
 
+ sleep(1);  # need some delay before interrogating slot xmin
+
  ($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2);
  is($xmin, '',
  'cascaded slot xmin reset after startup with hs feedback reset');


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Timing-sensitive case in src/test/recovery TAP tests

Craig Ringer-3
On 26 June 2017 at 05:10, Tom Lane <[hidden email]> wrote:

> I've been experimenting with a change to pg_ctl, which I'll post
> separately, to reduce its reaction time so that it reports success
> more quickly after a wait for postmaster start/stop.  I found one
> case in "make check-world" that got a failure when I reduced the
> reaction time to ~1ms.  That's the very last test in 001_stream_rep.pl,
> "cascaded slot xmin reset after startup with hs feedback reset", and
> the cause appears to be that it's not allowing any delay time for a
> replication slot's state to update after a postmaster restart.
>
> This seems worth fixing independently of any possible code changes,
> because it shows that this test could fail on a slow or overloaded
> machine.  I couldn't find any instances of such a failure in the
> buildfarm archives, but that may have a lot to do with the fact that
> owners of slow buildfarm animals are (mostly?) not running this test.
>
> Some experimentation says that the minimum delay needed to make it
> work reliably on my workstation is about 100ms, so a simple patch
> along the lines of the attached might be good enough.  I find this
> approach conceptually dissatisfying, though, since it's still
> potentially vulnerable to the failure under sufficient load.
> I wonder if there is an easy way to improve that ... maybe convert
> to something involving poll_query_until?

This should do the trick:

$node_standby_1->poll_query_until('postgres', q[SELECT xmin IS NULL
from pg_replication_slots WHERE slot_name = '] . $slotname_2 . q[']);




--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Timing-sensitive case in src/test/recovery TAP tests

Michael Paquier
On Mon, Jun 26, 2017 at 10:48 AM, Craig Ringer <[hidden email]> wrote:
> $node_standby_1->poll_query_until('postgres', q[SELECT xmin IS NULL
> from pg_replication_slots WHERE slot_name = '] . $slotname_2 . q[']);

+1 for avoiding a sleep call if it is not necessary. Fast platforms
would always pay a cost on that, and slow platforms would wait 1s (or
more!) when polling for the result.

Could it be possible to remove as well the second sleep(2) call in
this test please? This could be replaced by a similar poll using the
query Craig has just given as this needs to wait until the xmon is
cleared. Other tests expect a value so this poll cannot happen in
get_slot_xmins.
--
Michael


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Timing-sensitive case in src/test/recovery TAP tests

Tom Lane-2
Michael Paquier <[hidden email]> writes:
> On Mon, Jun 26, 2017 at 10:48 AM, Craig Ringer <[hidden email]> wrote:
>> $node_standby_1->poll_query_until('postgres', q[SELECT xmin IS NULL
>> from pg_replication_slots WHERE slot_name = '] . $slotname_2 . q[']);

> +1 for avoiding a sleep call if it is not necessary. Fast platforms
> would always pay a cost on that, and slow platforms would wait 1s (or
> more!) when polling for the result.

> Could it be possible to remove as well the second sleep(2) call in
> this test please?

Yes, I'd like to see those fixed sleeps go away too.  Want to work
on a concrete patch?

                        regards, tom lane


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Timing-sensitive case in src/test/recovery TAP tests

Craig Ringer-3
On 26 June 2017 at 10:09, Tom Lane <[hidden email]> wrote:

> Michael Paquier <[hidden email]> writes:
>> On Mon, Jun 26, 2017 at 10:48 AM, Craig Ringer <[hidden email]> wrote:
>>> $node_standby_1->poll_query_until('postgres', q[SELECT xmin IS NULL
>>> from pg_replication_slots WHERE slot_name = '] . $slotname_2 . q[']);
>
>> +1 for avoiding a sleep call if it is not necessary. Fast platforms
>> would always pay a cost on that, and slow platforms would wait 1s (or
>> more!) when polling for the result.
>
>> Could it be possible to remove as well the second sleep(2) call in
>> this test please?
>
> Yes, I'd like to see those fixed sleeps go away too.  Want to work
> on a concrete patch?

Attached.


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

fix-timing-in-tap-001-stream-rep.pl (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Timing-sensitive case in src/test/recovery TAP tests

Michael Paquier
On Mon, Jun 26, 2017 at 11:44 AM, Craig Ringer <[hidden email]> wrote:

> On 26 June 2017 at 10:09, Tom Lane <[hidden email]> wrote:
>> Michael Paquier <[hidden email]> writes:
>>> On Mon, Jun 26, 2017 at 10:48 AM, Craig Ringer <[hidden email]> wrote:
>>>> $node_standby_1->poll_query_until('postgres', q[SELECT xmin IS NULL
>>>> from pg_replication_slots WHERE slot_name = '] . $slotname_2 . q[']);
>>
>>> +1 for avoiding a sleep call if it is not necessary. Fast platforms
>>> would always pay a cost on that, and slow platforms would wait 1s (or
>>> more!) when polling for the result.
>>
>>> Could it be possible to remove as well the second sleep(2) call in
>>> this test please?
>>
>> Yes, I'd like to see those fixed sleeps go away too.  Want to work
>> on a concrete patch?
>
>
> Attached.

Thanks for the patch.

As long as we are on it, there is this code block in the test:
my ($xmin, $catalog_xmin) = get_slot_xmins($node_master, $slotname_1);
is($xmin,         '', 'non-cascaded slot xmin null with no hs_feedback');
is($catalog_xmin, '', 'non-cascaded slot xmin null with no hs_feedback');

($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2);
is($xmin,         '', 'cascaded slot xmin null with no hs_feedback');
is($catalog_xmin, '', 'cascaded slot xmin null with no hs_feedback');

This should be more verbose as the 2nd and 4th test should say
"catalog xmin" instead of xmin.

Also, wouldn't it be better to poll as well node_standby_1's
pg_replication_slot on slotname_2? It would really seem better to make
the nullness check conditional in get_slot_xmins instead. Sorry for
changing opinion here.
--
Michael


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Timing-sensitive case in src/test/recovery TAP tests

Craig Ringer-3
On 26 June 2017 at 11:06, Michael Paquier <[hidden email]> wrote:

> As long as we are on it, there is this code block in the test:
> my ($xmin, $catalog_xmin) = get_slot_xmins($node_master, $slotname_1);
> is($xmin,         '', 'non-cascaded slot xmin null with no hs_feedback');
> is($catalog_xmin, '', 'non-cascaded slot xmin null with no hs_feedback');
>
> ($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2);
> is($xmin,         '', 'cascaded slot xmin null with no hs_feedback');
> is($catalog_xmin, '', 'cascaded slot xmin null with no hs_feedback');
>
> This should be more verbose as the 2nd and 4th test should say
> "catalog xmin" instead of xmin.

Agree, should. Mind posting a revision?

> Also, wouldn't it be better to poll as well node_standby_1's
> pg_replication_slot on slotname_2? It would really seem better to make
> the nullness check conditional in get_slot_xmins instead. Sorry for
> changing opinion here.

I'm not sure I understand this.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Timing-sensitive case in src/test/recovery TAP tests

Michael Paquier
On Mon, Jun 26, 2017 at 12:12 PM, Craig Ringer <[hidden email]> wrote:

> On 26 June 2017 at 11:06, Michael Paquier <[hidden email]> wrote:
>
>> As long as we are on it, there is this code block in the test:
>> my ($xmin, $catalog_xmin) = get_slot_xmins($node_master, $slotname_1);
>> is($xmin,         '', 'non-cascaded slot xmin null with no hs_feedback');
>> is($catalog_xmin, '', 'non-cascaded slot xmin null with no hs_feedback');
>>
>> ($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2);
>> is($xmin,         '', 'cascaded slot xmin null with no hs_feedback');
>> is($catalog_xmin, '', 'cascaded slot xmin null with no hs_feedback');
>>
>> This should be more verbose as the 2nd and 4th test should say
>> "catalog xmin" instead of xmin.
>
> Agree, should. Mind posting a revision?
Sure.

>> Also, wouldn't it be better to poll as well node_standby_1's
>> pg_replication_slot on slotname_2? It would really seem better to make
>> the nullness check conditional in get_slot_xmins instead. Sorry for
>> changing opinion here.
>
> I'm not sure I understand this.

The patch attached may explain that better. Your patch added 3 poll
phases. I think that a 4th is needed. At the same time I have found
cleaner to put the poll calls into a small wrapper.
--
Michael


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

fix-timing-in-tap-001-stream-rep-v2.pl (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Timing-sensitive case in src/test/recovery TAP tests

Tom Lane-2
Michael Paquier <[hidden email]> writes:
> On Mon, Jun 26, 2017 at 12:12 PM, Craig Ringer <[hidden email]> wrote:
>> I'm not sure I understand this.

> The patch attached may explain that better. Your patch added 3 poll
> phases. I think that a 4th is needed. At the same time I have found
> cleaner to put the poll calls into a small wrapper.

Looks good as far as it goes, but I wonder whether any of the other
get_slot_xmins calls need polling too.  Don't feel a need to add such
calls until someone exhibits a failure there, but I won't be very
surprised if someone does.

Anyway, pushed this patch with minor editing.  Thanks!

                        regards, tom lane


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Timing-sensitive case in src/test/recovery TAP tests

Michael Paquier
On Tue, Jun 27, 2017 at 3:44 AM, Tom Lane <[hidden email]> wrote:
> Looks good as far as it goes, but I wonder whether any of the other
> get_slot_xmins calls need polling too.  Don't feel a need to add such
> calls until someone exhibits a failure there, but I won't be very
> surprised if someone does.

I got the same thought, wondering as well if get_slot_xmins should be
renamed check_slot_xmins with the is() tests moved inside it as well.
Not sure if that's worth the API ugliness though.

> Anyway, pushed this patch with minor editing.  Thanks!

Thanks.
--
Michael


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Timing-sensitive case in src/test/recovery TAP tests

Tom Lane-2
Michael Paquier <[hidden email]> writes:
> On Tue, Jun 27, 2017 at 3:44 AM, Tom Lane <[hidden email]> wrote:
>> Looks good as far as it goes, but I wonder whether any of the other
>> get_slot_xmins calls need polling too.  Don't feel a need to add such
>> calls until someone exhibits a failure there, but I won't be very
>> surprised if someone does.

And behold, we have here
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nightjar&dt=2017-08-08%2020%3A54%3A09

#   Failed test 'xmin of cascaded slot non-null with hs feedback'
#   at t/001_stream_rep.pl line 224.
#          got: ''
#     expected: anything else

That's one of only four calls lacking a preceding wait_slot_xmins call.
The ones at lines 173 and 177 seem safe because nothing has happened yet.
I think the one at line 300 should be safe because the standby_2 server is
shut down at that point, but is there any way that the standby_1's view
hasn't updated by the time that happens?

> I got the same thought, wondering as well if get_slot_xmins should be
> renamed check_slot_xmins with the is() tests moved inside it as well.
> Not sure if that's worth the API ugliness though.

Mmm, doesn't seem like that's worth doing, but I'm half tempted to merge
wait_slot_xmins into get_slot_xmins so you can't skip it ...

                        regards, tom lane


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Timing-sensitive case in src/test/recovery TAP tests

Michael Paquier
On Tue, Aug 8, 2017 at 11:33 PM, Tom Lane <[hidden email]> wrote:
> Michael Paquier <[hidden email]> writes:
>> I got the same thought, wondering as well if get_slot_xmins should be
>> renamed check_slot_xmins with the is() tests moved inside it as well.
>> Not sure if that's worth the API ugliness though.
>
> Mmm, doesn't seem like that's worth doing, but I'm half tempted to merge
> wait_slot_xmins into get_slot_xmins so you can't skip it ...

Let's do that please. Merging both was my first feeling when
refactoring this test upthread. Should I send a patch?
--
Michael


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Timing-sensitive case in src/test/recovery TAP tests

Tom Lane-2
Michael Paquier <[hidden email]> writes:
> On Tue, Aug 8, 2017 at 11:33 PM, Tom Lane <[hidden email]> wrote:
>> Michael Paquier <[hidden email]> writes:
>>> I got the same thought, wondering as well if get_slot_xmins should be
>>> renamed check_slot_xmins with the is() tests moved inside it as well.
>>> Not sure if that's worth the API ugliness though.

>> Mmm, doesn't seem like that's worth doing, but I'm half tempted to merge
>> wait_slot_xmins into get_slot_xmins so you can't skip it ...

> Let's do that please. Merging both was my first feeling when
> refactoring this test upthread. Should I send a patch?

Sure, have at it.

                        regards, tom lane


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Timing-sensitive case in src/test/recovery TAP tests

Michael Paquier
On Wed, Aug 9, 2017 at 3:39 PM, Tom Lane <[hidden email]> wrote:

> Michael Paquier <[hidden email]> writes:
>> On Tue, Aug 8, 2017 at 11:33 PM, Tom Lane <[hidden email]> wrote:
>>> Michael Paquier <[hidden email]> writes:
>>>> I got the same thought, wondering as well if get_slot_xmins should be
>>>> renamed check_slot_xmins with the is() tests moved inside it as well.
>>>> Not sure if that's worth the API ugliness though.
>
>>> Mmm, doesn't seem like that's worth doing, but I'm half tempted to merge
>>> wait_slot_xmins into get_slot_xmins so you can't skip it ...
>
>> Let's do that please. Merging both was my first feeling when
>> refactoring this test upthread. Should I send a patch?
>
> Sure, have at it.
And here you go.
--
Michael


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

tap-simplify-repslot.patch (8K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Timing-sensitive case in src/test/recovery TAP tests

Tom Lane-2
Michael Paquier <[hidden email]> writes:
> On Wed, Aug 9, 2017 at 3:39 PM, Tom Lane <[hidden email]> wrote:
>> Michael Paquier <[hidden email]> writes:
>>> Let's do that please. Merging both was my first feeling when
>>> refactoring this test upthread. Should I send a patch?

>> Sure, have at it.

> And here you go.

Pushed with a bit of work on the comments.

                        regards, tom lane


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Timing-sensitive case in src/test/recovery TAP tests

Michael Paquier
On Sun, Aug 13, 2017 at 1:09 AM, Tom Lane <[hidden email]> wrote:

> Michael Paquier <[hidden email]> writes:
>> On Wed, Aug 9, 2017 at 3:39 PM, Tom Lane <[hidden email]> wrote:
>>> Michael Paquier <[hidden email]> writes:
>>>> Let's do that please. Merging both was my first feeling when
>>>> refactoring this test upthread. Should I send a patch?
>
>>> Sure, have at it.
>
>> And here you go.
>
> Pushed with a bit of work on the comments.

Thanks.
--
Michael


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Previous Thread Next Thread
Loading...