Simplify backend terminate and wait logic in postgres_fdw test

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

Simplify backend terminate and wait logic in postgres_fdw test

Bharath Rupireddy
Hi,

With the recent commit aaf0432572 which introduced a waiting/timeout
capability for pg_teriminate_backend function, I would like to do
$subject. Attaching a patch, please have a look.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

v1-0001-Simplify-backend-terminate-and-wait-logic-in-post.patch (8K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Simplify backend terminate and wait logic in postgres_fdw test

Michael Paquier-2
On Thu, Apr 08, 2021 at 04:55:22PM +0530, Bharath Rupireddy wrote:
> With the recent commit aaf0432572 which introduced a waiting/timeout
> capability for pg_teriminate_backend function, I would like to do
> $subject. Attaching a patch, please have a look.

+-- Terminate the remote backend having the specified application_name and wait
+-- for the termination to complete. 10 seconds timeout here is chosen randomly,
+-- we will see a warning if the process doesn't go away within that time.
+SELECT pg_terminate_backend(pid, 10000) FROM pg_stat_activity
+    WHERE application_name = 'fdw_retry_check';

I think that you are making the tests less stable by doing that.  A
couple of buildfarm machines are very slow, and 10 seconds would not
be enough.  So it seems to me that this patch is trading what is a
stable solution for a solution that may finish by randomly bite.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Simplify backend terminate and wait logic in postgres_fdw test

Bharath Rupireddy
On Thu, Apr 8, 2021 at 5:28 PM Michael Paquier <[hidden email]> wrote:

>
> On Thu, Apr 08, 2021 at 04:55:22PM +0530, Bharath Rupireddy wrote:
> > With the recent commit aaf0432572 which introduced a waiting/timeout
> > capability for pg_teriminate_backend function, I would like to do
> > $subject. Attaching a patch, please have a look.
>
> +-- Terminate the remote backend having the specified application_name and wait
> +-- for the termination to complete. 10 seconds timeout here is chosen randomly,
> +-- we will see a warning if the process doesn't go away within that time.
> +SELECT pg_terminate_backend(pid, 10000) FROM pg_stat_activity
> +    WHERE application_name = 'fdw_retry_check';
>
> I think that you are making the tests less stable by doing that.  A
> couple of buildfarm machines are very slow, and 10 seconds would not
> be enough.  So it seems to me that this patch is trading what is a
> stable solution for a solution that may finish by randomly bite.
Agree. Please see the attached patch, I removed a fixed waiting time.
Instead of relying on pg_stat_activity, pg_sleep and
pg_stat_clear_snapshot, now it depends on pg_terminate_backend and
pg_wait_for_backend_termination. This way we could reduce the
functions that the procedure terminate_backend_and_wait uses and also
the new functions pg_terminate_backend and
pg_wait_for_backend_termination gets a test coverage.

Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

v2-0001-Simplify-backend-terminate-and-wait-logic-in-post.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Simplify backend terminate and wait logic in postgres_fdw test

Michael Paquier-2
On Thu, Apr 08, 2021 at 06:27:56PM +0530, Bharath Rupireddy wrote:
> Agree. Please see the attached patch, I removed a fixed waiting time.
> Instead of relying on pg_stat_activity, pg_sleep and
> pg_stat_clear_snapshot, now it depends on pg_terminate_backend and
> pg_wait_for_backend_termination. This way we could reduce the
> functions that the procedure terminate_backend_and_wait uses and also
> the new functions pg_terminate_backend and
> pg_wait_for_backend_termination gets a test coverage.

+       EXIT WHEN is_terminated;
+       SELECT * INTO is_terminated FROM pg_wait_for_backend_termination(pid_v, 1000);
This is still a regression if the termination takes more than 1s,
no?  In such a case terminate_backend_and_wait() would issue a WARNING
and pollute the regression test output.  I can see the point of what
you are achieving here, and that's a good idea, but from the point of
view of the buildfarm the WARNING introduced by aaf0432 is a no-go.  I
honestly don't quite get the benefit in issuing a WARNING in this case
anyway, as the code already returns false on timeout so as caller
would know the status of the operation.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Simplify backend terminate and wait logic in postgres_fdw test

Bharath Rupireddy
On Fri, Apr 9, 2021 at 5:51 AM Michael Paquier <[hidden email]> wrote:

>
> On Thu, Apr 08, 2021 at 06:27:56PM +0530, Bharath Rupireddy wrote:
> > Agree. Please see the attached patch, I removed a fixed waiting time.
> > Instead of relying on pg_stat_activity, pg_sleep and
> > pg_stat_clear_snapshot, now it depends on pg_terminate_backend and
> > pg_wait_for_backend_termination. This way we could reduce the
> > functions that the procedure terminate_backend_and_wait uses and also
> > the new functions pg_terminate_backend and
> > pg_wait_for_backend_termination gets a test coverage.
>
> +       EXIT WHEN is_terminated;
> +       SELECT * INTO is_terminated FROM pg_wait_for_backend_termination(pid_v, 1000);
> This is still a regression if the termination takes more than 1s,
> no?  In such a case terminate_backend_and_wait() would issue a WARNING
> and pollute the regression test output.  I can see the point of what
> you are achieving here, and that's a good idea, but from the point of
> view of the buildfarm the WARNING introduced by aaf0432 is a no-go.

I didn't think of the warning cases, my bad. How about using SET
client_min_messages = 'ERROR'; before we call
pg_wait_for_backend_termination? We can only depend on the return
value of pg_wait_for_backend_termination, when true we  can exit. This
way the buildfarm will not see warnings. Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Simplify backend terminate and wait logic in postgres_fdw test

Michael Paquier-2
On Fri, Apr 09, 2021 at 06:53:21AM +0530, Bharath Rupireddy wrote:
> I didn't think of the warning cases, my bad. How about using SET
> client_min_messages = 'ERROR'; before we call
> pg_wait_for_backend_termination? We can only depend on the return
> value of pg_wait_for_backend_termination, when true we  can exit. This
> way the buildfarm will not see warnings. Thoughts?

You could do that, but I would also bet that this is going to get
forgotten in the future if this gets extended in more SQL tests that
are output-sensitive, in or out of core.  Honestly, I can get behind a
warning in pg_wait_for_backend_termination() to inform that the
process poked at is not a PostgreSQL one, because it offers new and
useful information to the user.  But, and my apologies for sounding a
bit noisy, I really don't get why pg_wait_until_termination() has any
need to do that.  From what I can see, it provides the following
information:
- A PID, that we already know from the caller or just from
pg_stat_activity.
- A timeout, already known as well.
- The fact that the process did not terminate, information given by
the "false" status, only used in this case.

So there is no new information here to the user, only a duplicate of
what's already known to the caller of this function.  I see more
advantages in removing this WARNING rather than keeping it.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Simplify backend terminate and wait logic in postgres_fdw test

Kyotaro Horiguchi-4
At Fri, 9 Apr 2021 10:59:44 +0900, Michael Paquier <[hidden email]> wrote in

> On Fri, Apr 09, 2021 at 06:53:21AM +0530, Bharath Rupireddy wrote:
> > I didn't think of the warning cases, my bad. How about using SET
> > client_min_messages = 'ERROR'; before we call
> > pg_wait_for_backend_termination? We can only depend on the return
> > value of pg_wait_for_backend_termination, when true we  can exit. This
> > way the buildfarm will not see warnings. Thoughts?
>
> You could do that, but I would also bet that this is going to get
> forgotten in the future if this gets extended in more SQL tests that
> are output-sensitive, in or out of core.  Honestly, I can get behind a
> warning in pg_wait_for_backend_termination() to inform that the
> process poked at is not a PostgreSQL one, because it offers new and
> useful information to the user.  But, and my apologies for sounding a
> bit noisy, I really don't get why pg_wait_until_termination() has any
> need to do that.  From what I can see, it provides the following
> information:
> - A PID, that we already know from the caller or just from
> pg_stat_activity.
> - A timeout, already known as well.
> - The fact that the process did not terminate, information given by
> the "false" status, only used in this case.
>
> So there is no new information here to the user, only a duplicate of
> what's already known to the caller of this function.  I see more
> advantages in removing this WARNING rather than keeping it.

FWIW I agree to Michael. I faintly remember that I thought the same
while reviewing but it seems that I forgot to write a comment like
that. It's a work of the caller, concretely the existing callers and
any possible script that calls the function.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: Simplify backend terminate and wait logic in postgres_fdw test

Bharath Rupireddy
In reply to this post by Michael Paquier-2
On Fri, Apr 9, 2021 at 7:29 AM Michael Paquier <[hidden email]> wrote:

>
> On Fri, Apr 09, 2021 at 06:53:21AM +0530, Bharath Rupireddy wrote:
> > I didn't think of the warning cases, my bad. How about using SET
> > client_min_messages = 'ERROR'; before we call
> > pg_wait_for_backend_termination? We can only depend on the return
> > value of pg_wait_for_backend_termination, when true we  can exit. This
> > way the buildfarm will not see warnings. Thoughts?
>
> You could do that, but I would also bet that this is going to get
> forgotten in the future if this gets extended in more SQL tests that
> are output-sensitive, in or out of core.  Honestly, I can get behind a
> warning in pg_wait_for_backend_termination() to inform that the
> process poked at is not a PostgreSQL one, because it offers new and
> useful information to the user.  But, and my apologies for sounding a
> bit noisy, I really don't get why pg_wait_until_termination() has any
> need to do that.  From what I can see, it provides the following
> information:
> - A PID, that we already know from the caller or just from
> pg_stat_activity.
> - A timeout, already known as well.
> - The fact that the process did not terminate, information given by
> the "false" status, only used in this case.
>
> So there is no new information here to the user, only a duplicate of
> what's already known to the caller of this function.  I see more
> advantages in removing this WARNING rather than keeping it.

IMO it does make sense to provide a warning for a bool returning
function, if there are multiple situations in which the function
returns false. This will give clear information as to why the false is
returned.

pg_terminate_backend: false is returned 1) when the process with given
pid is not a backend(warning "PID %d is not a PostgreSQL server
process") 2) if the kill() fails(warning "could not send signal to
process %d: %m") 3) if the timeout is specified and the backend is not
terminated within it(warning "backend with PID %d did not terminate
within %lld milliseconds").
pg_cancel_backend: false is returned 1) when the process with the
given pid is not a backend 2) if the kill() fails.
pg_wait_for_backend_termination: false is returned 1) when the process
with a given pid is not a backend 2) the backend is not terminated
within the timeout.

If we ensure that all the above functions return false in only one
situation and error in all other situations, then removing warnings
makes sense.

Having said above, there seems to be a reason for issuing a warning
and returning false instead of error, that is the callers can just
call these functions in a loop until they return true. See the below
comments:
        /*
         * This is just a warning so a loop-through-resultset will not abort
         * if one backend terminated on its own during the run.
         */
        /* Again, just a warning to allow loops */

I would like to keep the behaviour of these functions as-is.

> You could do that, but I would also bet that this is going to get
> forgotten in the future if this gets extended in more SQL tests that
> are output-sensitive, in or out of core

On the above point of hackers (wanting to use these functions in more
SQL tests) forgetting that the functions pg_terminate_backend,
pg_cancel_backend, pg_wait_for_backend_termination issue a warning in
some cases which might pollute the tests if used without suppressing
these warnings, I feel it is best left to the patch implementers and
the reviewers. On our part, we mentioned that the functions
pg_terminate_backend and pg_wait_for_backend_termination would emit a
warning on timeout "On timeout a warning is emitted and
<literal>false</literal> is returned"

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Simplify backend terminate and wait logic in postgres_fdw test

Bharath Rupireddy
In reply to this post by Bharath Rupireddy
On Thu, Apr 8, 2021 at 6:27 PM Bharath Rupireddy
<[hidden email]> wrote:

>
> On Thu, Apr 8, 2021 at 5:28 PM Michael Paquier <[hidden email]> wrote:
> >
> > On Thu, Apr 08, 2021 at 04:55:22PM +0530, Bharath Rupireddy wrote:
> > > With the recent commit aaf0432572 which introduced a waiting/timeout
> > > capability for pg_teriminate_backend function, I would like to do
> > > $subject. Attaching a patch, please have a look.
> >
> > +-- Terminate the remote backend having the specified application_name and wait
> > +-- for the termination to complete. 10 seconds timeout here is chosen randomly,
> > +-- we will see a warning if the process doesn't go away within that time.
> > +SELECT pg_terminate_backend(pid, 10000) FROM pg_stat_activity
> > +    WHERE application_name = 'fdw_retry_check';
> >
> > I think that you are making the tests less stable by doing that.  A
> > couple of buildfarm machines are very slow, and 10 seconds would not
> > be enough.  So it seems to me that this patch is trading what is a
> > stable solution for a solution that may finish by randomly bite.
>
> Agree. Please see the attached patch, I removed a fixed waiting time.
> Instead of relying on pg_stat_activity, pg_sleep and
> pg_stat_clear_snapshot, now it depends on pg_terminate_backend and
> pg_wait_for_backend_termination. This way we could reduce the
> functions that the procedure terminate_backend_and_wait uses and also
> the new functions pg_terminate_backend and
> pg_wait_for_backend_termination gets a test coverage.
>
> Thoughts?
I realized that the usage like below in v2 patch is completely wrong,
because pg_terminate_backend without timeout will return true and the
loop exits without calling pg_wait_for_backend_terminatioin. Sorry for
not realizing this earlier.
    SELECT * INTO is_terminated FROM pg_terminate_backend(pid_v);
    LOOP
        EXIT WHEN is_terminated;
        SELECT * INTO is_terminated FROM
pg_wait_for_backend_termination(pid_v, 1000);
    END LOOP;

I feel that we can provide a high timeout value (It can be 1hr on the
similar lines of using pg_sleep(3600) for crash tests in
013_crash_restart.pl with the assumption that the backend running that
command will get killed with SIGQUIT) and make pg_terminate_backend
wait. This unreasonably high timeout looks okay because of the
assumption that the servers in the build farm will not take that much
time to remove the backend from the system processes, so the function
will return much earlier than that. If at all there's a server(which
is impractical IMO) that doesn't remove the backend process even
within 1hr, then that is anyways will fail with the warning.

With the attached patch, we could remove the procedure
terminate_backend_and_wait altogether. Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

v3-0001-Simplify-backend-terminate-and-wait-logic-in-post.patch (8K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Simplify backend terminate and wait logic in postgres_fdw test

Michael Paquier-2
On Fri, Apr 09, 2021 at 04:53:01PM +0530, Bharath Rupireddy wrote:

> I feel that we can provide a high timeout value (It can be 1hr on the
> similar lines of using pg_sleep(3600) for crash tests in
> 013_crash_restart.pl with the assumption that the backend running that
> command will get killed with SIGQUIT) and make pg_terminate_backend
> wait. This unreasonably high timeout looks okay because of the
> assumption that the servers in the build farm will not take that much
> time to remove the backend from the system processes, so the function
> will return much earlier than that. If at all there's a server(which
> is impractical IMO) that doesn't remove the backend process even
> within 1hr, then that is anyways will fail with the warning.
You may not need a value as large as 1h for that :)
 
Looking at the TAP tests, some areas have been living with timeouts of
up to 180s.  It is a matter of balance here, a timeout too long would
be annoying as it would make the detection of a problem longer for
machines that are stuck, and a too short value generates false
positives.  5 minutes gives some balance, but there is really no
perfect value.

> With the attached patch, we could remove the procedure
> terminate_backend_and_wait altogether. Thoughts?

That's clearly better, and logically it would work.  As those tests
are new in 14, it may be a good idea to cleanup all that so as all the
branches have the same set of tests.  Would people object to that?
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Simplify backend terminate and wait logic in postgres_fdw test

Bharath Rupireddy
On Mon, Apr 12, 2021 at 11:18 AM Michael Paquier <[hidden email]> wrote:

>
> On Fri, Apr 09, 2021 at 04:53:01PM +0530, Bharath Rupireddy wrote:
> > I feel that we can provide a high timeout value (It can be 1hr on the
> > similar lines of using pg_sleep(3600) for crash tests in
> > 013_crash_restart.pl with the assumption that the backend running that
> > command will get killed with SIGQUIT) and make pg_terminate_backend
> > wait. This unreasonably high timeout looks okay because of the
> > assumption that the servers in the build farm will not take that much
> > time to remove the backend from the system processes, so the function
> > will return much earlier than that. If at all there's a server(which
> > is impractical IMO) that doesn't remove the backend process even
> > within 1hr, then that is anyways will fail with the warning.
>
> You may not need a value as large as 1h for that :)
>
> Looking at the TAP tests, some areas have been living with timeouts of
> up to 180s.  It is a matter of balance here, a timeout too long would
> be annoying as it would make the detection of a problem longer for
> machines that are stuck, and a too short value generates false
> positives.  5 minutes gives some balance, but there is really no
> perfect value.
I changed to 5min. If at all there's any server that would take more
than 5min to remove a process from the system processes list, then it
would see a warning on timeout.

> > With the attached patch, we could remove the procedure
> > terminate_backend_and_wait altogether. Thoughts?
>
> That's clearly better, and logically it would work.  As those tests
> are new in 14, it may be a good idea to cleanup all that so as all the
> branches have the same set of tests.  Would people object to that?

Yes, these tests are introduced in v14, +1 to clean them with this
patch on v14 as well along with master.

Attaching v4, please review further.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

v4-0001-Simplify-backend-terminate-and-wait-logic-in-post.patch (8K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Simplify backend terminate and wait logic in postgres_fdw test

Michael Paquier-2
On Mon, Apr 12, 2021 at 11:29:28AM +0530, Bharath Rupireddy wrote:
> I changed to 5min. If at all there's any server that would take more
> than 5min to remove a process from the system processes list, then it
> would see a warning on timeout.

Looks fine to me.  Let's wait a bit first to see if Fujii-san has any
objections to this cleanup as that's his code originally, from
32a9c0bd.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Simplify backend terminate and wait logic in postgres_fdw test

Michael Paquier-2
On Tue, Apr 13, 2021 at 04:39:58PM +0900, Michael Paquier wrote:
> Looks fine to me.  Let's wait a bit first to see if Fujii-san has any
> objections to this cleanup as that's his code originally, from
> 32a9c0bd.

And hearing nothing, done.  The tests of postgres_fdw are getting much
faster for me now, from basically 6s to 4s (actually that's roughly
1.8s of gain as pg_wait_until_termination waits at least 100ms,
twice), so that's a nice gain.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Simplify backend terminate and wait logic in postgres_fdw test

Tom Lane-2
Michael Paquier <[hidden email]> writes:
> On Tue, Apr 13, 2021 at 04:39:58PM +0900, Michael Paquier wrote:
>> Looks fine to me.  Let's wait a bit first to see if Fujii-san has any
>> objections to this cleanup as that's his code originally, from
>> 32a9c0bd.

> And hearing nothing, done.  The tests of postgres_fdw are getting much
> faster for me now, from basically 6s to 4s (actually that's roughly
> 1.8s of gain as pg_wait_until_termination waits at least 100ms,
> twice), so that's a nice gain.

The buildfarm is showing that one of these test queries is not stable
under CLOBBER_CACHE_ALWAYS:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2021-05-01%2007%3A44%3A47

of which the relevant part is:

diff -U3 /home/buildfarm/buildroot/HEAD/pgsql.build/contrib/postgres_fdw/expected/postgres_fdw.out /home/buildfarm/buildroot/HEAD/pgsql.build/contrib/postgres_fdw/results/postgres_fdw.out
--- /home/buildfarm/buildroot/HEAD/pgsql.build/contrib/postgres_fdw/expected/postgres_fdw.out 2021-05-01 03:44:45.022300613 -0400
+++ /home/buildfarm/buildroot/HEAD/pgsql.build/contrib/postgres_fdw/results/postgres_fdw.out 2021-05-03 09:11:24.051379288 -0400
@@ -9215,8 +9215,7 @@
  WHERE application_name = 'fdw_retry_check';
  pg_terminate_backend
 ----------------------
- t
-(1 row)
+(0 rows)
 
 -- This query should detect the broken connection when starting new remote
 -- transaction, reestablish new connection, and then succeed.

I can reproduce that locally by setting

alter system set debug_invalidate_system_caches_always = 1;

and running "make installcheck" in contrib/postgres_fdw.
(It takes a good long time to run the whole test script
though, so you might want to see if running just these few
queries is enough.)

There's no evidence of distress in the postmaster log,
so I suspect this might just be a timing instability,
e.g. remote process already gone before local process
looks.  If so, it's probably hopeless to make this
test stable as-is.  Perhaps we should just take it out.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Simplify backend terminate and wait logic in postgres_fdw test

Bharath Rupireddy
On Tue, May 4, 2021 at 4:12 AM Tom Lane <[hidden email]> wrote:

> The buildfarm is showing that one of these test queries is not stable
> under CLOBBER_CACHE_ALWAYS:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2021-05-01%2007%3A44%3A47
>
> of which the relevant part is:
>
> diff -U3 /home/buildfarm/buildroot/HEAD/pgsql.build/contrib/postgres_fdw/expected/postgres_fdw.out /home/buildfarm/buildroot/HEAD/pgsql.build/contrib/postgres_fdw/results/postgres_fdw.out
> --- /home/buildfarm/buildroot/HEAD/pgsql.build/contrib/postgres_fdw/expected/postgres_fdw.out   2021-05-01 03:44:45.022300613 -0400
> +++ /home/buildfarm/buildroot/HEAD/pgsql.build/contrib/postgres_fdw/results/postgres_fdw.out    2021-05-03 09:11:24.051379288 -0400
> @@ -9215,8 +9215,7 @@
>         WHERE application_name = 'fdw_retry_check';
>   pg_terminate_backend
>  ----------------------
> - t
> -(1 row)
> +(0 rows)
>
>  -- This query should detect the broken connection when starting new remote
>  -- transaction, reestablish new connection, and then succeed.
Thanks for the report.

> I can reproduce that locally by setting
>
> alter system set debug_invalidate_system_caches_always = 1;
>
> and running "make installcheck" in contrib/postgres_fdw.
> (It takes a good long time to run the whole test script
> though, so you might want to see if running just these few
> queries is enough.)

I can reproduce the issue with the failing case. Issue is that the
backend pid will be null in the pg_stat_activity because of the cache
invalidation that happens at the beginning of the query and hence
pg_terminate_backend returns null on null input.

> There's no evidence of distress in the postmaster log,
> so I suspect this might just be a timing instability,
> e.g. remote process already gone before local process
> looks.  If so, it's probably hopeless to make this
> test stable as-is.  Perhaps we should just take it out.

Actually, that test case covers retry code, so removing it worries me.
Instead, I can do as attached i.e. ignore the pg_terminate_backend
output using PERFORM, as the function signals the backend if the given
pid is a valid backend pid and returns on success. If at all, the
function is to return false, it emits a warning, so it will be caught
in the tests.

And having a retry test case with clobber cache enabled doesn't make
sense because all the cache entries are removed/invalidated for each
query, but the test case covers the code on non-clobber cache
platforms, so I would like to keep it.

Please see the attached, it passes with "alter system set
debug_invalidate_system_caches_always = 1;" on my system.

Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

v1-0001-Stabilize-a-test-case-in-postgres_fdw.patch (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Simplify backend terminate and wait logic in postgres_fdw test

Michael Paquier-2
On Tue, May 04, 2021 at 12:43:53PM +0530, Bharath Rupireddy wrote:
> And having a retry test case with clobber cache enabled doesn't make
> sense because all the cache entries are removed/invalidated for each
> query, but the test case covers the code on non-clobber cache
> platforms, so I would like to keep it.

Yeah, I'd rather keep this test around as it is specific to connection
caches, and it is not time-consuming on fast machines in its new shape
either.  Another trick we could use here could be an aggregate
checking for the number of rows returned, say:
SELECT count(pg_terminate_backend(pid, 180000)) >= 0
  FROM pg_stat_activity
  WHERE application_name = 'fdw_retry_check';

But using CALL as you are suggesting is much cleaner.

(Worth noting that I am out this week for Golden Week, so if this can
wait until Monday, that would be nice.  I am not willing to take my
chances with the buildfarm now :p)
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Simplify backend terminate and wait logic in postgres_fdw test

Tom Lane-2
Michael Paquier <[hidden email]> writes:
> (Worth noting that I am out this week for Golden Week, so if this can
> wait until Monday, that would be nice.  I am not willing to take my
> chances with the buildfarm now :p)

I will see to it.  I think it's important to get a fix in in the next
couple of days, because hyrax has not had a clean run in six weeks.
That animal takes almost a week per test cycle, so the next HEAD run
it starts (two or three days from now) is about our last chance to
get it to go green before beta1 wrap.  I feel it's fairly urgent to
try to do that, because who knows if any other cache-clobber issues
snuck in just before feature freeze.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Simplify backend terminate and wait logic in postgres_fdw test

Bharath Rupireddy
On Tue, May 4, 2021 at 7:05 PM Tom Lane <[hidden email]> wrote:

>
> Michael Paquier <[hidden email]> writes:
> > (Worth noting that I am out this week for Golden Week, so if this can
> > wait until Monday, that would be nice.  I am not willing to take my
> > chances with the buildfarm now :p)
>
> I will see to it.  I think it's important to get a fix in in the next
> couple of days, because hyrax has not had a clean run in six weeks.
> That animal takes almost a week per test cycle, so the next HEAD run
> it starts (two or three days from now) is about our last chance to
> get it to go green before beta1 wrap.  I feel it's fairly urgent to
> try to do that, because who knows if any other cache-clobber issues
> snuck in just before feature freeze.

Thanks! Can we then take the patch proposed at [1]?

[1] - https://www.postgresql.org/message-id/CALj2ACWqh2nHzPyzP-bAY%2BCaAAbtQRO55AQ_4ppGiU_w8iOvTg%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Simplify backend terminate and wait logic in postgres_fdw test

Tom Lane-2
In reply to this post by Bharath Rupireddy
Bharath Rupireddy <[hidden email]> writes:
> On Tue, May 4, 2021 at 4:12 AM Tom Lane <[hidden email]> wrote:
>> The buildfarm is showing that one of these test queries is not stable
>> under CLOBBER_CACHE_ALWAYS:

> I can reproduce the issue with the failing case. Issue is that the
> backend pid will be null in the pg_stat_activity because of the cache
> invalidation that happens at the beginning of the query and hence
> pg_terminate_backend returns null on null input.

No, that's nonsense: if it were happening that way, the query would
return one row with a NULL result, but actually it's returning no
rows.  What's actually happening, it seems, is that because
pgfdw_inval_callback is constantly getting called due to cache
flushes, we invariably drop remote connections immediately during
transaction termination (cf pgfdw_xact_callback).  Thus, by the time
we inspect pg_stat_activity, there is no remote session to terminate.

I don't like your patch because what it effectively does is mask
whether termination happened or not; if there were a bug there
causing that not to happen, the test would still appear to pass.

I think the most expedient fix, if we want to keep this test, is
just to transiently disable debug_invalidate_system_caches_always.
(That option wasn't available before v14, but fortunately we
don't need a fix for the back branches.)

I believe the attached will do the trick, but I'm running the test
with debug_invalidate_system_caches_always turned on to verify
that.  Should be done in an hour or so...

                        regards, tom lane


diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 8e1cc69508..6f533c745d 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -9204,6 +9204,12 @@ WARNING:  there is no transaction in progress
 -- Change application_name of remote connection to special one
 -- so that we can easily terminate the connection later.
 ALTER SERVER loopback OPTIONS (application_name 'fdw_retry_check');
+-- If debug_invalidate_system_caches_always is active, it results in
+-- dropping remote connections after every transaction, making it
+-- impossible to test termination meaningfully.  So turn that off
+-- for this test.
+SET debug_invalidate_system_caches_always = 0;
+-- Make sure we have a remote connection.
 SELECT 1 FROM ft1 LIMIT 1;
  ?column?
 ----------
@@ -9227,9 +9233,8 @@ SELECT 1 FROM ft1 LIMIT 1;
         1
 (1 row)
 
--- If the query detects the broken connection when starting new remote
--- subtransaction, it doesn't reestablish new connection and should fail.
--- The text of the error might vary across platforms, so don't show it.
+-- If we detect the broken connection when starting a new remote
+-- subtransaction, we should fail instead of establishing a new connection.
 -- Terminate the remote connection and wait for the termination to complete.
 SELECT pg_terminate_backend(pid, 180000) FROM pg_stat_activity
  WHERE application_name = 'fdw_retry_check';
@@ -9239,11 +9244,13 @@ SELECT pg_terminate_backend(pid, 180000) FROM pg_stat_activity
 (1 row)
 
 SAVEPOINT s;
+-- The text of the error might vary across platforms, so only show SQLSTATE.
 \set VERBOSITY sqlstate
 SELECT 1 FROM ft1 LIMIT 1;    -- should fail
 ERROR:  08006
 \set VERBOSITY default
 COMMIT;
+RESET debug_invalidate_system_caches_always;
 -- =============================================================================
 -- test connection invalidation cases and postgres_fdw_get_connections function
 -- =============================================================================
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index dcd36a9753..000e2534fc 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -2795,6 +2795,14 @@ ROLLBACK;
 -- Change application_name of remote connection to special one
 -- so that we can easily terminate the connection later.
 ALTER SERVER loopback OPTIONS (application_name 'fdw_retry_check');
+
+-- If debug_invalidate_system_caches_always is active, it results in
+-- dropping remote connections after every transaction, making it
+-- impossible to test termination meaningfully.  So turn that off
+-- for this test.
+SET debug_invalidate_system_caches_always = 0;
+
+-- Make sure we have a remote connection.
 SELECT 1 FROM ft1 LIMIT 1;
 
 -- Terminate the remote connection and wait for the termination to complete.
@@ -2806,18 +2814,20 @@ SELECT pg_terminate_backend(pid, 180000) FROM pg_stat_activity
 BEGIN;
 SELECT 1 FROM ft1 LIMIT 1;
 
--- If the query detects the broken connection when starting new remote
--- subtransaction, it doesn't reestablish new connection and should fail.
--- The text of the error might vary across platforms, so don't show it.
+-- If we detect the broken connection when starting a new remote
+-- subtransaction, we should fail instead of establishing a new connection.
 -- Terminate the remote connection and wait for the termination to complete.
 SELECT pg_terminate_backend(pid, 180000) FROM pg_stat_activity
  WHERE application_name = 'fdw_retry_check';
 SAVEPOINT s;
+-- The text of the error might vary across platforms, so only show SQLSTATE.
 \set VERBOSITY sqlstate
 SELECT 1 FROM ft1 LIMIT 1;    -- should fail
 \set VERBOSITY default
 COMMIT;
 
+RESET debug_invalidate_system_caches_always;
+
 -- =============================================================================
 -- test connection invalidation cases and postgres_fdw_get_connections function
 -- =============================================================================
Reply | Threaded
Open this post in threaded view
|

Re: Simplify backend terminate and wait logic in postgres_fdw test

Bharath Rupireddy
On Tue, May 4, 2021 at 9:08 PM Tom Lane <[hidden email]> wrote:

>
> Bharath Rupireddy <[hidden email]> writes:
> > On Tue, May 4, 2021 at 4:12 AM Tom Lane <[hidden email]> wrote:
> >> The buildfarm is showing that one of these test queries is not stable
> >> under CLOBBER_CACHE_ALWAYS:
>
> > I can reproduce the issue with the failing case. Issue is that the
> > backend pid will be null in the pg_stat_activity because of the cache
> > invalidation that happens at the beginning of the query and hence
> > pg_terminate_backend returns null on null input.
>
> No, that's nonsense: if it were happening that way, the query would
> return one row with a NULL result, but actually it's returning no
> rows.  What's actually happening, it seems, is that because
> pgfdw_inval_callback is constantly getting called due to cache
> flushes, we invariably drop remote connections immediately during
> transaction termination (cf pgfdw_xact_callback).  Thus, by the time
> we inspect pg_stat_activity, there is no remote session to terminate.
>
> I don't like your patch because what it effectively does is mask
> whether termination happened or not; if there were a bug there
> causing that not to happen, the test would still appear to pass.
>
> I think the most expedient fix, if we want to keep this test, is
> just to transiently disable debug_invalidate_system_caches_always.
> (That option wasn't available before v14, but fortunately we
> don't need a fix for the back branches.)
>
> I believe the attached will do the trick, but I'm running the test
> with debug_invalidate_system_caches_always turned on to verify
> that.  Should be done in an hour or so...

Thanks for pushing this change.

If debug_invalidate_system_caches_always is allowed to be used for
cache sensitive test cases, I see an opportunity to make the tests,
that are adjusted by commit f77717b29, more meaningful as they were
before the commit. That commit changed the way below functions show up
output in the tests:
SELECT 1 FROM postgres_fdw_disconnect_all();
SELECT server_name FROM postgres_fdw_get_connections() ORDER BY 1;

If okay, I can work on it (not for PG14 of course). It can be
discussed in a separate thread though.

Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com