Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away

Bharath Rupireddy
Hi,

Currently with the postgres_fdw remote connections cached in the local
backend, the queries that use the cached connections from local
backend will not check whether the remote backend is killed or gone
away, and it goes tries to submit the query and fails if the remote
backend is killed.

This problem was found during the discussions made in [1].

One way, we could solve the above problem is that, upon firing the new
foreign query from local backend using the cached connection,
(assuming the remote backend that was cached in the local backed got
killed for some reasons), instead of failing the query in the local
backend, upon detecting error from the remote backend, we could just
delete the cached old entry and try getting another connection to
remote backend, cache it and proceed to submit the query. This has to
happen only at the beginning of remote xact.

This way, instead of the query getting failed, the query succeeds if
the local backend is able to get a new remote backend connection.

Attaching the patch that implements the above retry mechanism.

The way I tested the patch:
1. select * from foreign_tbl; /*from local backend - this results in a
remote connection being cached in the postgres_fdw connection cache
and a remote backend is opened.*/
2. (intentionally) kill the remote backend, just to simulate the scenario.
3. select * from foreign_tbl; /*from local backend - without patch
this throws error "ERROR: server closed the connection unexpectedly".
with patch - try to use the cached connection at the beginning of
remote xact, upon receiving error from remote postgres server, instead
of aborting the query, delete the cached entry, try to get a new
connection, if it gets, caches it and uses that for executing the
query, query succeeds.

I couldn't think of adding a test case to the existing postgres_fdw
regression test suite with an automated scenario of the remote backend
getting killed.

I would like to thank Ashutosh Bapat ([hidden email])
for the suggestion to fix this and the review of my initial patch
attached in [2]. I tried to address the review comments provided on my
initial patch [3].

For, one of the Ashutosh's review comments from [3] "the fact that the
same connection may be used by multiple plan nodes", I tried to have
few use cases where there exist joins on two foreign tables on the
same remote server, in a single query, so essentially, the same
connection was used for multiple plan nodes. In this case we avoid
retrying for the second GetConnection() request for the second foreign
table, with the check entry->xact_depth <= 0 , xact_depth after the
first GetConnection() and the first remote query will become 1 and we
don't hit the retry logic and seems like we are safe here. Please add
If I'm missing something here.

Request the community to consider the patch for further review if the
overall idea seems beneficial.

[1]  https://www.postgresql.org/message-id/CAExHW5t21B_XPQy_hownm1Qq%3DhMrgOhX%2B8gDj3YEKFvpk%3DVRgw%40mail.gmail.com
[2]  https://www.postgresql.org/message-id/CALj2ACXp6DQ3iLGx5g%2BLgVtGwC4F6K9WzKQJpyR4FfdydQzC_g%40mail.gmail.com
[3]  https://www.postgresql.org/message-id/CAExHW5u3Gyv6Q1BEr6zMg0t%2B59e8c4KMfKVrV3Z%3D4UKKjJ19nQ%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

v1-Retry-Cached-Remote-Connections-For-postgres_fdw.patch (9K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away

Kasahara Tatsuhito
Hi,

On Wed, Jul 8, 2020 at 9:40 PM Bharath Rupireddy
<[hidden email]> wrote:
> One way, we could solve the above problem is that, upon firing the new
> foreign query from local backend using the cached connection,
> (assuming the remote backend that was cached in the local backed got
> killed for some reasons), instead of failing the query in the local
> backend, upon detecting error from the remote backend, we could just
> delete the cached old entry and try getting another connection to
> remote backend, cache it and proceed to submit the query. This has to
> happen only at the beginning of remote xact.
+1.

I think this is a very useful feature.
In an environment with connection pooling for local, if a remote
server has a failover or switchover,
this feature would prevent unexpected errors of local queries after
recovery of the remote server.

I haven't looked at the code in detail yet, some comments here.

1. To keep the previous behavior (and performance), how about allowing
the user to specify
   whether or not to retry as a GUC parameter or in the FOREIGN SERVER OPTION?

2. How about logging a LOG message when retry was success to let us know
   the retry feature worked or how often the retries worked ?

> I couldn't think of adding a test case to the existing postgres_fdw
> regression test suite with an automated scenario of the remote backend
> getting killed.

Couldn't you confirm this by adding a test case like the following?
===================================================
BEGIN;
-- Generate a connection to remote
SELECT * FROM ft1 LIMIT 1;

-- retrieve pid of postgres_fdw and kill it
-- could use the other unique identifier (not postgres_fdw but
fdw_retry_check, etc ) for application name
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE
backend_type = 'client backend' AND application_name = 'postgres_fdw'

-- COMMIT, so next query will should success if connection-retry works
COMMIT;
SELECT * FROM ft1 LIMIT 1;
===================================================

Best regards,

--
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com


Reply | Threaded
Open this post in threaded view
|

Re: Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away

Bharath Rupireddy
Thanks for the comments. Attaching the v2 patch.

>
> > One way, we could solve the above problem is that, upon firing the new
> > foreign query from local backend using the cached connection,
> > (assuming the remote backend that was cached in the local backed got
> > killed for some reasons), instead of failing the query in the local
> > backend, upon detecting error from the remote backend, we could just
> > delete the cached old entry and try getting another connection to
> > remote backend, cache it and proceed to submit the query. This has to
> > happen only at the beginning of remote xact.
> +1.
>
> I think this is a very useful feature.
> In an environment with connection pooling for local, if a remote
> server has a failover or switchover,
> this feature would prevent unexpected errors of local queries after
> recovery of the remote server.

Thanks for backing this feature.

>
> I haven't looked at the code in detail yet, some comments here.
>

Thanks for the comments. Please feel free to review more of the
attached v2 patch.

>
> 1. To keep the previous behavior (and performance), how about allowing
> the user to specify
>    whether or not to retry as a GUC parameter or in the FOREIGN SERVER OPTION?
>

Do we actually need this? We don't encounter much performance with this connection retry, as
we just do it at the beginning of the remote xact i.e. only once per a remote session, if we are
able to establish it's well and good otherwise, the query is bound to fail.

If at all, we need one (if there exists a strong reason to have the option), then the question is
GUC or the SERVER OPTION?

There's a similar discussion going on having GUC at the core level vs SERVER OPTION for
postgres_fdw in [1].

>
> 2. How about logging a LOG message when retry was success to let us know
>    the retry feature worked or how often the retries worked ?
>

In the v1 patch I added the logging messages, but in v2 patch
"postgres_fdw connection retry is successful" is added. Please note that all the
new logs are added at level "DEBUG3" as all the existing logs are also at the same
level.

>
> > I couldn't think of adding a test case to the existing postgres_fdw
> > regression test suite with an automated scenario of the remote backend
> > getting killed.
>
> Couldn't you confirm this by adding a test case like the following?
> ===================================================
> BEGIN;
> -- Generate a connection to remote
> SELECT * FROM ft1 LIMIT 1;
>
> -- retrieve pid of postgres_fdw and kill it
> -- could use the other unique identifier (not postgres_fdw but
> fdw_retry_check, etc ) for application name
> SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE
> backend_type = 'client backend' AND application_name = 'postgres_fdw'
>
> -- COMMIT, so next query will should success if connection-retry works
> COMMIT;
> SELECT * FROM ft1 LIMIT 1;
> ===================================================
>

Yes, this way it works. Thanks for the suggestion. I added the test
case to the postgres_fdw regression test suite. v2 patch has these
changes also.

[1] - https://www.postgresql.org/message-id/CALj2ACVvrp5%3DAVp2PupEm%2BnAC8S4buqR3fJMmaCoc7ftT0aD2A%40mail.gmail.com
With Regards,
Bharath Rupireddy.


v2-Retry-Cached-Remote-Connections-For-postgres_fdw.patch (14K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away

Ashutosh Bapat-2
In reply to this post by Bharath Rupireddy
On Wed, Jul 8, 2020 at 6:10 PM Bharath Rupireddy
<[hidden email]> wrote:
>
> I couldn't think of adding a test case to the existing postgres_fdw
> regression test suite with an automated scenario of the remote backend
> getting killed.

You could get a backend's PID using PQbackendPID and then kill it by
calling pg_terminate_backend() to kill the remote backend to automate
scenario of remote backend being killed.

>
> I would like to thank Ashutosh Bapat ([hidden email])
> for the suggestion to fix this and the review of my initial patch
> attached in [2]. I tried to address the review comments provided on my
> initial patch [3].
>
> For, one of the Ashutosh's review comments from [3] "the fact that the
> same connection may be used by multiple plan nodes", I tried to have
> few use cases where there exist joins on two foreign tables on the
> same remote server, in a single query, so essentially, the same
> connection was used for multiple plan nodes. In this case we avoid
> retrying for the second GetConnection() request for the second foreign
> table, with the check entry->xact_depth <= 0 , xact_depth after the
> first GetConnection() and the first remote query will become 1 and we
> don't hit the retry logic and seems like we are safe here. Please add
> If I'm missing something here.
>
> Request the community to consider the patch for further review if the
> overall idea seems beneficial.

I think this idea will be generally useful if your work on dropping
stale connection uses idle_connection_timeout or something like that
on the remote server.

About the patch. It seems we could just catch the error from
begin_remote_xact() in GetConnection() and retry connection if the
error is "bad connection". Retrying using PQreset() might be better
than calling PQConnect* always.


>
> [1]  https://www.postgresql.org/message-id/CAExHW5t21B_XPQy_hownm1Qq%3DhMrgOhX%2B8gDj3YEKFvpk%3DVRgw%40mail.gmail.com
> [2]  https://www.postgresql.org/message-id/CALj2ACXp6DQ3iLGx5g%2BLgVtGwC4F6K9WzKQJpyR4FfdydQzC_g%40mail.gmail.com
> [3]  https://www.postgresql.org/message-id/CAExHW5u3Gyv6Q1BEr6zMg0t%2B59e8c4KMfKVrV3Z%3D4UKKjJ19nQ%40mail.gmail.com
>
> With Regards,
> Bharath Rupireddy.
> EnterpriseDB: http://www.enterprisedb.com



--
Best Wishes,
Ashutosh Bapat


Reply | Threaded
Open this post in threaded view
|

Re: Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away

Bharath Rupireddy
>
> You could get a backend's PID using PQbackendPID and then kill it by
> calling pg_terminate_backend() to kill the remote backend to automate
> scenario of remote backend being killed.
>

I already added the test case in v2 patch itself(added one more test
case in v3 patch), using the similar approach.

>
> > For, one of the Ashutosh's review comments from [3] "the fact that the
> > same connection may be used by multiple plan nodes", I tried to have
> > few use cases where there exist joins on two foreign tables on the
> > same remote server, in a single query, so essentially, the same
> > connection was used for multiple plan nodes. In this case we avoid
> > retrying for the second GetConnection() request for the second foreign
> > table, with the check entry->xact_depth <= 0 , xact_depth after the
> > first GetConnection() and the first remote query will become 1 and we
> > don't hit the retry logic and seems like we are safe here. Please add
> > If I'm missing something here.
> >
> > Request the community to consider the patch for further review if the
> > overall idea seems beneficial.
>
> I think this idea will be generally useful if your work on dropping
> stale connection uses idle_connection_timeout or something like that
> on the remote server.

Assuming we use idle_connection_timeout or some other means(as it is
not yet finalized, I will try to respond in that mail chain) to drop
stale/idle connections from the local backend, I think we have two
options 1) deleting that cached entry from the connection cache
entirely using disconnect_pg_server() and hash table remove. This
frees up some space and we don't have to deal with the connection
invalidations and don't have to bother on resetting cached entry's
other parameters such as xact_depth, have_prep_stmt etc. 2) or we
could just drop the connections using disconnect_pg_server(), retain
the hash entry, reset other parameters, and deal with the
invalidations. This is like, we maintain unnecessary info in the
cached entry, where we actually don't have a connection at all and
keep holding some space for cached entry.

IMO, if we go with option 1, then it will be good.

Anyways, this connection retry feature will not have any dependency on
the idle_connection_timeout or dropping stale connection feature,
because there can be many reasons where remote backends go away/get
killed.

If I'm not sidetracking - if we use something like
idle_session_timeout [1] on the remote server, this retry feature will
be very useful.

>
> About the patch. It seems we could just catch the error from
> begin_remote_xact() in GetConnection() and retry connection if the
> error is "bad connection". Retrying using PQreset() might be better
> than calling PQConnect* always.
>

Thanks for the comment, it made life easier. Added the patch with the
changes. Please take a look at the v3 patch and let me know if still
something needs to be done.


With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

v3-Retry-Cached-Remote-Connections-For-postgres_fdw.patch (10K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away

Ashutosh Bapat-2
Has this been added to CF, possibly next CF?

On Tue, Jul 14, 2020 at 7:27 AM Bharath Rupireddy
<[hidden email]> wrote:

>
> >
> > You could get a backend's PID using PQbackendPID and then kill it by
> > calling pg_terminate_backend() to kill the remote backend to automate
> > scenario of remote backend being killed.
> >
>
> I already added the test case in v2 patch itself(added one more test
> case in v3 patch), using the similar approach.
>
> >
> > > For, one of the Ashutosh's review comments from [3] "the fact that the
> > > same connection may be used by multiple plan nodes", I tried to have
> > > few use cases where there exist joins on two foreign tables on the
> > > same remote server, in a single query, so essentially, the same
> > > connection was used for multiple plan nodes. In this case we avoid
> > > retrying for the second GetConnection() request for the second foreign
> > > table, with the check entry->xact_depth <= 0 , xact_depth after the
> > > first GetConnection() and the first remote query will become 1 and we
> > > don't hit the retry logic and seems like we are safe here. Please add
> > > If I'm missing something here.
> > >
> > > Request the community to consider the patch for further review if the
> > > overall idea seems beneficial.
> >
> > I think this idea will be generally useful if your work on dropping
> > stale connection uses idle_connection_timeout or something like that
> > on the remote server.
>
> Assuming we use idle_connection_timeout or some other means(as it is
> not yet finalized, I will try to respond in that mail chain) to drop
> stale/idle connections from the local backend, I think we have two
> options 1) deleting that cached entry from the connection cache
> entirely using disconnect_pg_server() and hash table remove. This
> frees up some space and we don't have to deal with the connection
> invalidations and don't have to bother on resetting cached entry's
> other parameters such as xact_depth, have_prep_stmt etc. 2) or we
> could just drop the connections using disconnect_pg_server(), retain
> the hash entry, reset other parameters, and deal with the
> invalidations. This is like, we maintain unnecessary info in the
> cached entry, where we actually don't have a connection at all and
> keep holding some space for cached entry.
>
> IMO, if we go with option 1, then it will be good.
>
> Anyways, this connection retry feature will not have any dependency on
> the idle_connection_timeout or dropping stale connection feature,
> because there can be many reasons where remote backends go away/get
> killed.
>
> If I'm not sidetracking - if we use something like
> idle_session_timeout [1] on the remote server, this retry feature will
> be very useful.
>
> >
> > About the patch. It seems we could just catch the error from
> > begin_remote_xact() in GetConnection() and retry connection if the
> > error is "bad connection". Retrying using PQreset() might be better
> > than calling PQConnect* always.
> >
>
> Thanks for the comment, it made life easier. Added the patch with the
> changes. Please take a look at the v3 patch and let me know if still
> something needs to be done.
>
> [1] - https://www.postgresql.org/message-id/763A0689-F189-459E-946F-F0EC4458980B%40hotmail.com
>
> With Regards,
> Bharath Rupireddy.
> EnterpriseDB: http://www.enterprisedb.com



--
Best Wishes,
Ashutosh Bapat


Reply | Threaded
Open this post in threaded view
|

Re: Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away

Bharath Rupireddy
On Tue, Jul 14, 2020 at 6:13 PM Ashutosh Bapat
<[hidden email]> wrote:
>
> Has this been added to CF, possibly next CF?
>

I have not added yet. Request to get it done in this CF, as the final
patch for review(v3 patch) is ready and shared. We can target it to
the next CF if there are major issues with the patch.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away

Bharath Rupireddy
>
> On Tue, Jul 14, 2020 at 6:13 PM Ashutosh Bapat
> <[hidden email]> wrote:
> >
> > Has this been added to CF, possibly next CF?
> >
>
> I have not added yet. Request to get it done in this CF, as the final
> patch for review(v3 patch) is ready and shared. We can target it to
> the next CF if there are major issues with the patch.
>

I added this feature to the next CF - https://commitfest.postgresql.org/29/2651/

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away

Fujii Masao-4


On 2020/07/17 21:02, Bharath Rupireddy wrote:

>>
>> On Tue, Jul 14, 2020 at 6:13 PM Ashutosh Bapat
>> <[hidden email]> wrote:
>>>
>>> Has this been added to CF, possibly next CF?
>>>
>>
>> I have not added yet. Request to get it done in this CF, as the final
>> patch for review(v3 patch) is ready and shared. We can target it to
>> the next CF if there are major issues with the patch.
>>
>
> I added this feature to the next CF - https://commitfest.postgresql.org/29/2651/

Thanks for the patch! Here are some comments from me.

+ PQreset(entry->conn);

Isn't using PQreset() to reconnect to the foreign server unsafe?
When new connection is established, some SET commands seem
to need to be issued like configure_remote_session() does.
But PQreset() doesn't do that at all.


Originally when GetConnection() establishes new connection,
it resets all transient state fields, to be sure all are clean (as the
comment says). With the patch, even when reconnecting
the remote server, shouldn't we do the same, for safe?


+ PGresult *res = NULL;
+ res = PQgetResult(entry->conn);
+ PQclear(res);

Are these really necessary? I was just thinking that's not because
pgfdw_get_result() and pgfdw_report_error() seem to do that
already in do_sql_command().


+ /* Start a new transaction or subtransaction if needed. */
+ begin_remote_xact(entry);

Even when there is no cached connection and new connection is made,
then if begin_remote_xact() reports an error, another new connection
tries to be made again. This behavior looks a bit strange.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION


Reply | Threaded
Open this post in threaded view
|

Re: Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away

Bharath Rupireddy
Thanks for the review comments. I will post a new patch soon
addressing all the comments.

On Tue, Sep 15, 2020 at 2:49 PM Fujii Masao <[hidden email]> wrote:
>
> +                       PQreset(entry->conn);
>
> Isn't using PQreset() to reconnect to the foreign server unsafe?
> When new connection is established, some SET commands seem
> to need to be issued like configure_remote_session() does.
> But PQreset() doesn't do that at all.
>

This is a good catch. Thanks, I missed this point. Indeed we need to
set the session params. We can do this in two ways: 1. use
configure_remote_session() after PQreset(). 2. use connect_pg_server()
instead of PQreset() and configure_remote_session(). One problem I see
with the 2nd way is that we will be doing the checks that are being
performed in connect_pg_server() twice, which we would have done for
the first time before retrying.  The required parameters such as
keywords, values, are still in entry->conn structure from the first
attempt, which can safely be used by PQreset(). So, I will go with the
1st way. Thoughts?

>
> Originally when GetConnection() establishes new connection,
> it resets all transient state fields, to be sure all are clean (as the
> comment says). With the patch, even when reconnecting
> the remote server, shouldn't we do the same, for safe?
>

I guess there is no need for us to reset all the transient state
before we do begin_remote_xact() in the 2nd attempt. We retry the
connection only when entry->xact_depth <= 0 i.e. beginning of the
remote txn and the begin_remote_xact() doesn't modify any transient
state if entry->xact_depth <= 0, except for entry->changing_xact_state
= true; all other transient state is intact in entry structure. In the
error case, we will not reach the code after do_sql_command in
begin_remote_xact(). If needed, we can only set
entry->changing_xact_state to false which is set to true before
do_sql_command().

        entry->changing_xact_state = true;
        do_sql_command(entry->conn, sql);
        entry->xact_depth = 1;
        entry->changing_xact_state = false;

>
> +                       PGresult *res = NULL;
> +                       res = PQgetResult(entry->conn);
> +                       PQclear(res);
> Are these really necessary? I was just thinking that's not because
> pgfdw_get_result() and pgfdw_report_error() seem to do that
> already in do_sql_command().
>

If an error occurs in the first attempt, we return from
pgfdw_get_result()'s  if (!PQconsumeInput(conn)) to the catch block we
added and pgfdw_report_error() will never get called. And the below
part of the code is reached only in scenarios as mentioned in the
comments. Removing this might have problems if we receive errors other
than CONNECTION_BAD or for subtxns. We could clear if any result and
just rethrow the error upstream. I think no problem having this code
here.

        else
        {
            /*
             * We are here, due to either some error other than CONNECTION_BAD
             * occurred or connection may have broken during start of a
             * subtransacation. Just, clear off any result, try rethrowing the
             * error, so that it will be caught appropriately.
             */
            PGresult *res = NULL;
            res = PQgetResult(entry->conn);
            PQclear(res);
            PG_RE_THROW();
        }

>
> +               /* Start a new transaction or subtransaction if needed. */
> +               begin_remote_xact(entry);
>
> Even when there is no cached connection and new connection is made,
> then if begin_remote_xact() reports an error, another new connection
> tries to be made again. This behavior looks a bit strange.
>

When there is no cached connection, we try to acquire one, if we
can't, then no error will be thrown to the user, just we retry one
more time. If we get in the 2nd attempt, fine, if not, we will throw
the error to the user. Assume in the 1st attempt the remote server is
unreachable, we may hope to connect in the 2nd attempt. I think there
is no problem here.


With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away

Fujii Masao-4


On 2020/09/17 15:44, Bharath Rupireddy wrote:
> Thanks for the review comments. I will post a new patch soon
> addressing all the comments.

Thanks a lot!


>
> On Tue, Sep 15, 2020 at 2:49 PM Fujii Masao <[hidden email]> wrote:
>>
>> +                       PQreset(entry->conn);
>>
>> Isn't using PQreset() to reconnect to the foreign server unsafe?
>> When new connection is established, some SET commands seem
>> to need to be issued like configure_remote_session() does.
>> But PQreset() doesn't do that at all.
>>
>
> This is a good catch. Thanks, I missed this point. Indeed we need to
> set the session params. We can do this in two ways: 1. use
> configure_remote_session() after PQreset(). 2. use connect_pg_server()
> instead of PQreset() and configure_remote_session(). One problem I see
> with the 2nd way is that we will be doing the checks that are being
> performed in connect_pg_server() twice, which we would have done for
> the first time before retrying.  The required parameters such as
> keywords, values, are still in entry->conn structure from the first
> attempt, which can safely be used by PQreset(). So, I will go with the
> 1st way. Thoughts?


In 1st way, you may also need to call ReleaseExternalFD() when new connection fails
to be made, as connect_pg_server() does. Also we need to check that
non-superuser has used password to make new connection,
as connect_pg_server() does? I'm concerned about the case where
pg_hba.conf is changed accidentally so that no password is necessary
at the remote server and the existing connection is terminated. In this case,
if we connect to the local server as non-superuser, connection to
the remote server should fail because the remote server doesn't
require password. But with your patch, we can successfully reconnect
to the remote server.

Therefore I like 2nd option. Also maybe disconnect_ps_server() needs to
be called before that. I'm not sure how much useful 1st option is.


>
>>
>> Originally when GetConnection() establishes new connection,
>> it resets all transient state fields, to be sure all are clean (as the
>> comment says). With the patch, even when reconnecting
>> the remote server, shouldn't we do the same, for safe?
>>
>
> I guess there is no need for us to reset all the transient state
> before we do begin_remote_xact() in the 2nd attempt. We retry the
> connection only when entry->xact_depth <= 0 i.e. beginning of the
> remote txn and the begin_remote_xact() doesn't modify any transient
> state if entry->xact_depth <= 0, except for entry->changing_xact_state
> = true; all other transient state is intact in entry structure. In the
> error case, we will not reach the code after do_sql_command in
> begin_remote_xact(). If needed, we can only set
> entry->changing_xact_state to false which is set to true before
> do_sql_command().

What if 2nd attempt happens with have_prep_stmt=true? I'm not sure
if this case really happens, though. But if that can, it's strange to start
new connection with have_prep_stmt=true even when the caller of
GetConnection() doesn't intend to create any prepared statements.

I think it's safer to do 2nd attempt in the same way as 1st one. Maybe
we can simplify the code by making them into common code block
or function.


>
>          entry->changing_xact_state = true;
>          do_sql_command(entry->conn, sql);
>          entry->xact_depth = 1;
>          entry->changing_xact_state = false;
>
>>
>> +                       PGresult *res = NULL;
>> +                       res = PQgetResult(entry->conn);
>> +                       PQclear(res);
>> Are these really necessary? I was just thinking that's not because
>> pgfdw_get_result() and pgfdw_report_error() seem to do that
>> already in do_sql_command().
>>
>
> If an error occurs in the first attempt, we return from
> pgfdw_get_result()'s  if (!PQconsumeInput(conn)) to the catch block we
> added and pgfdw_report_error() will never get called. And the below
> part of the code is reached only in scenarios as mentioned in the
> comments. Removing this might have problems if we receive errors other
> than CONNECTION_BAD or for subtxns. We could clear if any result and
> just rethrow the error upstream. I think no problem having this code
> here.

But the orignal code works without this?
Or you mean that the original code has the bug?


>
>          else
>          {
>              /*
>               * We are here, due to either some error other than CONNECTION_BAD
>               * occurred or connection may have broken during start of a
>               * subtransacation. Just, clear off any result, try rethrowing the
>               * error, so that it will be caught appropriately.
>               */
>              PGresult *res = NULL;
>              res = PQgetResult(entry->conn);
>              PQclear(res);
>              PG_RE_THROW();
>          }
>
>>
>> +               /* Start a new transaction or subtransaction if needed. */
>> +               begin_remote_xact(entry);
>>
>> Even when there is no cached connection and new connection is made,
>> then if begin_remote_xact() reports an error, another new connection
>> tries to be made again. This behavior looks a bit strange.
>>
>
> When there is no cached connection, we try to acquire one, if we
> can't, then no error will be thrown to the user, just we retry one
> more time. If we get in the 2nd attempt, fine, if not, we will throw
> the error to the user. Assume in the 1st attempt the remote server is
> unreachable, we may hope to connect in the 2nd attempt. I think there
> is no problem here.
>
>
> With Regards,
> Bharath Rupireddy.
> EnterpriseDB: http://www.enterprisedb.com
>
>

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION


Reply | Threaded
Open this post in threaded view
|

Re: Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away

Bharath Rupireddy
On Thu, Sep 17, 2020 at 10:20 PM Fujii Masao <[hidden email]> wrote:
>

> In 1st way, you may also need to call ReleaseExternalFD() when new connection fails
> to be made, as connect_pg_server() does. Also we need to check that
> non-superuser has used password to make new connection,
> as connect_pg_server() does? I'm concerned about the case where
> pg_hba.conf is changed accidentally so that no password is necessary
> at the remote server and the existing connection is terminated. In this case,
> if we connect to the local server as non-superuser, connection to
> the remote server should fail because the remote server doesn't
> require password. But with your patch, we can successfully reconnect
> to the remote server.
>
> Therefore I like 2nd option. Also maybe disconnect_ps_server() needs to
> be called before that. I'm not sure how much useful 1st option is.
>

Thanks. Above points look sensible. +1 for the 2nd option i.e. instead of PQreset(entry->conn);, let's try to disconnect_pg_server() and connect_pg_server().

>
> What if 2nd attempt happens with have_prep_stmt=true? I'm not sure
> if this case really happens, though. But if that can, it's strange to start
> new connection with have_prep_stmt=true even when the caller of
> GetConnection() doesn't intend to create any prepared statements.
>
> I think it's safer to do 2nd attempt in the same way as 1st one. Maybe
> we can simplify the code by making them into common code block
> or function.
>

I don't think the have_prep_stmt will be set by the time we make 2nd attempt because entry->have_prep_stmt |= will_prep_stmt; gets hit only after we are successful in either 1st attempt or 2nd attempt. I think we don't need to set all transient state. No other transient state except changing_xact_state changes from 1st attempt to 2nd attempt(see below), so let's set only entry->changing_xact_state to false before 2nd attempt.

1st attempt:
(gdb) p *entry
$3 = {key = 16389, conn = 0x55a896199990, xact_depth = 0, have_prep_stmt = false,
  have_error = false, changing_xact_state = false, invalidated = false,
  server_hashvalue = 3905865521, mapping_hashvalue = 2617776010}

2nd attempt i.e. in retry block:
(gdb) p *entry
$4 = {key = 16389, conn = 0x55a896199990, xact_depth = 0, have_prep_stmt = false,
  have_error = false, changing_xact_state = true, invalidated = false,
  server_hashvalue = 3905865521, mapping_hashvalue = 2617776010}

>>
> > If an error occurs in the first attempt, we return from

> > pgfdw_get_result()'s  if (!PQconsumeInput(conn)) to the catch block we
> > added and pgfdw_report_error() will never get called. And the below
> > part of the code is reached only in scenarios as mentioned in the
> > comments. Removing this might have problems if we receive errors other
> > than CONNECTION_BAD or for subtxns. We could clear if any result and
> > just rethrow the error upstream. I think no problem having this code
> > here.
>
> But the orignal code works without this?
> Or you mean that the original code has the bug?
>

There's no bug in the original code. Sorry, I was wrong in saying pgfdw_report_error() will never get called with this patch. Indeed it gets called even when 1's attempt connection is failed. Since we added an extra try-catch block, we will not be throwing the error to the user, instead we make a 2nd attempt from the catch block.

I'm okay to remove below part of the code

> >> +                       PGresult *res = NULL;
> >> +                       res = PQgetResult(entry->conn);
> >> +                       PQclear(res);
> >> Are these really necessary? I was just thinking that's not because
> >> pgfdw_get_result() and pgfdw_report_error() seem to do that
> >> already in do_sql_command().

Please let me know if okay with the above agreed points, I will work on the new patch.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com