[bug fix] PG10: libpq doesn't connect to alternative hosts when some errors occur

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
42 messages Options
123
Reply | Threaded
Open this post in threaded view
|

[bug fix] PG10: libpq doesn't connect to alternative hosts when some errors occur

Tsunakawa, Takayuki
Hello,

I found a problem with libpq connection failover.  When libpq cannot connect to earlier hosts in the host list, it doesn't try to connect to other hosts.  For example, when you specify a wrong port that some non-postgres program is using, or some non-postgres program is using PG's port unexpectedly, you get an error like this:

$ psql -h localhost -p 23
psql: received invalid response to SSL negotiation: 
$ psql -h localhost -p 23 -d "sslmode=disable"
psql: expected authentication request from server, but received 

Likewise, when the first host has already reached max_connections, libpq doesn't attempt the connection aginst later hosts.

The attached patch fixes this.  I'll add this item in the PostgreSQL 10 Open Items.


Regards
Takayuki Tsunakawa



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

libpq-reconnect-on-error.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [bug fix] PG10: libpq doesn't connect to alternative hosts when some errors occur

Michael Paquier
On Fri, May 12, 2017 at 1:28 PM, Tsunakawa, Takayuki
<[hidden email]> wrote:
> Likewise, when the first host has already reached max_connections, libpq doesn't attempt the connection aginst later hosts.

It seems to me that the feature is behaving as wanted. Or in short
attempt to connect to the next host only if a connection cannot be
established. If there is a failure once the exchange with the server
has begun, just consider it as a hard failure. This is an important
property for authentication and SSL connection failures actually.
--
Michael


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [bug fix] PG10: libpq doesn't connect to alternative hosts when some errors occur

Tsunakawa, Takayuki
From: Michael Paquier [mailto:[hidden email]]
> It seems to me that the feature is behaving as wanted. Or in short attempt
> to connect to the next host only if a connection cannot be established.
> If there is a failure once the exchange with the server has begun, just
> consider it as a hard failure. This is an important property for
> authentication and SSL connection failures actually.

But PgJDBC behaves as expected -- attempt another connection to other hosts (and succeed).  I believe that's what users would naturally expect.  The current libpq implementation handles only the socket-level connect failure.

Regards
Takayuki Tsunakawa


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [bug fix] PG10: libpq doesn't connect to alternative hosts when some errors occur

Tom Lane-2
In reply to this post by Michael Paquier
Michael Paquier <[hidden email]> writes:
> On Fri, May 12, 2017 at 1:28 PM, Tsunakawa, Takayuki
> <[hidden email]> wrote:
>> Likewise, when the first host has already reached max_connections, libpq doesn't attempt the connection aginst later hosts.

> It seems to me that the feature is behaving as wanted. Or in short
> attempt to connect to the next host only if a connection cannot be
> established. If there is a failure once the exchange with the server
> has begun, just consider it as a hard failure. This is an important
> property for authentication and SSL connection failures actually.

I would not really expect that reconnection would retry after arbitrary
failure cases.  Should it retry for "wrong database name", for instance?
It's not hard to imagine that leading to very confusing behavior.

                        regards, tom lane


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [bug fix] PG10: libpq doesn't connect to alternative hosts when some errors occur

Michael Paquier
On Fri, May 12, 2017 at 10:44 PM, Tom Lane <[hidden email]> wrote:

> Michael Paquier <[hidden email]> writes:
>> On Fri, May 12, 2017 at 1:28 PM, Tsunakawa, Takayuki
>> <[hidden email]> wrote:
>>> Likewise, when the first host has already reached max_connections, libpq doesn't attempt the connection aginst later hosts.
>
>> It seems to me that the feature is behaving as wanted. Or in short
>> attempt to connect to the next host only if a connection cannot be
>> established. If there is a failure once the exchange with the server
>> has begun, just consider it as a hard failure. This is an important
>> property for authentication and SSL connection failures actually.
>
> I would not really expect that reconnection would retry after arbitrary
> failure cases.  Should it retry for "wrong database name", for instance?
> It's not hard to imagine that leading to very confusing behavior.

I guess not as well. That would be tricky for the user to have a
different behavior depending on the error returned by the server,
which is why the current code is doing things right IMO. Now, the
feature has been designed similarly to JDBC with its parametrization,
so it could be surprising for users to get a different failure
handling compared to that. Not saying that JDBC is doing it wrong, but
libpq does nothing wrong either.
--
Michael


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [bug fix] PG10: libpq doesn't connect to alternative hosts when some errors occur

Robert Haas
On Sun, May 14, 2017 at 9:19 PM, Michael Paquier
<[hidden email]> wrote:

> On Fri, May 12, 2017 at 10:44 PM, Tom Lane <[hidden email]> wrote:
>> Michael Paquier <[hidden email]> writes:
>>> On Fri, May 12, 2017 at 1:28 PM, Tsunakawa, Takayuki
>>> <[hidden email]> wrote:
>>>> Likewise, when the first host has already reached max_connections, libpq doesn't attempt the connection aginst later hosts.
>>
>>> It seems to me that the feature is behaving as wanted. Or in short
>>> attempt to connect to the next host only if a connection cannot be
>>> established. If there is a failure once the exchange with the server
>>> has begun, just consider it as a hard failure. This is an important
>>> property for authentication and SSL connection failures actually.
>>
>> I would not really expect that reconnection would retry after arbitrary
>> failure cases.  Should it retry for "wrong database name", for instance?
>> It's not hard to imagine that leading to very confusing behavior.
>
> I guess not as well. That would be tricky for the user to have a
> different behavior depending on the error returned by the server,
> which is why the current code is doing things right IMO. Now, the
> feature has been designed similarly to JDBC with its parametrization,
> so it could be surprising for users to get a different failure
> handling compared to that. Not saying that JDBC is doing it wrong, but
> libpq does nothing wrong either.

I concur.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [bug fix] PG10: libpq doesn't connect to alternative hosts when some errors occur

Tsunakawa, Takayuki
In reply to this post by Michael Paquier
From: Michael Paquier [mailto:[hidden email]]

> On Fri, May 12, 2017 at 10:44 PM, Tom Lane <[hidden email]> wrote:
> > I would not really expect that reconnection would retry after
> > arbitrary failure cases.  Should it retry for "wrong database name", for
> instance?
> > It's not hard to imagine that leading to very confusing behavior.
>
> I guess not as well. That would be tricky for the user to have a different
> behavior depending on the error returned by the server, which is why the
> current code is doing things right IMO. Now, the feature has been designed
> similarly to JDBC with its parametrization, so it could be surprising for
> users to get a different failure handling compared to that. Not saying that
> JDBC is doing it wrong, but libpq does nothing wrong either.

I didn't intend to make the user have a different behavior depending on the error returned by the server.  I meant attempting connection to alternative hosts when the server returned an error. I thought the new libpq feature tries to connect to other hosts when a connection attempt fails, where the "connection" is the *database connection* (user's perspective), not the *socket connection* (PG developer's perspective).  I think PgJDBC meets the user's desire better -- "Please connect to some host for better HA if a database server is unavailable for some reason."

By the way, could you elaborate what problem could occur if my solution is applied?  (it doesn't seem easy for me to imagine...)  FYI, as below, the case Tom picked up didn't raise an issue:

[libpq]
$ psql -h localhost,localhost -p 5450,5451 -d aaa
psql: FATAL:  database "aaa" does not exist
$


[JDBC]
$ java org.hsqldb.cmdline.SqlTool postgres
SqlTool v. 3481.
2017-05-15T10:23:55.991+0900  SEVERE  Connection error:
org.postgresql.util.PSQLException: FATAL: database "aaa" does not exist
  Location: File: postinit.c, Routine: InitPostgres, Line: 846
  Server SQLState: 3D000
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2412)
        at org.postgresql.core.v3.QueryExecutorImpl.readStartupMessages(QueryExecutorImpl.java:2538)
        at org.postgresql.core.v3.QueryExecutorImpl.<init>(QueryExecutorImpl.java:122)
        at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:227)
        at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49)
        at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:194)
        at org.postgresql.Driver.makeConnection(Driver.java:431)
        at org.postgresql.Driver.connect(Driver.java:247)
        at java.sql.DriverManager.getConnection(DriverManager.java:664)
        at java.sql.DriverManager.getConnection(DriverManager.java:247)
        at org.hsqldb.lib.RCData.getConnection(Unknown Source)
        at org.hsqldb.cmdline.SqlTool.objectMain(Unknown Source)
        at org.hsqldb.cmdline.SqlTool.main(Unknown Source)

Failed to get a connection to 'jdbc:postgresql://localhost:5450,localhost:5451/aaa' as user "tunakawa".
Cause: FATAL: database "aaa" does not exist
  Location: File: postinit.c, Routine: InitPostgres, Line: 846
  Server SQLState: 3D000
$

Regards
Takayuki Tsunakawa






--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [bug fix] PG10: libpq doesn't connect to alternative hosts when some errors occur

Robert Haas
On Sun, May 14, 2017 at 9:50 PM, Tsunakawa, Takayuki
<[hidden email]> wrote:

>> I guess not as well. That would be tricky for the user to have a different
>> behavior depending on the error returned by the server, which is why the
>> current code is doing things right IMO. Now, the feature has been designed
>> similarly to JDBC with its parametrization, so it could be surprising for
>> users to get a different failure handling compared to that. Not saying that
>> JDBC is doing it wrong, but libpq does nothing wrong either.
>
> I didn't intend to make the user have a different behavior depending on the error returned by the server.  I meant attempting connection to alternative hosts when the server returned an error. I thought the new libpq feature tries to connect to other hosts when a connection attempt fails, where the "connection" is the *database connection* (user's perspective), not the *socket connection* (PG developer's perspective).  I think PgJDBC meets the user's desire better -- "Please connect to some host for better HA if a database server is unavailable for some reason."
>
> By the way, could you elaborate what problem could occur if my solution is applied?  (it doesn't seem easy for me to imagine...)

Sure.  Imagine that the user thinks that 'foo' and 'bar' are the
relevant database servers for some service and writes 'dbname=quux
host=foo,bar' as a connection string.  However, actually the user has
made a mistake and 'foo' is supporting some other service entirely; it
has no database 'quux'; the database servers which have database
'quux' are in fact 'bar' and 'baz'.  All appears well as long as 'bar'
remains up, because the missing-database error for 'foo' is ignored
and we just connect to 'bar'.  However, when 'bar' goes down then we
are out of service instead of failing over to 'baz' as we should have
done.

Now it's quite possible that the user, if they test carefully, might
realize that things are not working as intended, because the DBA might
say "hey, all of your connections are being directed to 'bar' instead
of being load-balanced properly!".  But even if they are careful
enough to realize this, it may not be clear what has gone wrong.
Under your proposal, the connection to 'foo' could be failing for *any
reason whatsoever* from lack of connectivity to a missing database to
a missing user to a missing CONNECT privilege to an authentication
failure.  If the user looks at the server log and can pick out the
entries from their own connection attempts they can figure it out, but
otherwise they might spend quite a bit of time wondering what's wrong;
after all, libpq will report no error, as long as the connection to
the other server works.

Now, this is all arguable.  You could certainly say -- and you are
saying -- that this feature ought to be defined to retry after any
kind of failure whatsoever.  But I think what Tom and Michael and I
are saying is that this is a failover feature and therefore ought to
try the next server when the first one in the list appears to have
gone down, but not when the first one in the list is unhappy with the
connection request for some other reason.  Who is right is a judgement
call, but I don't think it's self-evident that users want to ignore
anything and everything that might have gone wrong with the connection
to the first server, rather than only those things which resemble a
down server.  It seems quite possible to me that if we had defined it
as you are proposing, somebody would now be arguing for a behavior
change in the other direction.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [bug fix] PG10: libpq doesn't connect to alternative hosts when some errors occur

Tom Lane-2
Robert Haas <[hidden email]> writes:
> On Sun, May 14, 2017 at 9:50 PM, Tsunakawa, Takayuki
> <[hidden email]> wrote:
>> By the way, could you elaborate what problem could occur if my solution is applied?  (it doesn't seem easy for me to imagine...)

> Sure.  Imagine that the user thinks that 'foo' and 'bar' are the
> relevant database servers for some service and writes 'dbname=quux
> host=foo,bar' as a connection string.  However, actually the user has
> made a mistake and 'foo' is supporting some other service entirely; it
> has no database 'quux'; the database servers which have database
> 'quux' are in fact 'bar' and 'baz'.

Even more simply, suppose that your userid is known to host bar but the
DBA has forgotten to create it on foo.  This is surely a configuration
error that ought to be rectified, not just failed past, or else you don't
have any of the redundancy you think you do.

Of course, the user would have to try connections to both foo and bar
to be sure that they're both configured correctly.  But he might try
"host=foo,bar" and "host=bar,foo" and figure he was OK, not noticing
that both connections had silently been made to bar.

The bigger picture here is that we only want to fail past transient
errors, not configuration errors.  I'm willing to err in favor of
regarding doubtful cases as transient, but most server login rejections
aren't for transient causes.

There might be specific post-connection errors that we should consider
retrying; "too many connections" is an obvious case.

                        regards, tom lane


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [bug fix] PG10: libpq doesn't connect to alternative hosts when some errors occur

Tsunakawa, Takayuki
Hello Robert, Tom,

Thank you for being kind enough to explain.  I think I could understand your concern.

From: [hidden email]
> [mailto:[hidden email]] On Behalf Of Robert Haas
> Who is right is a judgement call, but I don't think it's self-evident that
> users want to ignore anything and everything that might have gone wrong
> with the connection to the first server, rather than only those things which
> resemble a down server.  It seems quite possible to me that if we had defined
> it as you are proposing, somebody would now be arguing for a behavior change
> in the other direction.

Judgment call... so, I understood that it's a matter of choosing between helping to detect configuration errors early or service continuity.  Hmm, I'd like to know how other databases treat this, but I couldn't find useful information after some Google search.  I wonder whether I sould ask PgJDBC people if they know something, because they chose service continuity.


From: Tom Lane [mailto:[hidden email]]
> The bigger picture here is that we only want to fail past transient errors,
> not configuration errors.  I'm willing to err in favor of regarding doubtful
> cases as transient, but most server login rejections aren't for transient
> causes.

I got "doubtful cases" as ones such as specifying non-existent host or an unused port number.  In that case, the configuration error can't be distinguished from the server failure.

What do you think of the following cases?  Don't you want to connect to other servers?

* The DBA shuts down the database.  The server takes a long time to do checkpointing.  During the shutdown checkpoint, libpq tries to connect to the server and receive an error "the database system is shutting down."

* The former primary failed and now is trying to start as a standby, catching up by applying WAL.  During the recovery, libpq tries to connect to the server and receive an error "the database system is performing recovery."

* The database server crashed due to a bug.  Unfortunately, the server takes unexpectedly long time to shut down because it takes many seconds to write the stats file (as you remember, Tom-san experienced 57 seconds to write the stats file during regression tests.)  During the stats file write, libpq tries to connect to the server and receive an error "the database system is shutting down."

These are equivalent to server failure.  I believe we should prioritize rescuing errors during operation over detecting configuration errors.


> Of course, the user would have to try connections to both foo and bar to
> be sure that they're both configured correctly.  But he might try
> "host=foo,bar" and "host=bar,foo" and figure he was OK, not noticing that
> both connections had silently been made to bar.

In that case, I think he would specify "host=foo" and "host=bar" in turn, because he would be worried about where he's connected if he specified multiple hosts.

Regards
Takayuki Tsunakawa



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [bug fix] PG10: libpq doesn't connect to alternative hosts when some errors occur

Robert Haas
On Wed, May 17, 2017 at 3:06 AM, Tsunakawa, Takayuki
<[hidden email]> wrote:
> What do you think of the following cases?  Don't you want to connect to other servers?
>
> * The DBA shuts down the database.  The server takes a long time to do checkpointing.  During the shutdown checkpoint, libpq tries to connect to the server and receive an error "the database system is shutting down."
>
> * The former primary failed and now is trying to start as a standby, catching up by applying WAL.  During the recovery, libpq tries to connect to the server and receive an error "the database system is performing recovery."
>
> * The database server crashed due to a bug.  Unfortunately, the server takes unexpectedly long time to shut down because it takes many seconds to write the stats file (as you remember, Tom-san experienced 57 seconds to write the stats file during regression tests.)  During the stats file write, libpq tries to connect to the server and receive an error "the database system is shutting down."
>
> These are equivalent to server failure.  I believe we should prioritize rescuing errors during operation over detecting configuration errors.

Yeah, you have a point.  I'm willing to admit that we may have defined
the behavior of the feature incorrectly, provided that you're willing
to admit that you're proposing a definition change, not just a bug
fix.

Anybody else want to weigh in with an opinion here?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [bug fix] PG10: libpq doesn't connect to alternative hosts when some errors occur

Tom Lane-2
Robert Haas <[hidden email]> writes:
> Yeah, you have a point.  I'm willing to admit that we may have defined
> the behavior of the feature incorrectly, provided that you're willing
> to admit that you're proposing a definition change, not just a bug
> fix.

> Anybody else want to weigh in with an opinion here?

I'm not really on board with "try each server until you find one where
this dbname+username+password combination works".  That's just a recipe
for trouble, especially the password angle.

I think it's a good point that there are certain server responses that
we should take as equivalent to "server down", but by the same token
there are responses that we should not take that way.

I suggest that we need to conditionalize the decision based on what
SQLSTATE is reported.  Not sure offhand if it's better to have a whitelist
of SQLSTATEs that allow failing over to the next server, or a blacklist of
SQLSTATEs that don't.

                        regards, tom lane


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [bug fix] PG10: libpq doesn't connect to alternative hosts when some errors occur

Stephen Frost
Tom, Robert,

* Tom Lane ([hidden email]) wrote:

> Robert Haas <[hidden email]> writes:
> > Yeah, you have a point.  I'm willing to admit that we may have defined
> > the behavior of the feature incorrectly, provided that you're willing
> > to admit that you're proposing a definition change, not just a bug
> > fix.
>
> > Anybody else want to weigh in with an opinion here?
>
> I'm not really on board with "try each server until you find one where
> this dbname+username+password combination works".  That's just a recipe
> for trouble, especially the password angle.
Agreed.

> I think it's a good point that there are certain server responses that
> we should take as equivalent to "server down", but by the same token
> there are responses that we should not take that way.

Right.

> I suggest that we need to conditionalize the decision based on what
> SQLSTATE is reported.  Not sure offhand if it's better to have a whitelist
> of SQLSTATEs that allow failing over to the next server, or a blacklist of
> SQLSTATEs that don't.

No particular comment on this.  I do wonder about forward/backwards
compatibility in such lists and if SQLSTATE really covers all
cases/distinctions which are interesting when it comes to making this
decision.

Thanks!

Stephen

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [bug fix] PG10: libpq doesn't connect to alternative hosts when some errors occur

Tom Lane-2
Stephen Frost <[hidden email]> writes:
> * Tom Lane ([hidden email]) wrote:
>> I suggest that we need to conditionalize the decision based on what
>> SQLSTATE is reported.  Not sure offhand if it's better to have a whitelist
>> of SQLSTATEs that allow failing over to the next server, or a blacklist of
>> SQLSTATEs that don't.

> No particular comment on this.  I do wonder about forward/backwards
> compatibility in such lists and if SQLSTATE really covers all
> cases/distinctions which are interesting when it comes to making this
> decision.

If the server is reporting the same SQLSTATE for server-down types
of conditions as for server-up, then that's a bug and we need to change
the SQLSTATE assigned to one case or the other.  The entire point of
SQLSTATE is that it should generally capture distinctions as finely
as client software is likely to be interested in.

                        regards, tom lane


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [bug fix] PG10: libpq doesn't connect to alternative hosts when some errors occur

Robert Haas
In reply to this post by Tom Lane-2
On Wed, May 17, 2017 at 12:48 PM, Tom Lane <[hidden email]> wrote:

> Robert Haas <[hidden email]> writes:
>> Yeah, you have a point.  I'm willing to admit that we may have defined
>> the behavior of the feature incorrectly, provided that you're willing
>> to admit that you're proposing a definition change, not just a bug
>> fix.
>
>> Anybody else want to weigh in with an opinion here?
>
> I'm not really on board with "try each server until you find one where
> this dbname+username+password combination works".  That's just a recipe
> for trouble, especially the password angle.

Sure, I know what *your* opinion is.  And I'm somewhat inclined to
agree, but not to the degree that I don't think we should hear what
other people have to say.

> I suggest that we need to conditionalize the decision based on what
> SQLSTATE is reported.  Not sure offhand if it's better to have a whitelist
> of SQLSTATEs that allow failing over to the next server, or a blacklist of
> SQLSTATEs that don't.

Urgh.  There are two things I don't like about that.  First, it's a
major redesign of this feature at the 11th hour.  Second, if we can't
even agree on the general question of whether all, some, or no server
errors should cause a retry, the chances of agreeing on which SQL
states to include in the retry loop are probably pretty low.  Indeed,
there might not be one answer that will be right for everyone.

One good argument for leaving this alone entirely is that this feature
was committed on November 3rd and this thread began on May 12th.  If
there was ample time before feature freeze to question the design and
nobody did, then I'm not sure why we should disregard the freeze to
start whacking it around now, especially on the strength of one
complaint.  It may be that after we get some field experience with
this the right thing to do will become clearer.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [bug fix] PG10: libpq doesn't connect to alternative hosts when some errors occur

Stephen Frost
Robert,

* Robert Haas ([hidden email]) wrote:
> One good argument for leaving this alone entirely is that this feature
> was committed on November 3rd and this thread began on May 12th.  If
> there was ample time before feature freeze to question the design and
> nobody did, then I'm not sure why we should disregard the freeze to
> start whacking it around now, especially on the strength of one
> complaint.  It may be that after we get some field experience with
> this the right thing to do will become clearer.

I am not particularly convinced by this argument.  As much as we hope
that committers have worked with a variety of people with varying
interests and that individuals who are concerned about such start
testing just as soon as something is committed, that, frankly, isn't how
the world really works, based on my observations, at least.

The point of this period of time between feature freeze and actual
release is, more-or-less, to figure out if the solution we've reached
actually is a good one, and if not, to do something about it.

Thanks!

Stephen

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [bug fix] PG10: libpq doesn't connect to alternative hosts when some errors occur

Tom Lane-2
Stephen Frost <[hidden email]> writes:
> * Robert Haas ([hidden email]) wrote:
>> One good argument for leaving this alone entirely is that this feature
>> was committed on November 3rd and this thread began on May 12th.  If
>> there was ample time before feature freeze to question the design and
>> nobody did, then I'm not sure why we should disregard the freeze to
>> start whacking it around now, especially on the strength of one
>> complaint.  It may be that after we get some field experience with
>> this the right thing to do will become clearer.

> I am not particularly convinced by this argument.  As much as we hope
> that committers have worked with a variety of people with varying
> interests and that individuals who are concerned about such start
> testing just as soon as something is committed, that, frankly, isn't how
> the world really works, based on my observations, at least.

> The point of this period of time between feature freeze and actual
> release is, more-or-less, to figure out if the solution we've reached
> actually is a good one, and if not, to do something about it.

Sure, but part of the point of beta testing is to get user feedback.

I agree with Robert's point that major redesign of the feature on the
basis of one complaint isn't necessarily the way to go.  Since the
existing behavior is already out in beta1, let's wait and see if anyone
else complains.  We don't need to fix it Right This Instant.

Maybe add this to the list of open issues to reconsider mid-beta?

                        regards, tom lane


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: [bug fix] PG10: libpq doesn't connect to alternative hosts when some errors occur

David G Johnston
In reply to this post by Tsunakawa, Takayuki
On Wed, May 17, 2017 at 12:06 AM, Tsunakawa, Takayuki <[hidden email]> wrote:
From: [hidden email]
> [mailto:[hidden email]] On Behalf Of Robert Haas
> Who is right is a judgement call, but I don't think it's self-evident that
> users want to ignore anything and everything that might have gone wrong
> with the connection to the first server, rather than only those things which
> resemble a down server.  It seems quite possible to me that if we had defined
> it as you are proposing, somebody would now be arguing for a behavior change
> in the other direction.

Judgment call... so, I understood that it's a matter of choosing between helping to detect configuration errors early or service continuity.

​This is how I've been reading this thread and I'm tending to agree with prioritizing service continuity ​over configuration error detection.  As a client if I have an alternative that ends up working I don't really care whose fault it is that the earlier options weren't.  I don't have enough experience to think up plausible scenarios here but I'm sold on the theory.

David J.

Reply | Threaded
Open this post in threaded view
|

Re: [bug fix] PG10: libpq doesn't connect to alternative hosts when some errors occur

Stephen Frost
In reply to this post by Tom Lane-2
Tom,

* Tom Lane ([hidden email]) wrote:
> I agree with Robert's point that major redesign of the feature on the
> basis of one complaint isn't necessarily the way to go.  Since the
> existing behavior is already out in beta1, let's wait and see if anyone
> else complains.  We don't need to fix it Right This Instant.

Fair enough.

> Maybe add this to the list of open issues to reconsider mid-beta?

Works for me.

Thanks!

Stephen

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [bug fix] PG10: libpq doesn't connect to alternative hosts when some errors occur

Tels
In reply to this post by Robert Haas
Moin,

On Wed, May 17, 2017 12:34 pm, Robert Haas wrote:

> On Wed, May 17, 2017 at 3:06 AM, Tsunakawa, Takayuki
> <[hidden email]> wrote:
>> What do you think of the following cases?  Don't you want to connect to
>> other servers?
>>
>> * The DBA shuts down the database.  The server takes a long time to do
>> checkpointing.  During the shutdown checkpoint, libpq tries to connect
>> to the server and receive an error "the database system is shutting
>> down."
>>
>> * The former primary failed and now is trying to start as a standby,
>> catching up by applying WAL.  During the recovery, libpq tries to
>> connect to the server and receive an error "the database system is
>> performing recovery."
>>
>> * The database server crashed due to a bug.  Unfortunately, the server
>> takes unexpectedly long time to shut down because it takes many seconds
>> to write the stats file (as you remember, Tom-san experienced 57 seconds
>> to write the stats file during regression tests.)  During the stats file
>> write, libpq tries to connect to the server and receive an error "the
>> database system is shutting down."
>>
>> These are equivalent to server failure.  I believe we should prioritize
>> rescuing errors during operation over detecting configuration errors.
>
> Yeah, you have a point.  I'm willing to admit that we may have defined
> the behavior of the feature incorrectly, provided that you're willing
> to admit that you're proposing a definition change, not just a bug
> fix.
>
> Anybody else want to weigh in with an opinion here?

Hm, to me the feature needs to be reliable (for certain values of
reliable) to be usefull.

Consider that you have X hosts (rendundancy), and a lot of applications
that want a stable connection to the one that (still) works, whichever
this is.

You can then either:

1. make one primary, the other standby(s) and play DNS tricks or similiar
to make it appear that there is only one working host, and have all apps
connect to the "one host" (and reconnect to it upon failure)

2. let each app try each host until it finds a working one, if the
connection breaks, retry with the next host

3. or use libpq and let it try the hosts for you.

However, if I understand it correctly, #3 only works reliable in certain
cases (e.g. host down), but not if it is "sort of down". In that case each
app would again need code to retry different hosts until it finds a
working one, instead of letting libpq do the work.

That sound hard to deploy #3 in praxis, as you might easily just code up
#1 or #2 and call it a day.

All the best,

Tels


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
123
Previous Thread Next Thread