Rare SSL failures on eelpout

classic Classic list List threaded Threaded
31 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Rare SSL failures on eelpout

Thomas Munro-3
(Moving this discussion from the pgsql-committers thread "pgsql:
Update ssl test certificates and keys", which is innocent.)

On Wed, Jan 16, 2019 at 10:37 AM Thomas Munro
<[hidden email]> wrote:

> On Fri, Jan 4, 2019 at 10:08 AM Thomas Munro
> <[hidden email]> wrote:
> > Thanks.  FWIW I've just updated eelpout (a Debian testing BF animal
> > that runs all the extra tests including SSL) to use libssl-dev
> > (instead of libssl1.0-dev), and cleared its accache files.  Let's see
> > if that works...
>
> Since that upgrade (to libssl 1.1.1a-1), there are have been a few
> intermittent failures in the SSL tests on that animal (thanks Tom for
> pointing that out off-list).

I reproduced this manually.  From time to time (less than 1% of the
times I tried), the server hangs up without sending a goodbye message,
thereby causing a failure to match the expected psql error message.
From the kernel's perspective on the psql side, a working-as-expected
case looks like this:

sendto(3, "\27\3\3\2\356\313\334\372\201\353\201h\204\353\240A\4\355\232\340\375\0X\220\326V\374\253\222i\2724"...,
1115, MSG_NOSIGNAL, NULL, 0) = 1115
ppoll([{fd=3, events=POLLOUT|POLLERR}], 1, NULL, NULL, 0) = 1 ([{fd=3,
revents=POLLOUT}])
sendto(3, "\27\3\3\0f\335\313\224\263\256r\371\215\177\273,N\345\342\200\257\r\321\6\323@\316\241\316\17\204\26"...,
107, MSG_NOSIGNAL, NULL, 0) = 107
ppoll([{fd=3, events=POLLIN|POLLERR}], 1, NULL, NULL, 0) = 1 ([{fd=3,
revents=POLLIN}])
recvfrom(3, "\27\3\3\0\23", 5, 0, NULL, NULL) = 5
recvfrom(3, "I\"\t;\3006\276\347\344]7>\2\234m\340]B\241", 19, 0,
NULL, NULL) = 19
close(3)                                = 0
write(2, "psql: SSL error: sslv3 alert cer"..., 49psql: SSL error:
sslv3 alert certificate revoked
) = 49

... and the unexpected failure case looks like this:

sendto(3, "\27\3\3\2\356/\254\307\277\342G?&BB\321\f\314\245\22\246U\337\263;\203f\302s\306\37\276"...,
1115, MSG_NOSIGNAL, NULL, 0) = 1115
ppoll([{fd=3, events=POLLOUT|POLLERR}], 1, NULL, NULL, 0) = 1 ([{fd=3,
revents=POLLOUT|POLLERR|POLLHUP}])
sendto(3, "\27\3\3\0f\177\335\20\305\353\234\306\211#\343\321\336\22111J\312\323F\210\6U\331\264GAr"...,
107, MSG_NOSIGNAL, NULL, 0) = -1 ECONNRESET (Connection reset by peer)
write(2, "psql: server closed the connecti"..., 199psql: server closed
the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
could not send startup packet: Connection reset by peer
) = 199

To the kernel on the server side, a good case looks like this:

[pid 13740] sendto(8,
"\27\3\3\0\23\252\21\227\2\232s\354\21\243\236\207\301\3B\341\253\306k\346",
24, 0, NULL, 0) = 24
[pid 13740] write(2, "2019-01-22 09:23:04.425 UTC [137"...,
1112019-01-22 09:23:04.425 UTC [13740] [unknown] LOG:  could not
accept SSL connection: certificate verify failed
) = 111
[pid 13740] exit_group(0)               = ?
[pid 13740] +++ exited with 0 +++

I didn't manage to trace a bad case on the server side (maybe the
strace interferes with the required timing), but I noticed that
SSL_shutdown()'s return code (which we don't check) is always -1, and
I noticed that SSL_get_error() after that is always SSL_ERROR_SSL.  I
wondered if we might somehow be reaching exit() when the alert
response is still buffered inside OpenSSL, but that doesn't seem right
-- the port has noblock=0 at that stage.

Hmm.  Why is psql doing two sendto() calls without reading a response
in between, when it's possible for the server to exit after the first,
anyway?  Seems like a protocol violation somewhere?

--
Thomas Munro
http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Rare SSL failures on eelpout

Tom Lane-2
Thomas Munro <[hidden email]> writes:
> Hmm.  Why is psql doing two sendto() calls without reading a response
> in between, when it's possible for the server to exit after the first,
> anyway?  Seems like a protocol violation somewhere?

Keep in mind this is all down inside the SSL handshake, so if any
protocol is being violated, it's theirs not ours.

My gut reaction is that this probably indicates that in the "certificate
verify failed" code path, we're exiting the server too soon without
letting openssl finish out its handshake fully.  But that could be all
wet, or even if true it might not be convenient to postpone exit (e.g.,
we'd have to save the SSL error code somewhere, I suspect).

The whole thing reminds me of the recent bug #15598:

https://www.postgresql.org/message-id/87k1iy44fd.fsf%40news-spur.riddles.org.uk

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: Rare SSL failures on eelpout

Thomas Munro-3
On Wed, Jan 23, 2019 at 4:07 AM Tom Lane <[hidden email]> wrote:
> Thomas Munro <[hidden email]> writes:
> > Hmm.  Why is psql doing two sendto() calls without reading a response
> > in between, when it's possible for the server to exit after the first,
> > anyway?  Seems like a protocol violation somewhere?
>
> Keep in mind this is all down inside the SSL handshake, so if any
> protocol is being violated, it's theirs not ours.

The sendto() of 1115 bytes is SSL_connect()'s last syscall, just
before it returns 1 to indicate success (even though it wasn't
successful?), without waiting for a further response.  The sendto() of
107 bytes is our start-up packet, which either succeeds and is
followed by reading a "certificate revoked" message from the server,
or fails with ECONNRESET if the socket has already been shut down at
the server end due to the racing exit.

It seems very strange to me that the error report is deferred until we
send our start-up packet.  It seems like a response that belongs to
the connection attempt, not our later data sending.  Bug in OpenSSL?
Unintended consequence of our switch to blocking IO at that point?

I tried to find out how this looked on 1.0.2, but it looks like Debian
has just removed the older version from the buster distro and I'm out
of time to hunt this on other OSes today.

> The whole thing reminds me of the recent bug #15598:
>
> https://www.postgresql.org/message-id/87k1iy44fd.fsf%40news-spur.riddles.org.uk

Yeah, if errors get moved to later exchanges but the server might exit
and close its end of the socket before we can manage to initiate a
later exchange, it starts to look just like that.

A less interesting bug is the appearance of 3 nonsensical "Success"
(glibc) or "No error: 0" (FreeBSD) error messages in the server logs
on systems running OpenSSL 1.1.1, much like this, which I guess might
mean EOF:

https://www.postgresql.org/message-id/CAEepm=3cc5wYv=X4Nzy7VOUkdHBiJs9bpLzqtqJWxdDUp5DiPQ@...

--
Thomas Munro
http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Rare SSL failures on eelpout

Thomas Munro-5
On Wed, Jan 23, 2019 at 11:23 AM Thomas Munro
<[hidden email]> wrote:
> On Wed, Jan 23, 2019 at 4:07 AM Tom Lane <[hidden email]> wrote:
> > The whole thing reminds me of the recent bug #15598:
> >
> > https://www.postgresql.org/message-id/87k1iy44fd.fsf%40news-spur.riddles.org.uk
>
> Yeah, if errors get moved to later exchanges but the server might exit
> and close its end of the socket before we can manage to initiate a
> later exchange, it starts to look just like that.

Based on some clues from Andrew Gierth (in the email referenced above
and also in an off-list chat), I did some experimentation that seemed
to confirm a theory of his that Linux might be taking a shortcut when
both sides are local, bypassing the RST step because it can see both
ends (whereas normally the TCP stack should cause the *next* sendto()
to fail IIUC?).  I think this case is essentially the same as bug
#15598, it's just happening at a different time.

With a simple socket test program I can see that if you send a single
packet after the remote end has closed and after it had already read
everything you sent it up to now, you get EPIPE.  If there was some
outstanding data from a previous send that it hadn't read yet when it
closed its end, you get ECONNRESET.  This doesn't happen if client and
server are on different machines, or on FreeBSD even on the same
machine, but does happen if client and server are on the same Linux
system (whether using the loopback interface or a real network
interface).  However, after you get ECONNRESET, you can still read the
final data that was sent by the server before it closed, which
presumably contains the error we want to report to the user.  That
suggests that we could perhaps handle ECONNRESET both at startup
packet send time (for certificate rejection, eelpout's case) and at
initial query send (for idle timeout, bug #15598's case) by attempting
to read.  Does that make sense?  I haven't poked into the libpq state
machine stuff to see if that would be easy or hard.

PS: looking again at the strace output from earlier, it's kinda funny
that it says revents=POLLOUT|POLLERR|POLLHUP, since that seems to be a
contradiction: if this were poll() and not ppoll() I think it might
violate POSIX which says "[POLLHUP] and POLLOUT are
mutually-exclusive; a stream can never be writable if a hangup has
occurred", but I don't see what we could do differently with that
anyway.

--
Thomas Munro
https://enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Rare SSL failures on eelpout

Tom Lane-2
Thomas Munro <[hidden email]> writes:

> With a simple socket test program I can see that if you send a single
> packet after the remote end has closed and after it had already read
> everything you sent it up to now, you get EPIPE.  If there was some
> outstanding data from a previous send that it hadn't read yet when it
> closed its end, you get ECONNRESET.  This doesn't happen if client and
> server are on different machines, or on FreeBSD even on the same
> machine, but does happen if client and server are on the same Linux
> system (whether using the loopback interface or a real network
> interface).  However, after you get ECONNRESET, you can still read the
> final data that was sent by the server before it closed, which
> presumably contains the error we want to report to the user.  That
> suggests that we could perhaps handle ECONNRESET both at startup
> packet send time (for certificate rejection, eelpout's case) and at
> initial query send (for idle timeout, bug #15598's case) by attempting
> to read.  Does that make sense?

Hmm ... it definitely makes sense that we shouldn't assume that a *write*
failure means there is nothing left to *read*.  I would say that's
generally true regardless of the specific errno.  How about the opposite
case, btw --- should we continue to try to flush data after a read
error?  Or is it safe to assume that means the other end is gone?

> I haven't poked into the libpq state
> machine stuff to see if that would be easy or hard.

Me either.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: Rare SSL failures on eelpout

Tom Lane-2
I wrote:
> Thomas Munro <[hidden email]> writes:
>> That suggests that we could perhaps handle ECONNRESET both at startup
>> packet send time (for certificate rejection, eelpout's case) and at
>> initial query send (for idle timeout, bug #15598's case) by attempting
>> to read.  Does that make sense?

> Hmm ... it definitely makes sense that we shouldn't assume that a *write*
> failure means there is nothing left to *read*.

After staring at the code for awhile, I am thinking that there may be
a bug of that ilk, but if so it's down inside OpenSSL.  Perhaps it's
specific to the OpenSSL version you're using on eelpout?  There is
not anything we could do differently in libpq, AFAICS, because it's
OpenSSL's responsibility to read any data that might be available.

I also looked into the idea that we're doing something wrong on the
server side, allowing the final error message to not get flushed out.
A plausible theory there is that SSL_shutdown is returning a WANT_READ
or WANT_WRITE error and we should retry it ... but that doesn't square
with your observation upthread that it's returning SSL_ERROR_SSL.

It's all very confusing, but I think there's a nontrivial chance
that this is an OpenSSL bug, especially since we haven't been able
to replicate it elsewhere.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: Rare SSL failures on eelpout

Thomas Munro-5
On Tue, Mar 5, 2019 at 10:08 AM Tom Lane <[hidden email]> wrote:

> I wrote:
> > Thomas Munro <[hidden email]> writes:
> >> That suggests that we could perhaps handle ECONNRESET both at startup
> >> packet send time (for certificate rejection, eelpout's case) and at
> >> initial query send (for idle timeout, bug #15598's case) by attempting
> >> to read.  Does that make sense?
>
> > Hmm ... it definitely makes sense that we shouldn't assume that a *write*
> > failure means there is nothing left to *read*.
>
> After staring at the code for awhile, I am thinking that there may be
> a bug of that ilk, but if so it's down inside OpenSSL.  Perhaps it's
> specific to the OpenSSL version you're using on eelpout?  There is
> not anything we could do differently in libpq, AFAICS, because it's
> OpenSSL's responsibility to read any data that might be available.
>
> I also looked into the idea that we're doing something wrong on the
> server side, allowing the final error message to not get flushed out.
> A plausible theory there is that SSL_shutdown is returning a WANT_READ
> or WANT_WRITE error and we should retry it ... but that doesn't square
> with your observation upthread that it's returning SSL_ERROR_SSL.
>
> It's all very confusing, but I think there's a nontrivial chance
> that this is an OpenSSL bug, especially since we haven't been able
> to replicate it elsewhere.

Hmm.  Yes, it is strange that we haven't seen it elsewhere, but
remember that very few animals are running the ssl tests; also it
requires particular timing to hit.

OK, here's something.  I can reproduce it quite easily on this
machine, and I can fix it like this:

diff --git a/src/interfaces/libpq/fe-connect.c
b/src/interfaces/libpq/fe-connect.c
index f29202db5f..e9c137f1bd 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -2705,6 +2705,7 @@ keep_going:
         /* We will come back to here until there is

   libpq_gettext("could not send startup packet: %s\n"),

   SOCK_STRERROR(SOCK_ERRNO, sebuf, sizeof(sebuf)));
                                        free(startpacket);
+                                       pqHandleSendFailure(conn);
                                        goto error_return;
                                }

If I add some printf debugging in there, I can see that block being
reached every hundred or so times I try to connect with a revoked
certificate, and with that extra call to pqHandleSendFailure() in
there the error comes out as it should:

psql: SSL error: sslv3 alert certificate revoked

Now I'm confused because we already have handling like that in
PQsendQuery(), so I can't explain bug #15598.

--
Thomas Munro
https://enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Rare SSL failures on eelpout

Tom Lane-2
Thomas Munro <[hidden email]> writes:
> On Tue, Mar 5, 2019 at 10:08 AM Tom Lane <[hidden email]> wrote:
>> It's all very confusing, but I think there's a nontrivial chance
>> that this is an OpenSSL bug, especially since we haven't been able
>> to replicate it elsewhere.

> Hmm.  Yes, it is strange that we haven't seen it elsewhere, but
> remember that very few animals are running the ssl tests; also it
> requires particular timing to hit.

True.  I've spent some time today running the ssl tests on various
machines here, without any luck reproducing.

> OK, here's something.  I can reproduce it quite easily on this
> machine, and I can fix it like this:

>    libpq_gettext("could not send startup packet: %s\n"),
>    SOCK_STRERROR(SOCK_ERRNO, sebuf, sizeof(sebuf)));
>                                         free(startpacket);
> +                                       pqHandleSendFailure(conn);
>                                         goto error_return;
>                                 }

Yeah.  But I don't like pqHandleSendFailure all that much: it has strong
constraints on what state libpq has to be in, as a consequence of which
it's called from a bunch of ad-hoc places, and can't be called from
some others.  It's kind of accidental that it will work here.

I was toying with the idea of getting rid of that function in
favor of a design like this:

* Handle any socket-write failure at some low level of libpq by
recording the fact that the error occurred (plus whatever data we
need to produce a message about it), and then starting to just
discard output data.

* Eventually, we'll try to read.  Process any available input data
as usual (and, if we get a read error, report that).  When no more
input data is available, if a socket write failure has been recorded,
report that much as if it were an incoming ERROR message, and then
shut down the connection.

This would essentially give us pqHandleSendFailure-like behavior
across the board, which might be enough to fix this problem as well as
bug #15598.  Or not ... as you say, we haven't thoroughly understood
either issue, so it's possible this wouldn't do it.

This would have the effect of prioritizing reports of socket read
failures over those of write failures, which is an interesting
behavioral change, but offhand I can't see a problem with it.

One thing that isn't real clear to me is how much timing sensitivity
there is in "when no more input data is available".  Can we assume that
if we've gotten ECONNRESET or an allied error from a write, then any
data the far end might've wished to send us is already available to
read?  In principle, since TCP allows shutting down either transmission
direction independently, the server could close the read side of its
socket while continuing to transmit data.  In practice, PG servers
don't do that; but could network timing glitches create the same effect?
Even if it's possible, does it happen enough to worry about?

The reason I'm concerned is that I don't think it'd be bright to ignore a
send error until we see input EOF, which'd be the obvious way to solve a
timing problem if there is one.  If our send error was some transient
glitch and the connection is really still open, then we might not get EOF,
but we won't get a server reply either because no message went to the
server.  You could imagine waiting some small interval of time before
deciding that it's time to report the write failure, but ugh.

In any case, consuming immediately-available data before reporting the
write error would already be a step forward over what we're doing now,
I think.

Thoughts?

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: Rare SSL failures on eelpout

Thomas Munro-5
On Tue, Mar 5, 2019 at 11:35 AM Tom Lane <[hidden email]> wrote:

> Thomas Munro <[hidden email]> writes:
> > OK, here's something.  I can reproduce it quite easily on this
> > machine, and I can fix it like this:
>
> >    libpq_gettext("could not send startup packet: %s\n"),
> >    SOCK_STRERROR(SOCK_ERRNO, sebuf, sizeof(sebuf)));
> >                                         free(startpacket);
> > +                                       pqHandleSendFailure(conn);
> >                                         goto error_return;
> >                                 }
>
> Yeah.  But I don't like pqHandleSendFailure all that much: it has strong
> constraints on what state libpq has to be in, as a consequence of which
> it's called from a bunch of ad-hoc places, and can't be called from
> some others.  It's kind of accidental that it will work here.
>
> I was toying with the idea of getting rid of that function in
> favor of a design like this:
>
> * Handle any socket-write failure at some low level of libpq by
> recording the fact that the error occurred (plus whatever data we
> need to produce a message about it), and then starting to just
> discard output data.
>
> * Eventually, we'll try to read.  Process any available input data
> as usual (and, if we get a read error, report that).  When no more
> input data is available, if a socket write failure has been recorded,
> report that much as if it were an incoming ERROR message, and then
> shut down the connection.
>
> This would essentially give us pqHandleSendFailure-like behavior
> across the board, which might be enough to fix this problem as well as
> bug #15598.  Or not ... as you say, we haven't thoroughly understood
> either issue, so it's possible this wouldn't do it.
>
> This would have the effect of prioritizing reports of socket read
> failures over those of write failures, which is an interesting
> behavioral change, but offhand I can't see a problem with it.

That seems to recreate (and extend) the behaviour that we see on
another TCP stacks, or on Linux with a remote connection, namely that
the first sendto() succeeds (even though future sendto() calls may
fail, we don't usually get to that because we read and discover an
application-level error message or whatever).

> One thing that isn't real clear to me is how much timing sensitivity
> there is in "when no more input data is available".  Can we assume that
> if we've gotten ECONNRESET or an allied error from a write, then any
> data the far end might've wished to send us is already available to
> read?  In principle, since TCP allows shutting down either transmission
> direction independently, the server could close the read side of its
> socket while continuing to transmit data.  In practice, PG servers
> don't do that; but could network timing glitches create the same effect?
> Even if it's possible, does it happen enough to worry about?

That is beyond my knowledge of networking; I have CC'd Andrew Gierth
in case he has something to say about that.  It certainly seems
sensible to assume that eg RST must follow any data that the other end
sent before closing, if it did indeed call close(), and in our case we
know that it never calls shutdown(SHUT_RD), so the only other
possibility seems to be that it crashed or the connection was lost.
So it seems safe to assume that we can read the server's parting words
after we've had a EPIPE/ECONNRESET error.  Based on some quick
googling, it looks like SHUT_RD doesn't actually send anything anyway
(unlike SHUT_WR which sends FIN and would result in EOF at the other
end), so I'm not sure if SHUT_RD even "exists" outside the mind of the
TCP stack doing it and therefore I'm not sure if there is any "timing
glitch" that could resemble it.  But I don't know.

> The reason I'm concerned is that I don't think it'd be bright to ignore a
> send error until we see input EOF, which'd be the obvious way to solve a
> timing problem if there is one.  If our send error was some transient
> glitch and the connection is really still open, then we might not get EOF,
> but we won't get a server reply either because no message went to the
> server.  You could imagine waiting some small interval of time before
> deciding that it's time to report the write failure, but ugh.
>
> In any case, consuming immediately-available data before reporting the
> write error would already be a step forward over what we're doing now,
> I think.
>
> Thoughts?

It seems reasonable to me on the grounds that sending data doesn't
mean the other guy got it anyway and we can see that on other TCP
stacks.  Preferring errors resulting from reads seems sensible because
it's exactly what we want in these cases, and I can't immediately
think of a downside.  It's bigger surgery than I was thinking of but
it seems like it removes some pre-existing complicated code and
replaces it with something simple, so that seems to be a win.

--
Thomas Munro
https://enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Rare SSL failures on eelpout

Thomas Munro-5
In reply to this post by Tom Lane-2
On Tue, Mar 5, 2019 at 11:35 AM Tom Lane <[hidden email]> wrote:
> True.  I've spent some time today running the ssl tests on various
> machines here, without any luck reproducing.

BTW, I went looking for other failures on the buildfarm I noticed that
even for eelpout it's only happening on master and REL_11_STABLE:

https://buildfarm.postgresql.org/cgi-bin/show_failures.pl?max_days=200&branch=&stage=sslCheck&filter=Submit

Disappointingly, that turned out to be just because 10 and earlier
didn't care what the error message said.

--
Thomas Munro
https://enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Rare SSL failures on eelpout

Tom Lane-2
Thomas Munro <[hidden email]> writes:
> BTW, I went looking for other failures on the buildfarm I noticed that
> even for eelpout it's only happening on master and REL_11_STABLE:

Yeah, I'd noticed that.

> Disappointingly, that turned out to be just because 10 and earlier
> didn't care what the error message said.

That is, you can reproduce the failure on old branches?  That lets
out a half-theory I'd had, which was that Andres' changes to make
the backend always run its socket in nonblock mode had had something
to do with it.  (Those changes do represent a plausible reason why
SSL_shutdown might be returning WANT_READ/WANT_WRITE; but I'm not
in a hurry to add such code without evidence that it actually
happens and something useful would change if we retry.)

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: Rare SSL failures on eelpout

Thomas Munro-5
On Wed, Mar 6, 2019 at 3:33 AM Tom Lane <[hidden email]> wrote:

> Thomas Munro <[hidden email]> writes:
> > Disappointingly, that turned out to be just because 10 and earlier
> > didn't care what the error message said.
>
> That is, you can reproduce the failure on old branches?  That lets
> out a half-theory I'd had, which was that Andres' changes to make
> the backend always run its socket in nonblock mode had had something
> to do with it.  (Those changes do represent a plausible reason why
> SSL_shutdown might be returning WANT_READ/WANT_WRITE; but I'm not
> in a hurry to add such code without evidence that it actually
> happens and something useful would change if we retry.)

Yes, on REL_10_STABLE:

$ for i in `seq 1 1000 ` ; do
    psql "host=localhost port=56024 dbname=certdb user=postgres
sslcert=ssl/client-revoked.crt sslkey=ssl/client-revoked.key"
  done
psql: SSL error: sslv3 alert certificate revoked
psql: SSL error: sslv3 alert certificate revoked
psql: SSL error: sslv3 alert certificate revoked
...
psql: SSL error: sslv3 alert certificate revoked
psql: SSL error: sslv3 alert certificate revoked
psql: SSL error: sslv3 alert certificate revoked
psql: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
could not send startup packet: Connection reset by peer
psql: SSL error: sslv3 alert certificate revoked
psql: SSL error: sslv3 alert certificate revoked
psql: SSL error: sslv3 alert certificate revoked
psql: SSL error: sslv3 alert certificate revoked

Zooming in with strace:

sendto(3, "\27\3\3\2\356\r\214\352@\21\320\202\236}\376\367\262\227\177\255\212\204`q\254\108\326\201+c)"...,
1115, MSG_NOSIGNAL, NULL, 0) = 1115
ppoll([{fd=3, events=POLLOUT|POLLERR}], 1, NULL, NULL, 0) = 1 ([{fd=3,
revents=POLLOUT|POLLERR|POLLHUP}])
sendto(3, "\27\3\3\0cW_\210\337Q\227\360\216k\221\346\372pw\27\325P\203\357\245km\304Rx\355\200"...,
104, MSG_NOSIGNAL, NULL, 0) = -1 ECONNRESET (Connection reset by peer)

You can see that poll() already knew the other end had closed the
socket.  Since this is clearly timing... let's see, yeah, I can make
it fail every time by adding sleep(1) before the comment "Send the
startup packet.".  I assume that'll work on any Linux machine?

To set this test up, I ran a server with the following config:

ssl=on
ssl_ca_file='root+client_ca.crt'
ssl_cert_file='server-cn-only.crt'
ssl_key_file='server-cn-only.key'
ssl_crl_file='root+client.crl'

I copied those files out of src/test/ssl/ssl/.  Then I ran the psql
command shown earlier.  I think I had to chmod 600 the keys.

--
Thomas Munro
https://enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Rare SSL failures on eelpout

Tom Lane-2
Thomas Munro <[hidden email]> writes:
> You can see that poll() already knew the other end had closed the
> socket.  Since this is clearly timing... let's see, yeah, I can make
> it fail every time by adding sleep(1) before the comment "Send the
> startup packet.".  I assume that'll work on any Linux machine?

Great idea, but no cigar --- doesn't do anything for me except make
the ssl test really slow.  (I tried it on RHEL6 and Fedora 28 and, just
for luck, current macOS.)  What this seems to prove is that the thing
that's different about eelpout is the particular kernel it's running,
and that that kernel has some weird timing behavior in this situation.

I've also been experimenting with reducing libpq's SO_SNDBUF setting
on the socket, with more or less the same idea of making the sending
of the startup packet slower.  No joy there either.

Annoying.  I'd be happier about writing code to fix this if I could
reproduce it :-(

                        regards, tom lane

PS: but now I'm wondering about trying other non-Linux kernels.

Reply | Threaded
Open this post in threaded view
|

Re: Rare SSL failures on eelpout

Thomas Munro-5
On Wed, Mar 6, 2019 at 6:07 AM Tom Lane <[hidden email]> wrote:

> Thomas Munro <[hidden email]> writes:
> > You can see that poll() already knew the other end had closed the
> > socket.  Since this is clearly timing... let's see, yeah, I can make
> > it fail every time by adding sleep(1) before the comment "Send the
> > startup packet.".  I assume that'll work on any Linux machine?
>
> Great idea, but no cigar --- doesn't do anything for me except make
> the ssl test really slow.  (I tried it on RHEL6 and Fedora 28 and, just
> for luck, current macOS.)  What this seems to prove is that the thing
> that's different about eelpout is the particular kernel it's running,
> and that that kernel has some weird timing behavior in this situation.
>
> I've also been experimenting with reducing libpq's SO_SNDBUF setting
> on the socket, with more or less the same idea of making the sending
> of the startup packet slower.  No joy there either.
>
> Annoying.  I'd be happier about writing code to fix this if I could
> reproduce it :-(

Hmm.  Note that eelpout only started doing it with OpenSSL 1.1.1.  But
I just tried the sleep(1) trick on an x86 box running the same version
of Debian, OpenSSL etc and it didn't work.  So eelpout (a super cheap
virtualised 4-core ARMv8 system rented from scaleway.com running
Debian Buster with a kernel identifying itself as 4.9.23-std-1 and
libc6 2.28-7) is indeed starting to look pretty weird.  Let me know if
you want to log in and experiment on that machine.

--
Thomas Munro
https://enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Rare SSL failures on eelpout

Thomas Munro-5
On Wed, Mar 6, 2019 at 7:05 AM Thomas Munro <[hidden email]> wrote:
> On Wed, Mar 6, 2019 at 6:07 AM Tom Lane <[hidden email]> wrote:
> > Annoying.  I'd be happier about writing code to fix this if I could
> > reproduce it :-(
>
> Hmm.  Note that eelpout only started doing it with OpenSSL 1.1.1.

Bleugh.  I think this OpenSSL package might just be buggy on ARM.  On
x86, apparently the same version of OpenSSL and all other details of
the test the same, I can see that SSL_connect() returns <= 0
(failure), and then we ask for that cert revoked message directly and
never even reach the startup packet sending code.  On ARM,
SSL_connect() returns 1 (success) and then we proceed as discussed and
eventually get the error later (or not).  So I think I should figure
out a minimal repro and report this to them.

--
Thomas Munro
https://enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Rare SSL failures on eelpout

Tom Lane-2
Thomas Munro <[hidden email]> writes:
> Bleugh.  I think this OpenSSL package might just be buggy on ARM.  On
> x86, apparently the same version of OpenSSL and all other details of
> the test the same, I can see that SSL_connect() returns <= 0
> (failure), and then we ask for that cert revoked message directly and
> never even reach the startup packet sending code.  On ARM,
> SSL_connect() returns 1 (success) and then we proceed as discussed and
> eventually get the error later (or not).  So I think I should figure
> out a minimal repro and report this to them.

Yeah, I've still been unable to reproduce even with the sleep idea,
so eelpout is definitely looking like a special snowflake from here.
In any case, there seems little doubt that getting past SSL_connect()
when the cert check has failed is an OpenSSL bug; I don't feel a need
to create a workaround for it.

The bug #15598 report is more troublesome, as we don't have a strong
reason to believe it's not common on Windows.  However, I wonder whether
we can really do anything at all about that one.  If I understand what
Andrew was hypothesizing in that thread, it was that Windows might be
dropping undelivered data on the floor once the server closes its end
of the connection.  That would be totally broken behavior, but I never
expect anything else from Microsoft :-(.  If that is an accurate theory
then rewriting libpq won't fix it.

I still think the redesign I suggested upthread would make things cleaner,
but I don't have time or interest to make it happen in the near future
if it's not fixing an observable bug.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: Rare SSL failures on eelpout

Thomas Munro-5
On Wed, Mar 6, 2019 at 9:21 AM Tom Lane <[hidden email]> wrote:
> The bug #15598 report is more troublesome, as we don't have a strong
> reason to believe it's not common on Windows.  However, I wonder whether
> we can really do anything at all about that one.  If I understand what
> Andrew was hypothesizing in that thread, it was that Windows might be
> dropping undelivered data on the floor once the server closes its end
> of the connection.  That would be totally broken behavior, but I never
> expect anything else from Microsoft :-(.  If that is an accurate theory
> then rewriting libpq won't fix it.

Here is a stupid Python 2.7 program to try to test that.  Run one copy
of it like this:

$ python ./test.py --server

The server will wait for a client, send a message immediately, and
then close the socket after a second.  The client will connect and
send something once before and twice after the server closes the
socket, and finally see if it can read the message from the server.

Here's the output I get from the client on some different systems (the
server was running on the same system):

$ uname -a
Linux debian 4.18.0-3-amd64 #1 SMP Debian 4.18.20-2 (2018-11-23)
x86_64 GNU/Linux
$ python ./test.py --client
Sending A...
2
Sending B...
[Errno 104] Connection reset by peer
Sending C...
[Errno 32] Broken pipe
This is the server saying goodbye

$ uname -a
Darwin macaque.local 18.2.0 Darwin Kernel Version 18.2.0: Thu Dec 20
20:46:53 PST 2018; root:xnu-4903.241.1~1/RELEASE_X86_64 x86_64
$ python2.7 ./test.py --client
Sending A...
2
Sending B...
[Errno 32] Broken pipe
Sending C...
[Errno 32] Broken pipe
This is the server saying goodbye

$ uname -a
FreeBSD dogmatix 13.0-CURRENT FreeBSD 13.0-CURRENT c0873ea614a(master)
GENERIC  amd64
$ python2.7 ./test.py --client
Sending A...
2
Sending B...
2
Sending C...
2
This is the server saying goodbye

So... can anyone tell us what happens on Windows?

(A secondary question might be what happens if the server and client
are on different machines since I guess it could be different?)

--
Thomas Munro
https://enterprisedb.com

test.py (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Rare SSL failures on eelpout

Shawn Debnath
On Wed, Mar 06, 2019 at 11:13:31AM +1300, Thomas Munro wrote:
> So... can anyone tell us what happens on Windows?

Voila!

C:\Users\Shawn Debnath\Desktop>systeminfo
OS Name:                   Microsoft Windows 10 Pro
OS Version:                10.0.17763 N/A Build 17763
[...]

C:\Users\Shawn Debnath\Desktop>c:\Python27\python.exe tmunro-ssl-test.py --server


C:\Users\Shawn Debnath\Desktop>c:\Python27\python.exe tmunro-ssl-test.py --client
Sending A...
2
Sending B...
[Errno 10054] An existing connection was forcibly closed by the remote host
Sending C...
[Errno 10054] An existing connection was forcibly closed by the remote host
Traceback (most recent call last):
  File "tmunro-ssl-test.py", line 57, in <module>
    client()
  File "tmunro-ssl-test.py", line 51, in client
    print s.recv(1024)
socket.error: [Errno 10054] An existing connection was forcibly closed by the remote host


--
Shawn Debnath
Amazon Web Services (AWS)

Reply | Threaded
Open this post in threaded view
|

Re: Rare SSL failures on eelpout

Thomas Munro-5
On Wed, Mar 6, 2019 at 4:07 PM Shawn Debnath <[hidden email]> wrote:
> On Wed, Mar 06, 2019 at 11:13:31AM +1300, Thomas Munro wrote:
> > So... can anyone tell us what happens on Windows?

> C:\Users\Shawn Debnath\Desktop>c:\Python27\python.exe tmunro-ssl-test.py --client
> Sending A...
> 2
> Sending B...
> [Errno 10054] An existing connection was forcibly closed by the remote host
> Sending C...
> [Errno 10054] An existing connection was forcibly closed by the remote host
> Traceback (most recent call last):
>   File "tmunro-ssl-test.py", line 57, in <module>
>     client()
>   File "tmunro-ssl-test.py", line 51, in client
>     print s.recv(1024)
> socket.error: [Errno 10054] An existing connection was forcibly closed by the remote host

Thanks!  Wow, so not only can we not read the final message sent by
the server, if we try we get an error.  That's... not what we want.  I
wonder if there might be a way to put the socket into don't-do-that
mode...

--
Thomas Munro
https://enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Rare SSL failures on eelpout

Thomas Munro-5
In reply to this post by Tom Lane-2
On Wed, Mar 6, 2019 at 9:21 AM Tom Lane <[hidden email]> wrote:

> Thomas Munro <[hidden email]> writes:
> > Bleugh.  I think this OpenSSL package might just be buggy on ARM.  On
> > x86, apparently the same version of OpenSSL and all other details of
> > the test the same, I can see that SSL_connect() returns <= 0
> > (failure), and then we ask for that cert revoked message directly and
> > never even reach the startup packet sending code.  On ARM,
> > SSL_connect() returns 1 (success) and then we proceed as discussed and
> > eventually get the error later (or not).  So I think I should figure
> > out a minimal repro and report this to them.
>
> Yeah, I've still been unable to reproduce even with the sleep idea,
> so eelpout is definitely looking like a special snowflake from here.
> In any case, there seems little doubt that getting past SSL_connect()
> when the cert check has failed is an OpenSSL bug; I don't feel a need
> to create a workaround for it.

I was wrong: it breaks on an x86 system for me too (either with the
sleep, or with clunky scheduling, I was running psql under strace when
I saw it).  Not sure what I did wrong last time I tried that.  I
opened a bug report with OpenSSL, let's see what they say:

https://github.com/openssl/openssl/issues/8500

--
Thomas Munro
https://enterprisedb.com

12