On Nov 11, 2020, at 10:17 AM, Jacob Champion <[hidden email]> wrote:
> > (Two failures left on my machine.) False alarm -- the stderr debugging I'd added in to track down the assertion tripped up the "no stderr" tests. Zero failing tests now. --Jacob |
On Nov 11, 2020, at 10:57 AM, Jacob Champion <[hidden email]> wrote:
> > False alarm -- the stderr debugging I'd added in to track down the > assertion tripped up the "no stderr" tests. Zero failing tests now. I took a look at the OpenSSL interop problems you mentioned upthread. I don't see a hang like you did, but I do see a PR_IO_TIMEOUT_ERROR during connection. I think pgtls_read() needs to treat PR_IO_TIMEOUT_ERROR as if no bytes were read, in order to satisfy its API. There was some discussion on this upthread: On Oct 27, 2020, at 1:07 PM, Daniel Gustafsson <[hidden email]> wrote: > > On 20 Oct 2020, at 21:15, Andres Freund <[hidden email]> wrote: >> >>> + case PR_IO_TIMEOUT_ERROR: >>> + break; >> >> What does this mean? We'll return with a 0 errno here, right? When is >> this case reachable? > > It should, AFAICT, only be reachable when PR_Recv is used with a timeout which > we don't do. It mentioned somewhere that it had happened in no-wait calls due > to a bug, but I fail to find that reference now. Either way, I've removed it > to fall into the default error handling which now sets errno correctly as that > was a paddle short here. machine. It doesn't look like the PR_Recv() API has a choice -- if there's no data, it can't return a positive integer, and returning zero means that the socket has been disconnected. So -1 with a timeout error is the only option. I'm not completely sure why this is exposed so easily with an OpenSSL server -- I'm guessing the implementation slices up its packets differently on the wire, causing a read event before NSS is able to decrypt a full record -- but it's worth noting that this case also shows up during NSS-to-NSS psql connections, when handling notifications at the end of every query. PQconsumeInput() reports a hard failure with the current implementation, but its return value is ignored by PrintNotifications(). Otherwise this probably would have showed up earlier. (What's the best way to test this case? Are there lower-level tests for the protocol/network layer somewhere that I'm missing?) While patching this case, I also noticed that pgtls_read() doesn't call SOCK_ERRNO_SET() for the disconnection case. That is also in the attached patch. --Jacob |
> On 12 Nov 2020, at 23:12, Jacob Champion <[hidden email]> wrote:
> > On Nov 11, 2020, at 10:57 AM, Jacob Champion <[hidden email]> wrote: >> >> False alarm -- the stderr debugging I'd added in to track down the >> assertion tripped up the "no stderr" tests. Zero failing tests now. > > I took a look at the OpenSSL interop problems you mentioned upthread. Great, thanks! > I don't see a hang like you did, but I do see a PR_IO_TIMEOUT_ERROR during > connection. > > I think pgtls_read() needs to treat PR_IO_TIMEOUT_ERROR as if no bytes > were read, in order to satisfy its API. There was some discussion on > this upthread: > > On Oct 27, 2020, at 1:07 PM, Daniel Gustafsson <[hidden email]> wrote: >> >> On 20 Oct 2020, at 21:15, Andres Freund <[hidden email]> wrote: >>> >>>> + case PR_IO_TIMEOUT_ERROR: >>>> + break; >>> >>> What does this mean? We'll return with a 0 errno here, right? When is >>> this case reachable? >> >> It should, AFAICT, only be reachable when PR_Recv is used with a timeout which >> we don't do. It mentioned somewhere that it had happened in no-wait calls due >> to a bug, but I fail to find that reference now. Either way, I've removed it >> to fall into the default error handling which now sets errno correctly as that >> was a paddle short here. > > PR_IO_TIMEOUT_ERROR is definitely returned in no-wait calls on my > machine. It doesn't look like the PR_Recv() API has a choice -- if > there's no data, it can't return a positive integer, and returning zero > means that the socket has been disconnected. So -1 with a timeout error > is the only option. > I'm not completely sure why this is exposed so easily with an OpenSSL > server -- I'm guessing the implementation slices up its packets > differently on the wire, causing a read event before NSS is able to > decrypt a full record -- but it's worth noting that this case also shows > up during NSS-to-NSS psql connections, when handling notifications at > the end of every query. PQconsumeInput() reports a hard failure with the > current implementation, but its return value is ignored by > PrintNotifications(). Otherwise this probably would have showed up > earlier. Should there perhaps be an Assert there to catch those? > (What's the best way to test this case? Are there lower-level tests for > the protocol/network layer somewhere that I'm missing?) Not AFAIK. Having been knee-deep now, do you have any ideas on how to implement? > While patching this case, I also noticed that pgtls_read() doesn't call > SOCK_ERRNO_SET() for the disconnection case. That is also in the > attached patch. Ah yes, nice catch. I've incorporated this patch as well as the previous patch for the assertion failure on private callback data into the attached v19 patchset. I also did a spellcheck and pgindent run on it for ease of review. cheers ./daniel ![]() ![]() ![]() ![]() ![]() |
On Nov 13, 2020, at 4:14 AM, Daniel Gustafsson <[hidden email]> wrote:
>> On 12 Nov 2020, at 23:12, Jacob Champion <[hidden email]> wrote: >> >> I'm not completely sure why this is exposed so easily with an OpenSSL >> server -- I'm guessing the implementation slices up its packets >> differently on the wire, causing a read event before NSS is able to >> decrypt a full record -- but it's worth noting that this case also shows >> up during NSS-to-NSS psql connections, when handling notifications at >> the end of every query. PQconsumeInput() reports a hard failure with the >> current implementation, but its return value is ignored by >> PrintNotifications(). Otherwise this probably would have showed up >> earlier. > > Should there perhaps be an Assert there to catch those? the standpoint of "don't crash when an endpoint outside our control does something strange", I think that's a harder sell. Should the error be bubbled all the way up instead? Or perhaps, if psql isn't supposed to treat notification errors as "hard" failures, it should at least warn the user that something is fishy? >> (What's the best way to test this case? Are there lower-level tests for >> the protocol/network layer somewhere that I'm missing?) > > Not AFAIK. Having been knee-deep now, do you have any ideas on how to > implement? I think that testing these sorts of important edge cases needs a friendly DSL -- something that doesn't want to make devs tear their hair out while building tests. I've been playing a little bit with Scapy [1] to understand more of the libpq v3 protocol; I'll see if that can be adapted for pieces of the TLS handshake in a way that's easy to maintain. If it can be, maybe that'd be a good starting example. > I've incorporated this patch as well as the previous patch for the assertion > failure on private callback data into the attached v19 patchset. I also did a > spellcheck and pgindent run on it for ease of review. Commit 6be725e70 got rid of some psql error messaging that the tests were keying off of, so there are a few new failures after a rebase onto latest master. I've attached a patch that gets the SCRAM tests a little further (certificate hashing was caught in an infinite loop). I also added error checks to those loops, along the lines of the existing OpenSSL implementation: if a suitable digest can't be found, the user will see an error like psql: error: could not find digest for OID 'PKCS #1 SHA-256 With RSA Encryption' It's a little verbose but I don't think this case should come up in normal practice. --Jacob [1] https://scapy.net/ |
> On 16 Nov 2020, at 21:00, Jacob Champion <[hidden email]> wrote:
> On Nov 13, 2020, at 4:14 AM, Daniel Gustafsson <[hidden email]> wrote: >> I've incorporated this patch as well as the previous patch for the assertion >> failure on private callback data into the attached v19 patchset. I also did a >> spellcheck and pgindent run on it for ease of review. > > Commit 6be725e70 got rid of some psql error messaging that the tests > were keying off of, so there are a few new failures after a rebase onto > latest master. > > I've attached a patch that gets the SCRAM tests a little further > (certificate hashing was caught in an infinite loop). I also added error > checks to those loops, along the lines of the existing OpenSSL > implementation: if a suitable digest can't be found, the user will see > an error like > > psql: error: could not find digest for OID 'PKCS #1 SHA-256 With RSA Encryption' > > It's a little verbose but I don't think this case should come up in > normal practice. which also fixes client side error reporting to be more readable. The SCRAM tests are now also hooked up, albeit with SKIP blocks for NSS, so they can start getting fixed. cheers ./daniel ![]() ![]() ![]() ![]() ![]() |
On Nov 17, 2020, at 7:00 AM, Daniel Gustafsson <[hidden email]> wrote:
> > Nice, thanks for the fix! I've incorporated your patch into the attached v20 > which also fixes client side error reporting to be more readable. I was testing handshake failure modes and noticed that some FATAL messages are being sent through to the client in cleartext. The OpenSSL implementation doesn't do this, because it logs handshake problems at COMMERROR level. Should we switch all those ereport() calls in the NSS be_tls_open_server() to COMMERROR as well (and return explicitly), to avoid this? Or was there a reason for logging at FATAL/ERROR level? Related note, at the end of be_tls_open_server(): > ... > port->ssl_in_use = true; > return 0; > > error: > return 1; > } This needs to return -1 in the error case; the only caller of secure_open_server() does a direct `result == -1` comparison rather than checking `result != 0`. --Jacob |
In reply to this post by Daniel Gustafsson
On Tue, 2020-10-27 at 21:07 +0100, Daniel Gustafsson wrote:
> > On 20 Oct 2020, at 21:15, Andres Freund <[hidden email]> wrote: > > > > > +static SECStatus > > > +pg_cert_auth_handler(void *arg, PRFileDesc * fd, PRBool checksig, PRBool isServer) > > > +{ > > > + SECStatus status; > > > + Port *port = (Port *) arg; > > > + CERTCertificate *cert; > > > + char *peer_cn; > > > + int len; > > > + > > > + status = SSL_AuthCertificate(CERT_GetDefaultCertDB(), port->pr_fd, checksig, PR_TRUE); > > > + if (status == SECSuccess) > > > + { > > > + cert = SSL_PeerCertificate(port->pr_fd); > > > + len = strlen(cert->subjectName); > > > + peer_cn = MemoryContextAllocZero(TopMemoryContext, len + 1); > > > + if (strncmp(cert->subjectName, "CN=", 3) == 0) > > > + strlcpy(peer_cn, cert->subjectName + strlen("CN="), len + 1); > > > + else > > > + strlcpy(peer_cn, cert->subjectName, len + 1); > > > + CERT_DestroyCertificate(cert); > > > + > > > + port->peer_cn = peer_cn; > > > + port->peer_cert_valid = true; > > > > Hm. We either should have something similar to > > > > /* > > * Reject embedded NULLs in certificate common name to prevent > > * attacks like CVE-2009-4034. > > */ > > if (len != strlen(peer_cn)) > > { > > ereport(COMMERROR, > > (errcode(ERRCODE_PROTOCOL_VIOLATION), > > errmsg("SSL certificate's common name contains embedded null"))); > > pfree(peer_cn); > > return -1; > > } > > here, or a comment explaining why not. > > We should, but it's proving rather difficult as there is no equivalent API call > to get the string as well as the expected length of it. I'm going to try to tackle this part next. It looks like NSS uses RFC 4514 (or something like it) backslash-quoting, which this code either needs to undo or bypass before performing a comparison. --Jacob |
In reply to this post by Daniel Gustafsson
On Tue, Nov 17, 2020 at 04:00:53PM +0100, Daniel Gustafsson wrote:
> Nice, thanks for the fix! I've incorporated your patch into the attached v20 > which also fixes client side error reporting to be more readable. The SCRAM > tests are now also hooked up, albeit with SKIP blocks for NSS, so they can > start getting fixed. On top of the set of TODO items mentioned in the logs of the patches, this patch set needs a rebase because it does not apply. In order to move on with this set, I would suggest to extract some parts of the patch set independently of the others and have two buildfarm members for the MSVC and non-MSVC cases to stress the parts that can be committed. Just seeing the size, we could move on with: - The ./configure set, with the change to introduce --with-ssl=openssl. - 0004 for strong randoms. - Support for cryptohashes. +/* + * BITS_PER_BYTE is also defined in the NSPR header files, so we need to undef + * our version to avoid compiler warnings on redefinition. + */ +#define pg_BITS_PER_BYTE BITS_PER_BYTE +#undef BITS_PER_BYTE This could be done separately. src/sgml/libpq.sgml needs to document PQdefaultSSLKeyPassHook_nss, no? -- Michael |
> On 18 Jan 2021, at 08:08, Michael Paquier <[hidden email]> wrote:
> > On Tue, Nov 17, 2020 at 04:00:53PM +0100, Daniel Gustafsson wrote: >> Nice, thanks for the fix! I've incorporated your patch into the attached v20 >> which also fixes client side error reporting to be more readable. The SCRAM >> tests are now also hooked up, albeit with SKIP blocks for NSS, so they can >> start getting fixed. > > On top of the set of TODO items mentioned in the logs of the patches, > this patch set needs a rebase because it does not apply. Fixed in the attached, which also addresses the points raised earlier by Jacob as well as adds certificates created entirely by NSS tooling as well as initial cryptohash support. There is something iffy with these certs (the test fails on mismatching ciphers and/or signature algorithms) that I haven't been able to pin down, but to get more eyes on this I'm posting the patch with the test enabled. The NSS toolchain requires interactive input which makes the Makefile a bit hacky, ideas on cleaning that up are appreciated. > In order to > move on with this set, I would suggest to extract some parts of the > patch set independently of the others and have two buildfarm members > for the MSVC and non-MSVC cases to stress the parts that can be > committed. Just seeing the size, we could move on with: > - The ./configure set, with the change to introduce --with-ssl=openssl. > - 0004 for strong randoms. > - Support for cryptohashes. I will leave it to others to decide the feasibility of this, I'm happy to slice and dice the commits into smaller bits to for example separate out the --with-ssl autoconf change into a non NSS dependent commit, if that's wanted. > +/* > + * BITS_PER_BYTE is also defined in the NSPR header files, so we need to undef > + * our version to avoid compiler warnings on redefinition. > + */ > +#define pg_BITS_PER_BYTE BITS_PER_BYTE > +#undef BITS_PER_BYTE > This could be done separately. Based on an offlist discussion I believe this was a misunderstanding, but if I instead misunderstood that feel free to correct me with how you think this should be done. > src/sgml/libpq.sgml needs to document PQdefaultSSLKeyPassHook_nss, no? Good point, fixed. cheers ./daniel ![]() ![]() ![]() ![]() ![]() ![]() |
In reply to this post by Jacob Champion-2
> On 4 Dec 2020, at 01:57, Jacob Champion <[hidden email]> wrote:
> > On Nov 17, 2020, at 7:00 AM, Daniel Gustafsson <[hidden email]> wrote: >> >> Nice, thanks for the fix! I've incorporated your patch into the attached v20 >> which also fixes client side error reporting to be more readable. > > I was testing handshake failure modes and noticed that some FATAL > messages are being sent through to the client in cleartext. The OpenSSL > implementation doesn't do this, because it logs handshake problems at > COMMERROR level. Should we switch all those ereport() calls in the NSS > be_tls_open_server() to COMMERROR as well (and return explicitly), to > avoid this? Or was there a reason for logging at FATAL/ERROR level? The ERROR logging made early development easier but then stuck around, I've changed them to COMMERROR returning an error instead in the v21 patch just sent to the list. > Related note, at the end of be_tls_open_server(): > >> ... >> port->ssl_in_use = true; >> return 0; >> >> error: >> return 1; >> } > > This needs to return -1 in the error case; the only caller of > secure_open_server() does a direct `result == -1` comparison rather than > checking `result != 0`. Fixed. cheers ./daniel |
In reply to this post by Daniel Gustafsson
On Tue, 2021-01-19 at 21:21 +0100, Daniel Gustafsson wrote:
> There is something iffy with these certs (the test fails > on mismatching ciphers and/or signature algorithms) that I haven't been able to > pin down, but to get more eyes on this I'm posting the patch with the test > enabled. Removing `--keyUsage keyEncipherment` from the native_server-* CSR generation seems to let the tests pass for me, but I'm wary of just pushing that as a solution because I don't understand why that would have anything to do with the failure mode (SSL_ERROR_NO_SUPPORTED_SIGNATURE_ALGORITHM). > The NSS toolchain requires interactive input which makes the Makefile > a bit hacky, ideas on cleaning that up are appreciated. Hm. I got nothing, short of a feature request to NSS... --Jacob |
> On 20 Jan 2021, at 01:40, Jacob Champion <[hidden email]> wrote:
Aha, that was a good clue, I had overlooked the required extensions in the CSR.
> > On Tue, 2021-01-19 at 21:21 +0100, Daniel Gustafsson wrote: >> There is something iffy with these certs (the test fails >> on mismatching ciphers and/or signature algorithms) that I haven't been able to >> pin down, but to get more eyes on this I'm posting the patch with the test >> enabled. > > Removing `--keyUsage keyEncipherment` from the native_server-* CSR > generation seems to let the tests pass for me, but I'm wary of just > pushing that as a solution because I don't understand why that would > have anything to do with the failure mode > (SSL_ERROR_NO_SUPPORTED_SIGNATURE_ALGORITHM). Re-reading RFC 5280 it seems we need keyEncipherment, dataEncipherment and digitalSignature to create a valid SSL Server certificate. Adding those indeed make the test pass. Skimming the certutil code *I think* removing it as you did cause a set of defaults to kick in that made it work based on the parameter "--nsCertType sslServer", but it's not entirely easy to make out. Either way, relying on defaults in a test suite seems less than good, so I've extended the Makefile to be explicit about the extensions. The attached v22 rebase incorporates the fixup to the test Makefile, with not further changes on top of that. cheers ./daniel ![]() ![]() ![]() ![]() ![]() ![]() |
On Wed, 2021-01-20 at 12:58 +0100, Daniel Gustafsson wrote:
> Aha, that was a good clue, I had overlooked the required extensions in the CSR. > Re-reading RFC 5280 it seems we need keyEncipherment, dataEncipherment and > digitalSignature to create a valid SSL Server certificate. Adding those indeed > make the test pass. Skimming the certutil code *I think* removing it as you > did cause a set of defaults to kick in that made it work based on the parameter > "--nsCertType sslServer", but it's not entirely easy to make out. Lovely. I didn't expect *removing* an extension to effectively *add* more, but I'm glad it works now. == To continue the Subject Common Name discussion [1] from a different part of the thread: Attached is a v23 version of the patchset that peels the raw Common Name out from a client cert's Subject. This allows the following cases that the OpenSSL implementation currently handles: - subjects that don't begin with a CN - subjects with quotable characters - subjects that have no CN at all Embedded NULLs are now handled in a similar manner to the OpenSSL side, though because this failure happens during the certificate authentication callback, it results in a TLS alert rather than simply closing the connection. For easier review of just the parts I've changed, I've also attached a since-v22.diff, which is part of the 0001 patch. --Jacob [1] https://www.postgresql.org/message-id/7d6a23a7e30540b486abc823f7ced7a93e1da1e8.camel%40vmware.com ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Free forum by Nabble | Edit this page |