openssl valgrind failures on skink are due to openssl issue

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

openssl valgrind failures on skink are due to openssl issue

Andres Freund
Hi,

Skink a few days ago started failing [1][2] with errors like:

==2732== Conditional jump or move depends on uninitialised value(s)
==2732==    at 0x4C612E3: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==2732==    by 0x4C621FA: RAND_DRBG_generate (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==2732==    by 0x4C63620: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==2732==    by 0x4C61A09: RAND_DRBG_instantiate (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==2732==    by 0x4C62937: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==2732==    by 0x4C62CC9: RAND_DRBG_get0_public (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==2732==    by 0x4C62CEF: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==2732==    by 0x69C6AF: pg_strong_random (pg_strong_random.c:135)
==2732==    by 0x4A2841: InitProcessGlobals (postmaster.c:2581)
==2732==    by 0x661137: InitStandaloneProcess (miscinit.c:322)
==2732==    by 0x2943CF: AuxiliaryProcessMain (bootstrap.c:209)
==2732==    by 0x4005D1: main (main.c:220)
==2732==  Uninitialised value was created by a stack allocation
==2732==    at 0x4C633B0: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)

and then a lot of followup error.

Reproduced that locally to get a nicer trace:
==7146== Conditional jump or move depends on uninitialised value(s)
==7146==    at 0x4B122E3: inc_128 (drbg_ctr.c:32)
==7146==    by 0x4B122E3: drbg_ctr_generate (drbg_ctr.c:330)
==7146==    by 0x4B131FA: RAND_DRBG_generate (drbg_lib.c:638)
==7146==    by 0x4B14620: rand_drbg_get_entropy (rand_lib.c:172)
==7146==    by 0x4B12A09: RAND_DRBG_instantiate (drbg_lib.c:338)
==7146==    by 0x4B13937: drbg_setup (drbg_lib.c:892)
==7146==    by 0x4B13CC9: RAND_DRBG_get0_public (drbg_lib.c:1120)
==7146==    by 0x4B13CC9: RAND_DRBG_get0_public (drbg_lib.c:1109)
==7146==    by 0x4B13CEF: drbg_bytes (drbg_lib.c:963)
==7146==    by 0x87BD60: pg_strong_random (pg_strong_random.c:139)
==7146==    by 0x5CAFFC: InitProcessGlobals (postmaster.c:2581)
==7146==    by 0x81AAD7: InitStandaloneProcess (miscinit.c:322)
==7146==    by 0x681A61: PostgresMain (postgres.c:3732)
==7146==    by 0x4E3D86: main (main.c:224)
==7146==  Uninitialised value was created by a stack allocation
==7146==    at 0x4B143B0: rand_drbg_get_nonce (rand_lib.c:231)

reading through the code lead me to figure out that that's due to a
recent openssl change:
https://github.com/openssl/openssl/commit/b3d113ed2993801ee643126118ccf6592ad18ef7
as explained in
https://github.com/openssl/openssl/issues/8460
and fixed since in
https://github.com/openssl/openssl/commit/15d7e7997e219fc5fef3f6003cc6bd7b2e7379d4

For reasons I do not understand the "cosmetic change" was backpatched
into 1.1.1 And the fix for the cosmetic change, made on master at the
end of March, was only backpatched to 1.1.1 *after* the 1.1.1c release
was made in late May.  I mean, huh.

That release was then installed on skink recently:
2019-06-01 06:39:20 upgrade libssl1.1:amd64 1.1.1b-2 1.1.1c-1

And finally the reason for the issue only being visible on master: I am
stupid, and matched branch names like REL9_6 instead of REL9_6_STABLE
etc to enable openssl (9.4 doesn't work against current openssl).


I can't think of a better way to fix skink for now than just disabling
openssl for skink, until 1.1.1d is released.

Greetings,

Andres Freund

[1] https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=skink&br=HEAD
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2019-06-10%2001%3A36%3A12


Reply | Threaded
Open this post in threaded view
|

Re: openssl valgrind failures on skink are due to openssl issue

Tom Lane-2
Andres Freund <[hidden email]> writes:
> For reasons I do not understand the "cosmetic change" was backpatched
> into 1.1.1 And the fix for the cosmetic change, made on master at the
> end of March, was only backpatched to 1.1.1 *after* the 1.1.1c release
> was made in late May.  I mean, huh.

Bleah.  Not that we've not made equally dumb mistakes :-(

> I can't think of a better way to fix skink for now than just disabling
> openssl for skink, until 1.1.1d is released.

Couldn't you install a local valgrind exclusion matching this stack trace?

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: openssl valgrind failures on skink are due to openssl issue

Andres Freund
Hi,

On 2019-06-11 16:55:28 -0400, Tom Lane wrote:
> Andres Freund <[hidden email]> writes:
> > I can't think of a better way to fix skink for now than just disabling
> > openssl for skink, until 1.1.1d is released.
>
> Couldn't you install a local valgrind exclusion matching this stack trace?

Unfortunately no. The error spreads through significant parts of openssl
*and* postgres, because it taints the returned random value, which then
is used in a number of places. We could try to block all of those, but
that seems fairly painful. And one, to my knowledge, cannot do valgrind
suppressions based on the source of uninitialized memory.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: openssl valgrind failures on skink are due to openssl issue

Michael Paquier-2
On Tue, Jun 11, 2019 at 02:07:29PM -0700, Andres Freund wrote:
> On 2019-06-11 16:55:28 -0400, Tom Lane wrote:
>> Andres Freund <[hidden email]> writes:
>>> I can't think of a better way to fix skink for now than just disabling
>>> openssl for skink, until 1.1.1d is released.

Thanks for digging into the details of that!  I was wondering if we
did something wrong on our side but the backtraces were weird.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: openssl valgrind failures on skink are due to openssl issue

Andres Freund
In reply to this post by Andres Freund
Hi,

On 2019-06-11 14:07:29 -0700, Andres Freund wrote:

> On 2019-06-11 16:55:28 -0400, Tom Lane wrote:
> > Andres Freund <[hidden email]> writes:
> > > I can't think of a better way to fix skink for now than just disabling
> > > openssl for skink, until 1.1.1d is released.
> >
> > Couldn't you install a local valgrind exclusion matching this stack trace?
>
> Unfortunately no. The error spreads through significant parts of openssl
> *and* postgres, because it taints the returned random value, which then
> is used in a number of places. We could try to block all of those, but
> that seems fairly painful. And one, to my knowledge, cannot do valgrind
> suppressions based on the source of uninitialized memory.

What we could do is add a suppression like:

{
   broken-openssl-accesses-random
   Memcheck:Cond
   ...
   fun:pg_strong_random
   fun:InitProcessGlobals
   fun:PostmasterMain
   fun:main
}

(alternatively one suppression for each RAND_status, RAND_poll,
RAND_bytes(), to avoid suppressing all of pg_strong_random itself)
   
and then prevent spread of the uninitialized memory by adding a
                VALGRIND_MAKE_MEM_DEFINED(buf, len);
after a successful RAND_bytes() call.

I tested that that quiesces the problem locally. Probably not worth
pushing something like that though?

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: openssl valgrind failures on skink are due to openssl issue

Tom Lane-2
Andres Freund <[hidden email]> writes:
> What we could do is add a suppression like:

> {
>    broken-openssl-accesses-random
>    Memcheck:Cond
>    ...
>    fun:pg_strong_random
>    fun:InitProcessGlobals
>    fun:PostmasterMain
>    fun:main
> }

> (alternatively one suppression for each RAND_status, RAND_poll,
> RAND_bytes(), to avoid suppressing all of pg_strong_random itself)
   
> and then prevent spread of the uninitialized memory by adding a
> VALGRIND_MAKE_MEM_DEFINED(buf, len);
> after a successful RAND_bytes() call.

> I tested that that quiesces the problem locally. Probably not worth
> pushing something like that though?

Yeah, that seems awfully aggressive to be pushing to machines that
don't have the problem.  Did you get any sense of how fast the
openssl fix is goinng to show up?

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: openssl valgrind failures on skink are due to openssl issue

Andres Freund
Hi,

On 2019-06-18 19:25:12 -0400, Tom Lane wrote:
> Did you get any sense of how fast the openssl fix is goinng to show up?

It's merged to both branches that contain the broken code. Now we need
to wait for the next set of openssl releases, and then for distros to
pick that up. Based on the past release cadence
https://www.openssl.org/news/openssl-1.1.1-notes.html
that seems to be likely to happen within 2-3 months.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: openssl valgrind failures on skink are due to openssl issue

Michael Paquier-2
On Tue, Jun 18, 2019 at 04:34:07PM -0700, Andres Freund wrote:
> It's merged to both branches that contain the broken code. Now we need
> to wait for the next set of openssl releases, and then for distros to
> pick that up. Based on the past release cadence
> https://www.openssl.org/news/openssl-1.1.1-notes.html
> that seems to be likely to happen within 2-3 months.

If that's for the buildfarm coverage.  I would be of the opinion to
wait a bit.  Another possibility is that you could compile your own
version of OpenSSL with the patch included, say only 1.1.1c with the
patch.  Still that would cause the plpython tests to complain as the
system's python may still link to the system's OpenSSL which is
broken?

Another possibility would be to move back to 1.1.1b for the time
being...
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: openssl valgrind failures on skink are due to openssl issue

Andres Freund
On 2019-06-19 11:28:16 +0900, Michael Paquier wrote:
> On Tue, Jun 18, 2019 at 04:34:07PM -0700, Andres Freund wrote:
> > It's merged to both branches that contain the broken code. Now we need
> > to wait for the next set of openssl releases, and then for distros to
> > pick that up. Based on the past release cadence
> > https://www.openssl.org/news/openssl-1.1.1-notes.html
> > that seems to be likely to happen within 2-3 months.
>
> If that's for the buildfarm coverage.  I would be of the opinion to
> wait a bit.

Yea. For now I've just disabled ssl support on skink, but that has its
own disadvantages.

> Another possibility is that you could compile your own
> version of OpenSSL with the patch included, say only 1.1.1c with the
> patch.

Really, I can do that?


Reply | Threaded
Open this post in threaded view
|

Re: openssl valgrind failures on skink are due to openssl issue

Michael Paquier-2
On Tue, Jun 18, 2019 at 07:44:26PM -0700, Andres Freund wrote:
> Really, I can do that?

Here is some of the stuff I use, just for the reference:
./Configure linux-x86_64 --prefix=$HOME/stable/openssl/1.1.1/
./config --prefix=$HOME/stable/openssl/1.1.1 shared
--
Michael

signature.asc (849 bytes) Download Attachment