snapper vs. HEAD

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

snapper vs. HEAD

Tom Lane-2
Buildfarm member snapper has been crashing in the core regression tests
since commit 17a28b0364 (well, there's a bit of a range of uncertainty
there, but 17a28b0364 looks to be the only such commit that could have
affected code in gistget.c where the crash is).  Curiously, its sibling
skate is *not* failing, despite being on the same machine and compiler.

I looked into this by dint of setting up a similar environment in a
qemu VM.  I might not have reproduced things exactly, but I managed
to get the same kind of crash at approximately the same place, and
what it looks like to me is a compiler bug.  It's iterating
gistindex_keytest's loop over search keys

    ScanKey     key = scan->keyData;
    int         keySize = scan->numberOfKeys;

    while (keySize > 0)
    {
        ...
        key++;
        keySize--;
    }

one time too many, and accessing a garbage ScanKey value off the end of
the keyData array, leading to a function call into never-never land.

I'm no expert on Sparc assembly code, but it looks like the compiler
forgot the ",a" (annul) modifier here:

        .loc 1 181 0
        andcc %o7, 64, %g0
        be,pt %icc, .L134         <----
         addcc %l5, -1, %l5
        .loc 1 183 0
        lduh [%i4+16], %o7
        add %i4, %o7, %o7
        lduh [%o7+12], %o7
        andcc %o7, 1, %g0
        bne,pt %icc, .L141
         ld [%fp-32], %g2
        .loc 1 163 0
        ba,pt %xcc, .L134
         addcc %l5, -1, %l5

causing %l5 (which contains the keySize value) to be decremented
an extra time in the case where that branch is not taken and
we fall through as far as the "ba" at the end.  Even that would
not be fatal, perhaps, except that the compiler also decided to
optimize the "keySize > 0" test to "keySize != 0", for ghod only
knows what reason (surely it's not any faster on a RISC machine),
so that arriving at .L134 with %l5 containing -1 allows the loop
to be iterated again.  Kaboom.

It's unclear how 17a28b0364 would have affected this, but there is
an elog() call elsewhere in the same function, so maybe the new
coding for that changed register assignments or some other
phase-of-the-moon effect.

I doubt that anyone's going to take much interest in fixing this
old compiler version, so my recommendation is to back off the
optimization level on snapper to -O1, and probably on skate as
well because there's no obvious reason why the same compiler bug
might not bite skate at some point.  I was able to get through
the core regression tests on my qemu VM after recompiling
gistget.c at -O1 (with other flags the same as snapper is using).

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: snapper vs. HEAD

Andres Freund
Hi,

On 2020-03-28 23:50:32 -0400, Tom Lane wrote:
> Buildfarm member snapper has been crashing in the core regression tests
> since commit 17a28b0364 (well, there's a bit of a range of uncertainty
> there, but 17a28b0364 looks to be the only such commit that could have
> affected code in gistget.c where the crash is).  Curiously, its sibling
> skate is *not* failing, despite being on the same machine and compiler.

Hm. There's some difference in code-gen specific options.

snapper has:
'CFLAGS' => '-g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security ',
'CPPFLAGS' => '-D_FORTIFY_SOURCE=2',
'LDFLAGS' => '-Wl,-z,relro -Wl,-z,now'
and specifies (among others)
                                      '--enable-thread-safety',
                                      '--with-gnu-ld',
whereas skate has --enable-cassert.

Not too hard to imagine that several of these could cause enough
code-gen differences so that one exhibits the bug, and the other
doesn't.


The different commandlines for gistget end up being:

snapper:
ccache gcc-4.7 -std=gnu99 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security  -I../../../../src/include  -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2  -I/usr/include/mit-krb5  -c -o gistget.o gistget.c
skate:
ccache gcc-4.7 -std=gnu99 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -O2 -I../../../../src/include  -D_GNU_SOURCE -I/usr/include/libxml2   -c -o gistget.o gistget.c


> I looked into this by dint of setting up a similar environment in a
> qemu VM.  I might not have reproduced things exactly, but I managed
> to get the same kind of crash at approximately the same place, and
> what it looks like to me is a compiler bug.

What options were you using? Reproducing snapper as exactly as possible?


> It's unclear how 17a28b0364 would have affected this, but there is
> an elog() call elsewhere in the same function, so maybe the new
> coding for that changed register assignments or some other
> phase-of-the-moon effect.

Yea, wouldn't be too surprising.


> I doubt that anyone's going to take much interest in fixing this
> old compiler version, so my recommendation is to back off the
> optimization level on snapper to -O1, and probably on skate as
> well because there's no obvious reason why the same compiler bug
> might not bite skate at some point.  I was able to get through
> the core regression tests on my qemu VM after recompiling
> gistget.c at -O1 (with other flags the same as snapper is using).

If you still have the environment it might make sense to check wether
it's related to one of the other options. But otherwise I wouldn't be
against the proposal.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: snapper vs. HEAD

Tom Lane-2
Andres Freund <[hidden email]> writes:
> On 2020-03-28 23:50:32 -0400, Tom Lane wrote:
>> ... Curiously, its sibling
>> skate is *not* failing, despite being on the same machine and compiler.

> Hm. There's some difference in code-gen specific options.

Yeah, I confirmed that on my VM install too: using our typical codegen
options (just -O2 -g), the regression tests pass, matching skate's
results.  It fell over after I matched snapper's CFLAGS, CPPFLAGS,
and LDFLAGS.  I didn't try to break things down more finely as to
which option(s) trigger the bad code, since it looks like it's probably
some purely-internal compiler state ...

> What options were you using? Reproducing snapper as exactly as possible?

Yeah, see above.

> If you still have the environment it might make sense to check wether
> it's related to one of the other options. But otherwise I wouldn't be
> against the proposal.

I could, but it's mighty slow, so I don't especially want to try all 2^N
combinations.  Do you have any specific cases in mind?

(I guess we can exclude LDFLAGS, since the assembly output is visibly
bad.)

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: snapper vs. HEAD

Andres Freund
Hi,

On 2020-03-29 20:25:32 -0400, Tom Lane wrote:
> Andres Freund <[hidden email]> writes:
> > If you still have the environment it might make sense to check wether
> > it's related to one of the other options. But otherwise I wouldn't be
> > against the proposal.
>
> I could, but it's mighty slow, so I don't especially want to try all 2^N
> combinations.  Do you have any specific cases in mind?

I'd be most suspicious of -fstack-protector --param=ssp-buffer-size=4
and -D_FORTIFY_SOURCE=2. The first two have direct codegen implications,
the latter can lead to quite different headers being included and adds a
lot of size tracking to the optimizer.


> (I guess we can exclude LDFLAGS, since the assembly output is visibly
> bad.)

Seems likely.

Is it visibly bad when looking at the .s of gistget.c "directly", or
when disassembling the fiinal binary? Because I've seen linkers screw up
on a larger scale than I'd have expected in the past.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: snapper vs. HEAD

Tom Lane-2
Andres Freund <[hidden email]> writes:
> Is it visibly bad when looking at the .s of gistget.c "directly", or
> when disassembling the fiinal binary? Because I've seen linkers screw up
> on a larger scale than I'd have expected in the past.

Yes, the bogus asm that I showed was from gistget.s, not from
disassembling.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: snapper vs. HEAD

Tom Lane-2
In reply to this post by Andres Freund
Andres Freund <[hidden email]> writes:
> On 2020-03-29 20:25:32 -0400, Tom Lane wrote:
>> I could, but it's mighty slow, so I don't especially want to try all 2^N
>> combinations.  Do you have any specific cases in mind?

> I'd be most suspicious of -fstack-protector --param=ssp-buffer-size=4
> and -D_FORTIFY_SOURCE=2. The first two have direct codegen implications,
> the latter can lead to quite different headers being included and adds a
> lot of size tracking to the optimizer.

It occurred to me that just recompiling gistget.c, rather than the whole
backend, would be enough to prove the point.  So after a few trials:

* Removing "-fstack-protector --param=ssp-buffer-size=4" does nothing;
the generated .o file is bitwise the same.

* Removing -D_FORTIFY_SOURCE=2 does change the bits, but it still
crashes.

So that eliminates all of snapper's special compile options :-(.
I'm forced to the conclusion that the important difference between
snapper and skate is that the latter uses --enable-cassert and the
former doesn't, because that's the only remaining difference between
how I built a working version before and the not-working version
I have right now.  Which means that this really is pretty much a
phase-of-the-moon compiler bug, and we've just been very lucky
that we haven't tripped over it before.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: snapper vs. HEAD

Tom Lane-2
I wrote:
> I'm forced to the conclusion that the important difference between
> snapper and skate is that the latter uses --enable-cassert and the
> former doesn't, because that's the only remaining difference between
> how I built a working version before and the not-working version
> I have right now.

Confirmed: building gistget with --enable-cassert, and all of snapper's
compile/link options, produces something that passes regression.

The generated asm differs in a whole lot of details, but it looks like
the compiler remembers to annul the branch delay slot in all the
relevant places:

        .loc 1 163 0
        addcc %l7, -1, %l7
.L186:
        be,pn %icc, .L80
         add %l6, 48, %l6
...
        .loc 1 189 0
        be,a,pt %icc, .L186
         addcc %l7, -1, %l7
...
        .loc 1 183 0
        lduh [%g4+12], %g4
        andcc %g4, 1, %g0
        be,a,pt %icc, .L186
         addcc %l7, -1, %l7
        andcc %o7, 0xff, %g0
        bne,a,pt %icc, .L186
         addcc %l7, -1, %l7


                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: snapper vs. HEAD

Tom Turelinckx-2
Hi,

Tom Lane wrote:
> Confirmed: building gistget with --enable-cassert, and all of snapper's
> compile/link options, produces something that passes regression.

Skate uses buildfarm default configuration, whereas snapper uses settings which are used when building postgresql debian packages. Debian packages are built without --enable-cassert, but most buildfarm animals build with --enable-cassert. I specifically configured skate and snapper like that because I ran into such issues in the past (where debian packages would fail to build on sparc, but buildfarm animals on debian sparc did not highlight the issue).

In the past, I've already switched from gcc 4.6 to gcc 4.7 as a workaround for a similar compiler bug, but I can't upgrade to a newer gcc without backporting it myself, so for the moment I've switched snapper to use -O1 instead of -O2, for HEAD only.

Not sure whether wheezy on sparc 32-bit is very relevant today, but it's an exotic platform, so I try to keep those buildfarm animals alive as long as it's possible.

Best regards,
Tom Turelinckx


Reply | Threaded
Open this post in threaded view
|

Re: snapper vs. HEAD

Tom Lane-2
"Tom Turelinckx" <[hidden email]> writes:
> In the past, I've already switched from gcc 4.6 to gcc 4.7 as a workaround for a similar compiler bug, but I can't upgrade to a newer gcc without backporting it myself, so for the moment I've switched snapper to use -O1 instead of -O2, for HEAD only.

Thanks!  But it doesn't seem to have taken: snapper just did a new run
that still failed, and it still seems to be using -O2.

> Not sure whether wheezy on sparc 32-bit is very relevant today, but it's
> an exotic platform, so I try to keep those buildfarm animals alive as
> long as it's possible.

Yeah, I've got a couple of those myself.  But perhaps it'd be sensible
to move to a newer Debian LTS release?  Or have they dropped Sparc
support altogether?

(As of this weekend, it seemed to be impossible to find the wheezy sparc
distribution images on-line anymore.  Fortunately I still had a download
of the dvd-1 image stashed away, or I would not have been able to recreate
my qemu VM for the purpose.  It's going to be very hard for any other PG
hackers to investigate that platform in future.)

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: snapper vs. HEAD

Tom Turelinckx-2
Hi,

Tom Lane wrote:
> Thanks! But it doesn't seem to have taken: snapper just did a new run
> that still failed, and it still seems to be using -O2.

Snapper did build using -O1 a few hours ago, but it failed the check stage very early with a different error:

FATAL:  null value in column "classid" of relation "pg_depend" violates not-null constraint

I then cleared out the ccache and forced a build of HEAD: same issue.

Next I cleared out the ccache and forced a build of HEAD with -O2: this is the one you saw.

Finally, I've cleared out both the ccache and the accache and forced a build of HEAD with -O1. It failed the check stage again very early with the above error.

> to move to a newer Debian LTS release? Or have they dropped Sparc
> support altogether?

Wheezy was the last stable release for Debian sparc. Sparc64 is a Debian ports architecture, but there are no stable releases for sparc64. I do maintain private sparc64 repositories for Stretch and Buster, and I could configure buildfarm animals for those (on faster hardware too), but those releases are not officially available.

Best regards,
Tom Turelinckx


Reply | Threaded
Open this post in threaded view
|

Re: snapper vs. HEAD

Andres Freund
In reply to this post by Tom Lane-2
Hi,

On 2020-03-30 12:24:06 -0400, Tom Lane wrote:
> "Tom Turelinckx" <[hidden email]> writes:
> Yeah, I've got a couple of those myself.  But perhaps it'd be sensible
> to move to a newer Debian LTS release?  Or have they dropped Sparc
> support altogether?

I think the 32bit sparc support has been phased out. Sparc64 isn't a
"official port", but there's a port:
https://wiki.debian.org/Sparc64
including seemingly regularly updated images.


> (As of this weekend, it seemed to be impossible to find the wheezy sparc
> distribution images on-line anymore.  Fortunately I still had a download
> of the dvd-1 image stashed away, or I would not have been able to recreate
> my qemu VM for the purpose.  It's going to be very hard for any other PG
> hackers to investigate that platform in future.)

They've been moved to archive.debian.org, but they should still be
downloadable. Seems like the website hasn't been quite updated to that
fact...

The installer downloads are still available at:
https://www.debian.org/releases/wheezy/debian-installer/

but sources.list would need to be pointed at something like

deb http://archive.debian.org/debian/ wheezy contrib main non-free

See also https://www.debian.org/distrib/archive

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: snapper vs. HEAD

Tom Lane-2
Andres Freund <[hidden email]> writes:
> On 2020-03-30 12:24:06 -0400, Tom Lane wrote:
>> (As of this weekend, it seemed to be impossible to find the wheezy sparc
>> distribution images on-line anymore.

> The installer downloads are still available at:
> https://www.debian.org/releases/wheezy/debian-installer/

Ah, I should have clarified.  That's 7.11, which I'd tried last time
I was interested in duplicating snapper, and I found that it does
not work under qemu's sparc emulation (my notes don't have much detail,
but some installer component was core-dumping).  What did work, after
some hair pulling, was 7.6 ... and that's the version I can no longer
find on-line.  But it has what seems to be the same gcc release as
snapper is using, so I figured it was close enough.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: snapper vs. HEAD

Tom Lane-2
In reply to this post by Tom Turelinckx-2
"Tom Turelinckx" <[hidden email]> writes:
> Tom Lane wrote:
>> Thanks! But it doesn't seem to have taken: snapper just did a new run
>> that still failed, and it still seems to be using -O2.

> Snapper did build using -O1 a few hours ago, but it failed the check stage very early with a different error:
> FATAL:  null value in column "classid" of relation "pg_depend" violates not-null constraint

Ugh.  Compiler bugs coming out of the woodwork?

Not sure what to suggest here.  It certainly is useful to us to have
sparc buildfarm testing, but probably sparc64 would be more useful
than sparc32.  (It looks like FreeBSD and OpenBSD have dropped sparc32
altogether, and NetBSD has bumped it to tier II status.)  One idea
is to try -O0 ...

                        regards, tom lane