Finding cause of test fails on the cfbot site

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Finding cause of test fails on the cfbot site

Peter Smith
Hi -

I saw that one of our commitfest entries (32/2914) is recently
reporting a fail on the cfbot site [1]. I thought this was all ok a
few days ago.

We can see the test log indicating what was the test that failed [2]
Test Summary Report
-------------------
t/002_twophase_streaming.pl (Wstat: 7424 Tests: 1 Failed: 0)
  Non-zero exit status: 29
  Parse errors: Bad plan.  You planned 2 tests but ran 1.
Files=2, Tests=3,  4 wallclock secs ( 0.03 usr  0.00 sys +  1.59 cusr
0.81 csys =  2.44 CPU)
Result: FAIL
gmake[2]: *** [../../src/makefiles/pgxs.mk:440: check] Error 1
gmake[1]: *** [Makefile:94: check-test_decoding-recurse] Error 2
gmake: *** [GNUmakefile:71: check-world-contrib-recurse] Error 2
*** Error code 2


Is there any other detailed information available anywhere, e.g.
logs?, which might help us work out what was the cause of the test
failure?

Thankyou.

---
[1] http://cfbot.cputube.org/
[2] https://api.cirrus-ci.com/v1/task/5352561114873856/logs/test.log

Kind Regards,
Peter Smith.
Fujitsu Australia


Reply | Threaded
Open this post in threaded view
|

Re: Finding cause of test fails on the cfbot site

Tom Lane-2
Peter Smith <[hidden email]> writes:
> I saw that one of our commitfest entries (32/2914) is recently
> reporting a fail on the cfbot site [1]. I thought this was all ok a
> few days ago.
> ...
> Is there any other detailed information available anywhere, e.g.
> logs?, which might help us work out what was the cause of the test
> failure?

AFAIK the cfbot doesn't capture anything beyond the session typescript.
However, this doesn't look that hard to reproduce locally ... have you
tried, using similar configure options to what that cfbot run did?
Once you did reproduce it, there'd be logs under
contrib/test_decoding/tmp_check/.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Finding cause of test fails on the cfbot site

Andrew Dunstan

On 2/17/21 11:06 AM, Tom Lane wrote:

> Peter Smith <[hidden email]> writes:
>> I saw that one of our commitfest entries (32/2914) is recently
>> reporting a fail on the cfbot site [1]. I thought this was all ok a
>> few days ago.
>> ...
>> Is there any other detailed information available anywhere, e.g.
>> logs?, which might help us work out what was the cause of the test
>> failure?
> AFAIK the cfbot doesn't capture anything beyond the session typescript.
> However, this doesn't look that hard to reproduce locally ... have you
> tried, using similar configure options to what that cfbot run did?
> Once you did reproduce it, there'd be logs under
> contrib/test_decoding/tmp_check/.
>
>



yeah. The cfbot runs check-world which makes it difficult for it to know
which log files to show when there's an error. That's a major part of
the reason the buildfarm runs a much finer grained set of steps.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com



Reply | Threaded
Open this post in threaded view
|

Re: Finding cause of test fails on the cfbot site

Thomas Munro-5
On Thu, Feb 18, 2021 at 9:18 AM Andrew Dunstan <[hidden email]> wrote:

> On 2/17/21 11:06 AM, Tom Lane wrote:
> > Peter Smith <[hidden email]> writes:
> >> I saw that one of our commitfest entries (32/2914) is recently
> >> reporting a fail on the cfbot site [1]. I thought this was all ok a
> >> few days ago.
> >> ...
> >> Is there any other detailed information available anywhere, e.g.
> >> logs?, which might help us work out what was the cause of the test
> >> failure?
> > AFAIK the cfbot doesn't capture anything beyond the session typescript.
> > However, this doesn't look that hard to reproduce locally ... have you
> > tried, using similar configure options to what that cfbot run did?
> > Once you did reproduce it, there'd be logs under
> > contrib/test_decoding/tmp_check/.
>
> yeah. The cfbot runs check-world which makes it difficult for it to know
> which log files to show when there's an error. That's a major part of
> the reason the buildfarm runs a much finer grained set of steps.

Yeah, it's hard to make it print out just the right logs without
dumping so much stuff that it's hard to see the wood for the trees;
perhaps if the Makefile had an option to dump relevant stuff for the
specific tests that failed, or perhaps the buildfarm is already better
at that and cfbot should just use the buildfarm client directly.  Hmm.
Another idea would be to figure out how to make a tarball of all log
files that you can download for inspection with better tools at home
when things go wrong.  It would rapidly blow through the 1GB limit for
stored "artefacts" on open source/community Cirrus accounts though, so
we'd need to figure out how to manage retention carefully.

For what it's worth, I tried to reproduce this particular on a couple
of systems, many times, with no luck.  It doesn't look like a freak CI
failure (there have been a few random terminations I can't explain
recently, but they look different, I think there was a Google Compute
Engine outage that might explain that), and it failed in exactly the
same way on Linux and FreeBSD.  I tried  locally on FreeBSD, on top of
commit a975ff4980d60f8cbd8d8cbcff70182ea53e787a (which is what the
last cfbot run did), because it conflicts with a recent change so it
doesn't apply on the tip of master right now.


Reply | Threaded
Open this post in threaded view
|

Re: Finding cause of test fails on the cfbot site

Thomas Munro-5
On Thu, Feb 18, 2021 at 9:42 AM Thomas Munro <[hidden email]> wrote:
> ... (there have been a few random terminations I can't explain
> recently, but they look different, I think there was a Google Compute
> Engine outage that might explain that), ...

There's also occasionally a failure like this[1], on FreeBSD only:

t/001_stream_rep.pl .................. ok
t/002_archiving.pl ................... ok
t/003_recovery_targets.pl ............ ok
Bailout called. Further testing stopped: system pg_basebackup failed
FAILED--Further testing stopped: system pg_basebackup failed

I have a sneaking suspicion this is a real problem on master, but more
logs will be needed to guess more about that...

[1] https://cirrus-ci.com/task/5046982650626048?command=test#L733


Reply | Threaded
Open this post in threaded view
|

Re: Finding cause of test fails on the cfbot site

Peter Smith
On Thu, Feb 18, 2021 at 9:55 AM Thomas Munro <[hidden email]> wrote:

>
> On Thu, Feb 18, 2021 at 9:42 AM Thomas Munro <[hidden email]> wrote:
> > ... (there have been a few random terminations I can't explain
> > recently, but they look different, I think there was a Google Compute
> > Engine outage that might explain that), ...
>
> There's also occasionally a failure like this[1], on FreeBSD only:
>
> t/001_stream_rep.pl .................. ok
> t/002_archiving.pl ................... ok
> t/003_recovery_targets.pl ............ ok
> Bailout called. Further testing stopped: system pg_basebackup failed
> FAILED--Further testing stopped: system pg_basebackup failed
>
> I have a sneaking suspicion this is a real problem on master, but more
> logs will be needed to guess more about that...
>
> [1] https://cirrus-ci.com/task/5046982650626048?command=test#L733

Thanks for all the effort spent into looking at this.

Meanwhile, since you pointed out the patch is not applying on the HEAD
tip I can at least address that.

------
Kind Regards,
Peter Smith.
Fujitsu Australia


Reply | Threaded
Open this post in threaded view
|

Re: Finding cause of test fails on the cfbot site

Andrew Dunstan
In reply to this post by Thomas Munro-5

On 2/17/21 3:42 PM, Thomas Munro wrote:

> On Thu, Feb 18, 2021 at 9:18 AM Andrew Dunstan <[hidden email]> wrote:
>> On 2/17/21 11:06 AM, Tom Lane wrote:
>>> Peter Smith <[hidden email]> writes:
>>>> I saw that one of our commitfest entries (32/2914) is recently
>>>> reporting a fail on the cfbot site [1]. I thought this was all ok a
>>>> few days ago.
>>>> ...
>>>> Is there any other detailed information available anywhere, e.g.
>>>> logs?, which might help us work out what was the cause of the test
>>>> failure?
>>> AFAIK the cfbot doesn't capture anything beyond the session typescript.
>>> However, this doesn't look that hard to reproduce locally ... have you
>>> tried, using similar configure options to what that cfbot run did?
>>> Once you did reproduce it, there'd be logs under
>>> contrib/test_decoding/tmp_check/.
>> yeah. The cfbot runs check-world which makes it difficult for it to know
>> which log files to show when there's an error. That's a major part of
>> the reason the buildfarm runs a much finer grained set of steps.
> Yeah, it's hard to make it print out just the right logs without
> dumping so much stuff that it's hard to see the wood for the trees;
> perhaps if the Makefile had an option to dump relevant stuff for the
> specific tests that failed, or perhaps the buildfarm is already better
> at that and cfbot should just use the buildfarm client directly.  Hmm.
> Another idea would be to figure out how to make a tarball of all log
> files that you can download for inspection with better tools at home
> when things go wrong.  It would rapidly blow through the 1GB limit for
> stored "artefacts" on open source/community Cirrus accounts though, so
> we'd need to figure out how to manage retention carefully.


I did some thinking about this. How about if we have the make files and
the msvc build system create a well known file with the location(s) to
search for log files if there's an error. Each bit of testing could
overwrite this file before starting testing, and then tools like cfbot
would know where to look for files to report? To keep things clean for
other users the file would only be created if, say,
PG_NEED_ERROR_LOG_LOCATIONS is set. The well known location would be
something like "$(top_builddir)/error_log_locations.txt", and individual
Makefiles would have entries something like:,


    override ERROR_LOG_LOCATIONS =
    $(top_builddir)/contrib/test_decoding/tmp_check/log


If this seems like a good idea I can go and try to make that happen.


cheers


andrew


--

Andrew Dunstan
EDB: https://www.enterprisedb.com



Reply | Threaded
Open this post in threaded view
|

Re: Finding cause of test fails on the cfbot site

Andrew Dunstan

On 2/18/21 11:01 AM, Andrew Dunstan wrote:

> On 2/17/21 3:42 PM, Thomas Munro wrote:
>> On Thu, Feb 18, 2021 at 9:18 AM Andrew Dunstan <[hidden email]> wrote:
>>> On 2/17/21 11:06 AM, Tom Lane wrote:
>>>> Peter Smith <[hidden email]> writes:
>>>>> I saw that one of our commitfest entries (32/2914) is recently
>>>>> reporting a fail on the cfbot site [1]. I thought this was all ok a
>>>>> few days ago.
>>>>> ...
>>>>> Is there any other detailed information available anywhere, e.g.
>>>>> logs?, which might help us work out what was the cause of the test
>>>>> failure?
>>>> AFAIK the cfbot doesn't capture anything beyond the session typescript.
>>>> However, this doesn't look that hard to reproduce locally ... have you
>>>> tried, using similar configure options to what that cfbot run did?
>>>> Once you did reproduce it, there'd be logs under
>>>> contrib/test_decoding/tmp_check/.
>>> yeah. The cfbot runs check-world which makes it difficult for it to know
>>> which log files to show when there's an error. That's a major part of
>>> the reason the buildfarm runs a much finer grained set of steps.
>> Yeah, it's hard to make it print out just the right logs without
>> dumping so much stuff that it's hard to see the wood for the trees;
>> perhaps if the Makefile had an option to dump relevant stuff for the
>> specific tests that failed, or perhaps the buildfarm is already better
>> at that and cfbot should just use the buildfarm client directly.  Hmm.
>> Another idea would be to figure out how to make a tarball of all log
>> files that you can download for inspection with better tools at home
>> when things go wrong.  It would rapidly blow through the 1GB limit for
>> stored "artefacts" on open source/community Cirrus accounts though, so
>> we'd need to figure out how to manage retention carefully.
>
> I did some thinking about this. How about if we have the make files and
> the msvc build system create a well known file with the location(s) to
> search for log files if there's an error. Each bit of testing could
> overwrite this file before starting testing, and then tools like cfbot
> would know where to look for files to report? To keep things clean for
> other users the file would only be created if, say,
> PG_NEED_ERROR_LOG_LOCATIONS is set. The well known location would be
> something like "$(top_builddir)/error_log_locations.txt", and individual
> Makefiles would have entries something like:,
>
>
>     override ERROR_LOG_LOCATIONS =
>     $(top_builddir)/contrib/test_decoding/tmp_check/log
>
>
> If this seems like a good idea I can go and try to make that happen.
>
>
here's a very small and simple (and possibly naive)  POC patch that
demonstrates this and seems to do the right thing.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com


error_log_locations_POC.patch (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Finding cause of test fails on the cfbot site

Peter Smith
In reply to this post by Thomas Munro-5
Here is another related question about the cfbot error reporting -

The main cfbot "status page" [1] still shows a couple of fails for the
32/2914 (for freebsd & linux). But looking more closely, those fails
are not from the latest run. e.g. I also found this execution
"history" page [2] for our patch which shows the most recent run was
ok for commit a7e715.

~~

So it seems like there is some kind of rule that says the main status
page will still indicate "recent* errors (even if the latest execution
was ok)...

IIUC that explains the difference between a hollow red 'X' (old fail)
and a solid red 'X' fail (new fail)? And I am guessing if our patch
continues to work ok (for how long?) then that hollow red 'X' may
morph into a solid green 'tick' (when the old fail becomes too old to
care about anymore)?

But those are just my guesses based on those icon tooltips. What
*really* are the rules for those main page status indicators, and how
long do the old failure icons linger around before changing to success
icons? (Apologies if a legend for those icons is already described
somewhere - I didn't find it).

Thanks!

------
[1] http://cfbot.cputube.org/
[2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/32/2914

Kind Regards,
Peter Smith.
Fujitsu Australia


Reply | Threaded
Open this post in threaded view
|

Re: Finding cause of test fails on the cfbot site

Thomas Munro-5
On Sat, Feb 20, 2021 at 10:31 AM Peter Smith <[hidden email]> wrote:

> Here is another related question about the cfbot error reporting -
>
> The main cfbot "status page" [1] still shows a couple of fails for the
> 32/2914 (for freebsd & linux). But looking more closely, those fails
> are not from the latest run. e.g. I also found this execution
> "history" page [2] for our patch which shows the most recent run was
> ok for commit a7e715.
>
> ~~
>
> So it seems like there is some kind of rule that says the main status
> page will still indicate "recent* errors (even if the latest execution
> was ok)...

Hmmph.  It seems like there is indeed some kind of occasional glitch,
possible a bug in my code for picking up statuses from Cirrus through
their GraphQL API (that's a recent thing; we had to change providers
due to another CI's policy changes in January, and apparently
something in the new pipeline isn't quite fully baked).  Will look
into that this weekend.  Sorry about that, and thanks for letting me
know.

> IIUC that explains the difference between a hollow red 'X' (old fail)
> and a solid red 'X' fail (new fail)? And I am guessing if our patch
> continues to work ok (for how long?) then that hollow red 'X' may
> morph into a solid green 'tick' (when the old fail becomes too old to
> care about anymore)?
>
> But those are just my guesses based on those icon tooltips. What
> *really* are the rules for those main page status indicators, and how
> long do the old failure icons linger around before changing to success
> icons? (Apologies if a legend for those icons is already described
> somewhere - I didn't find it).

Yeah, I will try to clarify the UI a bit...


Reply | Threaded
Open this post in threaded view
|

Re: Finding cause of test fails on the cfbot site

Thomas Munro-5
On Sat, Feb 20, 2021 at 11:33 AM Thomas Munro <[hidden email]> wrote:
> On Sat, Feb 20, 2021 at 10:31 AM Peter Smith <[hidden email]> wrote:
> > Here is another related question about the cfbot error reporting -
> >
> > The main cfbot "status page" [1] still shows a couple of fails for the
> > 32/2914 (for freebsd & linux). But looking more closely, those fails
> > are not from the latest run. e.g. I also found this execution
> > "history" page [2] for our patch which shows the most recent run was
> > ok for commit a7e715.

> Hmmph.  It seems like there is indeed some kind of occasional glitch,
> possible a bug in my code for picking up statuses from Cirrus through
> their GraphQL API (that's a recent thing; we had to change providers
> due to another CI's policy changes in January, and apparently
> something in the new pipeline isn't quite fully baked).  Will look
> into that this weekend.  Sorry about that, and thanks for letting me
> know.

Should be fixed now.  (Embarrassing detail: I used flock(LOCK_EX) to
prevent two copies of a cron job from running at the same time, but
Python helpfully garbage collected and closed the fd, releasing the
lock,  Rarely, the script would run for long enough to have some state
clobbered by the next run, so you wouldn't see the latest result.)


Reply | Threaded
Open this post in threaded view
|

Re: Finding cause of test fails on the cfbot site

Thomas Munro-5
In reply to this post by Andrew Dunstan
On Sat, Feb 20, 2021 at 3:54 AM Andrew Dunstan <[hidden email]> wrote:
> here's a very small and simple (and possibly naive)  POC patch that
> demonstrates this and seems to do the right thing.

As a small variation that might be more parallelism-friendly, would it
be better to touch a file with a known name in any subdirectory that
contains potentially interesting logs, and rm it when a test succeeds?


Reply | Threaded
Open this post in threaded view
|

Re: Finding cause of test fails on the cfbot site

Andres Freund
In reply to this post by Andrew Dunstan
Hi,

On 2021-02-17 15:18:02 -0500, Andrew Dunstan wrote:
> yeah. The cfbot runs check-world which makes it difficult for it to know
> which log files to show when there's an error. That's a major part of
> the reason the buildfarm runs a much finer grained set of steps.

I really think we need a better solution for this across the different
use-cases of running tests. For development parallel check-world is
important for a decent hack-test loop. But I waste a fair bit of time to
scroll back to find the original source of failures. And on the
buildfarm we waste a significant amount of time by limiting parallelism
due to the non-parallel sequence of finer grained steps.

And it's not just about logs - even just easily seeing the first
reported test failure without needing to search through large amounts of
text would be great.

With, um, more modern buildtools (e.g. ninja) you'll at least get the
last failure displayed at the end, instead of seing a lot of other
things after it like with make.


My suspicion is that, given the need to have this work for both msvc and
make, writing an in-core test-runner script is the only real option to
improve upon the current situation.

For make it'd not be hard to add a recursive 'listchecks' target listing
the individual tests that need to be run. Hacking up vcregress.pl to do
that, instead of what it currently does, shouldn't be too hard either.


Once there's a list of commands that need to be run it's not hard to
write a loop in perl that runs up to N tests in parallel, saving their
output. Which then allows to display the failing test reports at the
end.


If we then also add a convention that each test outputs something like
TESTLOG: path/to/logfile
...
it'd not be hard to add support for the test runner to list the files
that cfbot et al should output.


Looking around the tree, the most annoying bit to implement something
like this is that things below src/bin/, src/interfaces, src/test,
src/pl implement their own check, installcheck targets. Given the number
of these that just boil down to a variant of

check:
        $(pg_regress_check)
        $(prove_check)
installcheck:
        $(pg_regress_installcheck)

it seems we should lift the REGRESS and TAP_TESTS specific logic in
pgxs.mk up into src/Makefiles.global. Which then would make something
list a global listchecks target easy.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Finding cause of test fails on the cfbot site

Andrew Dunstan
In reply to this post by Thomas Munro-5

On 2/21/21 9:28 PM, Thomas Munro wrote:
> On Sat, Feb 20, 2021 at 3:54 AM Andrew Dunstan <[hidden email]> wrote:
>> here's a very small and simple (and possibly naive)  POC patch that
>> demonstrates this and seems to do the right thing.
> As a small variation that might be more parallelism-friendly, would it
> be better to touch a file with a known name in any subdirectory that
> contains potentially interesting logs, and rm it when a test succeeds?



Yes, that sounds better, Thanks.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com



Reply | Threaded
Open this post in threaded view
|

Re: Finding cause of test fails on the cfbot site

Andrew Dunstan
In reply to this post by Andres Freund

On 2/21/21 10:34 PM, Andres Freund wrote:

> Hi,
>
> On 2021-02-17 15:18:02 -0500, Andrew Dunstan wrote:
>> yeah. The cfbot runs check-world which makes it difficult for it to know
>> which log files to show when there's an error. That's a major part of
>> the reason the buildfarm runs a much finer grained set of steps.
> I really think we need a better solution for this across the different
> use-cases of running tests. For development parallel check-world is
> important for a decent hack-test loop. But I waste a fair bit of time to
> scroll back to find the original source of failures. And on the
> buildfarm we waste a significant amount of time by limiting parallelism
> due to the non-parallel sequence of finer grained steps.


Maybe but running fast isn't really a design goal of the buildfarm. It's
meant to be automated so it doesn't matter if it takes 10 or 20 or 60
minutes.


Another reason for using fine grained tasks is to be able to
include/exclude them as needed. See the buildfarm's
exclude-steps/only-steps parameters.


That said there is some provision for parallelism, in that multiple
branches and multiple members can be tested at the same time, and
run_branches.pl will manage that fairly nicely for you. See
<https://wiki.postgresql.org/wiki/PostgreSQL_Buildfarm_Howto#Running_in_Parallel>
for details


> And it's not just about logs - even just easily seeing the first
> reported test failure without needing to search through large amounts of
> text would be great.
>
> With, um, more modern buildtools (e.g. ninja) you'll at least get the
> last failure displayed at the end, instead of seing a lot of other
> things after it like with make.
>
>
> My suspicion is that, given the need to have this work for both msvc and
> make, writing an in-core test-runner script is the only real option to
> improve upon the current situation.



Ok ... be prepared for a non-trivial maintenance cost, however, which
will be born by those of us fluent in perl, the only realistic
possibility unless we want to add to build dependencies. That's far from
everyone.


Part of the problem that this isn't going to solve is the sheer volume
that some tests produce. For example, the pg_dump tests produce about
40k lines  / 5Mb of log.



> For make it'd not be hard to add a recursive 'listchecks' target listing
> the individual tests that need to be run. Hacking up vcregress.pl to do
> that, instead of what it currently does, shouldn't be too hard either.
>
>
> Once there's a list of commands that need to be run it's not hard to
> write a loop in perl that runs up to N tests in parallel, saving their
> output. Which then allows to display the failing test reports at the
> end.
>
>
> If we then also add a convention that each test outputs something like
> TESTLOG: path/to/logfile
> ...
> it'd not be hard to add support for the test runner to list the files
> that cfbot et al should output.


Yeah, there is code in the buildfarm that contains a lot of building
blocks that can be used for this sort of stuff, see the PGBuild::Utils
and PGBuild::Log modules.


> Looking around the tree, the most annoying bit to implement something
> like this is that things below src/bin/, src/interfaces, src/test,
> src/pl implement their own check, installcheck targets. Given the number
> of these that just boil down to a variant of
>
> check:
> $(pg_regress_check)
>         $(prove_check)
> installcheck:
> $(pg_regress_installcheck)
>
> it seems we should lift the REGRESS and TAP_TESTS specific logic in
> pgxs.mk up into src/Makefiles.global. Which then would make something
> list a global listchecks target easy.
>

Yeah, some of this stuff has grown a bit haphazardly, and maybe needs
some rework.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com