Hi -
I saw that one of our commitfest entries (32/2914) is recently reporting a fail on the cfbot site [1]. I thought this was all ok a few days ago. We can see the test log indicating what was the test that failed [2] Test Summary Report ------------------- t/002_twophase_streaming.pl (Wstat: 7424 Tests: 1 Failed: 0) Non-zero exit status: 29 Parse errors: Bad plan. You planned 2 tests but ran 1. Files=2, Tests=3, 4 wallclock secs ( 0.03 usr 0.00 sys + 1.59 cusr 0.81 csys = 2.44 CPU) Result: FAIL gmake[2]: *** [../../src/makefiles/pgxs.mk:440: check] Error 1 gmake[1]: *** [Makefile:94: check-test_decoding-recurse] Error 2 gmake: *** [GNUmakefile:71: check-world-contrib-recurse] Error 2 *** Error code 2 Is there any other detailed information available anywhere, e.g. logs?, which might help us work out what was the cause of the test failure? Thankyou. --- [1] http://cfbot.cputube.org/ [2] https://api.cirrus-ci.com/v1/task/5352561114873856/logs/test.log Kind Regards, Peter Smith. Fujitsu Australia |
Peter Smith <[hidden email]> writes:
> I saw that one of our commitfest entries (32/2914) is recently > reporting a fail on the cfbot site [1]. I thought this was all ok a > few days ago. > ... > Is there any other detailed information available anywhere, e.g. > logs?, which might help us work out what was the cause of the test > failure? AFAIK the cfbot doesn't capture anything beyond the session typescript. However, this doesn't look that hard to reproduce locally ... have you tried, using similar configure options to what that cfbot run did? Once you did reproduce it, there'd be logs under contrib/test_decoding/tmp_check/. regards, tom lane |
On 2/17/21 11:06 AM, Tom Lane wrote: > Peter Smith <[hidden email]> writes: >> I saw that one of our commitfest entries (32/2914) is recently >> reporting a fail on the cfbot site [1]. I thought this was all ok a >> few days ago. >> ... >> Is there any other detailed information available anywhere, e.g. >> logs?, which might help us work out what was the cause of the test >> failure? > AFAIK the cfbot doesn't capture anything beyond the session typescript. > However, this doesn't look that hard to reproduce locally ... have you > tried, using similar configure options to what that cfbot run did? > Once you did reproduce it, there'd be logs under > contrib/test_decoding/tmp_check/. > > yeah. The cfbot runs check-world which makes it difficult for it to know which log files to show when there's an error. That's a major part of the reason the buildfarm runs a much finer grained set of steps. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com |
On Thu, Feb 18, 2021 at 9:18 AM Andrew Dunstan <[hidden email]> wrote:
> On 2/17/21 11:06 AM, Tom Lane wrote: > > Peter Smith <[hidden email]> writes: > >> I saw that one of our commitfest entries (32/2914) is recently > >> reporting a fail on the cfbot site [1]. I thought this was all ok a > >> few days ago. > >> ... > >> Is there any other detailed information available anywhere, e.g. > >> logs?, which might help us work out what was the cause of the test > >> failure? > > AFAIK the cfbot doesn't capture anything beyond the session typescript. > > However, this doesn't look that hard to reproduce locally ... have you > > tried, using similar configure options to what that cfbot run did? > > Once you did reproduce it, there'd be logs under > > contrib/test_decoding/tmp_check/. > > yeah. The cfbot runs check-world which makes it difficult for it to know > which log files to show when there's an error. That's a major part of > the reason the buildfarm runs a much finer grained set of steps. Yeah, it's hard to make it print out just the right logs without dumping so much stuff that it's hard to see the wood for the trees; perhaps if the Makefile had an option to dump relevant stuff for the specific tests that failed, or perhaps the buildfarm is already better at that and cfbot should just use the buildfarm client directly. Hmm. Another idea would be to figure out how to make a tarball of all log files that you can download for inspection with better tools at home when things go wrong. It would rapidly blow through the 1GB limit for stored "artefacts" on open source/community Cirrus accounts though, so we'd need to figure out how to manage retention carefully. For what it's worth, I tried to reproduce this particular on a couple of systems, many times, with no luck. It doesn't look like a freak CI failure (there have been a few random terminations I can't explain recently, but they look different, I think there was a Google Compute Engine outage that might explain that), and it failed in exactly the same way on Linux and FreeBSD. I tried locally on FreeBSD, on top of commit a975ff4980d60f8cbd8d8cbcff70182ea53e787a (which is what the last cfbot run did), because it conflicts with a recent change so it doesn't apply on the tip of master right now. |
On Thu, Feb 18, 2021 at 9:42 AM Thomas Munro <[hidden email]> wrote:
> ... (there have been a few random terminations I can't explain > recently, but they look different, I think there was a Google Compute > Engine outage that might explain that), ... There's also occasionally a failure like this[1], on FreeBSD only: t/001_stream_rep.pl .................. ok t/002_archiving.pl ................... ok t/003_recovery_targets.pl ............ ok Bailout called. Further testing stopped: system pg_basebackup failed FAILED--Further testing stopped: system pg_basebackup failed I have a sneaking suspicion this is a real problem on master, but more logs will be needed to guess more about that... [1] https://cirrus-ci.com/task/5046982650626048?command=test#L733 |
On Thu, Feb 18, 2021 at 9:55 AM Thomas Munro <[hidden email]> wrote:
> > On Thu, Feb 18, 2021 at 9:42 AM Thomas Munro <[hidden email]> wrote: > > ... (there have been a few random terminations I can't explain > > recently, but they look different, I think there was a Google Compute > > Engine outage that might explain that), ... > > There's also occasionally a failure like this[1], on FreeBSD only: > > t/001_stream_rep.pl .................. ok > t/002_archiving.pl ................... ok > t/003_recovery_targets.pl ............ ok > Bailout called. Further testing stopped: system pg_basebackup failed > FAILED--Further testing stopped: system pg_basebackup failed > > I have a sneaking suspicion this is a real problem on master, but more > logs will be needed to guess more about that... > > [1] https://cirrus-ci.com/task/5046982650626048?command=test#L733 Thanks for all the effort spent into looking at this. Meanwhile, since you pointed out the patch is not applying on the HEAD tip I can at least address that. ------ Kind Regards, Peter Smith. Fujitsu Australia |
In reply to this post by Thomas Munro-5
On 2/17/21 3:42 PM, Thomas Munro wrote: > On Thu, Feb 18, 2021 at 9:18 AM Andrew Dunstan <[hidden email]> wrote: >> On 2/17/21 11:06 AM, Tom Lane wrote: >>> Peter Smith <[hidden email]> writes: >>>> I saw that one of our commitfest entries (32/2914) is recently >>>> reporting a fail on the cfbot site [1]. I thought this was all ok a >>>> few days ago. >>>> ... >>>> Is there any other detailed information available anywhere, e.g. >>>> logs?, which might help us work out what was the cause of the test >>>> failure? >>> AFAIK the cfbot doesn't capture anything beyond the session typescript. >>> However, this doesn't look that hard to reproduce locally ... have you >>> tried, using similar configure options to what that cfbot run did? >>> Once you did reproduce it, there'd be logs under >>> contrib/test_decoding/tmp_check/. >> yeah. The cfbot runs check-world which makes it difficult for it to know >> which log files to show when there's an error. That's a major part of >> the reason the buildfarm runs a much finer grained set of steps. > Yeah, it's hard to make it print out just the right logs without > dumping so much stuff that it's hard to see the wood for the trees; > perhaps if the Makefile had an option to dump relevant stuff for the > specific tests that failed, or perhaps the buildfarm is already better > at that and cfbot should just use the buildfarm client directly. Hmm. > Another idea would be to figure out how to make a tarball of all log > files that you can download for inspection with better tools at home > when things go wrong. It would rapidly blow through the 1GB limit for > stored "artefacts" on open source/community Cirrus accounts though, so > we'd need to figure out how to manage retention carefully. I did some thinking about this. How about if we have the make files and the msvc build system create a well known file with the location(s) to search for log files if there's an error. Each bit of testing could overwrite this file before starting testing, and then tools like cfbot would know where to look for files to report? To keep things clean for other users the file would only be created if, say, PG_NEED_ERROR_LOG_LOCATIONS is set. The well known location would be something like "$(top_builddir)/error_log_locations.txt", and individual Makefiles would have entries something like:, override ERROR_LOG_LOCATIONS = $(top_builddir)/contrib/test_decoding/tmp_check/log If this seems like a good idea I can go and try to make that happen. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com |
On 2/18/21 11:01 AM, Andrew Dunstan wrote: > On 2/17/21 3:42 PM, Thomas Munro wrote: >> On Thu, Feb 18, 2021 at 9:18 AM Andrew Dunstan <[hidden email]> wrote: >>> On 2/17/21 11:06 AM, Tom Lane wrote: >>>> Peter Smith <[hidden email]> writes: >>>>> I saw that one of our commitfest entries (32/2914) is recently >>>>> reporting a fail on the cfbot site [1]. I thought this was all ok a >>>>> few days ago. >>>>> ... >>>>> Is there any other detailed information available anywhere, e.g. >>>>> logs?, which might help us work out what was the cause of the test >>>>> failure? >>>> AFAIK the cfbot doesn't capture anything beyond the session typescript. >>>> However, this doesn't look that hard to reproduce locally ... have you >>>> tried, using similar configure options to what that cfbot run did? >>>> Once you did reproduce it, there'd be logs under >>>> contrib/test_decoding/tmp_check/. >>> yeah. The cfbot runs check-world which makes it difficult for it to know >>> which log files to show when there's an error. That's a major part of >>> the reason the buildfarm runs a much finer grained set of steps. >> Yeah, it's hard to make it print out just the right logs without >> dumping so much stuff that it's hard to see the wood for the trees; >> perhaps if the Makefile had an option to dump relevant stuff for the >> specific tests that failed, or perhaps the buildfarm is already better >> at that and cfbot should just use the buildfarm client directly. Hmm. >> Another idea would be to figure out how to make a tarball of all log >> files that you can download for inspection with better tools at home >> when things go wrong. It would rapidly blow through the 1GB limit for >> stored "artefacts" on open source/community Cirrus accounts though, so >> we'd need to figure out how to manage retention carefully. > > I did some thinking about this. How about if we have the make files and > the msvc build system create a well known file with the location(s) to > search for log files if there's an error. Each bit of testing could > overwrite this file before starting testing, and then tools like cfbot > would know where to look for files to report? To keep things clean for > other users the file would only be created if, say, > PG_NEED_ERROR_LOG_LOCATIONS is set. The well known location would be > something like "$(top_builddir)/error_log_locations.txt", and individual > Makefiles would have entries something like:, > > > override ERROR_LOG_LOCATIONS = > $(top_builddir)/contrib/test_decoding/tmp_check/log > > > If this seems like a good idea I can go and try to make that happen. > > demonstrates this and seems to do the right thing. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com |
In reply to this post by Thomas Munro-5
Here is another related question about the cfbot error reporting -
The main cfbot "status page" [1] still shows a couple of fails for the 32/2914 (for freebsd & linux). But looking more closely, those fails are not from the latest run. e.g. I also found this execution "history" page [2] for our patch which shows the most recent run was ok for commit a7e715. ~~ So it seems like there is some kind of rule that says the main status page will still indicate "recent* errors (even if the latest execution was ok)... IIUC that explains the difference between a hollow red 'X' (old fail) and a solid red 'X' fail (new fail)? And I am guessing if our patch continues to work ok (for how long?) then that hollow red 'X' may morph into a solid green 'tick' (when the old fail becomes too old to care about anymore)? But those are just my guesses based on those icon tooltips. What *really* are the rules for those main page status indicators, and how long do the old failure icons linger around before changing to success icons? (Apologies if a legend for those icons is already described somewhere - I didn't find it). Thanks! ------ [1] http://cfbot.cputube.org/ [2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/32/2914 Kind Regards, Peter Smith. Fujitsu Australia |
On Sat, Feb 20, 2021 at 10:31 AM Peter Smith <[hidden email]> wrote:
> Here is another related question about the cfbot error reporting - > > The main cfbot "status page" [1] still shows a couple of fails for the > 32/2914 (for freebsd & linux). But looking more closely, those fails > are not from the latest run. e.g. I also found this execution > "history" page [2] for our patch which shows the most recent run was > ok for commit a7e715. > > ~~ > > So it seems like there is some kind of rule that says the main status > page will still indicate "recent* errors (even if the latest execution > was ok)... Hmmph. It seems like there is indeed some kind of occasional glitch, possible a bug in my code for picking up statuses from Cirrus through their GraphQL API (that's a recent thing; we had to change providers due to another CI's policy changes in January, and apparently something in the new pipeline isn't quite fully baked). Will look into that this weekend. Sorry about that, and thanks for letting me know. > IIUC that explains the difference between a hollow red 'X' (old fail) > and a solid red 'X' fail (new fail)? And I am guessing if our patch > continues to work ok (for how long?) then that hollow red 'X' may > morph into a solid green 'tick' (when the old fail becomes too old to > care about anymore)? > > But those are just my guesses based on those icon tooltips. What > *really* are the rules for those main page status indicators, and how > long do the old failure icons linger around before changing to success > icons? (Apologies if a legend for those icons is already described > somewhere - I didn't find it). Yeah, I will try to clarify the UI a bit... |
On Sat, Feb 20, 2021 at 11:33 AM Thomas Munro <[hidden email]> wrote:
> On Sat, Feb 20, 2021 at 10:31 AM Peter Smith <[hidden email]> wrote: > > Here is another related question about the cfbot error reporting - > > > > The main cfbot "status page" [1] still shows a couple of fails for the > > 32/2914 (for freebsd & linux). But looking more closely, those fails > > are not from the latest run. e.g. I also found this execution > > "history" page [2] for our patch which shows the most recent run was > > ok for commit a7e715. > Hmmph. It seems like there is indeed some kind of occasional glitch, > possible a bug in my code for picking up statuses from Cirrus through > their GraphQL API (that's a recent thing; we had to change providers > due to another CI's policy changes in January, and apparently > something in the new pipeline isn't quite fully baked). Will look > into that this weekend. Sorry about that, and thanks for letting me > know. Should be fixed now. (Embarrassing detail: I used flock(LOCK_EX) to prevent two copies of a cron job from running at the same time, but Python helpfully garbage collected and closed the fd, releasing the lock, Rarely, the script would run for long enough to have some state clobbered by the next run, so you wouldn't see the latest result.) |
In reply to this post by Andrew Dunstan
On Sat, Feb 20, 2021 at 3:54 AM Andrew Dunstan <[hidden email]> wrote:
> here's a very small and simple (and possibly naive) POC patch that > demonstrates this and seems to do the right thing. As a small variation that might be more parallelism-friendly, would it be better to touch a file with a known name in any subdirectory that contains potentially interesting logs, and rm it when a test succeeds? |
In reply to this post by Andrew Dunstan
Hi,
On 2021-02-17 15:18:02 -0500, Andrew Dunstan wrote: > yeah. The cfbot runs check-world which makes it difficult for it to know > which log files to show when there's an error. That's a major part of > the reason the buildfarm runs a much finer grained set of steps. I really think we need a better solution for this across the different use-cases of running tests. For development parallel check-world is important for a decent hack-test loop. But I waste a fair bit of time to scroll back to find the original source of failures. And on the buildfarm we waste a significant amount of time by limiting parallelism due to the non-parallel sequence of finer grained steps. And it's not just about logs - even just easily seeing the first reported test failure without needing to search through large amounts of text would be great. With, um, more modern buildtools (e.g. ninja) you'll at least get the last failure displayed at the end, instead of seing a lot of other things after it like with make. My suspicion is that, given the need to have this work for both msvc and make, writing an in-core test-runner script is the only real option to improve upon the current situation. For make it'd not be hard to add a recursive 'listchecks' target listing the individual tests that need to be run. Hacking up vcregress.pl to do that, instead of what it currently does, shouldn't be too hard either. Once there's a list of commands that need to be run it's not hard to write a loop in perl that runs up to N tests in parallel, saving their output. Which then allows to display the failing test reports at the end. If we then also add a convention that each test outputs something like TESTLOG: path/to/logfile ... it'd not be hard to add support for the test runner to list the files that cfbot et al should output. Looking around the tree, the most annoying bit to implement something like this is that things below src/bin/, src/interfaces, src/test, src/pl implement their own check, installcheck targets. Given the number of these that just boil down to a variant of check: $(pg_regress_check) $(prove_check) installcheck: $(pg_regress_installcheck) it seems we should lift the REGRESS and TAP_TESTS specific logic in pgxs.mk up into src/Makefiles.global. Which then would make something list a global listchecks target easy. Greetings, Andres Freund |
In reply to this post by Thomas Munro-5
On 2/21/21 9:28 PM, Thomas Munro wrote: > On Sat, Feb 20, 2021 at 3:54 AM Andrew Dunstan <[hidden email]> wrote: >> here's a very small and simple (and possibly naive) POC patch that >> demonstrates this and seems to do the right thing. > As a small variation that might be more parallelism-friendly, would it > be better to touch a file with a known name in any subdirectory that > contains potentially interesting logs, and rm it when a test succeeds? Yes, that sounds better, Thanks. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com |
In reply to this post by Andres Freund
On 2/21/21 10:34 PM, Andres Freund wrote: > Hi, > > On 2021-02-17 15:18:02 -0500, Andrew Dunstan wrote: >> yeah. The cfbot runs check-world which makes it difficult for it to know >> which log files to show when there's an error. That's a major part of >> the reason the buildfarm runs a much finer grained set of steps. > I really think we need a better solution for this across the different > use-cases of running tests. For development parallel check-world is > important for a decent hack-test loop. But I waste a fair bit of time to > scroll back to find the original source of failures. And on the > buildfarm we waste a significant amount of time by limiting parallelism > due to the non-parallel sequence of finer grained steps. Maybe but running fast isn't really a design goal of the buildfarm. It's meant to be automated so it doesn't matter if it takes 10 or 20 or 60 minutes. Another reason for using fine grained tasks is to be able to include/exclude them as needed. See the buildfarm's exclude-steps/only-steps parameters. That said there is some provision for parallelism, in that multiple branches and multiple members can be tested at the same time, and run_branches.pl will manage that fairly nicely for you. See <https://wiki.postgresql.org/wiki/PostgreSQL_Buildfarm_Howto#Running_in_Parallel> for details > And it's not just about logs - even just easily seeing the first > reported test failure without needing to search through large amounts of > text would be great. > > With, um, more modern buildtools (e.g. ninja) you'll at least get the > last failure displayed at the end, instead of seing a lot of other > things after it like with make. > > > My suspicion is that, given the need to have this work for both msvc and > make, writing an in-core test-runner script is the only real option to > improve upon the current situation. Ok ... be prepared for a non-trivial maintenance cost, however, which will be born by those of us fluent in perl, the only realistic possibility unless we want to add to build dependencies. That's far from everyone. Part of the problem that this isn't going to solve is the sheer volume that some tests produce. For example, the pg_dump tests produce about 40k linesĀ / 5Mb of log. > For make it'd not be hard to add a recursive 'listchecks' target listing > the individual tests that need to be run. Hacking up vcregress.pl to do > that, instead of what it currently does, shouldn't be too hard either. > > > Once there's a list of commands that need to be run it's not hard to > write a loop in perl that runs up to N tests in parallel, saving their > output. Which then allows to display the failing test reports at the > end. > > > If we then also add a convention that each test outputs something like > TESTLOG: path/to/logfile > ... > it'd not be hard to add support for the test runner to list the files > that cfbot et al should output. Yeah, there is code in the buildfarm that contains a lot of building blocks that can be used for this sort of stuff, see the PGBuild::Utils and PGBuild::Log modules. > Looking around the tree, the most annoying bit to implement something > like this is that things below src/bin/, src/interfaces, src/test, > src/pl implement their own check, installcheck targets. Given the number > of these that just boil down to a variant of > > check: > $(pg_regress_check) > $(prove_check) > installcheck: > $(pg_regress_installcheck) > > it seems we should lift the REGRESS and TAP_TESTS specific logic in > pgxs.mk up into src/Makefiles.global. Which then would make something > list a global listchecks target easy. > Yeah, some of this stuff has grown a bit haphazardly, and maybe needs some rework. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com |
Free forum by Nabble | Edit this page |