POC: Better infrastructure for automated testing of concurrency issues

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

POC: Better infrastructure for automated testing of concurrency issues

Alexander Korotkov-4
Hackers,

PostgreSQL is a complex multi-process system, and we are periodically faced with complicated concurrency issues. While the postgres community does a great job on investigating and fixing the problems, our ability to reproduce concurrency issues in the source code test suites is limited.

I think we currently have two general ways to reproduce the concurrency issues.
1. A text scenario for manual reproduction of the issue, which could involve psql sessions, gdb sessions etc. Couple of examples are [1] and [2]. This method provides reliable reproduction of concurrency issues. But it's  hard to automate, because it requires external instrumentation (debugger) and it's not stable in terms of postgres code changes (that is particular line numbers for breakpoints could be changed). I think this is why we currently don't have such scenarios among postgres test suites.
2. Another way is to reproduce the concurrency issue without directly touching the database internals using pgbench or other way to simulate the workload (see [3] for example). This way is easier to automate, because it doesn't need external instrumentation and it's not so sensitive to source code changes. But at the same time this way is not reliable and is resource-consuming.

In the view of above, I'd like to propose a POC patch, which implements new builtin infrastructure for reproduction of concurrency issues in automated test suites.  The general idea is so-called "stop events", which are special places in the code, where the execution could be stopped on some condition.  Stop event also exposes a set of parameters, encapsulated into jsonb value.  The condition over stop event parameters is defined using jsonpath language.

Following functions control behavior –
 * pg_stopevent_set(stopevent_name, jsonpath_conditon) – sets condition for the stop event.  Once the function is executed, all the backends, which run a given stop event with parameters satisfying the given jsonpath condition, will be stopped.
 * pg_stopevent_reset(stopevent_name) – resets stop events.  All the backends previously stopped on a given stop event will continue the execution.

For sure, evaluation of stop events causes a CPU overhead.  This is why it's controlled by enable_stopevents GUC, which is off by default. I expect the overhead with enable_stopevents = off shouldn't be observable.  Even if it would be observable, we could enable stop events only by specific configure parameter.  There is also trace_stopevents GUC, which traces all the stop events to the log with debug2 level.

In the code stop events are defined using macro STOPEVENT(event_id, params).  The 'params' should be a function call, and it's evaluated only if stop events are enabled.  pg_isolation_test_session_is_blocked() takes stop events into account.  So, stop events are suitable for isolation tests.

POC patch comes with a sample isolation test in src/test/isolation/specs/gin-traverse-deleted-pages.spec, which reproduces the issue described in [2] (gin scan steps to the page concurrently deleted by vacuum).

From my point of view, stop events would open great possibilities to improve coverage of concurrency issues.  They allow us to reliably test concurrency issues in both isolation and tap test suites.  And such test suites don't take extraordinary resources for execution.  The main cost here is maintaining a set of stop events among the codebase.  But I think this cost is justified by much better coverage of concurrency issues.

The feedback is welcome.

Links.
1. https://www.postgresql.org/message-id/4E1DE580.1090905%40enterprisedb.com
2. https://www.postgresql.org/message-id/CAPpHfdvMvsw-NcE5bRS7R1BbvA4BxoDnVVjkXC5W0Czvy9LVrg%40mail.gmail.com
3. https://www.postgresql.org/message-id/BF9B38A4-2BFF-46E8-BA87-A2D00A8047A6%40hintbits.com

------
Regards,
Alexander Korotkov

0001-Stopevents-v1.patch (37K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: POC: Better infrastructure for automated testing of concurrency issues

Álvaro Herrera
On 2020-Nov-25, Alexander Korotkov wrote:

> In the view of above, I'd like to propose a POC patch, which implements new
> builtin infrastructure for reproduction of concurrency issues in automated
> test suites.  The general idea is so-called "stop events", which are
> special places in the code, where the execution could be stopped on some
> condition.  Stop event also exposes a set of parameters, encapsulated into
> jsonb value.  The condition over stop event parameters is defined using
> jsonpath language.

+1 for the idea.  I agree we have a need for something on this area;
there are *many* scenarios currently untested because of the lack of
what you call "stop points".  I don't know if jsonpath is the best way
to implement it, but at least it is readily available and it seems a
decent way to go at it.



Reply | Threaded
Open this post in threaded view
|

Re: POC: Better infrastructure for automated testing of concurrency issues

Peter Geoghegan-4
In reply to this post by Alexander Korotkov-4
On Wed, Nov 25, 2020 at 6:11 AM Alexander Korotkov <[hidden email]> wrote:
> While the postgres community does a great job on investigating and fixing the problems, our ability to reproduce concurrency issues in the source code test suites is limited.

+1. This seems really cool.

> For sure, evaluation of stop events causes a CPU overhead.  This is why it's controlled by enable_stopevents GUC, which is off by default. I expect the overhead with enable_stopevents = off shouldn't be observable.  Even if it would be observable, we could enable stop events only by specific configure parameter.  There is also trace_stopevents GUC, which traces all the stop events to the log with debug2 level.

But why even risk adding noticeable overhead when "enable_stopevents =
off "? Even if it's a very small risk? We can still get most of the
benefit by enabling it only on certain builds and buildfarm animals.
It will be a bit annoying to not have stop events enabled in all
builds, but it avoids the problem of even having to think about the
overhead, now or in the future. I think that that trade-off is a good
one. Even if the performance trade-off is judged perfectly for the
first few tests you add, what are the chances that it will stay that
way as the infrastructure is used in more and more places? What if you
need to add a test to the back branches? Since we don't anticipate any
direct benefit for users (right?), I think that this question is
simple.

I am not arguing for not enabling stop events on standard builds
because the infrastructure isn't useful -- it's *very* useful. Useful
enough that it would be nice to be able to use it extensively without
really thinking about the performance hit each time. I know that I'll
be *far* more likely to use it if I don't have to waste time and
energy on that aspect every single time.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: POC: Better infrastructure for automated testing of concurrency issues

Alexander Korotkov-4
In reply to this post by Álvaro Herrera
On Fri, Dec 4, 2020 at 9:29 PM Alvaro Herrera <[hidden email]> wrote:

> On 2020-Nov-25, Alexander Korotkov wrote:
> > In the view of above, I'd like to propose a POC patch, which implements new
> > builtin infrastructure for reproduction of concurrency issues in automated
> > test suites.  The general idea is so-called "stop events", which are
> > special places in the code, where the execution could be stopped on some
> > condition.  Stop event also exposes a set of parameters, encapsulated into
> > jsonb value.  The condition over stop event parameters is defined using
> > jsonpath language.
>
> +1 for the idea.  I agree we have a need for something on this area;
> there are *many* scenarios currently untested because of the lack of
> what you call "stop points".  I don't know if jsonpath is the best way
> to implement it, but at least it is readily available and it seems a
> decent way to go at it.

Thank you for your feedback.  I agree with you regarding jsonpath.  My
initial idea was to use the executor expressions.  But executor
expressions require serialization/deserialization, while stop points
need to work cross-database or even with processes not connected to
any database (such as checkpointer, background writer etc).  That
leads to difficulties, while jsonpath appears to be very easy for this
use-case.

------
Regards,
Alexander Korotkov


Reply | Threaded
Open this post in threaded view
|

Re: POC: Better infrastructure for automated testing of concurrency issues

Alexander Korotkov-4
In reply to this post by Peter Geoghegan-4
On Fri, Dec 4, 2020 at 9:57 PM Peter Geoghegan <[hidden email]> wrote:

> On Wed, Nov 25, 2020 at 6:11 AM Alexander Korotkov <[hidden email]> wrote:
> > While the postgres community does a great job on investigating and fixing the problems, our ability to reproduce concurrency issues in the source code test suites is limited.
>
> +1. This seems really cool.
>
> > For sure, evaluation of stop events causes a CPU overhead.  This is why it's controlled by enable_stopevents GUC, which is off by default. I expect the overhead with enable_stopevents = off shouldn't be observable.  Even if it would be observable, we could enable stop events only by specific configure parameter.  There is also trace_stopevents GUC, which traces all the stop events to the log with debug2 level.
>
> But why even risk adding noticeable overhead when "enable_stopevents =
> off "? Even if it's a very small risk? We can still get most of the
> benefit by enabling it only on certain builds and buildfarm animals.
> It will be a bit annoying to not have stop events enabled in all
> builds, but it avoids the problem of even having to think about the
> overhead, now or in the future. I think that that trade-off is a good
> one. Even if the performance trade-off is judged perfectly for the
> first few tests you add, what are the chances that it will stay that
> way as the infrastructure is used in more and more places? What if you
> need to add a test to the back branches? Since we don't anticipate any
> direct benefit for users (right?), I think that this question is
> simple.
>
> I am not arguing for not enabling stop events on standard builds
> because the infrastructure isn't useful -- it's *very* useful. Useful
> enough that it would be nice to be able to use it extensively without
> really thinking about the performance hit each time. I know that I'll
> be *far* more likely to use it if I don't have to waste time and
> energy on that aspect every single time.

Thank you for your feedback.  We probably can't think over everything
in advance.  We can start with configure option enabled for developers
and some buildfarm animals.  That causes no risk of overhead in
standard builds.  After some time, we may reconsider to enable stop
events even in standard build if we see they cause no regression.

------
Regards,
Alexander Korotkov


Reply | Threaded
Open this post in threaded view
|

Re: POC: Better infrastructure for automated testing of concurrency issues

Peter Geoghegan-4
On Fri, Dec 4, 2020 at 1:20 PM Alexander Korotkov <[hidden email]> wrote:
> Thank you for your feedback.  We probably can't think over everything
> in advance.  We can start with configure option enabled for developers
> and some buildfarm animals.  That causes no risk of overhead in
> standard builds.  After some time, we may reconsider to enable stop
> events even in standard build if we see they cause no regression.

I'll start using the configure option for debug builds only as soon as
possible. It will easily work with my existing workflow.

I don't know about anyone else, but for me this is only a very small
inconvenience. Whereas the convenience of not having to think about
the performance impact seems huge.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: POC: Better infrastructure for automated testing of concurrency issues

Craig Ringer-5
In reply to this post by Alexander Korotkov-4
On Wed, 25 Nov 2020 at 22:11, Alexander Korotkov <[hidden email]> wrote:
Hackers,

PostgreSQL is a complex multi-process system, and we are periodically faced with complicated concurrency issues. While the postgres community does a great job on investigating and fixing the problems, our ability to reproduce concurrency issues in the source code test suites is limited.

I think we currently have two general ways to reproduce the concurrency issues.
1. A text scenario for manual reproduction of the issue, which could involve psql sessions, gdb sessions etc. Couple of examples are [1] and [2]. This method provides reliable reproduction of concurrency issues. But it's  hard to automate, because it requires external instrumentation (debugger) and it's not stable in terms of postgres code changes (that is particular line numbers for breakpoints could be changed). I think this is why we currently don't have such scenarios among postgres test suites.
2. Another way is to reproduce the concurrency issue without directly touching the database internals using pgbench or other way to simulate the workload (see [3] for example). This way is easier to automate, because it doesn't need external instrumentation and it's not so sensitive to source code changes. But at the same time this way is not reliable and is resource-consuming.

Agreed.

For a useful but limited set of cases there's (3) the isolation tester and pg_isolation_regress. But IIRC the patches to teach it to support multiple upstream nodes never got in, so it's essentially useless for any replication related testing.

There's also (4), write a TAP test that uses concurrent psql sessions via IPC::Run. Then play games with heavyweight or advisory lock waits to order events, use instance starts/stops, change ports or connstrings to simulate network issues, use SIGSTOP/SIGCONTs, add src/test/modules extensions that inject faults or provide custom blocking wait functions for the event you want, etc. I've done that more than I'd care to, and I don't want to do it any more than I have to in future.

In some cases I've gone further and written tests that use systemtap in "guru" mode (read/write, with embedded C enabled) to twiddle the memory of the target process(es) when a probe is hit, e.g. to modify a function argument or return value or inject a fault. Not exactly portable or convenient, though very powerful.


In the view of above, I'd like to propose a POC patch, which implements new builtin infrastructure for reproduction of concurrency issues in automated test suites.  The general idea is so-called "stop events", which are special places in the code, where the execution could be stopped on some condition.  Stop event also exposes a set of parameters, encapsulated into jsonb value.  The condition over stop event parameters is defined using jsonpath language.

The patched PostgreSQL used by 2ndQuadrant internally has a feature called PROBE_POINT()s that is somewhat akin to this. Since it's not a customer facing feature I'm sure I can discuss it here, though I'll need to seek permission before I can show code.

TL;DR: PROBE_POINT()s let you inject ERRORs, sleeps, crashes, and various other behaviour at points in the code marked by name, using GUCs, hooks loaded from test extensions, or even systemtap scripts to control what fires and when. Performance impact is essentially zero when no probes are currently enabled at runtime, so they're fine for cassert builds.

Details:

A PROBE_POINT() is a macro that works as a marker, a bit like a TRACE_POSTGRESQL_.... dtrace macro. But instead of the super lightweight tracepoint that SDT marker points emit, a PROBE_POINT tests an unlikely(probe_points_enabled) flag, and if true, it prepares arguments for the probe handler: A probe name, probe action, sleep duration, and a hit counter.

The default probe action and sleep duration come from GUCs. So your control of the probe is limited to the granularity you can easily manage GUCs at. That's often sufficient

But if you want finer control for something, there are two ways to achieve it.

After picking the default arguments to the handler, the probe point checks for a hook. If defined, it calls it with the probe point name and pointers to the action and sleep duration values, so the hook function can modify them per probe-point hit. That way you can use in src/test/modules extensions or your own test extensions first, with the probe point name as an argument and the action and sleep duration as out-params, as well as any accessible global state, custom GUCs you define in your test extension, etc. That's usually enough to target a probe very specifically but it's a bit of a hassle.

Another option is to use a systemtap script. You can write your code in systemtap with its language. When the systemtap marker for a probe point event fires, decide if it's the one you want and twiddle the target process variables that store the probe action and sleep duration from the systemtap script. I find this much more convenient for day to day testing, but because of systemtap portability challenges I don't find it as useful for writing regression tests for repeat use.

A PROBE_POINT() actually emits dtrace/perf SDT markers if postgres was compiled with --enable-dtrace too, so you can use them with perf, systemtap, bpftrace or whatever for read-only use. Including optional arguments to the probe point. Exactly as if it was a TRACE_POSTGRESQL_foo point, but without needing to hack probes.d for each one.

The PROBE_POINT() implementation can fake signal delivery with signal actions, which has been handy too.

I also have a version of the code that takes arguments to the PROBE_POINT() and passes them to the handler function as a va_list too, with a compile-time-generated array of argument types inferred by C11 _Generic as the first argument. So your handler function can be passed probe-point-specific contextual info like the current xid being committed or whatever. This isn't currently deployed.

The advantage of the PROBE_POINT() approach has been that it's generally very cheap to check whether a probe point should fire, and it's basically free to skip them if there are no probe points enabled right now. If we hashed the probe point names for the initial comparisons it'd be faster still.

I will seek approval to share the relevant code.

Following functions control behavior –
 * pg_stopevent_set(stopevent_name, jsonpath_conditon) – sets condition for the stop event.  Once the function is executed, all the backends, which run a given stop event with parameters satisfying the given jsonpath condition, will be stopped.
 * pg_stopevent_reset(stopevent_name) – resets stop events.  All the backends previously stopped on a given stop event will continue the execution.

Does that offer any way to affect early startup, late shutdown, servers in warm standby, etc? Or for that matter, any way to manipulate bgworkers and auxprocs or the postmaster itself, things you can't run a query on directly?

Also, based on my experience using PROBE_POINT()s I would suggest that in addition to a stop or start "event", it's desirable to be able to elog(PANIC), elog(ERROR), elog(LOG), and/or sleep() for a certain duration. I've found all to be extremely useful.

In the code stop events are defined using macro STOPEVENT(event_id, params).  The 'params' should be a function call, and it's evaluated only if stop events are enabled.  pg_isolation_test_session_is_blocked() takes stop events into account.

Oooh, that I like.

PROBE_POINT()s don't do that, and it's annoying.

Reply | Threaded
Open this post in threaded view
|

Re: POC: Better infrastructure for automated testing of concurrency issues

Alexander Korotkov-4
Hi!

On Mon, Dec 7, 2020 at 9:10 AM Craig Ringer
<[hidden email]> wrote:

> On Wed, 25 Nov 2020 at 22:11, Alexander Korotkov <[hidden email]> wrote:
>> PostgreSQL is a complex multi-process system, and we are periodically faced with complicated concurrency issues. While the postgres community does a great job on investigating and fixing the problems, our ability to reproduce concurrency issues in the source code test suites is limited.
>>
>> I think we currently have two general ways to reproduce the concurrency issues.
>> 1. A text scenario for manual reproduction of the issue, which could involve psql sessions, gdb sessions etc. Couple of examples are [1] and [2]. This method provides reliable reproduction of concurrency issues. But it's  hard to automate, because it requires external instrumentation (debugger) and it's not stable in terms of postgres code changes (that is particular line numbers for breakpoints could be changed). I think this is why we currently don't have such scenarios among postgres test suites.
>> 2. Another way is to reproduce the concurrency issue without directly touching the database internals using pgbench or other way to simulate the workload (see [3] for example). This way is easier to automate, because it doesn't need external instrumentation and it's not so sensitive to source code changes. But at the same time this way is not reliable and is resource-consuming.
>
> Agreed.
>
> For a useful but limited set of cases there's (3) the isolation tester and pg_isolation_regress. But IIRC the patches to teach it to support multiple upstream nodes never got in, so it's essentially useless for any replication related testing.
>
> There's also (4), write a TAP test that uses concurrent psql sessions via IPC::Run. Then play games with heavyweight or advisory lock waits to order events, use instance starts/stops, change ports or connstrings to simulate network issues, use SIGSTOP/SIGCONTs, add src/test/modules extensions that inject faults or provide custom blocking wait functions for the event you want, etc. I've done that more than I'd care to, and I don't want to do it any more than I have to in future.

Sure, there are isolation tester and TAP tests.  I just meant the
scenarios, where we can't reliably reproduce using either isolation
tests or tap tests.  Sorry for confusion.

> In some cases I've gone further and written tests that use systemtap in "guru" mode (read/write, with embedded C enabled) to twiddle the memory of the target process(es) when a probe is hit, e.g. to modify a function argument or return value or inject a fault. Not exactly portable or convenient, though very powerful.

Exactly, systemtap is good, but we need something more portable and
convenient for builtin test suites.

>> In the view of above, I'd like to propose a POC patch, which implements new builtin infrastructure for reproduction of concurrency issues in automated test suites.  The general idea is so-called "stop events", which are special places in the code, where the execution could be stopped on some condition.  Stop event also exposes a set of parameters, encapsulated into jsonb value.  The condition over stop event parameters is defined using jsonpath language.
>
>
> The patched PostgreSQL used by 2ndQuadrant internally has a feature called PROBE_POINT()s that is somewhat akin to this. Since it's not a customer facing feature I'm sure I can discuss it here, though I'll need to seek permission before I can show code.
>
> TL;DR: PROBE_POINT()s let you inject ERRORs, sleeps, crashes, and various other behaviour at points in the code marked by name, using GUCs, hooks loaded from test extensions, or even systemtap scripts to control what fires and when. Performance impact is essentially zero when no probes are currently enabled at runtime, so they're fine for cassert builds.
>
> Details:
>
> A PROBE_POINT() is a macro that works as a marker, a bit like a TRACE_POSTGRESQL_.... dtrace macro. But instead of the super lightweight tracepoint that SDT marker points emit, a PROBE_POINT tests an unlikely(probe_points_enabled) flag, and if true, it prepares arguments for the probe handler: A probe name, probe action, sleep duration, and a hit counter.
>
> The default probe action and sleep duration come from GUCs. So your control of the probe is limited to the granularity you can easily manage GUCs at. That's often sufficient
>
> But if you want finer control for something, there are two ways to achieve it.
>
> After picking the default arguments to the handler, the probe point checks for a hook. If defined, it calls it with the probe point name and pointers to the action and sleep duration values, so the hook function can modify them per probe-point hit. That way you can use in src/test/modules extensions or your own test extensions first, with the probe point name as an argument and the action and sleep duration as out-params, as well as any accessible global state, custom GUCs you define in your test extension, etc. That's usually enough to target a probe very specifically but it's a bit of a hassle.
>
> Another option is to use a systemtap script. You can write your code in systemtap with its language. When the systemtap marker for a probe point event fires, decide if it's the one you want and twiddle the target process variables that store the probe action and sleep duration from the systemtap script. I find this much more convenient for day to day testing, but because of systemtap portability challenges I don't find it as useful for writing regression tests for repeat use.
>
> A PROBE_POINT() actually emits dtrace/perf SDT markers if postgres was compiled with --enable-dtrace too, so you can use them with perf, systemtap, bpftrace or whatever for read-only use. Including optional arguments to the probe point. Exactly as if it was a TRACE_POSTGRESQL_foo point, but without needing to hack probes.d for each one.
>
> The PROBE_POINT() implementation can fake signal delivery with signal actions, which has been handy too.
>
> I also have a version of the code that takes arguments to the PROBE_POINT() and passes them to the handler function as a va_list too, with a compile-time-generated array of argument types inferred by C11 _Generic as the first argument. So your handler function can be passed probe-point-specific contextual info like the current xid being committed or whatever. This isn't currently deployed.
>
> The advantage of the PROBE_POINT() approach has been that it's generally very cheap to check whether a probe point should fire, and it's basically free to skip them if there are no probe points enabled right now. If we hashed the probe point names for the initial comparisons it'd be faster still.
>
> I will seek approval to share the relevant code.

It's nice to know that we've also worked in this direction.  I was a
bit surprised when I didn't find relevant patches published in the
mailing lists.  I hope you would be able to share the code, it would
be very nice to see.

>> Following functions control behavior –
>>  * pg_stopevent_set(stopevent_name, jsonpath_conditon) – sets condition for the stop event.  Once the function is executed, all the backends, which run a given stop event with parameters satisfying the given jsonpath condition, will be stopped.
>>  * pg_stopevent_reset(stopevent_name) – resets stop events.  All the backends previously stopped on a given stop event will continue the execution.
>
>
> Does that offer any way to affect early startup, late shutdown, servers in warm standby, etc? Or for that matter, any way to manipulate bgworkers and auxprocs or the postmaster itself, things you can't run a query on directly?

Using the current version of patch you can manipulate bgworkers and
auxprocs as soon as they're connected to the shmem.  We can write
queries from another backend and the setting affects the whole
cluster.  I'm planning to add the ability to access the process
information from the jsonpath condition.  So, we would be able to
choose which process to stop on the stop event.

Early startup, late shutdown, servers in warm standby are not
supported yet.  I think this could be done using GUCa and hooks +
custom extensions in the similar way you describe it for
PROBE_POINT().

Also, I don't think we need to support everything at once.  It would
be nice to get something simple as soon as we have a clear roadmap of
how to add the rest of the features later.

> Also, based on my experience using PROBE_POINT()s I would suggest that in addition to a stop or start "event", it's desirable to be able to elog(PANIC), elog(ERROR), elog(LOG), and/or sleep() for a certain duration. I've found all to be extremely useful.
>
>> In the code stop events are defined using macro STOPEVENT(event_id, params).  The 'params' should be a function call, and it's evaluated only if stop events are enabled.  pg_isolation_test_session_is_blocked() takes stop events into account.
>
>
> Oooh, that I like.
>
> PROBE_POINT()s don't do that, and it's annoying.

Thank you for your feedback.  I'm looking forward if you can publish
the PROBE_POINT() work.

------
Regards,
Alexander Korotkov


Reply | Threaded
Open this post in threaded view
|

Re: POC: Better infrastructure for automated testing of concurrency issues

Andrey Borodin-2
In reply to this post by Alexander Korotkov-4
Hi Alexander!

> 25 нояб. 2020 г., в 19:10, Alexander Korotkov <[hidden email]> написал(а):
>
> In the code stop events are defined using macro STOPEVENT(event_id, params).  The 'params' should be a function call, and it's evaluated only if stop events are enabled.  pg_isolation_test_session_is_blocked() takes stop events into account.  So, stop events are suitable for isolation tests.

Thanks for this infrastructure. Looks like a really nice way to increase test coverage of most difficult things.

Can we also somehow prove that test was deterministic? I.e. expect number of blocked backends (if known) or something like that.
I'm not really sure it's useful, just an idea.

Thanks!

Best regards, Andrey Borodin.

Reply | Threaded
Open this post in threaded view
|

Re: POC: Better infrastructure for automated testing of concurrency issues

Alexander Korotkov-4
On Tue, Dec 8, 2020 at 1:26 PM Andrey Borodin <[hidden email]> wrote:
> > 25 нояб. 2020 г., в 19:10, Alexander Korotkov <[hidden email]> написал(а):
> >
> > In the code stop events are defined using macro STOPEVENT(event_id, params).  The 'params' should be a function call, and it's evaluated only if stop events are enabled.  pg_isolation_test_session_is_blocked() takes stop events into account.  So, stop events are suitable for isolation tests.
>
> Thanks for this infrastructure. Looks like a really nice way to increase test coverage of most difficult things.
>
> Can we also somehow prove that test was deterministic? I.e. expect number of blocked backends (if known) or something like that.
> I'm not really sure it's useful, just an idea.

Thank you for your feedback!

I forgot to mention, patch comes with pg_stopevents() function which
returns rowset (stopevent text, condition jsonpath, waiters int[]).
Waiters is an array of pids of waiting processes.

Additionally, isolation tester checks if a particular backend is
waiting using pg_isolation_test_session_is_blocked(), which works with
stop events too.

------
Regards,
Alexander Korotkov


Reply | Threaded
Open this post in threaded view
|

Re: POC: Better infrastructure for automated testing of concurrency issues

Peter Geoghegan-4
On Tue, Dec 8, 2020 at 2:42 AM Alexander Korotkov <[hidden email]> wrote:
> Thank you for your feedback!

It would be nice to use this patch to test things that are important
but untested inside vacuumlazy.c, such as the rare
HEAPTUPLE_DEAD/tupgone case (grep for "Ordinarily, DEAD tuples would
have been removed by..."). Same is true of the closely related
heap_prepare_freeze_tuple()/heap_tuple_needs_freeze() code.

--
Peter Geoghegan