pgsql: Reorder EPQ work, to fix rowmark related bugs and improve effici

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

pgsql: Reorder EPQ work, to fix rowmark related bugs and improve effici

Andres Freund
Reorder EPQ work, to fix rowmark related bugs and improve efficiency.

In ad0bda5d24ea I changed the EvalPlanQual machinery to store
substitution tuples in slot, instead of using plain HeapTuples. The
main motivation for that was that using HeapTuples will be inefficient
for future tableams.  But it turns out that that conversion was buggy
for non-locking rowmarks - the wrong tuple descriptor was used to
create the slot.

As a secondary issue 5db6df0c0 changed ExecLockRows() to begin EPQ
earlier, to allow to fetch the locked rows directly into the EPQ
slots, instead of having to copy tuples around. Unfortunately, as Tom
complained, that forces some expensive initialization to happen
earlier.

As a third issue, the test coverage for EPQ was clearly insufficient.

Fixing the first issue is unfortunately not trivial: Non-locked row
marks were fetched at the start of EPQ, and we don't have the type
information for the rowmarks available at that point. While we could
change that, it's not easy. It might be worthwhile to change that at
some point, but to fix this bug, it seems better to delay fetching
non-locking rowmarks when they're actually needed, rather than
eagerly. They're referenced at most once, and in cases where EPQ
fails, might never be referenced. Fetching them when needed also
increases locality a bit.

To be able to fetch rowmarks during execution, rather than
initialization, we need to be able to access the active EPQState, as
that contains necessary data. To do so move EPQ related data from
EState to EPQState, and, only for EStates creates as part of EPQ,
reference the associated EPQState from EState.

To fix the second issue, change EPQ initialization to allow use of
EvalPlanQualSlot() to be used before EvalPlanQualBegin() (but
obviously still requiring EvalPlanQualInit() to have been done).

As these changes made struct EState harder to understand, e.g. by
adding multiple EStates, significantly reorder the members, and add a
lot more comments.

Also add a few more EPQ tests, including one that fails for the first
issue above. More is needed.

Reported-By: yi huang
Author: Andres Freund
Reviewed-By: Tom Lane
Discussion:
    https://postgr.es/m/CAHU7rYZo_C4ULsAx_LAj8az9zqgrD8WDd4hTegDTMM1LMqrBsg@...
    https://postgr.es/m/24530.1562686693@...
Backpatch: 12-, where the EPQ changes were introduced

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/27cc7cd2bc8a5e8efc8279bc5d2a8ae42fd8ad33

Modified Files
--------------
src/backend/commands/trigger.c                 |   3 +-
src/backend/executor/execMain.c                | 424 +++++++++++++------------
src/backend/executor/execScan.c                |  66 +++-
src/backend/executor/execUtils.c               |   2 -
src/backend/executor/nodeIndexonlyscan.c       |  24 +-
src/backend/executor/nodeIndexscan.c           |  22 +-
src/backend/executor/nodeLockRows.c            |  14 +-
src/backend/executor/nodeModifyTable.c         |  11 +-
src/include/executor/executor.h                |  10 +-
src/include/nodes/execnodes.h                  |  83 ++++-
src/test/isolation/expected/eval-plan-qual.out | 273 +++++++++++++++-
src/test/isolation/specs/eval-plan-qual.spec   |  46 ++-
12 files changed, 705 insertions(+), 273 deletions(-)

Reply | Threaded
Open this post in threaded view
|

Re: pgsql: Reorder EPQ work, to fix rowmark related bugs and improve effici

Thomas Munro-5
On Tue, Sep 10, 2019 at 12:32 AM Andres Freund <[hidden email]> wrote:
> Reorder EPQ work, to fix rowmark related bugs and improve efficiency.

Did this cause the following failure in eval-plan-qual on REL_12_STABLE?

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2019-09-09%2019%3A08%3A41

--
Thomas Munro
https://enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: pgsql: Reorder EPQ work, to fix rowmark related bugs and improve effici

Andres Freund
Hi,

On September 9, 2019 10:51:26 PM GMT+01:00, Thomas Munro <[hidden email]> wrote:
>On Tue, Sep 10, 2019 at 12:32 AM Andres Freund <[hidden email]>
>wrote:
>> Reorder EPQ work, to fix rowmark related bugs and improve efficiency.
>
>Did this cause the following failure in eval-plan-qual on
>REL_12_STABLE?
>
>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2019-09-09%2019%3A08%3A41

Argh, yes, probably. But it's just isolationtester I think, not a backed bug. Need to backport the recent fix in master to at least 12.

Tom, do you think we should backpatch both the order fix and the notice improvement, or just the former? And to which version?

Andres

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Reply | Threaded
Open this post in threaded view
|

Re: pgsql: Reorder EPQ work, to fix rowmark related bugs and improve effici

Tom Lane-2
Andres Freund <[hidden email]> writes:
> On September 9, 2019 10:51:26 PM GMT+01:00, Thomas Munro <[hidden email]> wrote:
>> On Tue, Sep 10, 2019 at 12:32 AM Andres Freund <[hidden email]>
>> Did this cause the following failure in eval-plan-qual on
>> REL_12_STABLE?
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2019-09-09%2019%3A08%3A41

Just saw that :-(

> Argh, yes, probably. But it's just isolationtester I think, not a backed bug. Need to backport the recent fix in master to at least 12.
> Tom, do you think we should backpatch both the order fix and the notice improvement, or just the former? And to which version?

I think the first question is why is only prairiedog showing this.
Let's not panic until we understand that.

prairiedog looks to be about 20 minutes away from finishing the
isolation tests for master, so that will be a useful data point
as to whether there's any 12/HEAD difference.  I'll dig deeper
as soon as that run finishes.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: pgsql: Reorder EPQ work, to fix rowmark related bugs and improve effici

Andres Freund
Hi,

On September 9, 2019 11:01:28 PM GMT+01:00, Tom Lane <[hidden email]> wrote:

>Andres Freund <[hidden email]> writes:
>> On September 9, 2019 10:51:26 PM GMT+01:00, Thomas Munro
><[hidden email]> wrote:
>>> On Tue, Sep 10, 2019 at 12:32 AM Andres Freund <[hidden email]>
>>> Did this cause the following failure in eval-plan-qual on
>>> REL_12_STABLE?
>>>
>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2019-09-09%2019%3A08%3A41
>
>Just saw that :-(
>
>> Argh, yes, probably. But it's just isolationtester I think, not a
>backed bug. Need to backport the recent fix in master to at least 12.
>> Tom, do you think we should backpatch both the order fix and the
>notice improvement, or just the former? And to which version?
>
>I think the first question is why is only prairiedog showing this.

We only saw the similar failure on master on a very small number of machines as well. Given The somewhat unusual way packets have to go out to be likely to be problematic, that's not too surprising.


>Let's not panic until we understand that.

If it's the packet ordering thing in isolationtester, then I think there's no need to panic, personally. And the output does look a lot like that.


>prairiedog looks to be about 20 minutes away from finishing the
>isolation tests for master, so that will be a useful data point
>as to whether there's any 12/HEAD difference.  I'll dig deeper
>as soon as that run finishes.

Cool.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Reply | Threaded
Open this post in threaded view
|

Re: pgsql: Reorder EPQ work, to fix rowmark related bugs and improve effici

Tom Lane-2
Andres Freund <[hidden email]> writes:
> On September 9, 2019 11:01:28 PM GMT+01:00, Tom Lane <[hidden email]> wrote:
>> I think the first question is why is only prairiedog showing this.

> We only saw the similar failure on master on a very small number of machines as well. Given The somewhat unusual way packets have to go out to be likely to be problematic, that's not too surprising.

Actually, now that I look at the prior discussion, it was *only*
prairiedog that we saw that case on.  However, in the last little
while mandrill and bowerbird have come up with failures more or
less like prairiedog's.  Interestingly, bowerbird passed that
test the first time, so it's intermittent there.

I did just see prairiedog get past eval-plan-qual in master,
though it won't finish that run for a good while yet.

It appears to me that this is indeed a case of notice-reporting
timing problems in isolationtester, since these WARNING messages
are handled the same way as notices.  I want to try to reproduce
manually on prairiedog to confirm that, but it seems like a pretty
likely explanation.

If that is the explanation, I agree that there's no need for a
panic response (ie re-wrapping the beta4 tarball).  I doubt that
very many people are going to be testing beta4 on machines that
are slow enough to observe this issue --- and if there are people
testing on slow machines, they probably aren't masochistic enough
to be doing check-world, anyway.

>>> Tom, do you think we should backpatch both the order fix and the notice improvement, or just the former? And to which version?

As for that, now that we realize that this applies to more than
just NOTICEs, I think we should back-patch the code change in
30717637c at least to v11, maybe all the way.  I don't see any
WARNINGs in the isolation expected files before v11, but it
hardly seems unlikely that we might back-patch some future test
that expects those to be printed in a consistent way.

The case for back-patching ebd499282 (allow NOTICEs to print)
is weaker, but it still seems like it might be a hazard for
back-patching test cases if we don't do so.

On balance I'm inclined to back-patch both changes.  Thoughts?

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: pgsql: Reorder EPQ work, to fix rowmark related bugs and improve effici

Alvaro Herrera-9
On 2019-Sep-09, Tom Lane wrote:

> As for that, now that we realize that this applies to more than
> just NOTICEs, I think we should back-patch the code change in
> 30717637c at least to v11, maybe all the way.  I don't see any
> WARNINGs in the isolation expected files before v11, but it
> hardly seems unlikely that we might back-patch some future test
> that expects those to be printed in a consistent way.
>
> The case for back-patching ebd499282 (allow NOTICEs to print)
> is weaker, but it still seems like it might be a hazard for
> back-patching test cases if we don't do so.
>
> On balance I'm inclined to back-patch both changes.  Thoughts?

As well as a28e10e82e54, I suppose.  I agree with keeping the tool
similar across branches, if we're going to do this.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: pgsql: Reorder EPQ work, to fix rowmark related bugs and improve effici

Tom Lane-2
Alvaro Herrera <[hidden email]> writes:
> On 2019-Sep-09, Tom Lane wrote:
>> It appears to me that this is indeed a case of notice-reporting
>> timing problems in isolationtester, since these WARNING messages
>> are handled the same way as notices.  I want to try to reproduce
>> manually on prairiedog to confirm that, but it seems like a pretty
>> likely explanation.

Yeah, seems confirmed.  I ran forty cycles of eval-plan-qual
without a failure with current HEAD.  I then reverted 30717637c,
and eval-plan-qual fell over six times out of six, with the same
diffs shown in the buildfarm report.  So that patch is definitely
a prerequisite to making this version of eval-plan-qual stable.

>> On balance I'm inclined to back-patch both changes.  Thoughts?

> As well as a28e10e82e54, I suppose.  I agree with keeping the tool
> similar across branches, if we're going to do this.

Hm, good point.  My first thought was that a28e10e82e54 is just
cosmetic, but it isn't entirely, because it suppresses notice
reports on the control connection.  That means it might actually
be a prerequisite to having stable output with ebd499282 (the
change of client_min_messages).

After reviewing the git log a little more, I'm inclined to think
we should only back-patch this stuff to 9.6, which is where 38f8bdcac
("Modify the isolation tester so that multiple sessions can wait")
and a number of follow-up patches came in.  That was enough of a
quantum jump in flexibility that it'd likely limit our ability to
back-patch tests further than that anyway.  Also I don't think the
patches mentioned here would apply without that ...

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: pgsql: Reorder EPQ work, to fix rowmark related bugs and improve effici

Andres Freund
Hi,

On 2019-09-09 22:13:22 -0400, Tom Lane wrote:

> Alvaro Herrera <[hidden email]> writes:
> > On 2019-Sep-09, Tom Lane wrote:
> >> It appears to me that this is indeed a case of notice-reporting
> >> timing problems in isolationtester, since these WARNING messages
> >> are handled the same way as notices.  I want to try to reproduce
> >> manually on prairiedog to confirm that, but it seems like a pretty
> >> likely explanation.
>
> Yeah, seems confirmed.  I ran forty cycles of eval-plan-qual
> without a failure with current HEAD.  I then reverted 30717637c,
> and eval-plan-qual fell over six times out of six, with the same
> diffs shown in the buildfarm report.  So that patch is definitely
> a prerequisite to making this version of eval-plan-qual stable.

Phew. Thanks for testing that.


> >> On balance I'm inclined to back-patch both changes.  Thoughts?
>
> > As well as a28e10e82e54, I suppose.  I agree with keeping the tool
> > similar across branches, if we're going to do this.
>
> Hm, good point.  My first thought was that a28e10e82e54 is just
> cosmetic, but it isn't entirely, because it suppresses notice
> reports on the control connection.  That means it might actually
> be a prerequisite to having stable output with ebd499282 (the
> change of client_min_messages).
>
> After reviewing the git log a little more, I'm inclined to think
> we should only back-patch this stuff to 9.6, which is where 38f8bdcac
> ("Modify the isolation tester so that multiple sessions can wait")
> and a number of follow-up patches came in.  That was enough of a
> quantum jump in flexibility that it'd likely limit our ability to
> back-patch tests further than that anyway.  Also I don't think the
> patches mentioned here would apply without that ...

That seems like a good plan to me.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: pgsql: Reorder EPQ work, to fix rowmark related bugs and improve effici

Tom Lane-2
Andres Freund <[hidden email]> writes:
> On 2019-09-09 22:13:22 -0400, Tom Lane wrote:
>> Alvaro Herrera <[hidden email]> writes:
>>> As well as a28e10e82e54, I suppose.  I agree with keeping the tool
>>> similar across branches, if we're going to do this.

>> Hm, good point.  My first thought was that a28e10e82e54 is just
>> cosmetic, but it isn't entirely, because it suppresses notice
>> reports on the control connection.  That means it might actually
>> be a prerequisite to having stable output with ebd499282 (the
>> change of client_min_messages).
>>
>> After reviewing the git log a little more, I'm inclined to think
>> we should only back-patch this stuff to 9.6, which is where 38f8bdcac
>> ("Modify the isolation tester so that multiple sessions can wait")
>> and a number of follow-up patches came in.  That was enough of a
>> quantum jump in flexibility that it'd likely limit our ability to
>> back-patch tests further than that anyway.  Also I don't think the
>> patches mentioned here would apply without that ...

> That seems like a good plan to me.

Hearing no votes against, I'll go make it so.

                        regards, tom lane