Improve handling of pg_stat_statements handling of bind "IN" variables

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Improve handling of pg_stat_statements handling of bind "IN" variables

Pavel Trukhanov-2
Hi Hackers

I would like to embark on a journey to try to implement this issue I
found on TODO list –
https://www.postgresql.org/message-id/flat/CAM3SWZSpdPB3uErnXWMt3q74y0r%2B84ZsOt2U3HKKes_V7O%2B0Qg%40mail.gmail.com
In short: pgss distinguishes "SELECT * WHERE id IN (1, 2)" and "SELECT
* WHERE id IN (1, 2, 3)" as two separate queryId's, resulting in
separate entries in pgss. While in practice in most cases it should be
considered as the same thing.

Though it was added in TODO by Bruce Momjian some time ago, I
personally have been annoyed by this issue, because we use pgss as a
data source in our monitoring system okmeter.io – so we've been using
some work arounds for this in our system.

The way AFAIU it is suggested to be handled in the previous thread is
to not jumble ArrayExpr recursively and just treat it as "some list of
zero or more nodes".
I have already lurked around related code, but I have stumbled on some
problems with the way I see I can implement this.

So I want to ask for advice and maybe even guidance because I'm new to
PG internals and not a regular in C coding.

1. ArrayExpr
ArrayExpr is used to represent not only "IN" clauses, but also for
example "SELECT ARRAY[1, 2, 3]" and maybe some other cases I didn't
think of.
That brings the question whether "IN (...)" should be handled
separately from actual usage of ARRAY.
Or it is okay for any ARRAY to be jumbled w/o respect to number of
entries in it?
With that, "SELECT ARRAY[1, 2]" becomes undistinguishable from "SELECT
ARRAY[1, 2, 3]" etc in pgss.

I'm asking this because I'm not sure if it would be okay to handle
both cases in the same way.
For example "SELECT ARRAY[1, 2, a] FROM table" and "SELECT ARRAY[b]
FROM table" might end up in the same pgss entry.

While a separate handling for "IN (...)" seems to require lots of
changes – starting from parser (new parser node type) and further.
How should I proceed?

2 Weird arrays - with Consts and Params or const expressions or
different types etc
SELECT * FROM test WHERE a IN (1, $1)
SELECT * FROM test WHERE a IN (1, 3+1)
SELECT * FROM test WHERE a IN (1, 2.1)
SELECT * FROM test WHERE a IN (1.1, 2.1)  etc.
How those should be handled?
Should those be indistinguishable from "IN ($1, $2, $3)" as well?
Or such non realistic usage examples are negligible and no one cares
what happens with them?

3 Tests in pgss.sql/out and Vars
I would like someone to point me in a direction of how could I
implement a test that will query
"SELECT * FROM test WHERE a IN ($1, $2, $3, ...)" with params, not
consts, because I think this is the most common case actually.
And existing tests only check consts in "IN" list.
I don't see a way to implement such a test with the existing test
infrastructure.
Though if that might be considered out of the scope for this TODO, it
would be okay with me.


I would appreciate any feedback.
---
Pavel Trukhanov


Reply | Threaded
Open this post in threaded view
|

Re: Improve handling of pg_stat_statements handling of bind "IN" variables

Dmitry Dolgov
> On Thu, Jun 13, 2019 at 1:38 PM Pavel Trukhanov <[hidden email]> wrote:
>
> Hi Hackers
>
> I would like to embark on a journey to try to implement this issue I
> found on TODO list –
> https://www.postgresql.org/message-id/flat/CAM3SWZSpdPB3uErnXWMt3q74y0r%2B84ZsOt2U3HKKes_V7O%2B0Qg%40mail.gmail.com
> In short: pgss distinguishes "SELECT * WHERE id IN (1, 2)" and "SELECT
> * WHERE id IN (1, 2, 3)" as two separate queryId's, resulting in
> separate entries in pgss. While in practice in most cases it should be
> considered as the same thing.
>
> Though it was added in TODO by Bruce Momjian some time ago, I
> personally have been annoyed by this issue, because we use pgss as a
> data source in our monitoring system okmeter.io – so we've been using
> some work arounds for this in our system.

Thanks! One more time stumbled upon it just now, so I agree it would be nice.

> 1. ArrayExpr
> ArrayExpr is used to represent not only "IN" clauses, but also for
> example "SELECT ARRAY[1, 2, 3]" and maybe some other cases I didn't
> think of.
> That brings the question whether "IN (...)" should be handled
> separately from actual usage of ARRAY.

If I understand correctly, "IN (...)" is reprecented by A_Expr with kind
AEXPR_IN, and only in transformAExprIn converted to ArrayExpr if possible. So
probably it doesn't makes sense to introduce another one.

> For example "SELECT ARRAY[1, 2, a] FROM table" and "SELECT ARRAY[b]
> FROM table" might end up in the same pgss entry.

What are a, b here, parameters?

> 3 Tests in pgss.sql/out and Vars
> I would like someone to point me in a direction of how could I
> implement a test that will query
> "SELECT * FROM test WHERE a IN ($1, $2, $3, ...)" with params, not
> consts

Wouldn't a prepared statement work? It will create an ArrayExpr with Params
inside.


Reply | Threaded
Open this post in threaded view
|

Re: Improve handling of pg_stat_statements handling of bind "IN" variables

Pavel Trukhanov-2
Thanks for your reply

> If I understand correctly, "IN (...)" is reprecented by A_Expr with kind
> AEXPR_IN, and only in transformAExprIn converted to ArrayExpr if possible.
That seems to be correct, yes, thank you.

> So probably it doesn't makes sense to introduce another one.
Though I might've used wrong words to describe my holdback here, what
I meant is that I'll need to create new node type (in primnodes.h?)
for IN-list, that will allow to differentiate it from direct "ARRAY"
usage.
This will require changes to parse_expr.c, execExpr.c, etc, which
seems to be overkill for such issue IMO, hence the question.
Please advise.

> > For example "SELECT ARRAY[1, 2, a] FROM table" and "SELECT ARRAY[b]
> > FROM table" might end up in the same pgss entry.
>
> What are a, b here, parameters?

Here a and b are table columns.
I couldn't come up with other examples of ARRAY usage, would
appreciate any suggestions.


> > 3 Tests in pgss.sql/out and Vars
> > I would like someone to point me in a direction of how could I
> > implement a test that will query
> > "SELECT * FROM test WHERE a IN ($1, $2, $3, ...)" with params, not
> > consts
>
> Wouldn't a prepared statement work? It will create an ArrayExpr with Params
> inside.

Thanks for the tip. It seems to work, at least it looks like it.


Reply | Threaded
Open this post in threaded view
|

Re: Improve handling of pg_stat_statements handling of bind "IN" variables

Tom Lane-2
Pavel Trukhanov <[hidden email]> writes:
> Though I might've used wrong words to describe my holdback here, what
> I meant is that I'll need to create new node type (in primnodes.h?)
> for IN-list, that will allow to differentiate it from direct "ARRAY"
> usage.
> This will require changes to parse_expr.c, execExpr.c, etc, which
> seems to be overkill for such issue IMO, hence the question.

I do not think you need new expression infrastructure.  IMO what's going
on here is that we're indulging in premature optimization in the parser.
It would be better from a structural standpoint if the output of parse
analysis were closer to what the user wrote, and then the business of
separating Vars from Consts and reducing the Consts to an array were
handled in the planner's expression preprocessing phase.

So maybe what you should be thinking about is a preliminary patch that's
mostly in the nature of refactoring, to move that processing to where
it should be.

Of course, life is never quite that simple; there are at least two
issues you'd have to think about.

* The parsing phase is responsible for determining the semantics of
the query, in particular resolving the data types of the IN-list items
and choosing the comparison operators that will be used.  The planner
is not allowed to rethink that.  What I'm not clear about offhand is
whether the existing coding in parse analysis might lead to different
choices of data types/operators than a more straightforward approach
does.  If so, we'd have to decide whether that's okay.

* Postponing this work might make things slower overall, which wouldn't
matter too much for short IN-lists, but you can bet that people who
throw ten-thousand-entry IN-lists at us will notice.  So you'd need to
keep an eye on efficiency and make sure you don't end up repeating
similar processing over and over.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Improve handling of pg_stat_statements handling of bind "IN" variables

Pavel Trukhanov-2
Thanks for the feedback.

I thought once again about separating IN from ARRAY, with refactoring
etc as Tom suggested, and now I don't think it's worth it to do so.
I've tried to implement that, and besides that it will require to
change things in every part of query processing pipeline, it seems
that most of the times I will have to repeat (copy/paste) for IN case
all the code that now works in for ARRAY. At first I though there will
be simplifications, that will justify such refactoring - i.e. I
thought I could at least drop "multidims" bool that tells ARRAY[] from
ARRAY[ARRAY[]]. But it turns out it's not the case – one can write
something like "x IN (ARRAY[1], ARRAY[1,2])" that will result in
multidim IN-list array.

So I don't think there's actually enough benefit to split those two apart.

Now I want to do this: just add a meta info (basically a bool field)
to the ArrayExpr struct, so on later stages we could tell if that's an
ArrayExpr of an ARRAY or of an IN list. Plus to add ignoring updating
Jumble for expression subtree within IN-list array.

If that approach doesn't seem too bad to anyone, I would like to go
forward and submit a patch – it seems pretty straightforward to
implement that.

Thoughts?

Thank you.
 ---
Pasha Trukhanov


Reply | Threaded
Open this post in threaded view
|

Re: Improve handling of pg_stat_statements handling of bind "IN" variables

Greg Stark


On Sat., Jun. 15, 2019, 12:29 p.m. Pavel Trukhanov, <[hidden email]> wrote:

So I don't think there's actually enough benefit to split those two apart.

Now I want to do this: just add a meta info (basically a bool field)
to the ArrayExpr struct, so on later stages we could tell if that's an
ArrayExpr of an ARRAY or of an IN list. Plus to add ignoring updating
Jumble for expression subtree within IN-list array.

If that approach doesn't seem too bad to anyone, I would like to go
forward and submit a patch – it seems pretty straightforward to
implement that.

So what would this do for someone who explicitly writes:

WHERE col = ANY ?

and pass an array?
Reply | Threaded
Open this post in threaded view
|

Re: Improve handling of pg_stat_statements handling of bind "IN" variables

Greg Stark


On Sat., Jun. 15, 2019, 8:30 p.m. Greg Stark, <[hidden email]> wrote:


So what would this do for someone who explicitly writes:

WHERE col = ANY ?

and pass an array?

Actually thinking about this for two more seconds the question is what it would do with a query like

WHERE col = ANY '1,2,3'::integer[]

Or 

WHERE col = ANY ARRAY[1,2,3]

Whichever the language binding that is failing to do parameterized queries is doing to emulate them.