WIP: cross column correlation ...

classic Classic list List threaded Threaded
72 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

WIP: cross column correlation ...

PostgreSQL - Hans-Jürgen Schönig
hello everbody,

we have spent some time in finally attacking cross column correlation. as this is an issue which keeps bugging us for a couple of applications (some years). this is a WIP patch which can do:

        special cross column correlation specific syntax:
       
                CREATE CROSS COLUMN STATISTICS ON tablename (field, ...);
                DROP CROSS COLUMN STATISTICS ON tablename (field, ...);

we use specific syntax because we simply cannot keep track of all possible correlations in the DB so the admi can take care of things explicitly. some distant day somebody might want to write a mechanism to derive the desired stats automatically but this is beyond the scope of our project for now.

as far as the patch is concerned:
it is patched nicely into clauselist_selectivity(), but has some rough edges, even when a cross-col stat is found, the single col selectivities are still counted ( = lovering the selectivity even more), this is a TODO.
this patch adds the grammar and the start of planner integration with a static selectivity value for now, the previous discussion about cross-column statistics can be continued and perhaps comes to fruition soon.

how does it work? we try to find suitable statistics for an arbitrary length list of conditions so that the planner can use it directly rather than multiplying all the selectivities. this should make estimates a lot more precise.
the current approach can be extended to work with expressions and well as "straight" conditions.

goal: to make cross column correlation work for 9.2 ...

the purpose of this mail is mostly to get the race for a patch going and to see if the approach as such is reasonable / feasible.

        many thanks,

                hans




       



--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

cross-column-v5.patch (39K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: WIP: cross column correlation ...

Robert Haas
2011/2/22 PostgreSQL - Hans-Jürgen Schönig <[hidden email]>:
> how does it work? we try to find suitable statistics for an arbitrary length list of conditions so that the planner can use it directly rather than multiplying all the selectivities. this should make estimates a lot more precise.
> the current approach can be extended to work with expressions and well as "straight" conditions.

/me prepares to go down in flames.

Personally, I think the first thing we ought to do is add a real, bona
fide planner hint to override the selectivity calculation manually,
maybe something like this:

WHERE (x < 5 AND y = 1) SELECTIVITY (0.1);

Then, having provided a method for the DBA to extinguish the raging
flames of searing agony which are consuming them while a crocodile
chews off their leg and their boss asks them why they didn't use
Oracle, we can continue bikeshedding about the best way of fixing this
problem in a more user-transparent fashion.

As to the approach you've proposed here, I'm not sure I understand
what this is actually doing.  Selectivity estimates aren't made
directly for predicates; they're made based on MCV and histogram
information for predicates.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WIP: cross column correlation ...

Tom Lane-2
Robert Haas <[hidden email]> writes:
> /me prepares to go down in flames.

> Personally, I think the first thing we ought to do is add a real, bona
> fide planner hint to override the selectivity calculation manually,
> maybe something like this:

> WHERE (x < 5 AND y = 1) SELECTIVITY (0.1);

One of the criteria we've always had for a suitable hint-or-whatever-
you-call-it design is that it *not* involve decorating the queries.
There are a number of reasons for that, some of the killer ones being

(1) People frequently *can't* adjust their queries that way, because
they're coming out of some broken query generator or other.  (Crappy
query generators are of course one of the prime reasons for
poor-performing queries in the first place, so you can't write this off
as not being a key use case.)

(2) Anything we do like that, we'd be locked into supporting forever,
even after we think of better solutions.

(3) People don't like decorating their queries with nonstandard stuff;
it smells of vendor lock-in.  Especially if it's actually SQL syntax
and not comments.  Once you put something into the DML it's just too
hard to fix applications to get rid of it (the inverse case of point
#1).

I haven't looked at Hans' patch in any detail, and don't intend to
do so while the CF is still running; but at least he got this point
right.

                        regards, tom lane

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WIP: cross column correlation ...

Robert Haas
On Tue, Feb 22, 2011 at 9:43 PM, Tom Lane <[hidden email]> wrote:

> Robert Haas <[hidden email]> writes:
>> /me prepares to go down in flames.
>
>> Personally, I think the first thing we ought to do is add a real, bona
>> fide planner hint to override the selectivity calculation manually,
>> maybe something like this:
>
>> WHERE (x < 5 AND y = 1) SELECTIVITY (0.1);
>
> One of the criteria we've always had for a suitable hint-or-whatever-
> you-call-it design is that it *not* involve decorating the queries.
> There are a number of reasons for that, some of the killer ones being
>
> (1) People frequently *can't* adjust their queries that way, because
> they're coming out of some broken query generator or other.  (Crappy
> query generators are of course one of the prime reasons for
> poor-performing queries in the first place, so you can't write this off
> as not being a key use case.)
>
> (2) Anything we do like that, we'd be locked into supporting forever,
> even after we think of better solutions.
>
> (3) People don't like decorating their queries with nonstandard stuff;
> it smells of vendor lock-in.  Especially if it's actually SQL syntax
> and not comments.  Once you put something into the DML it's just too
> hard to fix applications to get rid of it (the inverse case of point
> #1).

Those are real problems, but I still want it.  The last time I hit
this problem I spent two days redesigning my schema and adding
triggers all over the place to make things work.  If I had been
dealing with a 30TB database instead of a 300MB database I would have
been royally up a creek.

To put that another way, it's true that some people can't adjust their
queries, but also some people can.  It's true that nonstandard stuff
sucks, but queries that don't work suck, too.  And as for better
solutions, how many major release cycles do we expect people to wait
for them?  Even one major release cycle is an eternity when you're
trying to get the application working before your company runs out of
money, and this particular problem has had a lot of cycles expended on
it without producing anything very tangible (proposed patch, which
like you I can't spare a lot of cycles to look at just now, possibly
excepted).

I agree that if we can get something that actually works that doesn't
involve decorating the queries, that is better.  But I would surely
rather decorate the queries than rewrite the entire application around
the problem.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WIP: cross column correlation ...

Tom Lane-2
Robert Haas <[hidden email]> writes:
> On Tue, Feb 22, 2011 at 9:43 PM, Tom Lane <[hidden email]> wrote:
>> One of the criteria we've always had for a suitable hint-or-whatever-
>> you-call-it design is that it *not* involve decorating the queries.

> [ snip ]
> To put that another way, it's true that some people can't adjust their
> queries, but also some people can.  It's true that nonstandard stuff
> sucks, but queries that don't work suck, too.  And as for better
> solutions, how many major release cycles do we expect people to wait
> for them?

Well, a decorating-the-queries solution that isn't utter crap is not
going to be a small amount of work, either.

                        regards, tom lane

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WIP: cross column correlation ...

Nathan Boley-2
In reply to this post by Robert Haas
> Personally, I think the first thing we ought to do is add a real, bona
> fide planner hint to override the selectivity calculation manually,
> maybe something like this:
>
> WHERE (x < 5 AND y = 1) SELECTIVITY (0.1);
>


If you're going to go that far, why not just collect statistics on
that specific predicate?

ie,  ANALYZE SELECTIVITY ON tablename (x, y) WHERE (x < 5 AND y = 1);

Then it won't fall subject to all of the pitfalls that Tom outlines below.

Selectivities are easy to estimate if we know the predicate. They only
become hard when they have to work for every possible predicate.

Best,
Nathan

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WIP: cross column correlation ...

PostgreSQL - Hans-Jürgen Schönig
In reply to this post by Robert Haas

On Feb 23, 2011, at 2:58 AM, Robert Haas wrote:

> 2011/2/22 PostgreSQL - Hans-Jürgen Schönig <[hidden email]>:
>> how does it work? we try to find suitable statistics for an arbitrary length list of conditions so that the planner can use it directly rather than multiplying all the selectivities. this should make estimates a lot more precise.
>> the current approach can be extended to work with expressions and well as "straight" conditions.
>
> /me prepares to go down in flames.
>
> Personally, I think the first thing we ought to do is add a real, bona
> fide planner hint to override the selectivity calculation manually,
> maybe something like this:
>
> WHERE (x < 5 AND y = 1) SELECTIVITY (0.1);


i thought there was an agreement that we don't want planner hints?
as tom pointed out - many broken queries come out of some query generator where even the design to make the design is broken by design.
personally i like query generators as long as other people use them ... telling people that this is the wrong way to go is actually financing my holiday next week ... ;).  in general - hibernate and stuff like that is a no-go.

personally i like the type of planner hints oleg and teodor came up with - i think we should do more of those hooks they are using but hiding it in some syntax is not a good idea.
it does not change the query and it still gives a lot of room to toy around. it looks like a compromise.

however, oleg's contrib module does not fix the core problem of cross column statistics because a hint is usually static but you want flexible selectivity.

        regards,

                hans

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WIP: cross column correlation ...

PostgreSQL - Hans-Jürgen Schönig
In reply to this post by Robert Haas
> Those are real problems, but I still want it.  The last time I hit
> this problem I spent two days redesigning my schema and adding
> triggers all over the place to make things work.  If I had been
> dealing with a 30TB database instead of a 300MB database I would have
> been royally up a creek.
>
> To put that another way, it's true that some people can't adjust their
> queries, but also some people can.  It's true that nonstandard stuff
> sucks, but queries that don't work suck, too.  And as for better
> solutions, how many major release cycles do we expect people to wait
> for them?  Even one major release cycle is an eternity when you're
> trying to get the application working before your company runs out of
> money, and this particular problem has had a lot of cycles expended on
> it without producing anything very tangible (proposed patch, which
> like you I can't spare a lot of cycles to look at just now, possibly
> excepted).



cheapest and easiest solution if you run into this: add "fake" functions which the planner cannot estimate properly.
use OR to artificially prop up estimates or use AND to artificially lower them. there is actually no need to redesign the schema to get around it but it is such an ugly solution that it does not even deserve to be called "ugly" ...
however, fast and reliable way to get around it.

        regards,

                hans

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WIP: cross column correlation ...

Robert Haas
In reply to this post by Nathan Boley-2
On Wed, Feb 23, 2011 at 12:50 AM, Nathan Boley <[hidden email]> wrote:

>> Personally, I think the first thing we ought to do is add a real, bona
>> fide planner hint to override the selectivity calculation manually,
>> maybe something like this:
>>
>> WHERE (x < 5 AND y = 1) SELECTIVITY (0.1);
>
> If you're going to go that far, why not just collect statistics on
> that specific predicate?
>
> ie,  ANALYZE SELECTIVITY ON tablename (x, y) WHERE (x < 5 AND y = 1);
>
> Then it won't fall subject to all of the pitfalls that Tom outlines below.
>
> Selectivities are easy to estimate if we know the predicate. They only
> become hard when they have to work for every possible predicate.

Fair point.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WIP: cross column correlation ...

Robert Haas
In reply to this post by PostgreSQL - Hans-Jürgen Schönig
2011/2/23 PostgreSQL - Hans-Jürgen Schönig <[hidden email]>:

>> Those are real problems, but I still want it.  The last time I hit
>> this problem I spent two days redesigning my schema and adding
>> triggers all over the place to make things work.  If I had been
>> dealing with a 30TB database instead of a 300MB database I would have
>> been royally up a creek.
>>
>> To put that another way, it's true that some people can't adjust their
>> queries, but also some people can.  It's true that nonstandard stuff
>> sucks, but queries that don't work suck, too.  And as for better
>> solutions, how many major release cycles do we expect people to wait
>> for them?  Even one major release cycle is an eternity when you're
>> trying to get the application working before your company runs out of
>> money, and this particular problem has had a lot of cycles expended on
>> it without producing anything very tangible (proposed patch, which
>> like you I can't spare a lot of cycles to look at just now, possibly
>> excepted).
>
> cheapest and easiest solution if you run into this: add "fake" functions which the planner cannot estimate properly.
> use OR to artificially prop up estimates or use AND to artificially lower them. there is actually no need to redesign the schema to get around it but it is such an ugly solution that it does not even deserve to be called "ugly" ...
> however, fast and reliable way to get around it.

We couldn't possibly design a hint mechanism that would be uglier or
less future-proof than this workaround (which, by the way, I'll keep
in mind for the next time I get bitten by this).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WIP: cross column correlation ...

PostgreSQL - Hans-Jürgen Schönig
>>>
>>
>> cheapest and easiest solution if you run into this: add "fake" functions which the planner cannot estimate properly.
>> use OR to artificially prop up estimates or use AND to artificially lower them. there is actually no need to redesign the schema to get around it but it is such an ugly solution that it does not even deserve to be called "ugly" ...
>> however, fast and reliable way to get around it.
>
> We couldn't possibly design a hint mechanism that would be uglier or
> less future-proof than this workaround (which, by the way, I'll keep
> in mind for the next time I get bitten by this).
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


i think the main issue is: what we do is ugly because of despair and a lack of alternative ... what you proposed is ugly by design ;).
overall: the workaround will win the ugliness contest, however ;).

        many thanks,

                hans

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WIP: cross column correlation ...

PostgreSQL - Hans-Jürgen Schönig
In reply to this post by Robert Haas

On Feb 23, 2011, at 3:46 PM, Robert Haas wrote:

> On Wed, Feb 23, 2011 at 12:50 AM, Nathan Boley <[hidden email]> wrote:
>>> Personally, I think the first thing we ought to do is add a real, bona
>>> fide planner hint to override the selectivity calculation manually,
>>> maybe something like this:
>>>
>>> WHERE (x < 5 AND y = 1) SELECTIVITY (0.1);
>>
>> If you're going to go that far, why not just collect statistics on
>> that specific predicate?
>>
>> ie,  ANALYZE SELECTIVITY ON tablename (x, y) WHERE (x < 5 AND y = 1);
>>
>> Then it won't fall subject to all of the pitfalls that Tom outlines below.
>>
>> Selectivities are easy to estimate if we know the predicate. They only
>> become hard when they have to work for every possible predicate.
>
> Fair point.
>
> --
> Robert Haas


basically we got the idea of allowing "expressions" in cross column stuff. i think this can be very useful. it would fix the problem of a query like that:

        SELECT * FROM table WHERE cos(field) = some_number;

this takes a constant fraction of the table which is usually plain wrong as well (and the error tends to multiply inside the plan).
i am just not sure if i have understood all corner cases of that already.
ultimate goal: get it right for join estimates (this is why a syntax extension is definitely needed - you cannot track all of them automatically).

        many thanks,

                hans

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WIP: cross column correlation ...

Robert Haas
In reply to this post by PostgreSQL - Hans-Jürgen Schönig
2011/2/23 PostgreSQL - Hans-Jürgen Schönig <[hidden email]>:
> i thought there was an agreement that we don't want planner hints?

Well, I want them.  I think some other people do, too.  Whether those
people are more numerous than than the people who don't want them, and
how much that matters either way, is another question.  I don't want
to have to use them very often, but I like to have an out when I get
desperate.

> as tom pointed out - many broken queries come out of some query generator where even the design to make the design is broken by design.
> personally i like query generators as long as other people use them ... telling people that this is the wrong way to go is actually financing my holiday next week ... ;).  in general - hibernate and stuff like that is a no-go.
>
> personally i like the type of planner hints oleg and teodor came up with - i think we should do more of those hooks they are using but hiding it in some syntax is not a good idea.
> it does not change the query and it still gives a lot of room to toy around. it looks like a compromise.
>
> however, oleg's contrib module does not fix the core problem of cross column statistics because a hint is usually static but you want flexible selectivity.

IIRC, what Teodor and Oleg did was a contrib module that excluded a
certain index from consideration based on a GUC.  That to me is a
little more hacky than just wiring the selectivity estimate.  You're
going to need to set that just before each query that needs it, and
reset it afterwards, so it's actually worse than just decorating the
queries, IMHO.  Also, I haven't run into any actual problems in the
field that would be solved by this approach, though I am sure others
have.  IME, most bad query plans are caused by either incorrect
estimates of selectivity, or wrongheaded notions about what's likely
to be cached.  If we could find a way, automated or manual, of
providing the planner some better information about the facts of life
in those areas, I think we'd be way better off.  I'm open to ideas
about what the best way to do that is.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WIP: cross column correlation ...

Bruce Momjian
In reply to this post by PostgreSQL - Hans-Jürgen Schönig
Reply | Threaded
Open this post in threaded view
|

Re: WIP: cross column correlation ...

Bruce Momjian
In reply to this post by Robert Haas
Robert Haas wrote:

> 2011/2/23 PostgreSQL - Hans-J?rgen Sch?nig <[hidden email]>:
> > i thought there was an agreement that we don't want planner hints?
>
> Well, I want them.  I think some other people do, too.  Whether those
> people are more numerous than than the people who don't want them, and
> how much that matters either way, is another question.  I don't want
> to have to use them very often, but I like to have an out when I get
> desperate.
>
> > as tom pointed out - many broken queries come out of some query generator where even the design to make the design is broken by design.
> > personally i like query generators as long as other people use them ... telling people that this is the wrong way to go is actually financing my holiday next week ... ;). ?in general - hibernate and stuff like that is a no-go.
> >
> > personally i like the type of planner hints oleg and teodor came up with - i think we should do more of those hooks they are using but hiding it in some syntax is not a good idea.
> > it does not change the query and it still gives a lot of room to toy around. it looks like a compromise.
> >
> > however, oleg's contrib module does not fix the core problem of cross column statistics because a hint is usually static but you want flexible selectivity.
>
> IIRC, what Teodor and Oleg did was a contrib module that excluded a
> certain index from consideration based on a GUC.  That to me is a
> little more hacky than just wiring the selectivity estimate.  You're
> going to need to set that just before each query that needs it, and
> reset it afterwards, so it's actually worse than just decorating the
> queries, IMHO.  Also, I haven't run into any actual problems in the
> field that would be solved by this approach, though I am sure others
> have.  IME, most bad query plans are caused by either incorrect
> estimates of selectivity, or wrongheaded notions about what's likely
> to be cached.  If we could find a way, automated or manual, of
> providing the planner some better information about the facts of life
> in those areas, I think we'd be way better off.  I'm open to ideas
> about what the best way to do that is.

For me the key is finding a way to get that information to the planner
so all queries can benefit, not just the queries we decorate.

--
  Bruce Momjian  <[hidden email]>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WIP: cross column correlation ...

Josh berkus
In reply to this post by Robert Haas

> Personally, I think the first thing we ought to do is add a real, bona
> fide planner hint to override the selectivity calculation manually,
> maybe something like this:
>
> WHERE (x < 5 AND y = 1) SELECTIVITY (0.1);
>
> Then, having provided a method for the DBA to extinguish the raging
> flames of searing agony which are consuming them while a crocodile
> chews off their leg and their boss asks them why they didn't use
> Oracle, we can continue bikeshedding about the best way of fixing this
> problem in a more user-transparent fashion.

Is there some way we can do that without adding the selectivity hint to
the query itself?  That's the biggest issue with hints.

--
                                  -- Josh Berkus
                                     PostgreSQL Experts Inc.
                                     http://www.pgexperts.com

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WIP: cross column correlation ...

Robert Haas
On Wed, Feb 23, 2011 at 8:09 PM, Josh Berkus <[hidden email]> wrote:

>> Personally, I think the first thing we ought to do is add a real, bona
>> fide planner hint to override the selectivity calculation manually,
>> maybe something like this:
>>
>> WHERE (x < 5 AND y = 1) SELECTIVITY (0.1);
>>
>> Then, having provided a method for the DBA to extinguish the raging
>> flames of searing agony which are consuming them while a crocodile
>> chews off their leg and their boss asks them why they didn't use
>> Oracle, we can continue bikeshedding about the best way of fixing this
>> problem in a more user-transparent fashion.
>
> Is there some way we can do that without adding the selectivity hint to
> the query itself?  That's the biggest issue with hints.

I've been mulling this issue over a bit more - Nathan Boley raised a
similar point upthread.  I think it's useful to consider some concrete
cases which can occur.

1. Default estimate.  The planner tends to estimate that the
selectivity of <something> = <something> is 0.005, and that the
selectivity of <something> != <something> is 0.995, when it doesn't
know any better.  This estimate often sucks.  Sometimes it sucks
because it's too high, other times because it's too low, and of course
sometimes it is close enough for government work.

2. One special customer.  Suppose we have a database that contains
lots and lots of people and associates different attributes to those
people, including customer_id.  We put all of our employees in the
table too, and assign them customer_id = 1, since the record with
customer.id = 1 represents us.  I've built this kind of system for
several different employers over the years.  Turns out, the subset of
the person table with customer_id = 1 looks very different, in terms
of the MCVs on the remaining columns and the distribution of the
values otherwise, than the records with customer_id != 1.  I'm sure
this problem comes up in different forms in other domains; this is
just where I've seen it the most.

3. The mostly-redundant condition.  Something like creation_date >
'some timestamp' AND active.  Turns out, most of the not active stuff
is also... old.  A variant of this is creation_date > 'some timestamp'
AND customer_id = 1, which overlaps #2.  For extra fun the creation
date and customer_id may be in different tables, with some
intermediate join muddying the waters.

4. The condition that's redundant except when it isn't.  The classic
example here is WHERE zipcode = <constant> AND state = <constant>.
Most of the time, the selectivity of the two clauses together is much
higher than the product of their individually selectivities; you might
as well ignore the second part altogether.  But if some numbskull user
enters a state that doesn't match the zipcode, then suddenly it
matters a lot - the selectivity drops to zero when the second part is
added.

5. The bitfield.  Conditions like (x & 64) != 0.  I know disk is
cheap, but people keep doing this.

There are probably some others I'm missing, too.  That's just off the
top of my head.  Now here are some possible approaches to fixing it:

A. Decorate the query.  This would often be useful for case #1, and
some instances of #3 and #5.  It's useless for #2 and #4.

B. Specify a particular predicate and the selectivity thereof.  Like,
whenever you see (x & 64) = 0, assume the selectivity is 0.5.  Upon
reflection, this seems pretty terrible in every respect.  Unless you
only ever issue an extremely limited range of queries, you're going to
be hardwiring a lot of selectivities.  I think this really only
handles case #5 well, and maybe some instances of case #1.

C. Specify an expression and gather statistics on it as if it were a
column: i.e. ALTER TABLE tab ADD VIRTUAL STATISTICS COLUMN x & 64.
This is pretty good.  It is pretty much ideal for #2 and also handles
#5 and some cases of #3 and #1 well.  You could even make it handle
some instances of #4 if you made the virtual column ROW(state,
zipcode) and rewrote the query as a row comparison.

D. N x N implicativeness matrix.  Record for each pair of attributes
the extent to which a given value for A implies a value for B, and
derate the selectivity multipliers based on this information.  This is
an idea of Heikki's.  It seemed good to me when he proposed it, and I
think he proposed it in regards to #4, but I'm not sure we really ever
figured out how to make it work.

E. Given a set of columns (A1, .., An), collect MCVs and make a
histogram for ROW(A1, ..., An), and then use it to handle cases like
#4.  This is similar to C and is intended to handle the zipcode
problem, but it's not as flexible (because you are only specifying
columns, not expressions).  However, it's intended to work without
rewriting the state/zipcode comparisons as a rowcompare.

If you want to take the above as in any way an exhaustive survey of
the landscape (which it isn't), C seems like a standout, maybe
augmented by the making the planner able to notice that A1 = x1 AND A2
= x2 is equivalent to (A1,A2) = (x1, x2) so you don't have to rewrite
queries as much.

I don't really know how to handle the join selectivity problem.  I am
not convinced that there is a better solution to that than decorating
the query.  After all the join selectivity depends not only on the
join clause itself, but also on what you've filtered out of each table
in the meantime.

Note that I am not sure whether any of this is similar to what the WIP
patch already implements, so apologies for possibly rampaging off in a
different direction and/or reinventing your ideas.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WIP: cross column correlation ...

Bruce Momjian
Robert Haas wrote:

> If you want to take the above as in any way an exhaustive survey of
> the landscape (which it isn't), C seems like a standout, maybe
> augmented by the making the planner able to notice that A1 = x1 AND A2
> = x2 is equivalent to (A1,A2) = (x1, x2) so you don't have to rewrite
> queries as much.
>
> I don't really know how to handle the join selectivity problem.  I am
> not convinced that there is a better solution to that than decorating
> the query.  After all the join selectivity depends not only on the
> join clause itself, but also on what you've filtered out of each table
> in the meantime.

Thinking some more, I think another downside to the "decorate the query"
idea is that many queries use constants that are supplied only at
runtime, so there would be no way to hard-code a selectivity value into
a query when you don't know the value.  Could a selectivity function
handle that?

--
  Bruce Momjian  <[hidden email]>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WIP: cross column correlation ...

Robert Haas
On Wed, Feb 23, 2011 at 10:30 PM, Bruce Momjian <[hidden email]> wrote:

> Robert Haas wrote:
>> If you want to take the above as in any way an exhaustive survey of
>> the landscape (which it isn't), C seems like a standout, maybe
>> augmented by the making the planner able to notice that A1 = x1 AND A2
>> = x2 is equivalent to (A1,A2) = (x1, x2) so you don't have to rewrite
>> queries as much.
>>
>> I don't really know how to handle the join selectivity problem.  I am
>> not convinced that there is a better solution to that than decorating
>> the query.  After all the join selectivity depends not only on the
>> join clause itself, but also on what you've filtered out of each table
>> in the meantime.
>
> Thinking some more, I think another downside to the "decorate the query"
> idea is that many queries use constants that are supplied only at
> runtime, so there would be no way to hard-code a selectivity value into
> a query when you don't know the value.  Could a selectivity function
> handle that?

Beats me.  What do you have in mind?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WIP: cross column correlation ...

PostgreSQL - Hans-Jürgen Schönig
In reply to this post by Josh berkus
On Feb 24, 2011, at 2:09 AM, Josh Berkus wrote:

>
>> Personally, I think the first thing we ought to do is add a real, bona
>> fide planner hint to override the selectivity calculation manually,
>> maybe something like this:
>>
>> WHERE (x < 5 AND y = 1) SELECTIVITY (0.1);
>>
>> Then, having provided a method for the DBA to extinguish the raging
>> flames of searing agony which are consuming them while a crocodile
>> chews off their leg and their boss asks them why they didn't use
>> Oracle, we can continue bikeshedding about the best way of fixing this
>> problem in a more user-transparent fashion.
>
> Is there some way we can do that without adding the selectivity hint to
> the query itself?  That's the biggest issue with hints.
>



well, you could hide this hint in the system table - say; instead of decorating the query you could store the decoration in some system relation ... but, if you get it right, you call this decoration histogram ;).
i think the patch with a multi-dim histogram is good (i have seen something similar for PostGIS).
what is still needed in our patch is a.) multi-dim sampling (no idea how to get it right) and b.) investigating how to deal with joins and expressions (e.g. cos(id) ).
hints into the right direction are highly welcome.

        many thanks,

                hans


--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
1234