WIP: Aggregation push-down

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
36 messages Options
12
Reply | Threaded
Open this post in threaded view
|

WIP: Aggregation push-down

Antonin Houska-2
This is a new version of the patch I presented in [1]. A new thread seems
appropriate because the current version can aggregate both base relations and
joins, so the original subject would no longer match.

There's still work to do but I'd consider the patch complete in terms of
concept. A few things worth attention for those who want to look into the
code:

* I've abandoned the concept of aggmultifn proposed in [1], as it doesn't
  appear to be very useful. That implies that a "grouped join" can be formed
  in 2 ways: 1) join a grouped relation to a "plain" (i.e. non-grouped) one,
  2) join 2 plain relations and aggregate the result. However, w/o the
  aggmultifn we can't join 2 grouped relations.

* GroupedVar type is used to propagate the result of partial aggregation from
  to the top-level join. It's conceptually very similar to PlaceHolderVar.

* Although I intended to use the "unique join" feature [2], I postponed it so
  far. The point is that [2] does conflict with my patch and thus I'd have to
  rebase the patch more often. Anyway, the impact of [2] on aggregation
  finalization (i.e. possible avoidance of the "finalize aggregate node"
  setup) is not really specific to my patch.

* Scan of base relation or join result can be partially aggregated for 2
  reasons: 1) it makes the whole plan cheaper because the aggregation takes
  place on remote node and thus the amount of data to be transferred via
  network is significanlty reduced, 2) aggregate functions are rather
  expensive so it makes sense to evaluate them by multiple parallel workers.

  The patch contains both of these features as they are hard to
  separate from each other.

  While 1) needs additional work on postgres_fdw, scripts to simulate 2) are
  attached. Planner settings are such that cost of expression evaluation is
  significant, so that it's worth to engage multiple parallel workers.

  In my environment it yields the following output:

 Parallel Finalize HashAggregate
   Group Key: a.i
   ->  Gather Merge
         Workers Planned: 4
         ->  Merge Join
               Merge Cond: (b.parent = a.i)
               ->  Sort
                     Sort Key: b.parent
                     ->  Parallel Partial HashAggregate
                           Group Key: b.parent
                           ->  Hash Join
                                 Hash Cond: ((b.parent = c.parent) AND (b.j = c.k))
                                 ->  Parallel Seq Scan on b
                                 ->  Hash
                                       ->  Seq Scan on c
               ->  Sort
                     Sort Key: a.i
                     ->  Seq Scan on a

Feedback is appreciated.

[1] https://www.postgresql.org/message-id/29111.1483984605%40localhost

[2] https://commitfest.postgresql.org/13/859/

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

agg_pushdown.diff (215K) Download Attachment
test_setup.sql (765 bytes) Download Attachment
query.sql (387 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: WIP: Aggregation push-down

Antonin Houska-2
Antonin Houska <[hidden email]> wrote:

> This is a new version of the patch I presented in [1].

Rebased.

cat .git/refs/heads/master
b9a3ef55b253d885081c2d0e9dc45802cab71c7b

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

agg_pushdown_v2.diff (213K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: WIP: Aggregation push-down

Antonin Houska-2
Antonin Houska <[hidden email]> wrote:

> Antonin Houska <[hidden email]> wrote:
>
> > This is a new version of the patch I presented in [1].
>
> Rebased.
>
> cat .git/refs/heads/master
> b9a3ef55b253d885081c2d0e9dc45802cab71c7b

This is another version of the patch.

Besides other changes, it enables the aggregation push-down for postgres_fdw,
although only for aggregates whose transient aggregation state is equal to the
output type. For other aggregates (like avg()) the remote nodes will have to
return the transient state value in an appropriate form (maybe bytea type),
which does not depend on PG version.

shard.tgz demonstrates the typical postgres_fdw use case. One query shows base
scans of base relation's partitions being pushed to shard nodes, the other
pushes down a join and performs aggregation of the join result on the remote
node. Of course, the query can only references one particular partition, until
the "partition-wise join" [1] patch gets committed and merged with this my
patch.

One thing I'm not sure about is whether the patch should remove
GetForeignUpperPaths function from FdwRoutine, which it essentially makes
obsolete. Or should it only be deprecated so far? I understand that
deprecation is important for "ordinary SQL users", but FdwRoutine is an
interface for extension developers, so the policy might be different.

[1] https://commitfest.postgresql.org/14/994/

Any feedback is appreciated.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

agg_pushdown_v3.tgz (108K) Download Attachment
shard.tgz (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: WIP: Aggregation push-down

Jeevan Chalke
Hi Antonin,

To understand the feature you have proposed, I have tried understanding
your patch. Here are few comments so far on it:

1.
+         if (aggref->aggvariadic ||
+             aggref->aggdirectargs || aggref->aggorder ||
+             aggref->aggdistinct || aggref->aggfilter)

I did not understand, why you are not pushing aggregate in above cases?
Can you please explain?

2. "make check" in contrib/postgres_fdw crashes.

  SELECT COUNT(*) FROM ft1 t1;
! server closed the connection unexpectedly
!   This probably means the server terminated abnormally
!   before or while processing the request.
! connection to server was lost

From your given setup, if I wrote a query like:
EXPLAIN VERBOSE SELECT count(*) FROM orders_0;
it crashes.

Seems like missing few checks.

3. In case of pushing partial aggregation on the remote side, you use schema
named "partial", I did not get that change. If I have AVG() aggregate,
then I end up getting an error saying
"ERROR:  schema "partial" does not exist".

4. I am not sure about the code changes related to pushing partial
aggregate on the remote side. Basically, you are pushing a whole aggregate
on the remote side and result of that is treated as partial on the
basis of aggtype = transtype. But I am not sure that is correct unless
I miss something here. The type may be same but the result may not.

5. There are lot many TODO comments in the patch-set, are you working
on those?

Thanks


On Thu, Aug 17, 2017 at 8:52 PM, Antonin Houska <[hidden email]> wrote:
Antonin Houska <[hidden email]> wrote:

> Antonin Houska <[hidden email]> wrote:
>
> > This is a new version of the patch I presented in [1].
>
> Rebased.
>
> cat .git/refs/heads/master
> b9a3ef55b253d885081c2d0e9dc45802cab71c7b

This is another version of the patch.

Besides other changes, it enables the aggregation push-down for postgres_fdw,
although only for aggregates whose transient aggregation state is equal to the
output type. For other aggregates (like avg()) the remote nodes will have to
return the transient state value in an appropriate form (maybe bytea type),
which does not depend on PG version.

shard.tgz demonstrates the typical postgres_fdw use case. One query shows base
scans of base relation's partitions being pushed to shard nodes, the other
pushes down a join and performs aggregation of the join result on the remote
node. Of course, the query can only references one particular partition, until
the "partition-wise join" [1] patch gets committed and merged with this my
patch.

One thing I'm not sure about is whether the patch should remove
GetForeignUpperPaths function from FdwRoutine, which it essentially makes
obsolete. Or should it only be deprecated so far? I understand that
deprecation is important for "ordinary SQL users", but FdwRoutine is an
interface for extension developers, so the policy might be different.

[1] https://commitfest.postgresql.org/14/994/

Any feedback is appreciated.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers




--
Jeevan Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: WIP: Aggregation push-down

Ashutosh Bapat
In reply to this post by Antonin Houska-2
On Thu, Aug 17, 2017 at 8:52 PM, Antonin Houska <[hidden email]> wrote:

> Antonin Houska <[hidden email]> wrote:
>
>> Antonin Houska <[hidden email]> wrote:
>>
>> > This is a new version of the patch I presented in [1].
>>
>> Rebased.
>>
>> cat .git/refs/heads/master
>> b9a3ef55b253d885081c2d0e9dc45802cab71c7b
>
> This is another version of the patch.
>
> Besides other changes, it enables the aggregation push-down for postgres_fdw,
> although only for aggregates whose transient aggregation state is equal to the
> output type. For other aggregates (like avg()) the remote nodes will have to
> return the transient state value in an appropriate form (maybe bytea type),
> which does not depend on PG version.

Having transient aggregation state type same as output type doesn't
mean that we can feed the output of remote aggregation as is to the
finalization stage locally or finalization is not needed at all. I am
not sure why is that test being used to decide whether or not to push
an aggregate (I assume they are partial aggregates) to the foreign
server. postgres_fdw doesn't have a feature to fetch partial aggregate
results from the foreign server. Till we do that, I don't think this
patch can push any partial aggregates down to the foreign server.

>
> shard.tgz demonstrates the typical postgres_fdw use case. One query shows base
> scans of base relation's partitions being pushed to shard nodes, the other
> pushes down a join and performs aggregation of the join result on the remote
> node. Of course, the query can only references one particular partition, until
> the "partition-wise join" [1] patch gets committed and merged with this my
> patch.

Right. Until then joins will not have children.

>
> One thing I'm not sure about is whether the patch should remove
> GetForeignUpperPaths function from FdwRoutine, which it essentially makes
> obsolete. Or should it only be deprecated so far? I understand that
> deprecation is important for "ordinary SQL users", but FdwRoutine is an
> interface for extension developers, so the policy might be different.

I doubt if that's correct. We can do that only when we know that all
kinds aggregates/grouping can be pushed down the join tree, which
looks impossible to me. Apart from that that FdwRoutine handles all
kinds of upper relations like windowing, distinct, ordered which are
not pushed down by this set of patches.

Here's review of the patchset
The patches compile cleanly when applied together.

Testcase "limit" fails in make check.

This is a pretty large patchset and the patches are divided by stages in the
planner rather than functionality. IOW, no single patch provides a full
functionality. Reviewing and probably committing such patches is not easy and
may take quite long. Instead, if the patches are divided by functionality, i.e.
each patch provides some functionality completely, it will be easy to review
and commit such patches with each commit giving something useful. I had
suggested breaking patches on that line at [1]. The patches also refactor
existing code. It's better to move such refactoring in a patch/es by itself.

The patches need more comments, explaining various decisions e.g.
             if (joinrel)
             {
                 /* Create GatherPaths for any useful partial paths for rel */
-                generate_gather_paths(root, joinrel);
+                generate_gather_paths(root, joinrel, false);

                 /* Find and save the cheapest paths for this joinrel */
                 set_cheapest(joinrel);
For base relations, the patch calls generate_gather_paths() with
grouped as true and
false, but for join relations it doesn't do that. There is no comment
explaining why.
OR
in the following code
+    /*
+     * Do partial aggregation at base relation level if the relation is
+     * eligible for it.
+     */
+    if (rel->gpi != NULL)
+        create_grouped_path(root, rel, path, false, true, AGG_HASHED);
Why not to consider AGG_SORTED paths?

The patch maintains an extra rel target, two pathlists, partial pathlist and
non-partial pathlist for grouping in RelOptInfo. These two extra
pathlists require "grouped" argument to be passed to a number of functions.
Instead probably it makes sense to create a RelOptInfo for aggregation/grouping
and set pathlist and partial_pathlist in that RelOptInfo. That will also make a
place for keeping grouped estimates.

For placeholders we have two function add_placeholders_to_base_rels() and
add_placeholders_to_joinrel(). We have add_grouping_info_to_base_rels(), but do
not have corresponding function for joinrels. How do we push aggregates
involving two or more relations to the corresponding joinrels?

Some minor assorted comments
Avoid useless hunk.
@@ -279,8 +301,8 @@ add_paths_to_joinrel(PlannerInfo *root,
      * joins, because there may be no other alternative.
      */
     if (enable_hashjoin || jointype == JOIN_FULL)
-        hash_inner_and_outer(root, joinrel, outerrel, innerrel,
-                             jointype, &extra);
+        hash_inner_and_outer(root, joinrel, outerrel, innerrel, jointype,
+                             &extra);

In the last "regression" patch, there are some code changes (mostly because of
pg_indent run). May be you want to include those in appropriate code patches.

Some quick observations using two tables t1(a int, b int) and t2(a int, b int),
populated as
insert into t1 select i, i % 5 from generate_series(1, 100) i where i % 2 = 0;
insert into t2 select i, i % 5 from generate_series(1, 100) i where i % 3 = 0;

1. The patch pushes aggregates down join in the following query
explain verbose select avg(t2.a) from t1 inner join t2 on t1.b = t2.b group by
t2.b;
but does not push the aggregates down join in the following query
explain verbose select avg(t2.a) from t1 left join t2 on t1.b = t2.b group by
t2.b;
In fact, it doesn't use the optimization for any OUTER join. I think the reason
for this behaviour is that the patch uses equivalence classes to distribute the
aggregates and grouping to base relations and OUTER equi-joins do not form
equivalence classes. But I think it should be possible to push the aggregates
down the OUTER join by adding one row for NULL values if the grouping is pushed
to the inner side. I don't see much change for outer side. This also means that
we have to choose some means other than equivalence class for propagating the
aggregates.

2. Following query throws error with these patches, but not without the
patches.
explain verbose select sum(t1.a + t2.a) from t1, t2, t2 t3 where t1.a
= t2.a
group by t2.a, t1.a;
ERROR:  ORDER/GROUP BY expression not found in targetlist

[1] https://www.postgresql.org/message-id/CAFjFpRejPP4K%3Dg%2B0aaq_US0YrMaZzyM%2BNUCi%3DJgwaxhMUj2Zcg%40mail.gmail.com

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WIP: Aggregation push-down

Merlin Moncure-2
In reply to this post by Antonin Houska-2
On Thu, Aug 17, 2017 at 10:22 AM, Antonin Houska <[hidden email]> wrote:
> Antonin Houska <[hidden email]> wrote:
> output type. For other aggregates (like avg()) the remote nodes will have to
> return the transient state value in an appropriate form (maybe bytea type),
> which does not depend on PG version.

Hm, that seems like an awful lot of work (new version agnostic
serialization format) for very little benefit (version independent
type serialization for remote aggregate pushdown).  How about forcing
the version to be the same for the feature to be used?

merlin


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WIP: Aggregation push-down

Antonin Houska-2
In reply to this post by Jeevan Chalke
Jeevan Chalke <[hidden email]> wrote:

> 1.
> + if (aggref->aggvariadic ||
> + aggref->aggdirectargs || aggref->aggorder ||
> + aggref->aggdistinct || aggref->aggfilter)
>
> I did not understand, why you are not pushing aggregate in above cases?
> Can you please explain?

Currently I'm trying to implement infrastructure to propagate result of
partial aggregation from base relation or join to the root of the join tree
and to use these as input for the final aggregation. Some special cases are
disabled because I haven't thought about them enough so far. Some of these
restrictions may be easy to relax, others may be hard. I'll get back to them
at some point.

> 2. "make check" in contrib/postgres_fdw crashes.
>
> SELECT COUNT(*) FROM ft1 t1;
> ! server closed the connection unexpectedly
> ! This probably means the server terminated abnormally
> ! before or while processing the request.
> ! connection to server was lost
>
> From your given setup, if I wrote a query like:
> EXPLAIN VERBOSE SELECT count(*) FROM orders_0;
> it crashes.
>
> Seems like missing few checks.
Please consider this a temporary limitation.

> 3. In case of pushing partial aggregation on the remote side, you use schema
> named "partial", I did not get that change. If I have AVG() aggregate,
> then I end up getting an error saying
> "ERROR: schema "partial" does not exist".

Attached to his email is an extension that I hacked quickly at some point, for
experimental purposes only. It's pretty raw but may be useful if you just want
to play with it.

> 4. I am not sure about the code changes related to pushing partial
> aggregate on the remote side. Basically, you are pushing a whole aggregate
> on the remote side and result of that is treated as partial on the
> basis of aggtype = transtype. But I am not sure that is correct unless
> I miss something here. The type may be same but the result may not.

You're right. I mostly used sum() and count() in my examples but these are
special cases. It should also be checked whether aggfinalfn exists or
not. I'll fix it, thanks.

> 5. There are lot many TODO comments in the patch-set, are you working
> on those?

Yes. I wasn't too eager to complete all the TODOs soon because reviews of the
partch may result in a major rework. And if large parts of the code will have
to be wasted, some / many TODOs will be gone as well.

> Thanks
>
> On Thu, Aug 17, 2017 at 8:52 PM, Antonin Houska <[hidden email]> wrote:
>
>  Antonin Houska <[hidden email]> wrote:
>
>  > Antonin Houska <[hidden email]> wrote:
>  >
>  > > This is a new version of the patch I presented in [1].
>  >
>  > Rebased.
>  >
>  > cat .git/refs/heads/master
>  > b9a3ef55b253d885081c2d0e9dc45802cab71c7b
>
>  This is another version of the patch.
>
>  Besides other changes, it enables the aggregation push-down for postgres_fdw,
>  although only for aggregates whose transient aggregation state is equal to the
>  output type. For other aggregates (like avg()) the remote nodes will have to
>  return the transient state value in an appropriate form (maybe bytea type),
>  which does not depend on PG version.
>
>  shard.tgz demonstrates the typical postgres_fdw use case. One query shows base
>  scans of base relation's partitions being pushed to shard nodes, the other
>  pushes down a join and performs aggregation of the join result on the remote
>  node. Of course, the query can only references one particular partition, until
>  the "partition-wise join" [1] patch gets committed and merged with this my
>  patch.
>
>  One thing I'm not sure about is whether the patch should remove
>  GetForeignUpperPaths function from FdwRoutine, which it essentially makes
>  obsolete. Or should it only be deprecated so far? I understand that
>  deprecation is important for "ordinary SQL users", but FdwRoutine is an
>  interface for extension developers, so the policy might be different.
>
>  [1] https://commitfest.postgresql.org/14/994/
>
>  Any feedback is appreciated.
>
>  --
>  Antonin Houska
>  Cybertec Schönig & Schönig GmbH
>  Gröhrmühlgasse 26
>  A-2700 Wiener Neustadt
>  Web: http://www.postgresql-support.de, http://www.cybertec.at
>
>  --
>  Sent via pgsql-hackers mailing list ([hidden email])
>  To make changes to your subscription:
>  http://www.postgresql.org/mailpref/pgsql-hackers
>
> --
> Jeevan Chalke
> Principal Software Engineer, Product Development
> EnterpriseDB Corporation
> The Enterprise PostgreSQL Company
>
>
--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

partial.tgz (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: WIP: Aggregation push-down

Antonin Houska-2
In reply to this post by Ashutosh Bapat
Ashutosh Bapat <[hidden email]> wrote:

> On Thu, Aug 17, 2017 at 8:52 PM, Antonin Houska <[hidden email]> wrote:
> > Antonin Houska <[hidden email]> wrote:
> >
> >> Antonin Houska <[hidden email]> wrote:
> >>
> >> > This is a new version of the patch I presented in [1].
> >>
> >> Rebased.
> >>
> >> cat .git/refs/heads/master
> >> b9a3ef55b253d885081c2d0e9dc45802cab71c7b
> >
> > This is another version of the patch.
> >
> > Besides other changes, it enables the aggregation push-down for postgres_fdw,
> > although only for aggregates whose transient aggregation state is equal to the
> > output type. For other aggregates (like avg()) the remote nodes will have to
> > return the transient state value in an appropriate form (maybe bytea type),
> > which does not depend on PG version.
>
> Having transient aggregation state type same as output type doesn't
> mean that we can feed the output of remote aggregation as is to the
> finalization stage locally or finalization is not needed at all. I am
> not sure why is that test being used to decide whether or not to push
> an aggregate (I assume they are partial aggregates) to the foreign
> server.

I agree. This seems to be the same problem as reported in [2].

> postgres_fdw doesn't have a feature to fetch partial aggregate
> results from the foreign server. Till we do that, I don't think this
> patch can push any partial aggregates down to the foreign server.

Even if the query contains only aggregates like sum(int4), i.e. aggfinalfn
does not exist and the transient type can be transfered from the remote server
in textual form? (Of course there's a question is if such behaviour is
consistent from user's perspective.)

> >
> > One thing I'm not sure about is whether the patch should remove
> > GetForeignUpperPaths function from FdwRoutine, which it essentially makes
> > obsolete. Or should it only be deprecated so far? I understand that
> > deprecation is important for "ordinary SQL users", but FdwRoutine is an
> > interface for extension developers, so the policy might be different.
>
> I doubt if that's correct. We can do that only when we know that all
> kinds aggregates/grouping can be pushed down the join tree, which
> looks impossible to me. Apart from that that FdwRoutine handles all
> kinds of upper relations like windowing, distinct, ordered which are
> not pushed down by this set of patches.

Good point.

> Here's review of the patchset
> The patches compile cleanly when applied together.
>
> Testcase "limit" fails in make check.

If I remember correctly, this test generates different plan because the
aggregate push-down gets involved. I tend to ignore this error until the patch
has cost estimates refined. Then let's see if the test will pass or if the
expected output should be adjusted.

> This is a pretty large patchset and the patches are divided by stages in the
> planner rather than functionality. IOW, no single patch provides a full
> functionality. Reviewing and probably committing such patches is not easy and
> may take quite long. Instead, if the patches are divided by functionality, i.e.
> each patch provides some functionality completely, it will be easy to review
> and commit such patches with each commit giving something useful. I had
> suggested breaking patches on that line at [1]. The patches also refactor
> existing code. It's better to move such refactoring in a patch/es by itself.

I definitely saw commits whose purpose is preparation for something else. But
you're right that some splitting of the actual funcionality wouldn't
harm. I'll at least separate partial aggregation of base relations and that of
joins. And maybe some more preparation steps.

> The patches need more comments, explaining various decisions

o.k.

> The patch maintains an extra rel target, two pathlists, partial pathlist and
> non-partial pathlist for grouping in RelOptInfo. These two extra
> pathlists require "grouped" argument to be passed to a number of functions.
> Instead probably it makes sense to create a RelOptInfo for aggregation/grouping
> and set pathlist and partial_pathlist in that RelOptInfo. That will also make a
> place for keeping grouped estimates.

If grouped paths require a separate RelOptInfo, why the existing partial paths
do not?

> For placeholders we have two function add_placeholders_to_base_rels() and
> add_placeholders_to_joinrel(). We have add_grouping_info_to_base_rels(), but do
> not have corresponding function for joinrels. How do we push aggregates
> involving two or more relations to the corresponding joinrels?

add_grouping_info_to_base_rels() calls prepare_rel_for_grouping() which
actually adds the "grouping info". For join, prepare_rel_for_grouping() is
called from build_join_rel().

While PlaceHolderVars need to be finalized before planner starts to create
joins, the GroupedPathInfo has to fit particular pair of joined relations.

Perhaps create_grouping_info_... would be better, to indicate that the thing
we're adding does not exist yet. I'll think about it.

> Some minor assorted comments ...

o.k., will fix.

> Some quick observations using two tables t1(a int, b int) and t2(a int, b int),
> populated as
> insert into t1 select i, i % 5 from generate_series(1, 100) i where i % 2 = 0;
> insert into t2 select i, i % 5 from generate_series(1, 100) i where i % 3 = 0;
>
> 1. The patch pushes aggregates down join in the following query
> explain verbose select avg(t2.a) from t1 inner join t2 on t1.b = t2.b group by
> t2.b;
> but does not push the aggregates down join in the following query
> explain verbose select avg(t2.a) from t1 left join t2 on t1.b = t2.b group by
> t2.b;

> In fact, it doesn't use the optimization for any OUTER join. I think the reason
> for this behaviour is that the patch uses equivalence classes to distribute the
> aggregates and grouping to base relations and OUTER equi-joins do not form
> equivalence classes. But I think it should be possible to push the aggregates
> down the OUTER join by adding one row for NULL values if the grouping is pushed
> to the inner side. I don't see much change for outer side. This also means that
> we have to choose some means other than equivalence class for propagating the
> aggregates.

The problem is that aggregate pushed down to the nullable side of an outer
join receives different input values than the original aggregate at the top
level of the query. NULL values generated by the OJ make the difference. I
have no idea how to handle this problem. If the aggregation is performed on
the nullable side of the OJ, it can't predict which of the input values don't
match any value of the other side. Suggestions are appreciated.

> 2. Following query throws error with these patches, but not without the
> patches.
> explain verbose select sum(t1.a + t2.a) from t1, t2, t2 t3 where t1.a
> = t2.a
> group by t2.a, t1.a;
> ERROR:  ORDER/GROUP BY expression not found in targetlist

I'll check this. Thanks.

> [1] https://www.postgresql.org/message-id/CAFjFpRejPP4K%3Dg%2B0aaq_US0YrMaZzyM%2BNUCi%3DJgwaxhMUj2Zcg%40mail.gmail.com

[2] https://www.postgresql.org/message-id/CAM2%2B6%3DW2J-iaQBgj-sdMERELQLUm5dvOQEWQ2ho%2BQ7KZgnonkQ%40mail.gmail.com

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WIP: Aggregation push-down

Ashutosh Bapat
On Fri, Sep 8, 2017 at 7:04 PM, Antonin Houska <[hidden email]> wrote:

> Ashutosh Bapat <[hidden email]> wrote:
>
>> On Thu, Aug 17, 2017 at 8:52 PM, Antonin Houska <[hidden email]> wrote:
>> > Antonin Houska <[hidden email]> wrote:
>> >
>> >> Antonin Houska <[hidden email]> wrote:
>> >>
>> >> > This is a new version of the patch I presented in [1].
>> >>
>> >> Rebased.
>> >>
>> >> cat .git/refs/heads/master
>> >> b9a3ef55b253d885081c2d0e9dc45802cab71c7b
>> >
>> > This is another version of the patch.
>> >
>> > Besides other changes, it enables the aggregation push-down for postgres_fdw,
>> > although only for aggregates whose transient aggregation state is equal to the
>> > output type. For other aggregates (like avg()) the remote nodes will have to
>> > return the transient state value in an appropriate form (maybe bytea type),
>> > which does not depend on PG version.
>>
>> Having transient aggregation state type same as output type doesn't
>> mean that we can feed the output of remote aggregation as is to the
>> finalization stage locally or finalization is not needed at all. I am
>> not sure why is that test being used to decide whether or not to push
>> an aggregate (I assume they are partial aggregates) to the foreign
>> server.
>
> I agree. This seems to be the same problem as reported in [2].
>
>> postgres_fdw doesn't have a feature to fetch partial aggregate
>> results from the foreign server. Till we do that, I don't think this
>> patch can push any partial aggregates down to the foreign server.
>
> Even if the query contains only aggregates like sum(int4), i.e. aggfinalfn
> does not exist and the transient type can be transfered from the remote server
> in textual form? (Of course there's a question is if such behaviour is
> consistent from user's perspective.)

Yes, those functions will work.

>
>> >
>> > One thing I'm not sure about is whether the patch should remove
>> > GetForeignUpperPaths function from FdwRoutine, which it essentially makes
>> > obsolete. Or should it only be deprecated so far? I understand that
>> > deprecation is important for "ordinary SQL users", but FdwRoutine is an
>> > interface for extension developers, so the policy might be different.
>>
>> I doubt if that's correct. We can do that only when we know that all
>> kinds aggregates/grouping can be pushed down the join tree, which
>> looks impossible to me. Apart from that that FdwRoutine handles all
>> kinds of upper relations like windowing, distinct, ordered which are
>> not pushed down by this set of patches.
>
> Good point.
>

I think this is where Jeevan Chalke's partition-wise aggregation
helps. It deals with the cases where your patch can not push the
aggregates down to children of join.

>> The patch maintains an extra rel target, two pathlists, partial pathlist and
>> non-partial pathlist for grouping in RelOptInfo. These two extra
>> pathlists require "grouped" argument to be passed to a number of functions.
>> Instead probably it makes sense to create a RelOptInfo for aggregation/grouping
>> and set pathlist and partial_pathlist in that RelOptInfo. That will also make a
>> place for keeping grouped estimates.
>
> If grouped paths require a separate RelOptInfo, why the existing partial paths
> do not?

partial paths produce the same targetlist and the same relation that
non-partial paths do. A RelOptInfo in planner represents a set of rows
and all paths added to that RelOptInfo produce same whole result
(parameterized paths produces the same result collectively). Grouping
paths however are producing a different result and different
targetlist, so may be it's better to have a separate RelOptInfo.

>
>> For placeholders we have two function add_placeholders_to_base_rels() and
>> add_placeholders_to_joinrel(). We have add_grouping_info_to_base_rels(), but do
>> not have corresponding function for joinrels. How do we push aggregates
>> involving two or more relations to the corresponding joinrels?
>
> add_grouping_info_to_base_rels() calls prepare_rel_for_grouping() which
> actually adds the "grouping info". For join, prepare_rel_for_grouping() is
> called from build_join_rel().

Ok.

>
> While PlaceHolderVars need to be finalized before planner starts to create
> joins, the GroupedPathInfo has to fit particular pair of joined relations.
>
> Perhaps create_grouping_info_... would be better, to indicate that the thing
> we're adding does not exist yet. I'll think about it.
>
>> Some minor assorted comments ...
>
> o.k., will fix.
>
>> Some quick observations using two tables t1(a int, b int) and t2(a int, b int),
>> populated as
>> insert into t1 select i, i % 5 from generate_series(1, 100) i where i % 2 = 0;
>> insert into t2 select i, i % 5 from generate_series(1, 100) i where i % 3 = 0;
>>
>> 1. The patch pushes aggregates down join in the following query
>> explain verbose select avg(t2.a) from t1 inner join t2 on t1.b = t2.b group by
>> t2.b;
>> but does not push the aggregates down join in the following query
>> explain verbose select avg(t2.a) from t1 left join t2 on t1.b = t2.b group by
>> t2.b;
>
>> In fact, it doesn't use the optimization for any OUTER join. I think the reason
>> for this behaviour is that the patch uses equivalence classes to distribute the
>> aggregates and grouping to base relations and OUTER equi-joins do not form
>> equivalence classes. But I think it should be possible to push the aggregates
>> down the OUTER join by adding one row for NULL values if the grouping is pushed
>> to the inner side. I don't see much change for outer side. This also means that
>> we have to choose some means other than equivalence class for propagating the
>> aggregates.
>
> The problem is that aggregate pushed down to the nullable side of an outer
> join receives different input values than the original aggregate at the top
> level of the query. NULL values generated by the OJ make the difference. I
> have no idea how to handle this problem. If the aggregation is performed on
> the nullable side of the OJ, it can't predict which of the input values don't
> match any value of the other side. Suggestions are appreciated.

I haven't thought through this fully as well, but I think this can be
fixed by using some kind of all NULL row on the nullable side. If we
are going to use equivalence classes, we won't be able to extend that
mechanism to OUTER joins, so may be you want to rethink about using
equivalence classes as a method of propagating grouping information.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: WIP: Aggregation push-down

Antonin Houska-2
In reply to this post by Antonin Houska-2
Antonin Houska <[hidden email]> wrote:

> This is another version of the patch.

> shard.tgz demonstrates the typical postgres_fdw use case. One query shows base
> scans of base relation's partitions being pushed to shard nodes, the other
> pushes down a join and performs aggregation of the join result on the remote
> node. Of course, the query can only references one particular partition, until
> the "partition-wise join" [1] patch gets committed and merged with this my
> patch.

Since [1] is already there, the new version of shard.tgz shows what I consider
the typical use case. (I'm aware of the postgres_fdw regression test failures,
I'll try to fix them all in the next version.)

Besides that:

* A concept of "path unique keys" has been introduced. It helps to find out if
  the final relation appears to generate a distinct set of grouping keys. If
  that happens, the final aggregation is replaced by mere call of aggfinalfn()
  function on each transient state value.

* FDW can sort rows by aggregate.

* enable_agg_pushdown GUC was added. The default value is false.

* I fixed errors reported during the previous CF.

* Added a few more regression tests.


I'm not about to add any other features now. Implementation of the missing
parts (see the TODO comments in the code) is the next step. But what I'd
appreciate most is a feedback on the design. Thanks.

> [1] https://commitfest.postgresql.org/14/994/

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

agg_pushdown_v4.tgz (140K) Download Attachment
shard.tgz (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: WIP: Aggregation push-down

Michael Paquier
On Sat, Nov 4, 2017 at 12:33 AM, Antonin Houska <[hidden email]> wrote:
> I'm not about to add any other features now. Implementation of the missing
> parts (see the TODO comments in the code) is the next step. But what I'd
> appreciate most is a feedback on the design. Thanks.

I am getting a conflict after applying patch 5 but this did not get
any reviews so moved to next CF with waiting on author as status.
--
Michael

Reply | Threaded
Open this post in threaded view
|

Re: WIP: Aggregation push-down

Antonin Houska-2
Michael Paquier <[hidden email]> wrote:

> On Sat, Nov 4, 2017 at 12:33 AM, Antonin Houska <[hidden email]> wrote:
> > I'm not about to add any other features now. Implementation of the missing
> > parts (see the TODO comments in the code) is the next step. But what I'd
> > appreciate most is a feedback on the design. Thanks.
>
> I am getting a conflict after applying patch 5 but this did not get
> any reviews so moved to next CF with waiting on author as status.

Attached is the next version.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


agg_pushdown_v5.tgz (162K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: WIP: Aggregation push-down

Robert Haas
On Fri, Dec 22, 2017 at 10:43 AM, Antonin Houska <[hidden email]> wrote:

> Michael Paquier <[hidden email]> wrote:
>> On Sat, Nov 4, 2017 at 12:33 AM, Antonin Houska <[hidden email]> wrote:
>> > I'm not about to add any other features now. Implementation of the missing
>> > parts (see the TODO comments in the code) is the next step. But what I'd
>> > appreciate most is a feedback on the design. Thanks.
>>
>> I am getting a conflict after applying patch 5 but this did not get
>> any reviews so moved to next CF with waiting on author as status.
>
> Attached is the next version.

I've been a bit confused for a while about what this patch is trying
to do, so I spent some time today looking at it to try to figure it
out.  There's a lot I don't understand yet, but it seems like the
general idea is to build, potentially for each relation in the join
tree, not only the regular list of paths but also a list of "grouped"
paths.  If the pre-grouped path wins, then we can get a final path
that looks like Finalize Aggregate -> Some Joins -> Partial Aggregate
-> Maybe Some More Joins -> Base Table Scan.  In some cases the patch
seems to replace that uppermost Finalize Aggregate with a Result node.

translate_expression_to_rels() looks unsafe.  Equivalence members are
known to be equal under the semantics of the relevant operator class,
but that doesn't mean that one can be freely substituted for another.
For example:

rhaas=# create table one (a numeric);
CREATE TABLE
rhaas=# create table two (a numeric);
CREATE TABLE
rhaas=# insert into one values ('0');
INSERT 0 1
rhaas=# insert into two values ('0.00');
INSERT 0 1
rhaas=# select one.a, count(*) from one, two where one.a = two.a group by 1;
 a | count
---+-------
 0 |     1
(1 row)

rhaas=# select two.a, count(*) from one, two where one.a = two.a group by 1;
  a   | count
------+-------
 0.00 |     1
(1 row)

There are, admittedly, a large number of data types for which such a
substitution would work just fine.  numeric will not, citext will not,
but many others are fine. Regrettably, we have no framework in the
system for identifying equality operators which actually test identity
versus some looser notion of equality.  Cross-type operators are a
problem, too; if we have foo.x = bar.y in the query, one might be int4
and the other int8, for example, but they can still belong to the same
equivalence class.

Concretely, in your test query "SELECT p.i, avg(c1.v) FROM
agg_pushdown_parent AS p JOIN agg_pushdown_child1 AS c1 ON c1.parent =
p.i GROUP BY p.i" you assume that it's OK to do a Partial
HashAggregate over c1.parent rather than p.i.  This will be false if,
say, c1.parent is of type citext and p.i is of type text; this will
get grouped together that shouldn't.  It will also be false if the
grouping expression is something like GROUP BY length(p.i::text),
because one value could be '0'::numeric and the other '0.00'::numeric.
I can't think of a reason why it would be false if the grouping
expressions are both simple Vars of the same underlying data type, but
I'm a little nervous that I might be wrong even about that case.
Maybe you've handled all of this somehow, but it's not obvious to me
that it has been considered.

Another thing I noticed is that the GroupedPathInfo looks a bit like a
stripped-down RelOptInfo, in that for example it has both a pathlist
and a partial_pathlist. I'm inclined to think that we should build new
RelOptInfos instead.  As you have it, there are an awful lot of things
that seem to know about the grouped/ungrouped distinction, many of
which are quite low-level functions like add_path(),
add_paths_to_append_rel(), try_nestloop_path(), etc.  I think that
some of this would go away if you had separate RelOptInfos instead of
GroupedPathInfo.

Some compiler noise:

allpaths.c:2794:11: error: variable 'partial_target' is used
uninitialized whenever 'if' condition is false
      [-Werror,-Wsometimes-uninitialized]
        else if (rel->gpi != NULL && rel->gpi->target != NULL)
                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
allpaths.c:2798:56: note: uninitialized use occurs here
                create_gather_path(root, rel, cheapest_partial_path,
partial_target,

^~~~~~~~~~~~~~
allpaths.c:2794:7: note: remove the 'if' if its condition is always true
        else if (rel->gpi != NULL && rel->gpi->target != NULL)
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
allpaths.c:2794:11: error: variable 'partial_target' is used
uninitialized whenever '&&' condition is false
      [-Werror,-Wsometimes-uninitialized]
        else if (rel->gpi != NULL && rel->gpi->target != NULL)
                 ^~~~~~~~~~~~~~~~
allpaths.c:2798:56: note: uninitialized use occurs here
                create_gather_path(root, rel, cheapest_partial_path,
partial_target,

^~~~~~~~~~~~~~
allpaths.c:2794:11: note: remove the '&&' if its condition is always true
        else if (rel->gpi != NULL && rel->gpi->target != NULL)
                 ^~~~~~~~~~~~~~~~~~~
allpaths.c:2773:28: note: initialize the variable 'partial_target' to
silence this warning
        PathTarget *partial_target;
                                  ^
                                   = NULL

Core dump running the regression tests:

  * frame #0: 0x00007fffe633ad42 libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fffe6428457 libsystem_pthread.dylib`pthread_kill + 90
    frame #2: 0x00007fffe62a0420 libsystem_c.dylib`abort + 129
    frame #3: 0x000000010e3ae90e
postgres`ExceptionalCondition(conditionName=<unavailable>,
errorType=<unavailable>, fileName=<unavailable>,
lineNumber=<unavailable>) at assert.c:54 [opt]
    frame #4: 0x000000010e17c23b
postgres`check_list_invariants(list=<unavailable>) at list.c:39 [opt]
    frame #5: 0x000000010e17cdff postgres`list_free [inlined]
list_free_private(list=0x00007fe6a483be40, deep='\0') at list.c:1107
[opt]
    frame #6: 0x000000010e17cdfa
postgres`list_free(list=0x00007fe6a483be40) at list.c:1135 [opt]
    frame #7: 0x000000010e1b365f
postgres`generate_bitmap_or_paths(root=0x00007fe6a68131c8,
rel=0x00007fe6a6815310, clauses=<unavailable>,
other_clauses=<unavailable>, agg_info=0x0000000000000000) at
indxpath.c:1580 [opt]
    frame #8: 0x000000010e1b305c
postgres`create_index_paths(root=0x00007fe6a68131c8,
rel=<unavailable>, agg_info=0x0000000000000000) at indxpath.c:334
[opt]
    frame #9: 0x000000010e1a7cfa postgres`set_rel_pathlist [inlined]
set_plain_rel_pathlist at allpaths.c:745 [opt]
    frame #10: 0x000000010e1a7bc1
postgres`set_rel_pathlist(root=0x00007fe6a68131c8,
rel=0x00007fe6a6815310, rti=1, rte=0x00007fe6a6803270) at
allpaths.c:454 [opt]
    frame #11: 0x000000010e1a4910 postgres`make_one_rel at allpaths.c:312 [opt]
    frame #12: 0x000000010e1a48cc
postgres`make_one_rel(root=0x00007fe6a68131c8,
joinlist=0x00007fe6a6815810) at allpaths.c:182 [opt]
    frame #13: 0x000000010e1cb806
postgres`query_planner(root=0x00007fe6a68131c8, tlist=<unavailable>,
qp_callback=<unavailable>, qp_extra=0x00007fff51ca24d8) at
planmain.c:268 [opt]
    frame #14: 0x000000010e1ce7b4
postgres`grouping_planner(root=<unavailable>, inheritance_update='\0',
tuple_fraction=<unavailable>) at planner.c:1789 [opt]
    frame #15: 0x000000010e1ccb12
postgres`subquery_planner(glob=<unavailable>,
parse=0x00007fe6a6803158, parent_root=<unavailable>,
hasRecursion=<unavailable>, tuple_fraction=0) at planner.c:901 [opt]
    frame #16: 0x000000010e1cba3e
postgres`standard_planner(parse=0x00007fe6a6803158, cursorOptions=256,
boundParams=0x0000000000000000) at planner.c:364 [opt]
    frame #17: 0x000000010e291b3b
postgres`pg_plan_query(querytree=0x00007fe6a6803158,
cursorOptions=256, boundParams=0x0000000000000000) at postgres.c:807
[opt]
    frame #18: 0x000000010e291c6e
postgres`pg_plan_queries(querytrees=<unavailable>, cursorOptions=256,
boundParams=0x0000000000000000) at postgres.c:873 [opt]
    frame #19: 0x000000010e296335
postgres`exec_simple_query(query_string="SELECT p1.oid,
p1.typname\nFROM pg_type as p1\nWHERE p1.typnamespace = 0 OR\n
(p1.typlen <= 0 AND p1.typlen != -1 AND p1.typlen != -2) OR\n
(p1.typtype not in ('b', 'c', 'd', 'e', 'p', 'r')) OR\n    NOT
p1.typisdefined OR\n    (p1.typalign not in ('c', 's', 'i', 'd')) OR\n
   (p1.typstorage not in ('p', 'x', 'e', 'm'));") at postgres.c:1048
[opt]
    frame #20: 0x000000010e295099
postgres`PostgresMain(argc=<unavailable>, argv=<unavailable>,
dbname="regression", username=<unavailable>) at postgres.c:0 [opt]
    frame #21: 0x000000010e20a910 postgres`PostmasterMain [inlined]
BackendRun at postmaster.c:4412 [opt]
    frame #22: 0x000000010e20a741 postgres`PostmasterMain [inlined]
BackendStartup at postmaster.c:4084 [opt]
    frame #23: 0x000000010e20a3bd postgres`PostmasterMain at
postmaster.c:1757 [opt]
    frame #24: 0x000000010e2098ff
postgres`PostmasterMain(argc=1516050552, argv=<unavailable>) at
postmaster.c:1365 [opt]
    frame #25: 0x000000010e177537 postgres`main(argc=<unavailable>,
argv=<unavailable>) at main.c:228 [opt]
    frame #26: 0x00007fffe620c235 libdyld.dylib`start + 1
(lldb) p debug_query_string
(const char *) $0 = 0x00007fe6a401e518 "SELECT p1.oid,
p1.typname\nFROM pg_type as p1\nWHERE p1.typnamespace = 0 OR\n
(p1.typlen <= 0 AND p1.typlen != -1 AND p1.typlen != -2) OR\n
(p1.typtype not in ('b', 'c', 'd', 'e', 'p', 'r')) OR\n    NOT
p1.typisdefined OR\n    (p1.typalign not in ('c', 's', 'i', 'd')) OR\n
   (p1.typstorage not in ('p', 'x', 'e', 'm'));"

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: WIP: Aggregation push-down

Antonin Houska-2
Robert Haas <[hidden email]> wrote:

> On Fri, Dec 22, 2017 at 10:43 AM, Antonin Houska <[hidden email]> wrote:
> > Michael Paquier <[hidden email]> wrote:
> >> On Sat, Nov 4, 2017 at 12:33 AM, Antonin Houska <[hidden email]> wrote:
> >> > I'm not about to add any other features now. Implementation of the missing
> >> > parts (see the TODO comments in the code) is the next step. But what I'd
> >> > appreciate most is a feedback on the design. Thanks.
> >>
> >> I am getting a conflict after applying patch 5 but this did not get
> >> any reviews so moved to next CF with waiting on author as status.
> >
> > Attached is the next version.
>
> I've been a bit confused for a while about what this patch is trying
> to do, so I spent some time today looking at it to try to figure it
> out.

Thanks.

> There's a lot I don't understand yet, but it seems like the general idea is
> to build, potentially for each relation in the join tree, not only the
> regular list of paths but also a list of "grouped" paths.  If the
> pre-grouped path wins, then we can get a final path that looks like Finalize
> Aggregate -> Some Joins -> Partial Aggregate -> Maybe Some More Joins ->
> Base Table Scan.

Yes. The regression test output shows some more plans.

> In some cases the patch seems to replace that uppermost Finalize Aggregate
> with a Result node.

This is a feature implemented in 09_avoid_agg_finalization.diff. You can
ignore it for now, it's of lower priority than the preceding parts and needs
more work to be complete.

> translate_expression_to_rels() looks unsafe.  Equivalence members are
> known to be equal under the semantics of the relevant operator class,
> but that doesn't mean that one can be freely substituted for another.
> For example:
>
> rhaas=# create table one (a numeric);
> CREATE TABLE
> rhaas=# create table two (a numeric);
> CREATE TABLE
> rhaas=# insert into one values ('0');
> INSERT 0 1
> rhaas=# insert into two values ('0.00');
> INSERT 0 1
> rhaas=# select one.a, count(*) from one, two where one.a = two.a group by 1;
>  a | count
> ---+-------
>  0 |     1
> (1 row)
>
> rhaas=# select two.a, count(*) from one, two where one.a = two.a group by 1;
>   a   | count
> ------+-------
>  0.00 |     1
> (1 row)
>
> There are, admittedly, a large number of data types for which such a
> substitution would work just fine.  numeric will not, citext will not,
> but many others are fine. Regrettably, we have no framework in the
> system for identifying equality operators which actually test identity
> versus some looser notion of equality.  Cross-type operators are a
> problem, too; if we have foo.x = bar.y in the query, one might be int4
> and the other int8, for example, but they can still belong to the same
> equivalence class.
>
> Concretely, in your test query "SELECT p.i, avg(c1.v) FROM
> agg_pushdown_parent AS p JOIN agg_pushdown_child1 AS c1 ON c1.parent =
> p.i GROUP BY p.i" you assume that it's OK to do a Partial
> HashAggregate over c1.parent rather than p.i.  This will be false if,
> say, c1.parent is of type citext and p.i is of type text; this will
> get grouped together that shouldn't.

I've really missed this problem, thanks for bringing it up! Your
considerations made me think of the specifics of the types like numeric or
citext. Although your example does not demonstrate what I consider the core
issue, I believe we eventually think of the same.

Consider this example query (using the tables you defined upthread):

SELECT      one.a
FROM        one, two
WHERE       one.a = two.a AND scale(two.a) > 1
GROUP BY    1

I we first try to aggregate "two", then evaluate WHERE clause and then
finalize the aggregation, the partial aggregation of "two" can put various
binary values of the "a" column (0, 0.0, etc.) into the same group so the
scale of the output (of the partial agregation) will be undefined. Thus the
aggregation push-down will change the input of the WHERE clause.

So one problem is that the grouping expression can be inappropriate for
partial aggregation even if there's no type change during the
translation. What I consider typical for this case is that the equality
operator used to identify groups (SortGroupClause.eqop) can return true even
if binary (stored) values of the inputs differ. The partial aggregation could
only take place if we had a special equality operator which distinguishes the
*binary* values (I don't know yet how to store this operator the catalog ---
in pg_opclass recors for the hash opclasses?)..

On the other hand, if the grouping expression is not a plain Var and there's
no type change during the translation and the expression output type is not of
the "identity-unsafe type" (numeric, citext, etc.), input vars of the
"identity-unsafe type"should not prevent us from using the expression for
partial aggregation: in such a case the grouping keys are computed in the same
way they would be w/o the partial aggregation.

The other problem is which type changes are legal. We should not allow any
type-specific information (such as scale or precision of the numeric type) to
bet lost or ambiguous. In this example

> It will also be false if the grouping expression is something like GROUP BY
> length(p.i::text), because one value could be '0'::numeric and the other
> '0.00'::numeric.

you show that not only numeric -> int is a problem but also int ->
numeric. This is the tricky part.

One of my ideas is to check whether the source and destination types are
binary coercible (i.e. pg_cast(castfunc) = 0) but this might be a misuse of
the binary coercibility. Another idea is to allow only such changes that the
destination type is in the same operator class as the source, and explicitly
enumerate the "safe opclasses". But that would mean that user cannot define
new opclasses within which the translation is possible --- not sure this is a
serious issue.

Perhaps the most consistent solution is that the translation is not performed
if any input variable of the grouping expression should change its data type.

> Another thing I noticed is that the GroupedPathInfo looks a bit like a
> stripped-down RelOptInfo, in that for example it has both a pathlist
> and a partial_pathlist. I'm inclined to think that we should build new
> RelOptInfos instead.  As you have it, there are an awful lot of things
> that seem to know about the grouped/ungrouped distinction, many of
> which are quite low-level functions like add_path(),
> add_paths_to_append_rel(), try_nestloop_path(), etc.  I think that
> some of this would go away if you had separate RelOptInfos instead of
> GroupedPathInfo.

This was proposed by Ashutosh Bapat upthread. I actually did this in the early
(not published) prototype of the patch and I considered it too invasive for
standard_join_search() and subroutines. IIRC an array like simple_rel_array
had to be introduced for the grouped relation in which the meaning of NULL was
twofold: either the relation is not a base relation or it is base relation but
no grouped relation exists for it.  Also I thought there would be too much
duplicate data if the grouped relation had a separate RelOptInfo. Now that I'm
done with one approach I can consider the original approach aggain.

> Some compiler noise:

> Core dump running the regression tests:

I'll fix these, thanks.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com


Reply | Threaded
Open this post in threaded view
|

Re: WIP: Aggregation push-down

Robert Haas
On Fri, Jan 26, 2018 at 8:04 AM, Antonin Houska <[hidden email]> wrote:
> So one problem is that the grouping expression can be inappropriate for
> partial aggregation even if there's no type change during the
> translation. What I consider typical for this case is that the equality
> operator used to identify groups (SortGroupClause.eqop) can return true even
> if binary (stored) values of the inputs differ. The partial aggregation could
> only take place if we had a special equality operator which distinguishes the
> *binary* values (I don't know yet how to store this operator the catalog ---
> in pg_opclass recors for the hash opclasses?)..

We don't have an operator that tests for binary equality, but it's
certainly testable from C; see datumIsEqual.  I'm not sure how much
this really helps, though.  I think it would be safe to build an
initial set of groups based on datumIsEqual comparisons and then
arrange to later merge some of the groups.  But that's not something
we ever do in the executor today, so it might require quite a lot of
hacking.  Also, it seems like it might really suck in some cases.  For
instance, consider something like SELECT scale(one.a), sum(two.a) FROM
one, two WHERE one.a = two.a GROUP BY 1.  Doing a Partial Aggregate on
two.a using datumIsEqual semantics, joining to a, and then doing a
Finalize Aggregate looks legal, but the Partial Aggregate may produce
a tremendous number of groups compared to the Finalize Aggregate.  In
other words, this technique wouldn't merge any groups that shouldn't
be merged, but it might fail to merge groups that really need to be
merged to get good performance.

> One of my ideas is to check whether the source and destination types are
> binary coercible (i.e. pg_cast(castfunc) = 0) but this might be a misuse of
> the binary coercibility.

Yeah, binary coercibility has nothing to do with this; that tells you
whether the two types are the same on disk, not whether they have the
same equality semantics.  For instance, I think text and citext are
binary coercible, but their equality semantics are not the same.

> Another idea is to allow only such changes that the
> destination type is in the same operator class as the source, and explicitly
> enumerate the "safe opclasses". But that would mean that user cannot define
> new opclasses within which the translation is possible --- not sure this is a
> serious issue.

Enumerating specific opclasses in the source code is a non-starter --
Tom Lane would roll over in his grave if he weren't still alive.  What
we could perhaps consider doing is adding some mechanism for an
opclass or opfamily to say whether its equality semantics happen to be
exactly datumIsEqual() semantics.  That's a little grotty because it
leaves data types for which that isn't true out in the cold, but on
the other hand it seems like it would be useful as a way of optimizing
a bunch of things other than this.  Maybe it could also include a way
to specify that the comparator happens to have the semantics as C's
built-in < and > operators, which seems like a case that might also
lend itself to some optimizations.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: WIP: Aggregation push-down

Antonin Houska-2
Robert Haas <[hidden email]> wrote:

> On Fri, Jan 26, 2018 at 8:04 AM, Antonin Houska <[hidden email]> wrote:
> > So one problem is that the grouping expression can be inappropriate for
> > partial aggregation even if there's no type change during the
> > translation. What I consider typical for this case is that the equality
> > operator used to identify groups (SortGroupClause.eqop) can return true even
> > if binary (stored) values of the inputs differ. The partial aggregation could
> > only take place if we had a special equality operator which distinguishes the
> > *binary* values (I don't know yet how to store this operator the catalog ---
> > in pg_opclass recors for the hash opclasses?)..
>
> We don't have an operator that tests for binary equality, but it's
> certainly testable from C; see datumIsEqual.  I'm not sure how much
> this really helps, though.  I think it would be safe to build an
> initial set of groups based on datumIsEqual comparisons and then
> arrange to later merge some of the groups.  But that's not something
> we ever do in the executor today, so it might require quite a lot of
> hacking.  Also, it seems like it might really suck in some cases.  For
> instance, consider something like SELECT scale(one.a), sum(two.a) FROM
> one, two WHERE one.a = two.a GROUP BY 1.  Doing a Partial Aggregate on
> two.a using datumIsEqual semantics, joining to a, and then doing a
> Finalize Aggregate looks legal, but the Partial Aggregate may produce
> a tremendous number of groups compared to the Finalize Aggregate.  In
> other words, this technique wouldn't merge any groups that shouldn't
> be merged, but it might fail to merge groups that really need to be
> merged to get good performance.

I don't insist on doing Partial Aggregate in any case. If we wanted to group
by the binary value, we'd probably have to enhance statistics accordingly. The
important thing is to recognize the special case like this. Rejection of the
Partial Aggregate would be the default response.

> > Another idea is to allow only such changes that the
> > destination type is in the same operator class as the source, and explicitly
> > enumerate the "safe opclasses". But that would mean that user cannot define
> > new opclasses within which the translation is possible --- not sure this is a
> > serious issue.
>
> Enumerating specific opclasses in the source code is a non-starter --
> Tom Lane would roll over in his grave if he weren't still alive.  What
> we could perhaps consider doing is adding some mechanism for an
> opclass or opfamily to say whether its equality semantics happen to be
> exactly datumIsEqual() semantics.  That's a little grotty because it
> leaves data types for which that isn't true out in the cold, but on
> the other hand it seems like it would be useful as a way of optimizing
> a bunch of things other than this.  Maybe it could also include a way
> to specify that the comparator happens to have the semantics as C's
> built-in < and > operators, which seems like a case that might also
> lend itself to some optimizations.

I think of a variant of this: implement an universal function that tests the
binary values for equality (besides the actual arguments, caller would have to
pass info like typlen, typalign, typbyval for each argument, and cache these
for repeated calls) and set pg_proc(oprcode) to 0 wherever this function is
sufficient. Thus the problematic cases like numeric, citext, etc. would be
detected by their non-zero oprcode.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com

Reply | Threaded
Open this post in threaded view
|

Re: WIP: Aggregation push-down

Chapman Flack
On 01/29/18 03:32, Antonin Houska wrote:
> Robert Haas <[hidden email]> wrote:

>>> only take place if we had a special equality operator which distinguishes the
>>> *binary* values (I don't know yet how to store this operator the catalog ---
...
>> We don't have an operator that tests for binary equality, but it's
>> certainly testable from C; see datumIsEqual.

Disclaimer: I haven't been following the whole thread closely. But could
there be some way to exploit the composite-type *= operator?

https://www.postgresql.org/docs/current/static/functions-comparisons.html#COMPOSITE-TYPE-COMPARISON

-Chap

Reply | Threaded
Open this post in threaded view
|

Re: WIP: Aggregation push-down

Robert Haas
In reply to this post by Antonin Houska-2
On Mon, Jan 29, 2018 at 3:32 AM, Antonin Houska <[hidden email]> wrote:
> I think of a variant of this: implement an universal function that tests the
> binary values for equality (besides the actual arguments, caller would have to
> pass info like typlen, typalign, typbyval for each argument, and cache these
> for repeated calls) and set pg_proc(oprcode) to 0 wherever this function is
> sufficient. Thus the problematic cases like numeric, citext, etc. would be
> detected by their non-zero oprcode.

I don't think that's a good option, because we don't want int4eq to
have to go through a code path that has branches to support varlenas
and cstrings.  Andres is busy trying to speed up expression evaluation
by removing unnecessary code branches; adding new ones would be
undoing that valuable work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: WIP: Aggregation push-down

Antonin Houska-2
Robert Haas <[hidden email]> wrote:

> On Mon, Jan 29, 2018 at 3:32 AM, Antonin Houska <[hidden email]> wrote:
> > I think of a variant of this: implement an universal function that tests the
> > binary values for equality (besides the actual arguments, caller would have to
> > pass info like typlen, typalign, typbyval for each argument, and cache these
> > for repeated calls) and set pg_proc(oprcode) to 0 wherever this function is
> > sufficient. Thus the problematic cases like numeric, citext, etc. would be
> > detected by their non-zero oprcode.
>
> I don't think that's a good option, because we don't want int4eq to
> have to go through a code path that has branches to support varlenas
> and cstrings.  Andres is busy trying to speed up expression evaluation
> by removing unnecessary code branches; adding new ones would be
> undoing that valuable work.

I spent some more time thinking about this. What about adding a new strategy
number for hash index operator classes, e.g. HTBinaryEqualStrategyNumber? For
most types both HTEqualStrategyNumber and HTBinaryEqualStrategyNumber strategy
would point to the same operator, but types like numeric would naturally have
them different.

Thus the pushed-down partial aggregation can only use the
HTBinaryEqualStrategyNumber's operator to compare grouping expressions. In the
initial version (until we have useful statistics for the binary values) we can
avoid the aggregation push-down if the grouping expression output type has the
two strategies implemented using different functions because, as you noted
upthread, grouping based on binary equality can result in excessive number of
groups.

One open question is whether the binary equality operator needs a separate
operator class or not. If an opclass cares only about the binary equality, its
hash function(s) can be a lot simpler.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com

Reply | Threaded
Open this post in threaded view
|

Re: WIP: Aggregation push-down

Robert Haas
On Fri, Feb 23, 2018 at 11:08 AM, Antonin Houska <[hidden email]> wrote:

> I spent some more time thinking about this. What about adding a new strategy
> number for hash index operator classes, e.g. HTBinaryEqualStrategyNumber? For
> most types both HTEqualStrategyNumber and HTBinaryEqualStrategyNumber strategy
> would point to the same operator, but types like numeric would naturally have
> them different.
>
> Thus the pushed-down partial aggregation can only use the
> HTBinaryEqualStrategyNumber's operator to compare grouping expressions. In the
> initial version (until we have useful statistics for the binary values) we can
> avoid the aggregation push-down if the grouping expression output type has the
> two strategies implemented using different functions because, as you noted
> upthread, grouping based on binary equality can result in excessive number of
> groups.
>
> One open question is whether the binary equality operator needs a separate
> operator class or not. If an opclass cares only about the binary equality, its
> hash function(s) can be a lot simpler.

Hmm.  How about instead adding another regproc field to pg_type which
stores the OID of a function that tests binary equality for that
datatype?  If that happens to be equal to the OID you got from the
opclass, then you're all set.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

12