(select query)/relation as first class citizen

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

(select query)/relation as first class citizen

Roman Pekar
Hello,

Just a bit of background - I currently work as a full-time db developer, mostly with Ms Sql server but I like Postgres a lot, especially because I really program in sql all the time and type system / plpgsql language of Postgres seems to me more suitable for actual programming then t-sql.

Here's the problem - current structure of the language doesn't allow to decompose the code well and split calculations and data into different modules.

For example. Suppose I have a table employee and I have a function like this (I'll skip definition of return types for the sake of simplicity):

create function departments_salary ()
returns  table (...)
as
return $$
    select department, sum(salary) as salary from employee group by department;
$$;

so that's fine, but what if I want to run this function on filtered employee? I can adjust the function of course, but it implies I can predict all possible filters I'm going to need in the future.
And logically, function itself doesn't have to be run on employee table, anything with department and salary columns will fit.
So it'd be nice to be able to define the function like this:

create function departments_salary(_employee query)
returns table (...)
as
return $$
    select department, sum(salary) as salary from _employee group by department;
$$;

and then call it like this:

declare _employee query;
...
_poor_employee = (select salary, department from employee where salary < 1000);
select * from  departments_salary( _poor_employee);

And just to be clear, the query is not really invoked until the last line, so re-assigning _employee variable is more like building query expression.

As far as I understand the closest way to do this is to put the data into temporary table and use this temporary table inside of the function. It's not exactly the same of course, cause in case of temporary tables data should be transferred to temporary table, while it will might be filtered later. So it's something like array vs generator in python, or List vs IQueryable in C#.

Adding this functionality will allow much better decomposition of the program's logic.
What do you think about the idea itself? If you think the idea is worthy, is it even possible to implement it?

Regards,
Roman Pekar


Reply | Threaded
Open this post in threaded view
|

Re: (select query)/relation as first class citizen

Pavel Stehule
Hi

ne 7. 7. 2019 v 14:54 odesílatel Roman Pekar <[hidden email]> napsal:
Hello,

Just a bit of background - I currently work as a full-time db developer, mostly with Ms Sql server but I like Postgres a lot, especially because I really program in sql all the time and type system / plpgsql language of Postgres seems to me more suitable for actual programming then t-sql.

Here's the problem - current structure of the language doesn't allow to decompose the code well and split calculations and data into different modules.

For example. Suppose I have a table employee and I have a function like this (I'll skip definition of return types for the sake of simplicity):

create function departments_salary ()
returns  table (...)
as
return $$
    select department, sum(salary) as salary from employee group by department;
$$;

so that's fine, but what if I want to run this function on filtered employee? I can adjust the function of course, but it implies I can predict all possible filters I'm going to need in the future.
And logically, function itself doesn't have to be run on employee table, anything with department and salary columns will fit.
So it'd be nice to be able to define the function like this:

create function departments_salary(_employee query)
returns table (...)
as
return $$
    select department, sum(salary) as salary from _employee group by department;
$$;

and then call it like this:

declare _employee query;
...
_poor_employee = (select salary, department from employee where salary < 1000);
select * from  departments_salary( _poor_employee);

And just to be clear, the query is not really invoked until the last line, so re-assigning _employee variable is more like building query expression.

As far as I understand the closest way to do this is to put the data into temporary table and use this temporary table inside of the function. It's not exactly the same of course, cause in case of temporary tables data should be transferred to temporary table, while it will might be filtered later. So it's something like array vs generator in python, or List vs IQueryable in C#.

Adding this functionality will allow much better decomposition of the program's logic.
What do you think about the idea itself? If you think the idea is worthy, is it even possible to implement it?

If we talk about plpgsql, then I afraid so this idea can disallow plan caching - or significantly increase the cost of plan cache.

There are two possibilities of implementation - a) query like cursor - unfortunately it effectively disables any optimization and it carry ORM performance to procedures. This usage is known performance antipattern, b) query like view - it should not to have a performance problems with late optimization, but I am not sure about possibility to reuse execution plans.

Currently PLpgSQL is compromise between performance and dynamic (PLpgSQL is really static language). Your proposal increase much more dynamic behave, but performance can be much more worse.

More - with this behave, there is not possible to do static check - so you have to find bugs only at runtime. I afraid about performance of this solution.

Regards

Pavel



Regards,
Roman Pekar


Reply | Threaded
Open this post in threaded view
|

Re: (select query)/relation as first class citizen

Roman Pekar
Hi,

Yes, I'm thinking about 'query like a view', 'query like a cursor' is probably possible even now in ms sql server (not sure about postgresql), but it requires this paradygm shift from set-based thinking to row-by-row thinking which I'd not want to do.

I completely agree with your points of plan caching and static checks. With static checks, though it might be possible to do if the query would be defined as typed, so all the types of the columns is known in advance.
In certain cases having possibility of much better decomposition is might be more important than having cached plan. Not sure how often these cases appear in general, but personally for me it'd be awesome to have this possibility.

Regards,
Roman Pekar

On Sun, 7 Jul 2019 at 15:39, Pavel Stehule <[hidden email]> wrote:
Hi

ne 7. 7. 2019 v 14:54 odesílatel Roman Pekar <[hidden email]> napsal:
Hello,

Just a bit of background - I currently work as a full-time db developer, mostly with Ms Sql server but I like Postgres a lot, especially because I really program in sql all the time and type system / plpgsql language of Postgres seems to me more suitable for actual programming then t-sql.

Here's the problem - current structure of the language doesn't allow to decompose the code well and split calculations and data into different modules.

For example. Suppose I have a table employee and I have a function like this (I'll skip definition of return types for the sake of simplicity):

create function departments_salary ()
returns  table (...)
as
return $$
    select department, sum(salary) as salary from employee group by department;
$$;

so that's fine, but what if I want to run this function on filtered employee? I can adjust the function of course, but it implies I can predict all possible filters I'm going to need in the future.
And logically, function itself doesn't have to be run on employee table, anything with department and salary columns will fit.
So it'd be nice to be able to define the function like this:

create function departments_salary(_employee query)
returns table (...)
as
return $$
    select department, sum(salary) as salary from _employee group by department;
$$;

and then call it like this:

declare _employee query;
...
_poor_employee = (select salary, department from employee where salary < 1000);
select * from  departments_salary( _poor_employee);

And just to be clear, the query is not really invoked until the last line, so re-assigning _employee variable is more like building query expression.

As far as I understand the closest way to do this is to put the data into temporary table and use this temporary table inside of the function. It's not exactly the same of course, cause in case of temporary tables data should be transferred to temporary table, while it will might be filtered later. So it's something like array vs generator in python, or List vs IQueryable in C#.

Adding this functionality will allow much better decomposition of the program's logic.
What do you think about the idea itself? If you think the idea is worthy, is it even possible to implement it?

If we talk about plpgsql, then I afraid so this idea can disallow plan caching - or significantly increase the cost of plan cache.

There are two possibilities of implementation - a) query like cursor - unfortunately it effectively disables any optimization and it carry ORM performance to procedures. This usage is known performance antipattern, b) query like view - it should not to have a performance problems with late optimization, but I am not sure about possibility to reuse execution plans.

Currently PLpgSQL is compromise between performance and dynamic (PLpgSQL is really static language). Your proposal increase much more dynamic behave, but performance can be much more worse.

More - with this behave, there is not possible to do static check - so you have to find bugs only at runtime. I afraid about performance of this solution.

Regards

Pavel



Regards,
Roman Pekar


Reply | Threaded
Open this post in threaded view
|

Re: (select query)/relation as first class citizen

Roman Pekar
Hi,

what do you think about this idea in general? If you don't have to think about implementation for now? From my point of view writing Sql queries is very close to how functional language work if you treat "select" queries as functions without side-effects, and having query being first-class-citizen could move this even further.

Regards,
Roman

On Sun, 7 Jul 2019 at 16:22, Roman Pekar <[hidden email]> wrote:
Hi,

Yes, I'm thinking about 'query like a view', 'query like a cursor' is probably possible even now in ms sql server (not sure about postgresql), but it requires this paradygm shift from set-based thinking to row-by-row thinking which I'd not want to do.

I completely agree with your points of plan caching and static checks. With static checks, though it might be possible to do if the query would be defined as typed, so all the types of the columns is known in advance.
In certain cases having possibility of much better decomposition is might be more important than having cached plan. Not sure how often these cases appear in general, but personally for me it'd be awesome to have this possibility.

Regards,
Roman Pekar

On Sun, 7 Jul 2019 at 15:39, Pavel Stehule <[hidden email]> wrote:
Hi

ne 7. 7. 2019 v 14:54 odesílatel Roman Pekar <[hidden email]> napsal:
Hello,

Just a bit of background - I currently work as a full-time db developer, mostly with Ms Sql server but I like Postgres a lot, especially because I really program in sql all the time and type system / plpgsql language of Postgres seems to me more suitable for actual programming then t-sql.

Here's the problem - current structure of the language doesn't allow to decompose the code well and split calculations and data into different modules.

For example. Suppose I have a table employee and I have a function like this (I'll skip definition of return types for the sake of simplicity):

create function departments_salary ()
returns  table (...)
as
return $$
    select department, sum(salary) as salary from employee group by department;
$$;

so that's fine, but what if I want to run this function on filtered employee? I can adjust the function of course, but it implies I can predict all possible filters I'm going to need in the future.
And logically, function itself doesn't have to be run on employee table, anything with department and salary columns will fit.
So it'd be nice to be able to define the function like this:

create function departments_salary(_employee query)
returns table (...)
as
return $$
    select department, sum(salary) as salary from _employee group by department;
$$;

and then call it like this:

declare _employee query;
...
_poor_employee = (select salary, department from employee where salary < 1000);
select * from  departments_salary( _poor_employee);

And just to be clear, the query is not really invoked until the last line, so re-assigning _employee variable is more like building query expression.

As far as I understand the closest way to do this is to put the data into temporary table and use this temporary table inside of the function. It's not exactly the same of course, cause in case of temporary tables data should be transferred to temporary table, while it will might be filtered later. So it's something like array vs generator in python, or List vs IQueryable in C#.

Adding this functionality will allow much better decomposition of the program's logic.
What do you think about the idea itself? If you think the idea is worthy, is it even possible to implement it?

If we talk about plpgsql, then I afraid so this idea can disallow plan caching - or significantly increase the cost of plan cache.

There are two possibilities of implementation - a) query like cursor - unfortunately it effectively disables any optimization and it carry ORM performance to procedures. This usage is known performance antipattern, b) query like view - it should not to have a performance problems with late optimization, but I am not sure about possibility to reuse execution plans.

Currently PLpgSQL is compromise between performance and dynamic (PLpgSQL is really static language). Your proposal increase much more dynamic behave, but performance can be much more worse.

More - with this behave, there is not possible to do static check - so you have to find bugs only at runtime. I afraid about performance of this solution.

Regards

Pavel



Regards,
Roman Pekar


Reply | Threaded
Open this post in threaded view
|

Re: (select query)/relation as first class citizen

Pavel Stehule
Hi

po 8. 7. 2019 v 9:33 odesílatel Roman Pekar <[hidden email]> napsal:
Hi,

what do you think about this idea in general? If you don't have to think about implementation for now? From my point of view writing Sql queries is very close to how functional language work if you treat "select" queries as functions without side-effects, and having query being first-class-citizen could move this even further.

first - please, don't send top posts.

second - my opinion is not clear. I can imagine benefits - on second hand, the usage is relative too close to one antipattern - only one query wrapped by functions. I see your proposal as little bit more dynamic (with little bit different syntax) views.

With my experience I really afraid about it - it can be very effective (from developer perspective) and very slow (from customer perspective). This is example of tool that looks nice on paper, but can be very badly used.

Maybe I am not the best man for this topic - I like some functional programming concepts, but I use it locally - your proposal moves SQL to some unexplored areas - and I think so it can be interesting as real research topic, but not today Postgres's theme.

The basic question is why extend SQL and don't use some native functional language. Postgres should to implement ANSI SQL - and there is not a space for big experiments. I am sceptic about it - relational databases are static, SQL is static language, so it is hard to implement some dynamic system over it - SQL language is language over relation algebra - it is not functional language, I afraid so introduction another concept to this do more bad than good.

Regards

Pavel
 



Regards,
Roman

On Sun, 7 Jul 2019 at 16:22, Roman Pekar <[hidden email]> wrote:
Hi,

Yes, I'm thinking about 'query like a view', 'query like a cursor' is probably possible even now in ms sql server (not sure about postgresql), but it requires this paradygm shift from set-based thinking to row-by-row thinking which I'd not want to do.

I completely agree with your points of plan caching and static checks. With static checks, though it might be possible to do if the query would be defined as typed, so all the types of the columns is known in advance.
In certain cases having possibility of much better decomposition is might be more important than having cached plan. Not sure how often these cases appear in general, but personally for me it'd be awesome to have this possibility.

Regards,
Roman Pekar

On Sun, 7 Jul 2019 at 15:39, Pavel Stehule <[hidden email]> wrote:
Hi

ne 7. 7. 2019 v 14:54 odesílatel Roman Pekar <[hidden email]> napsal:
Hello,

Just a bit of background - I currently work as a full-time db developer, mostly with Ms Sql server but I like Postgres a lot, especially because I really program in sql all the time and type system / plpgsql language of Postgres seems to me more suitable for actual programming then t-sql.

Here's the problem - current structure of the language doesn't allow to decompose the code well and split calculations and data into different modules.

For example. Suppose I have a table employee and I have a function like this (I'll skip definition of return types for the sake of simplicity):

create function departments_salary ()
returns  table (...)
as
return $$
    select department, sum(salary) as salary from employee group by department;
$$;

so that's fine, but what if I want to run this function on filtered employee? I can adjust the function of course, but it implies I can predict all possible filters I'm going to need in the future.
And logically, function itself doesn't have to be run on employee table, anything with department and salary columns will fit.
So it'd be nice to be able to define the function like this:

create function departments_salary(_employee query)
returns table (...)
as
return $$
    select department, sum(salary) as salary from _employee group by department;
$$;

and then call it like this:

declare _employee query;
...
_poor_employee = (select salary, department from employee where salary < 1000);
select * from  departments_salary( _poor_employee);

And just to be clear, the query is not really invoked until the last line, so re-assigning _employee variable is more like building query expression.

As far as I understand the closest way to do this is to put the data into temporary table and use this temporary table inside of the function. It's not exactly the same of course, cause in case of temporary tables data should be transferred to temporary table, while it will might be filtered later. So it's something like array vs generator in python, or List vs IQueryable in C#.

Adding this functionality will allow much better decomposition of the program's logic.
What do you think about the idea itself? If you think the idea is worthy, is it even possible to implement it?

If we talk about plpgsql, then I afraid so this idea can disallow plan caching - or significantly increase the cost of plan cache.

There are two possibilities of implementation - a) query like cursor - unfortunately it effectively disables any optimization and it carry ORM performance to procedures. This usage is known performance antipattern, b) query like view - it should not to have a performance problems with late optimization, but I am not sure about possibility to reuse execution plans.

Currently PLpgSQL is compromise between performance and dynamic (PLpgSQL is really static language). Your proposal increase much more dynamic behave, but performance can be much more worse.

More - with this behave, there is not possible to do static check - so you have to find bugs only at runtime. I afraid about performance of this solution.

Regards

Pavel



Regards,
Roman Pekar


Reply | Threaded
Open this post in threaded view
|

Re: (select query)/relation as first class citizen

denty
Hi Roman, Pavel,

I was interested in this post, as it’s a topic I’ve stumbled upon in the past. 

There are two topics at play here:

1. The ability to flexibly craft queries from procedural language functions

2. Support for pipelined access to SETOF/TABLEs from procedural language functions

Postgres has parameterised functions that can return a query and optimise it in context of the overall statement, but only for the SQL language. Absence of support for other languages is a curious gap.

And it leads to fact that, even in presence of only static parameters, only one “shape” of query can ever be returned. (No IF/ELSE IF logic can alter the query, unless you dive down the rat hole of encoding it into the query itself.) So that is another gap.

Postgres has some relevant first class types: TABLE/SETOF and REFCURSOR. TABLE/SETOF are output only, materialised always and optimiser fences. Current syntax supports pipelined output (via RETURN NEXT), and docs call out the fact that it might in future not be materialised. I suspect an executor change is needed to support it, as well as plpgsql change. 

Their output-only nature is an odd gap. REFCURSOR is not materialised, and is also input-capable. If SETOF/TABLE were made both, then there would be a curious type system duplication. 

However REFCURSOR is pretty awkward to use from SQL. The fact you can’t cast or convert it to a SETOF/TABLE and SELECT FROM a REFCURSOR in native SQL is weird, and a gap, IMHO.

On the input aide, REFCURSOR is neat. Despite the above limitation, it can become bound to a query before being OPENed for execution and fetching. If only the optimiser could “see” that pre-OPENed state, as with parameterised views, then, in principle, there would be nothing stopping some other outer function consuming it, SELECTing FROM it, and perhaps even returning a new query, and then the optimiser would be able to see and optimise the final global statement. Okay: this is a biggie, but it’s still a gap, in my view. 

So my view is that Postgres already has types that are close to what is asked for. It also has tools that look ripe to be plumbed together. Problem is, when they are combined, they don’t fit well, and when they are made to fit, the fence, materialisation always and curious output-only nature leads developers to create un-performant messes. :-)

I think some of this could be fixed quite easily. The executor already (obviously) can pipeline. PLs can’t today save and restore their context to support pipelining, but it is not impossible. REFCURSOR can’t be cast to a TABLE/SETOF, not meaningfully be SELECTed FROM, but that can’t be too hard either.

Exposing the pre-OPENed query for optimisation is another thing. But here again, I see it as a challenge of mental gymnastics rather than actually hard in terms of code factoring — much of what is needed is surely already there in the way of VIEW rewriting. 

Regarding demand for the #2 feature set, this somewhat dated tread is suggestive of a niche use case: https://www.postgresql.org/message-id/flat/005701c6dc2c%2449011fc0%240a00a8c0%40trivadis.com.

d.

On 8 Jul 2019, at 10:19, Pavel Stehule <[hidden email]> wrote:

Hi

po 8. 7. 2019 v 9:33 odesílatel Roman Pekar <[hidden email]> napsal:
Hi,

what do you think about this idea in general? If you don't have to think about implementation for now? From my point of view writing Sql queries is very close to how functional language work if you treat "select" queries as functions without side-effects, and having query being first-class-citizen could move this even further.

first - please, don't send top posts.

second - my opinion is not clear. I can imagine benefits - on second hand, the usage is relative too close to one antipattern - only one query wrapped by functions. I see your proposal as little bit more dynamic (with little bit different syntax) views.

With my experience I really afraid about it - it can be very effective (from developer perspective) and very slow (from customer perspective). This is example of tool that looks nice on paper, but can be very badly used.

Maybe I am not the best man for this topic - I like some functional programming concepts, but I use it locally - your proposal moves SQL to some unexplored areas - and I think so it can be interesting as real research topic, but not today Postgres's theme.

The basic question is why extend SQL and don't use some native functional language. Postgres should to implement ANSI SQL - and there is not a space for big experiments. I am sceptic about it - relational databases are static, SQL is static language, so it is hard to implement some dynamic system over it - SQL language is language over relation algebra - it is not functional language, I afraid so introduction another concept to this do more bad than good.

Regards

Pavel
 



Regards,
Roman

On Sun, 7 Jul 2019 at 16:22, Roman Pekar <[hidden email]> wrote:
Hi,

Yes, I'm thinking about 'query like a view', 'query like a cursor' is probably possible even now in ms sql server (not sure about postgresql), but it requires this paradygm shift from set-based thinking to row-by-row thinking which I'd not want to do.

I completely agree with your points of plan caching and static checks. With static checks, though it might be possible to do if the query would be defined as typed, so all the types of the columns is known in advance.
In certain cases having possibility of much better decomposition is might be more important than having cached plan. Not sure how often these cases appear in general, but personally for me it'd be awesome to have this possibility.

Regards,
Roman Pekar

On Sun, 7 Jul 2019 at 15:39, Pavel Stehule <[hidden email]> wrote:
Hi

ne 7. 7. 2019 v 14:54 odesílatel Roman Pekar <[hidden email]> napsal:
Hello,

Just a bit of background - I currently work as a full-time db developer, mostly with Ms Sql server but I like Postgres a lot, especially because I really program in sql all the time and type system / plpgsql language of Postgres seems to me more suitable for actual programming then t-sql.

Here's the problem - current structure of the language doesn't allow to decompose the code well and split calculations and data into different modules.

For example. Suppose I have a table employee and I have a function like this (I'll skip definition of return types for the sake of simplicity):

create function departments_salary ()
returns  table (...)
as
return $$
    select department, sum(salary) as salary from employee group by department;
$$;

so that's fine, but what if I want to run this function on filtered employee? I can adjust the function of course, but it implies I can predict all possible filters I'm going to need in the future.
And logically, function itself doesn't have to be run on employee table, anything with department and salary columns will fit.
So it'd be nice to be able to define the function like this:

create function departments_salary(_employee query)
returns table (...)
as
return $$
    select department, sum(salary) as salary from _employee group by department;
$$;

and then call it like this:

declare _employee query;
...
_poor_employee = (select salary, department from employee where salary < 1000);
select * from  departments_salary( _poor_employee);

And just to be clear, the query is not really invoked until the last line, so re-assigning _employee variable is more like building query expression.

As far as I understand the closest way to do this is to put the data into temporary table and use this temporary table inside of the function. It's not exactly the same of course, cause in case of temporary tables data should be transferred to temporary table, while it will might be filtered later. So it's something like array vs generator in python, or List vs IQueryable in C#.

Adding this functionality will allow much better decomposition of the program's logic.
What do you think about the idea itself? If you think the idea is worthy, is it even possible to implement it?

If we talk about plpgsql, then I afraid so this idea can disallow plan caching - or significantly increase the cost of plan cache.

There are two possibilities of implementation - a) query like cursor - unfortunately it effectively disables any optimization and it carry ORM performance to procedures. This usage is known performance antipattern, b) query like view - it should not to have a performance problems with late optimization, but I am not sure about possibility to reuse execution plans.

Currently PLpgSQL is compromise between performance and dynamic (PLpgSQL is really static language). Your proposal increase much more dynamic behave, but performance can be much more worse.

More - with this behave, there is not possible to do static check - so you have to find bugs only at runtime. I afraid about performance of this solution.

Regards

Pavel



Regards,
Roman Pekar


Reply | Threaded
Open this post in threaded view
|

Re: (select query)/relation as first class citizen

Roman Pekar
Hi, John,

I think you've outlined the problem and possible solutions quite well. It's great to see that the goal might be not that far from implementing.

Reply | Threaded
Open this post in threaded view
|

Re: (select query)/relation as first class citizen

denty
On 19 Aug 2019, at 15:16, Roman Pekar <[hidden email]> wrote:

Hi, John,

I think you've outlined the problem and possible solutions quite well. It's great to see that the goal might be not that far from implementing.


Thanks for the prompt, Roman. I meant to have a bit of a play, and your message reminded me.

I was intrigued by the gap in how REFCURSOR is exposed. I’ve made two very rough patches to illustrate a possible solution. Down the line, I am wondering if there is appetite to receive these into core code.

First is a variant of UNNEST that accepts a REFCURSOR, allowing the results to be processed in a normal query, such as SELECT. To illustrate how it works, consider the following:

postgres=# create or replace function test (src_tab text) returns refcursor immutable language plpgsql as $$ begin return refcursor_from_query ('select * from ' || src_tab); end; $$;
CREATE FUNCTION
postgres=# select key, count (value), min (value), max (value) from unnest (array['test1', 'test2', 'test3']) tab, lateral unnest (test (tab.tab)) as (key text, value numeric) group by key;
  key   | count | min | max 
--------+-------+-----+-----
 ITEM_A |   100 |   0 |  99
 ITEM_C |    50 | -50 |  -1
 ITEM_B |   200 |   0 | 199
(3 rows)

postgres=# explain select key, count (value), min (value), max (value) from unnest (array['test1', 'test2', 'test3']) tab, lateral unnest (test (tab.tab)) as (key text, value numeric) group by key;
psql: WARNING:  cache reference leak: cache pg_proc (43), tuple 11/9 has count 1
                                 QUERY PLAN                                 
----------------------------------------------------------------------------
 HashAggregate  (cost=9.29..10.29 rows=100 width=104)
   Group Key: unnest.key
   ->  Nested Loop  (cost=0.26..6.29 rows=300 width=64)
         ->  Function Scan on unnest tab  (cost=0.00..0.03 rows=3 width=32)
         ->  Function Scan on unnest  (cost=0.25..1.25 rows=100 width=64)
(5 rows)

The example requires the following setup:

postgres=# create table test1 (key text, value numeric);
CREATE TABLE
postgres=# insert into test1 select 'ITEM_A', generate_series (0, 99);
INSERT 0 100
postgres=# create table test2 (key text, value numeric);
CREATE TABLE
postgres=# insert into test2 select 'ITEM_B', generate_series (0, 199);
INSERT 0 200
postgres=# create table test3 (key text, value numeric);
CREATE TABLE
postgres=# insert into test1 select 'ITEM_C', generate_series (-50, -1);
INSERT 0 50
postgres=# create or replace function refcursor_from_query (qry text) returns refcursor immutable language plpgsql as $$ declare cur refcursor; begin open cur for execute qry; return cur; end; $$;
CREATE FUNCTION

Obviously this kind of construction is open to wide variety of attacks, and a more realistic example would need to defend against inappropriate input.

My code is really really rough, and also yields a WARNING about cache leaks — which obviously needs fixing — but demonstrates the point. This variant is provided in unnest-refcursor.patch.

The example is pretty contrived, but I think there is general utility in having a way of processing output from REFCURSORs. Arguably, as I mentioned in my previous comments, this overlaps the plpgsql’s RETURN QUERY capability. However, unlike RETURN QUERY, the result set is not materialised before returning.

It is also interesting that a RECORD-returning UNNEST requires the row type to be declared explicitly (hence the as (key text, value numeric) clause). This seems a less than ideal syntax, but I’m not sure there is much alternative.

Second is another variant of UNNEST which attempts to retrieve the query that the REFCURSOR is OPEN’ed for, and inlines the SQL text into the query being planned. (Exactly as may be done with the text of certain SQL language FUNCTIONs.)

Again, an example probably best illustrates what is going on:

postgres=# explain select key, count (value), min (value), max (value) from unnest (test ('test1')) as (key text, value numeric) where value > 50 group by key;
                         QUERY PLAN                          
-------------------------------------------------------------
 HashAggregate  (cost=3.37..3.39 rows=2 width=79)
   Group Key: test1.key
   ->  Seq Scan on test1  (cost=0.00..2.88 rows=49 width=11)
         Filter: (value > '50'::numeric)
(4 rows)

postgres=# select key, count (value), min (value), max (value) from unnest (test ('test1')) as (key text, value numeric) where value > 50 group by key; 
  key   | count | min | max 
--------+-------+-----+-----
 ITEM_A |    49 |  51 |  99
(1 row)

There are two interesting things going on here. First is that the query returned by test() is rewritten and consumed as if it were verbatim in the query text. Second is that the outer filter (value > 50) can now be pushed down by the planner, potentially yielding a much more efficient plan.

This variant is provided in unnest-rewrite-refcursor.patch.

My code here is even rougher than the first. I stopped short of creating a new ’state’ for REFCURSOR that is not yet ‘OPEN’ but allows the query text to be yielded. This ultimately turned out quite non-trivial for a POC. (It does, though, seem ultimately feasible.) The consequence is that the REFCURSOR query is planned before being returned, only for that plan to be ignored and the query text consumed into the outer plan, thus being re-planned. In any significant use, this would be a huge inefficiency, and a shortcoming that would need to be addressed.

Both patches should apply against 12beta2. unnest-refcursor should be applied first, as the second depends upon its foundation.

I’m wondering what do you think of the concept, Roman and Pavel?

denty.




unnest-refcursor.patch (5K) Download Attachment
unnest-rewrite-refcursor.patch (5K) Download Attachment