Prompted originally by a post by Roman Pekar , I wanted to share a revised version of a patch that allows REFCURSOR results to be consumed as data in a regular SQL query as well as my thoughts on how to improve the area as a whole.
In order to be clear about the purpose and how I see it fitting into a broader context, I’ve started a new thread and I’d welcome discussion about it.
The ambition of this contribution is to make PostgreSQL able to efficiently support procedural language functions that either produce or consume sets (or both).
PostgreSQL already has some support for this in functions that return SETOFs. However, as my review  identified, there are some gaps in PostgreSQL’s current capability, as well as scope for extension to improve its overall capability.
This first patch addresses only a small part of the overall ambition, but I wanted to share both the patch and the overall ambition as work in progress, and I’d welcome comments on both. (The patch is still based on 12beta2.)
Problems to be solved
1. Allow procedural languages (PLs) to efficiently consume sets of records
PostgreSQL does allow PL functions to consume sets, however it does so be feeding records to the function, one row per function invocation. REFCURSORs, however can be supplied as input parameters and their content consumed by the function as it wishes, but only if the PL supports the REFCURSOR concept.
Typically, PLs do allow SQL queries to be executed within the PL function [5, 6, 7]. However REFCURSOR results cannot be effectively consumed in a regular SQL SELECT, significantly limiting their use.
2. Allow plpgsql functions to efficiently return sets of records
By ‘efficiently’, I mean that a large result set should not be required to be staged before the executor is able to process it. Staging is not an issue for small sets, but for large sets and especially if they are subject to further processing, intermediate staging it is a performance disadvantage.
PostgreSQL already has some support for this functions that return SETOFs. At present, plpgsql cannot take advantage of this support while also achieving the efficiency criteria because, as the documentation  notes, all returned data is staged before it is retuned.
Addressing this limitation could also of benefit to other PLs, however a quick survey finds at least PL Python is already well-adapted to efficiently return SETOFs.
3. Allow optimisation of a returned query
plpgsql offers a syntactic shortcut to return the results of a SQL query directly. Despite appearing to return a query, the RETURN QUERY syntax actually returns the /results/ of the query . This means the optimiser has no opportunity to influence its execution, such as by pushing down expressions into its WHERE clause, or taking advantage of alternative indexes to modify its sort order.
Other PLs are similarly constrained. Most PLs lack plpgsql’s syntactic sugar, but even though some PLs are better able to efficiently return SETOFs row-by-row, the optimiser cannot see “inside” the query that the PL executes even if its intent is to return the results directly.
Only SQL language functions are afforded the luxury of integration into then outer statement’s plan , but even SQL language functions are somewhat constrained in the types of dynamism that are permitted.
4. Allow a set-returning query to be supplied as an input parameter
It is possible to supply a scalar value, or function call that returns a scalar value as an input parameter. And, with problems 1 & 2 addressed, sets can be supplied as input parameters. However, a literal set-returning SQL query cannot be supplied as a parameter (not without PostgreSQL invoking the ‘calling’ function for each row in the set). REFCURSORs cannot be constructed natively from SQL.
A simplistic response would provide a trivial constructor for REFCURSORs, accepting the query as a text parameter. However it is quite unnatural to supply SQL in textual form, more especially to do so safely. So the challenge is to allow a set-returning subquery to be provided as a parameter in literal form.
Why are these problems important?
My personal wish is for PostgreSQL to offer a feature set that is consistent with itself and without arbitrary limitation. A better argument might be that that it is desirable to match the features of other RDBMSs , or for reason of the use cases they address  that are new or push the boundaries of what PostgreSQL can do , or that are important such as fulfilling a DW/ETL need , or to more directly address approaches touted of NoSQL such as Map Reduce .
Design and implementation
1. Set returning function (SRF) for REFCURSOR
Tackling first problems 1 and (part of) 2, it seems easy and obvious to allow that REFCURSORs can be consumed in a SELECT query.
PostgreSQL already allows an array to be consumed one record per entry via the UNNEST(anyarray) built-in function . Overloading UNNEST() to accept a REFCURSOR argument can be done easily, and the executor’s SRF machinery allows the result set to be consumed efficiently.
With such an implementation, and given a REFCURSOR-returning function, kv() the following syntax illustrates trivial usage:
With this UNNEST() construct, it is possible to consume a returned REFCURSOR inline in a single SQL statement.
To complete the example, the function kv() might trivially be defined as:
Other obvious example setup includes:
It is also possible to accept a REFCURSOR as an input parameter:
Having this construct, it is possible for plpgsql FUNCTION’s to both accept and return REFCURSOR variables. In plpgsql, is would be unnecessary to cast the REFCURSOR to and from text, but other PLs, presumably lacking first class knowledge of the REFCURSOR type, probably need to do so. In above example, limit_val() illustrates how a REFCURSOR can be accepted in text form.
In my patch, I’ve used the SPI APIs to access the Portal which lies behind the REFCURSOR. Although SPI seems to offer an easy interface, and it’s also what plpgsql uses internally, I’m not sure it wouldn’t be better to access the Portal directly.
It is interesting to note that Oracle names its similar construct TABLE() , rather than UNNEST(), and in later releases, its use is optional. TABLE is a reserved word and it would be unusual to overload it, although we could educate the parser to treat it specially. Oracle compatibility is an important consideration, but this is a niche area.
If we continue to use REFCURSOR, it is difficult to make some function call-like construct optional because it is already syntactically possible to select FROM a REFCURSOR-returning function. For example, SELECT * FROM kv (‘A’), is a valid and effective expression, despite being of questionable construction and utility.
An alternative might build on top of existing support for returning SETOFs, which already requires no UNNEST()-like construct. This is attractive in principle, but it makes some of the further extensions discussed below more awkward (in my opinion).
2. Query-bound REFCURSORs
Problem 3 could be addressed by educating the planner in how to extract the query inside the Portal behind the REFCURSOR.
At present, REFCURSORs are only bound to a query once they are OPENed, but when they are OPENed, the query also is fully planned, ready for immediate execution. The ambition is to allow the REFCURSOR’s query to be inlined within the outer query’s plan, so it seems wasteful to expend planner cycles, only for the plan to be thrown away.
The proposed implementation would (1) create an intermediate BOUND state for REFCURSORs, and (2) educate plpgsql about how to BIND unbound_cursorvar FOR query.
My first idea was to modify the REFCURSOR type itself, creating a new state, and adding storage for the BOUND query, but this seems unfeasible without extensive hackery. The REFCURSOR type is a trivial veneer atop a C string (which contains the Portal name), so there is no internal structure to extend.
So my plan is to retain the direct coupling of REFCURSOR<->Portal, and to allow plpgsql to set the query text at BIND time via PortalDefineQuery(). Existing plpgsql code should be unaffected as it need know nothing about the new BOUND state.
In order for any of this to work, the planner has to be able to extract the query from the returned Portal. It seems inline_set_returning_function() is the right place to make this extraction. Adding specific recognition for a function call to UNNEST() with single argument of type REFCURSOR is easy, and existing eval_const_expressions() semantics determine whether the single argument expression can be evaluated at plan time. (Of course, if it cannot, then it falls through to be processed at execution time by the REFCURSOR set returning function (SRF) described above.)
It feels slightly uncomfortable to have UNNEST(REFCURSOR) operate as a regular function, and also have specific recognition for UNNEST() elsewhere in the planner machinery. Arguably, this is already a kind of specific knowledge that inline_set_returning_function() has for SQL language FUNCTIONs, but the recognition I propose for UNNEST(REFCURSOR) is much narrower. An alternative might be to introduce a new type that inline_set_returning_function() can recognise (for example, INLINEABLE_QUERY), or to entirely separate the SRF case from the inlining case at a syntax level (for example, retaining UNNEST(REFCURSOR) as the SRF, but using INLINE(REFCURSOR) for the case at hand).
I’d welcome input here. Although the implementation seems quite feasible, the SQL and plpgsql syntax is less obvious.
3. Literal subquery type
Problem 4 could be addressed by educating the parser specially about the REFCURSOR type when faced with a literal query.
Consider this example:
The REFCURSOR(literal_query) construct could be made to result in a BOUND REFCURSOR, in this case, with SELECT key || '_COPY', value FROM kv_table_a, and then passed as a constant parameter to limit_val().
Usefully, at present, the construct yields a syntax error: although REFCURSOR(string) is an already valid construct (being equivalent to CAST (string AS REFCURSOR)), it’s not acceptable to provide a literal subquery without parenthesis. So, while REFCURSOR ((SELECT 'some_cursor')) is roughly equivalent to CAST ('some_cursor' AS REFCURSOR), the near-alternative of REFCURSOR (SELECT 'some_cursor') is quite simply not valid.
If I’m right, the task is simply a matter of ‘plumbing through’ special knowledge of a REFCURSOR(literal_subquery) construct through the parser. It seems tricky as there are many affected code sites, but the work seems uncomplicated.
Educating the parser about special types seems, again, slightly uncomfortable. An alternative might be to create an intermediate construct: for example, QUERY(literal_subquery) might be made to return the parse tree as a pg_node_tree (similar to how VIEWs are exposed in the system catalogue), and REFCURSOR(pg_node_tree) could consume it, yielding a joined-up construct of REFCURSOR(QUERY(literal_subquery)). However, we might also simply accept REFCURSOR(literal_subquery) to be special, and if/when other need is found for a literal subquery as a parameter, then this becomes the way to supply it.
For point of reference, Oracle seems to have gone the way of making CURSOR(literal_subquery) do something similar, yielding a REF CURSOR, which allows the CURSOR to be passed by reference .
1. This contribution does not actually address the limitation in plpgsql, that the “current implementation of RETURN NEXT and RETURN QUERY stores the entire result set before returning from the function” . My original investigation  presumed this limitation to apply generally to all PLs, but I now realise this is not the case: at least the Python PL allows efficient return of SETOFs . With this in mind, I see the plpgsql limitation as less encumbering (as plpython is presumably broadly available) but I’d be interested to know if this view is shared.
2. A perhaps more significant problem is the apparent duplication of plpgsql’s RETURN QUERY syntax. One could perhaps conceive that plpgsql supported an additional marker, for example, RETURN [INLINEABLE] QUERY. It is difficult to see this fitting well with other PLs.
3. The current proposal also requires to declare the expected record with an AS (...) construction. This is rather inconvenient, but it is difficult to see how it could be avoided.
4. Other PLs can use REFCURSORS by virtue of REFCURSOR being a thin veneer atop string. It is coherent if one understands how the PostgreSQL type system works, but quite strange otherwise. Better integration into other PLs and their type systems might be important.
5. As mentioned, SETOF is a near-equivalent for some use cases. There’s no way to cast the results of a function RETURNING SETOF to a REFCURSOR without something like REFCURSOR (SELECT * from <some function>(...)). It might be useful to offer a little syntactic sugar. Perhaps we could invent NEST(SETOF <some RECORD type>) could return REFCURSOR.
unnest-refcursor-v2.patch (7K) Download Attachment
Thanks for pushing this, for me it looks like promising start! I need a bit more time to go through the code (and I'm not an expert in Postgres internals in any way) but I really appreciate you doing this.
I’ve made a revision of this patch.
The significant change is to access the Portal using Portal APIs rather than through SPI. It seems the persisted state necessary to survive being used to retrieve a row at a time inside an SRF just isn’t a good fit for SPI.
It turned out there was upstream machinery in the FunctionScan node that prevented Postgres being able to pipeline SRFs, even if they return ValuePerCall. So, in practice, this patch is of limited benefit without another patch that changes that behaviour (see ). Nevertheless, the code is independent so I’m submitting the two changes separately.
I’ll push this into the Jan commit fest.
unnest-refcursor-v3.patch (23K) Download Attachment
Dent John wrote:
> I’ve made a revision of this patch.
* the commitfest app did not extract up the patch from the mail,
possibly because it's buried in the MIME structure of the mail
(using plain text instead of HTML messages might help with that).
The patch has no status in http://commitfest.cputube.org/
probably because of this too.
* unnest-refcursor-v3.patch needs a slight rebase because this chunk
in the Makefile fails
- regexp.o regproc.o ri_triggers.o rowtypes.o ruleutils.o \
+ refcursor.o regexp.o regproc.o ri_triggers.o rowtypes.o ruleutils.o \
* I'm under the impression that refcursor_unnest() is functionally
equivalent to a plpgsql implementation like this:
create function unnest(x refcursor) returns setof record as $$
fetch x into r;
exit when not found;
return next r;
end $$ language plpgsql;
but it would differ in performance, because internally a materialization step
could be avoided, but only when the other patch "Allow FunctionScans to
pipeline results" gets in?
Or are performance benefits expected right away with this patch?
@@ -5941,7 +5941,7 @@ array_fill_internal(ArrayType *dims, ArrayType *lbs,
- * UNNEST
+ * UNNEST (array)
This chunk looks unnecessary?
* some user-facing doc would be needed.
* Is it good to overload "unnest" rather than choosing a specific
PostgreSQL-powered mailer: http://www.manitou-mail.org
> On 9 Jan 2020, at 17:43, Daniel Verite <[hidden email]> wrote:
> (using plain text instead of HTML messages might help with that).
Thanks. I’ll do that next time.
> * unnest-refcursor-v3.patch needs a slight rebase because this chunk
> in the Makefile fails
> - regexp.o regproc.o ri_triggers.o rowtypes.o ruleutils.o \
> + refcursor.o regexp.o regproc.o ri_triggers.o rowtypes.o ruleutils.o \
Likewise I’ll make that rebase in the next version.
> * I'm under the impression that refcursor_unnest() is functionally
> equivalent to a plpgsql implementation like this:
> but it would differ in performance, because internally a materialization step
> could be avoided, but only when the other patch "Allow FunctionScans to
> pipeline results" gets in?
Yes. That’s at least true if unnest(x) is used in the FROM. If it’s used in the SELECT, actually it can get the performance benefit right away. However, in the SELECT case, there’s a bit of a gotcha because anonymous records can’t easily be manipulated because they have no type information available. So to make a useful performance contribution, it does need to be combined with another change — either to make it FROM pipeline as in my other patch, or perhaps enabling anonymous record types to be cast or otherwise manipulated.
> - * UNNEST
> + * UNNEST (array)
> This chunk looks unnecessary?
It was for purpose of disambiguating. But indeed it is unnecessary. Choosing a different name would avoid need for it.
> * some user-facing doc would be needed.
Indeed. I fully intend that. I figured I’d get the concept on a firmer footing first.
> * Is it good to overload "unnest" rather than choosing a specific
> function name?
Yeah. I wondered about that. A couple of syntactically obvious ideas were:
SELECT … FROM TABLE (x) (which is what I think Oracle does, but is reserved)
SELECT … FROM CURSOR (x) (which seems likely to confuse, but, surprisingly, isn’t actually reserved)
SELECT … FROM FETCH (x) (which I quite like, but is reserved)
SELECT … FROM ROWS_FROM (x) (is okay, but conflicts with our ROWS FROM construct)
So I kind of landed on UNNEST(x) out of lack of better idea. EXPAND(x) could be an option. Actually ROWS_IN(x) or ROWS_OF(x) might work.
Do you have any preference or suggestion?
Thanks a lot for the feedback.
Dent John wrote:
> Yes. That’s at least true if unnest(x) is used in the FROM. If it’s used in
> the SELECT, actually it can get the performance benefit right away
At a quick glance, I don't see it called in the select-list in any
of the regression tests. When trying it, it appears to crash (segfault):
postgres=# declare c cursor for select oid::int as i, relname::text as r from
postgres=# select unnest('c'::refcursor);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The build is configured with:
./configure --enable-debug --with-icu --with-perl --enable-depend
PostgreSQL-powered mailer: http://www.manitou-mail.org
> On 10 Jan 2020, at 15:45, Daniel Verite <[hidden email]> wrote:
> At a quick glance, I don't see it called in the select-list in any
> of the regression tests. […]
Yep. I didn’t test it because I figured it wasn’t particularly useful in that context. I’ll add some tests for that too once I get to the root of the problem.
> postgres=# select unnest('c'::refcursor);
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
Okay. That’s pretty bad, isn’t it.
It’s crashing when it’s checking that the returned tuple matches the declared return type in rsinfo->setDesc. Seems rsinfo->setDesc gets overwritten. So I think I have a memory management problem.
To be honest, I wasn’t fully sure I’d got a clear understanding of what is in what memory context, but things seemed to work so I figured it was close. Seems I was wrong. I need a bit of time to review. Leave it with me, but I guess it’ll take to next weekend before I get more time.
Dent John wrote:
> It’s crashing when it’s checking that the returned tuple matches the
> declared return type in rsinfo->setDesc. Seems rsinfo->setDesc gets
> overwritten. So I think I have a memory management problem.
What is the expected result anyway? A single column with a "record"
type? FWIW I notice that with plpgsql, this is not allowed to happen:
CREATE FUNCTION cursor_unnest(x refcursor) returns setof record
fetch x into r;
exit when not found;
return next r;
end $$ language plpgsql;
declare c cursor for select oid::int as i, relname::text as r from pg_class;
ERROR: set-valued function called in context that cannot accept a set
CONTEXT: PL/pgSQL function cursor_unnest(refcursor) line 8 at RETURN NEXT
PostgreSQL-powered mailer: http://www.manitou-mail.org
> On 14 Jan 2020, at 14:53, Daniel Verite <[hidden email]> wrote:
> What is the expected result anyway? A single column with a "record"
> type? FWIW I notice that with plpgsql, this is not allowed to happen:
Hmm. How interesting.
I had not really investigated what happens in the case of a function returning SETOF (untyped) RECORD in a SELECT clause because, whatever the result, there’s no mechanism to access the individual fields.
As you highlight, it doesn’t work at all in plpgsql, and plperl is the same.
However, SQL language functions get away with it. For example, inspired by _pg_expandarray():
CREATE OR REPLACE FUNCTION public.my_pg_expandarray(anyarray)
RETURNS SETOF record
IMMUTABLE PARALLEL SAFE STRICT
select $1[s], s - pg_catalog.array_lower($1,1) + 1
pg_catalog.array_upper($1,1), 1) as g(s)
postgres=# select my_pg_expandarray (array[0, 1, 2, 3, 4]);
Back in the FROM clause, it’s possible to manipulate the individual fields:
postgres=# select b, a from my_pg_expandarray (array[0, 1, 2, 3, 4]) as r(a int, b int);
b | a
1 | 0
2 | 1
3 | 2
4 | 3
5 | 4
It’s quite interesting. All the other PLs make explicit checks for rsinfo.expectedDesc being non-NULL, but fmgr_sql() explicitly calls out the contrary: “[…] note we do not require caller to provide an expectedDesc.” So I guess either there’s something special about the SQL PL, or perhaps the other PLs are just inheriting a pattern of being cautious.
Either way, though, there’s no way that I can see to "get at” the fields inside the anonymous record that is returned when the function is in the SELECT list.
But back to the failure, I still need to make it not crash. I guess it doesn’t matter whether I simply refuse to work if called from the SELECT list, or just return an anonymous record, like fmgr_sql() does.
In reply to this post by denty
> On 11 Jan 2020, at 12:04, Dent John <[hidden email]> wrote:I’ve addressed the issue, which was due to me allocating the TupleDesc in the multi_call_memory_ctx, which seemed quite reasonable, but it actually needs to be in ecxt_per_query_memory. It seems tSRF-mode queries are much more sensitive to the misstep.
>> On 10 Jan 2020, at 15:45, Daniel Verite <[hidden email]> wrote:
>> postgres=# select unnest('c'::refcursor);
>> server closed the connection unexpectedly
>> This probably means the server terminated abnormally
>> before or while processing the request.
>> The connection to the server was lost. Attempting reset: Failed.
> Okay. That’s pretty bad, isn’t it.
A v4 patch is attached, which also renames UNNEST(REFCURSOR) to ROWS_IN(REFCURSOR), adds a test case for use in tSRF mode, and makes some minor fixes to the support function.
I have not yet made steps towards documentation, nor yet rebased, so the Makefile chunk will probably still fail.
unnest-refcursor-v4.patch (26K) Download Attachment
|Free forum by Nabble||Edit this page|