COPY FROM WHEN condition

classic Classic list List threaded Threaded
40 messages Options
12
Reply | Threaded
Open this post in threaded view
|

COPY FROM WHEN condition

Surafel Temesgen

Hello,

Currently we can not moves data from a file to a table based on some condition on a certain column but I think there are many use case for it that worth supporting. Attache is a patch for escaping a row that does not satisfy WHEN condition from inserting into a table and its work on the top of commit b68ff3ea672c06

and the syntax is

COPY table_name [ ( column_name [, ...] ) ]

FROM { 'filename' | PROGRAM 'command' | STDIN }

[ [ WITH ] ( option [, ...] ) ]

[ WHEN condition ]

comment ?

Regards

Surafel


copy_from_when_con_v1.patch (12K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: COPY FROM WHEN condition

Christoph Moench-Tegeder
## Surafel Temesgen ([hidden email]):

> Currently we can not moves data from a file to a table based on some
> condition on a certain column

You can:
  COPY ( query ) TO 'filename';

There's even an example in the documentation:
https://www.postgresql.org/docs/10/static/sql-copy.html
"To copy into a file just the countries whose names start with 'A':
 COPY (SELECT * FROM country WHERE country_name LIKE 'A%') TO '/usr1/proj/bray/sql/a_list_countries.copy';"

Regards,
Christoph

--
Spare Space.

Reply | Threaded
Open this post in threaded view
|

Re: COPY FROM WHEN condition

Surafel Temesgen


On Thu, Oct 11, 2018 at 12:00 PM Christoph Moench-Tegeder <[hidden email]> wrote:
You can:
  COPY ( query ) TO 'filename';
it is for COPY FROM

regards
Surafel
Reply | Threaded
Open this post in threaded view
|

Re: COPY FROM WHEN condition

Corey Huinker
On Thu, Oct 11, 2018 at 5:04 AM Surafel Temesgen <[hidden email]> wrote:


On Thu, Oct 11, 2018 at 12:00 PM Christoph Moench-Tegeder <[hidden email]> wrote:
You can:
  COPY ( query ) TO 'filename';
it is for COPY FROM

regards
Surafel

It didn't get far, but you may want to take a look at a rejected patch for copy_srf() (set returning function)

Having a set returning function gives you the full expressiveness of SQL, at the cost of an extra materialization step.


 
Reply | Threaded
Open this post in threaded view
|

Re: COPY FROM WHEN condition

David Fetter
On Thu, Oct 11, 2018 at 05:12:48AM -0400, Corey Huinker wrote:

> On Thu, Oct 11, 2018 at 5:04 AM Surafel Temesgen <[hidden email]>
> wrote:
>
> >
> >
> > On Thu, Oct 11, 2018 at 12:00 PM Christoph Moench-Tegeder <
> > [hidden email]> wrote:
> >
> >> You can:
> >>   COPY ( query ) TO 'filename';
> >>
> > it is for COPY FROM
> >
> > regards
> > Surafel
> >
>
> It didn't get far, but you may want to take a look at a rejected patch for
> copy_srf() (set returning function)
> https://www.postgresql.org/message-id/CADkLM%3DdoeiWQX4AGtDNG4PsWfSXz3ai7kY%3DPZm3sUhsUeev9Bg%40mail.gmail.com
> https://commitfest.postgresql.org/12/869/
>
> Having a set returning function gives you the full expressiveness of SQL,
> at the cost of an extra materialization step.

I wonder whether something JIT-like could elide this. A very
interesting subset of such WHEN clauses could be pretty
straight-forward to implement in a pretty efficient way.

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Reply | Threaded
Open this post in threaded view
|

RE: COPY FROM WHEN condition

myungkyu.lim
Hello,

I want test this patch, but can't apply it.

(master)$ git apply copy_from_when_con_v1.patch
error: patch failed: src/backend/commands/copy.c:849
error: src/backend/commands/copy.c: patch does not apply

fix or another way, let me know.

Best regards,
Myungkyu, Lim


Reply | Threaded
Open this post in threaded view
|

Re: COPY FROM WHEN condition

Surafel Temesgen


On Fri, Oct 26, 2018 at 9:09 AM 임명규 <[hidden email]> wrote:
Hello,

I want test this patch, but can't apply it.

(master)$ git apply copy_from_when_con_v1.patch
i use patch and it work
try (master)$ patch -p1 < copy_from_when_con_v1.patch
regards 
Surafel 
Reply | Threaded
Open this post in threaded view
|

Re: COPY FROM WHEN condition

Tomas Vondra-4
In reply to this post by Surafel Temesgen
Hi,

I've taken a quick look at this on the way back from pgconf.eu, and it
seems like a nice COPY improvement in a fairly good shape.

Firstly, I think it's a valuable because it allows efficiently importing
a subset of data. Currently, we either have to create an intermediate
table, copy all the data into that, do the filtering, and then insert
the subset into the actual target table. Another common approach that I
see in practice is using file_fdw to do this, but it's not particularly
straight-forward (having to create the FDW servers etc. first) not
efficient (particularly for large amounts of data). This feature allows
skipping this extra step (at least in simpler ETL cases).

So, I like the idea and I think it makes sense.


A couple of comments regarding the code/docs:


1) I think this deserves at least some regression tests. Plenty of tests
already use COPY, but there's no coverage for the new piece. So let's
add a new test suite, or maybe add a couple of tests into copy2.sql.


2) In copy.sqml, the new item is defined like this

    <term><literal>WHEN Clause</literal></term>

I suggest we use just <term><literal>WHEN</literal></term>, that's what
the other items do (see ENCODING).

The other thing is that this does not say what expressions are allowed
in the WHEN clause. It seems pretty close to WHEN clause for triggers,
which e.g. mentions that subselects are not allowed. I'm pretty sure
that's true here too, because it fails like this (118 = T_SubLink):

    test=# copy t(a,b,c) from '/tmp/t.data' when ((select 1) < 10);
    ERROR:  unrecognized node type: 118

So, the patch needs to detect this, produce a reasonable error message
and document the limitations in copy.sqml, just like we do for CREATE
TRIGGER.


3) For COPY TO, the WHEN clause is accepted but ignored, leading to
confusing cases like this:

    test=# copy t(a,b,c) to '/tmp/t.data' when ((select 100) < 10);
    COPY 151690

So, it contains subselect, but unlike COPY FROM it does not fail
(because we never execute it). The fun part is that the expression is
logically false, so a user might expect it to filter rows, yet we copy
everything.

IMHO we need to either error-out in these cases, complaining about WHEN
not being supported for COPY TO, or make it work (effectively treating
it as a simpler alternative to COPY (subselect) TO).


4) There are some minor code style issues in copy.c - the variable is
misnamed as when_cluase, there are no spaces after 'if' etc. See the
attached patch fixing this.

AFAICS we could just get rid of the extra when_cluase variable and mess
with the cstate->whenClause directly, depending on how (3) gets fixed.


5) As I mentioned, the CREATE TRIGGER already has WHEN clause, but it
requires it to be 'WHEN (expr)'. I suggest we do the same thing here,
requiring the parentheses.


6) The skip logic in CopyFrom() seems to be slightly wrong. It does
work, but the next_record label is defined after CHECK_FOR_INTERRUPTS()
so a COPY will not respond to Ctrl-C unless it finds a row matching the
WHEN condition. If you have a highly selective condition, that's a bit
inconvenient.

It also skips

    MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));

so I wonder what the heap_form_tuple() right after the next_record label
will use for tuples right after a skipped one. I'd bet it'll use the
oldcontext (essentially the long-lived context), essentially making it
a memory leak.

So I suggest to get rid of the next_record label, and use 'continue'
instead of the 'goto next_record'.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

0002-review.patch (4K) Download Attachment
0001-rebased-patch.patch (9K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: COPY FROM WHEN condition

myungkyu.lim
Hello,

Basically, this patch worked very well in my tests.

> 3) For COPY TO, the WHEN clause is accepted but ignored, leading to confusing cases like this:

I found same issue.

    postgres=# copy t1 to '/home/lmk/t1.data' when c1 < 1;

In the 'COPY TO' statement, 'WHEN clause' does not do any extra work.
(include error or hint)

> 4) There are some minor code style issues in copy.c - the variable is misnamed as when_cluase, there are no spaces after 'if' etc. See the attached patch fixing this.

It is recommended to use 'pg tool' when you finish development.

    src/tools/pgindent/pgindent

    pgindent is used to fix the source code style to conform to pg standards.

Best regards,
Myungkyu, Lim


Reply | Threaded
Open this post in threaded view
|

Re: COPY FROM WHEN condition

Surafel Temesgen
In reply to this post by Tomas Vondra-4

Hi,
Thank you for looking at it .
On Sun, Oct 28, 2018 at 7:19 PM Tomas Vondra <[hidden email]> wrote: 

1) I think this deserves at least some regression tests. Plenty of tests
already use COPY, but there's no coverage for the new piece. So let's
add a new test suite, or maybe add a couple of tests into copy2.sql.


2) In copy.sqml, the new item is defined like this

    <term><literal>WHEN Clause</literal></term>

I suggest we use just <term><literal>WHEN</literal></term>, that's what
the other items do (see ENCODING).

The other thing is that this does not say what expressions are allowed
in the WHEN clause. It seems pretty close to WHEN clause for triggers,
which e.g. mentions that subselects are not allowed. I'm pretty sure
that's true here too, because it fails like this (118 = T_SubLink):

    test=# copy t(a,b,c) from '/tmp/t.data' when ((select 1) < 10);
    ERROR:  unrecognized node type: 118

So, the patch needs to detect this, produce a reasonable error message
and document the limitations in copy.sqml, just like we do for CREATE
TRIGGER.
fixed 

3) For COPY TO, the WHEN clause is accepted but ignored, leading to
confusing cases like this:

    test=# copy t(a,b,c) to '/tmp/t.data' when ((select 100) < 10);
    COPY 151690

So, it contains subselect, but unlike COPY FROM it does not fail
(because we never execute it). The fun part is that the expression is
logically false, so a user might expect it to filter rows, yet we copy
everything.

IMHO we need to either error-out in these cases, complaining about WHEN
not being supported for COPY TO, or make it work (effectively treating
it as a simpler alternative to COPY (subselect) TO).

English is not my first language but I chose error-out because WHEN condition for COPY TO seems to me semantically incorrect


AFAICS we could just get rid of the extra when_cluase variable and mess
with the cstate->whenClause directly, depending on how (3) gets fixed.

I did it this way because CopyState structure memory allocate and initialize in BeginCopyFrom but the analysis done before it


5) As I mentioned, the CREATE TRIGGER already has WHEN clause, but it
requires it to be 'WHEN (expr)'. I suggest we do the same thing here,
requiring the parentheses.


6) The skip logic in CopyFrom() seems to be slightly wrong. It does
work, but the next_record label is defined after CHECK_FOR_INTERRUPTS()
so a COPY will not respond to Ctrl-C unless it finds a row matching the
WHEN condition. If you have a highly selective condition, that's a bit
inconvenient.

It also skips

    MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));

so I wonder what the heap_form_tuple() right after the next_record label
will use for tuples right after a skipped one. I'd bet it'll use the
oldcontext (essentially the long-lived context), essentially making it
a memory leak.

So I suggest to get rid of the next_record label, and use 'continue'
instead of the 'goto next_record'.

fixed 
regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

copy_from_when_con_v2.patch (20K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: COPY FROM WHEN condition

Masahiko Sawada
On Tue, Oct 30, 2018 at 11:47 PM Surafel Temesgen <[hidden email]> wrote:

>
>
> Hi,
> Thank you for looking at it .
> On Sun, Oct 28, 2018 at 7:19 PM Tomas Vondra <[hidden email]> wrote:
>>
>>
>> 1) I think this deserves at least some regression tests. Plenty of tests
>> already use COPY, but there's no coverage for the new piece. So let's
>> add a new test suite, or maybe add a couple of tests into copy2.sql.
>>
>>
>> 2) In copy.sqml, the new item is defined like this
>>
>>     <term><literal>WHEN Clause</literal></term>
>>
>> I suggest we use just <term><literal>WHEN</literal></term>, that's what
>> the other items do (see ENCODING).
>>
>> The other thing is that this does not say what expressions are allowed
>> in the WHEN clause. It seems pretty close to WHEN clause for triggers,
>> which e.g. mentions that subselects are not allowed. I'm pretty sure
>> that's true here too, because it fails like this (118 = T_SubLink):
>>
>>     test=# copy t(a,b,c) from '/tmp/t.data' when ((select 1) < 10);
>>     ERROR:  unrecognized node type: 118
>>
>> So, the patch needs to detect this, produce a reasonable error message
>> and document the limitations in copy.sqml, just like we do for CREATE
>> TRIGGER.
>
> fixed
>>
>>
>> 3) For COPY TO, the WHEN clause is accepted but ignored, leading to
>> confusing cases like this:
>>
>>     test=# copy t(a,b,c) to '/tmp/t.data' when ((select 100) < 10);
>>     COPY 151690
>>
>> So, it contains subselect, but unlike COPY FROM it does not fail
>> (because we never execute it). The fun part is that the expression is
>> logically false, so a user might expect it to filter rows, yet we copy
>> everything.
>>
>> IMHO we need to either error-out in these cases, complaining about WHEN
>> not being supported for COPY TO, or make it work (effectively treating
>> it as a simpler alternative to COPY (subselect) TO).
>
> English is not my first language but I chose error-out because WHEN condition for COPY TO seems to me semantically incorrect
>>
>>
>> AFAICS we could just get rid of the extra when_cluase variable and mess
>> with the cstate->whenClause directly, depending on how (3) gets fixed.
>
> I did it this way because CopyState structure memory allocate and initialize in BeginCopyFrom but the analysis done before it
>>
>>
>> 5) As I mentioned, the CREATE TRIGGER already has WHEN clause, but it
>> requires it to be 'WHEN (expr)'. I suggest we do the same thing here,
>> requiring the parentheses.
>>
>>
>> 6) The skip logic in CopyFrom() seems to be slightly wrong. It does
>> work, but the next_record label is defined after CHECK_FOR_INTERRUPTS()
>> so a COPY will not respond to Ctrl-C unless it finds a row matching the
>> WHEN condition. If you have a highly selective condition, that's a bit
>> inconvenient.
>>
>> It also skips
>>
>>     MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
>>
>> so I wonder what the heap_form_tuple() right after the next_record label
>> will use for tuples right after a skipped one. I'd bet it'll use the
>> oldcontext (essentially the long-lived context), essentially making it
>> a memory leak.
>>
>> So I suggest to get rid of the next_record label, and use 'continue'
>> instead of the 'goto next_record'.
>>
> fixed
>>

I've looked at this patch and tested.

When I use a function returning string in WHEN clause I got the following error:

=# copy test from '/tmp/aaa.csv' (format 'csv') when (lower(t) = 'hello');
ERROR:  could not determine which collation to use for lower() function
HINT:  Use the COLLATE clause to set the collation explicitly.
CONTEXT:  COPY hoge, line 1: "1,hoge,2018-01-01"

And then although I specified COLLATE I got an another error (127 =
T_CollateExpr):

=# copy test from '/tmp/aaa.csv' (format 'csv') when (lower(t) collate
"en_US" = 'hello');
ERROR:  unrecognized node type: 127

This error doesn't happen if I put the similar condition in WHEN
clause for triggers. I think the patch needs to produce a reasonable
error message.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Reply | Threaded
Open this post in threaded view
|

Re: COPY FROM WHEN condition

Nasby, Jim-2
In reply to this post by David Fetter
On Oct 11, 2018, at 10:35 AM, David Fetter <[hidden email]> wrote:

>
>> It didn't get far, but you may want to take a look at a rejected patch for
>> copy_srf() (set returning function)
>> https://www.postgresql.org/message-id/CADkLM%3DdoeiWQX4AGtDNG4PsWfSXz3ai7kY%3DPZm3sUhsUeev9Bg%40mail.gmail.com
>> https://commitfest.postgresql.org/12/869/
>>
>> Having a set returning function gives you the full expressiveness of SQL,
>> at the cost of an extra materialization step.
>
> I wonder whether something JIT-like could elide this. A very
> interesting subset of such WHEN clauses could be pretty
> straight-forward to implement in a pretty efficient way.

Are you thinking something like having a COPY command that provides results in such a way that they could be referenced in a FROM clause (perhaps a COPY that defines a cursor…)?
Reply | Threaded
Open this post in threaded view
|

Re: COPY FROM WHEN condition

David Fetter
On Wed, Oct 31, 2018 at 11:21:33PM +0000, Nasby, Jim wrote:

> On Oct 11, 2018, at 10:35 AM, David Fetter <[hidden email]> wrote:
> >
> >> It didn't get far, but you may want to take a look at a rejected patch for
> >> copy_srf() (set returning function)
> >> https://www.postgresql.org/message-id/CADkLM%3DdoeiWQX4AGtDNG4PsWfSXz3ai7kY%3DPZm3sUhsUeev9Bg%40mail.gmail.com
> >> https://commitfest.postgresql.org/12/869/
> >>
> >> Having a set returning function gives you the full expressiveness of SQL,
> >> at the cost of an extra materialization step.
> >
> > I wonder whether something JIT-like could elide this. A very
> > interesting subset of such WHEN clauses could be pretty
> > straight-forward to implement in a pretty efficient way.
>
> Are you thinking something like having a COPY command that provides
> results in such a way that they could be referenced in a FROM clause
> (perhaps a COPY that defines a cursor…)?

That would also be nice, but what I was thinking of was that some
highly restricted subset of cases of SQL in general could lend
themselves to levels of optimization that would be impractical in
other contexts.

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Reply | Threaded
Open this post in threaded view
|

Re: COPY FROM WHEN condition

Corey Huinker
> Are you thinking something like having a COPY command that provides
> results in such a way that they could be referenced in a FROM clause
> (perhaps a COPY that defines a cursor…)?

That would also be nice, but what I was thinking of was that some
highly restricted subset of cases of SQL in general could lend
themselves to levels of optimization that would be impractical in
other contexts.

If COPY (or a syntactical equivalent) can return a result set, then the whole of SQL is available to filter and aggregate the results and we don't have to invent new syntax, or endure confusion whenCOPY-WHEN syntax behaves subtly different from a similar FROM-WHERE.

Also, what would we be saving computationally? The whole file (or program output) has to be consumed no matter what, the columns have to be parsed no matter what. At least some of the columns have to be converted to their assigned datatypes enough to know whether or not to filter the row, but we might be able push that logic inside a copy. I'm thinking of something like this:

SELECT x.a, sum(x.b)
FROM ( COPY INLINE '/path/to/foo.txt' FORMAT CSV ) as x( a integer, b numeric, c text, d date, e json) )
WHERE x.d >= '2018-11-01'

In this case, there is the opportunity to see the following optimizations:
- columns c and e are never referenced, and need never be turned into a datum (though we might do so just to confirm that they conform to the data type)
- if column d is converted first, we can filter on it and avoid converting columns a,b
- whatever optimizations we can infer from knowing that the two surviving columns will go directly into an aggregate

If we go this route, we can train the planner to notice other optimizations and add those mechanisms at that time, and then existing code gets faster.

If we go the COPY-WHEN route, then we have to make up new syntax for every possible future optimization.
Reply | Threaded
Open this post in threaded view
|

Re: COPY FROM WHEN condition

Pavel Stehule


pá 2. 11. 2018 v 3:57 odesílatel Corey Huinker <[hidden email]> napsal:
> Are you thinking something like having a COPY command that provides
> results in such a way that they could be referenced in a FROM clause
> (perhaps a COPY that defines a cursor…)?

That would also be nice, but what I was thinking of was that some
highly restricted subset of cases of SQL in general could lend
themselves to levels of optimization that would be impractical in
other contexts.

If COPY (or a syntactical equivalent) can return a result set, then the whole of SQL is available to filter and aggregate the results and we don't have to invent new syntax, or endure confusion whenCOPY-WHEN syntax behaves subtly different from a similar FROM-WHERE.

Also, what would we be saving computationally? The whole file (or program output) has to be consumed no matter what, the columns have to be parsed no matter what. At least some of the columns have to be converted to their assigned datatypes enough to know whether or not to filter the row, but we might be able push that logic inside a copy. I'm thinking of something like this:

SELECT x.a, sum(x.b)
FROM ( COPY INLINE '/path/to/foo.txt' FORMAT CSV ) as x( a integer, b numeric, c text, d date, e json) )
WHERE x.d >= '2018-11-01'

Without some special feature this example is not extra useful. It is based on copy on server that can use only super user with full rights.

What should be benefit of this feature?

Regards

Pavel


In this case, there is the opportunity to see the following optimizations:
- columns c and e are never referenced, and need never be turned into a datum (though we might do so just to confirm that they conform to the data type)
- if column d is converted first, we can filter on it and avoid converting columns a,b
- whatever optimizations we can infer from knowing that the two surviving columns will go directly into an aggregate

If we go this route, we can train the planner to notice other optimizations and add those mechanisms at that time, and then existing code gets faster.

If we go the COPY-WHEN route, then we have to make up new syntax for every possible future optimization.
Reply | Threaded
Open this post in threaded view
|

Re: COPY FROM WHEN condition

Surafel Temesgen
In reply to this post by Masahiko Sawada
hi,

On Wed, Oct 31, 2018 at 10:54 AM Masahiko Sawada <[hidden email]> wrote:
On Tue, Oct 30, 2018 at 11:47 PM Surafel Temesgen <[hidden email]> wrote:

I've looked at this patch and tested.

When I use a function returning string in WHEN clause I got the following error:

=# copy test from '/tmp/aaa.csv' (format 'csv') when (lower(t) = 'hello');
ERROR:  could not determine which collation to use for lower() function
HINT:  Use the COLLATE clause to set the collation explicitly.
CONTEXT:  COPY hoge, line 1: "1,hoge,2018-01-01"

And then although I specified COLLATE I got an another error (127 =
T_CollateExpr):

=# copy test from '/tmp/aaa.csv' (format 'csv') when (lower(t) collate
"en_US" = 'hello');
ERROR:  unrecognized node type: 127

This error doesn't happen if I put the similar condition in WHEN
clause for triggers. I think the patch needs to produce a reasonable
error message.

The attached patch include a fix for it .can you confirm it
regards
Surafel

copy_from_when_con_v3.patch (21K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: COPY FROM WHEN condition

Daniel Verite
In reply to this post by Pavel Stehule
        Pavel Stehule wrote:

> > SELECT x.a, sum(x.b)
> > FROM ( COPY INLINE '/path/to/foo.txt' FORMAT CSV ) as x( a integer, b
> > numeric, c text, d date, e json) )
> > WHERE x.d >= '2018-11-01'
> >
> >
> Without some special feature this example is not extra useful. It is based
> on copy on server that can use only super user with full rights.

And if superuser, one might use the file data wrapper [1]  to get
the same results without the need for the explicit COPY in the query.

BTW, this brings into question whether the [WHEN condition] clause
should be included in the options of file_fdw, as it supports pretty
much all COPY options.

Also, psql's \copy should gain the option too?


[1] https://www.postgresql.org/docs/current/static/file-fdw.html


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Reply | Threaded
Open this post in threaded view
|

Re: COPY FROM WHEN condition

Tomas Vondra-4
In reply to this post by Corey Huinker


On 11/02/2018 03:57 AM, Corey Huinker wrote:

>     > Are you thinking something like having a COPY command that provides
>     > results in such a way that they could be referenced in a FROM clause
>     > (perhaps a COPY that defines a cursor…)?
>
>     That would also be nice, but what I was thinking of was that some
>     highly restricted subset of cases of SQL in general could lend
>     themselves to levels of optimization that would be impractical in
>     other contexts.
>
>
> If COPY (or a syntactical equivalent) can return a result set, then the
> whole of SQL is available to filter and aggregate the results and we
> don't have to invent new syntax, or endure confusion whenCOPY-WHEN
> syntax behaves subtly different from a similar FROM-WHERE.
>
> Also, what would we be saving computationally? The whole file (or
> program output) has to be consumed no matter what, the columns have to
> be parsed no matter what. At least some of the columns have to be
> converted to their assigned datatypes enough to know whether or not to
> filter the row, but we might be able push that logic inside a copy. I'm
> thinking of something like this:
>
>     SELECT x.a, sum(x.b)
>     FROM ( COPY INLINE '/path/to/foo.txt' FORMAT CSV ) as x( a integer,
>     b numeric, c text, d date, e json) )
>     WHERE x.d >= '2018-11-01'
>
>
> In this case, there is the /opportunity/ to see the following optimizations:
> - columns c and e are never referenced, and need never be turned into a
> datum (though we might do so just to confirm that they conform to the
> data type)
> - if column d is converted first, we can filter on it and avoid
> converting columns a,b
> - whatever optimizations we can infer from knowing that the two
> surviving columns will go directly into an aggregate
>
> If we go this route, we can train the planner to notice other
> optimizations and add those mechanisms at that time, and then existing
> code gets faster.
>
> If we go the COPY-WHEN route, then we have to make up new syntax for
> every possible future optimization.

IMHO those two things address vastly different use-cases. The COPY WHEN
case deals with filtering data while importing them into a database,
while what you're describing seems to be more about querying data stored
in a CSV file. But we already do have a solution for that - FDW, and I'd
say it's working pretty well. And AFAIK it does give you tools to
implement most of what you're asking for. I don't see why should we bolt
this on top of COPY, or how is it an alternative to COPY WHEN.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: COPY FROM WHEN condition

David Fetter
In reply to this post by Corey Huinker
On Thu, Nov 01, 2018 at 10:57:25PM -0400, Corey Huinker wrote:

> >
> > > Are you thinking something like having a COPY command that provides
> > > results in such a way that they could be referenced in a FROM clause
> > > (perhaps a COPY that defines a cursor…)?
> >
> > That would also be nice, but what I was thinking of was that some
> > highly restricted subset of cases of SQL in general could lend
> > themselves to levels of optimization that would be impractical in
> > other contexts.
>
> If COPY (or a syntactical equivalent) can return a result set, then the
> whole of SQL is available to filter and aggregate the results and we don't
> have to invent new syntax, or endure confusion whenCOPY-WHEN syntax behaves
> subtly different from a similar FROM-WHERE.

That's an excellent point.

> Also, what would we be saving computationally? The whole file (or program
> output) has to be consumed no matter what, the columns have to be parsed no
> matter what. At least some of the columns have to be converted to their
> assigned datatypes enough to know whether or not to filter the row, but we
> might be able push that logic inside a copy. I'm thinking of something like
> this:
>
> SELECT x.a, sum(x.b)
> FROM ( COPY INLINE '/path/to/foo.txt' FORMAT CSV ) as x( a integer, b numeric, c text, d date, e json) )

Apologies for bike-shedding, but wouldn't the following be a better
fit with the current COPY?

    COPY t(a integer, b numeric, c text, d date, e json) FROM '/path/to/foo.txt' WITH (FORMAT CSV, INLINE)

> WHERE x.d >= '2018-11-01'
>
>
> In this case, there is the *opportunity* to see the following optimizations:
> - columns c and e are never referenced, and need never be turned into a
> datum (though we might do so just to confirm that they conform to the data
> type)

That sounds like something that could go inside the WITH extension
I'm proposing above.

[STRICT_TYPE boolean DEFAULT true]?

This might not be something that has to be in version 1.

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Reply | Threaded
Open this post in threaded view
|

Re: COPY FROM WHEN condition

David Fetter
In reply to this post by Daniel Verite
On Fri, Nov 02, 2018 at 12:58:12PM +0100, Daniel Verite wrote:

> Pavel Stehule wrote:
>
> > > SELECT x.a, sum(x.b)
> > > FROM ( COPY INLINE '/path/to/foo.txt' FORMAT CSV ) as x( a integer, b
> > > numeric, c text, d date, e json) )
> > > WHERE x.d >= '2018-11-01'
> > >
> > >
> > Without some special feature this example is not extra useful. It is based
> > on copy on server that can use only super user with full rights.
>
> And if superuser, one might use the file data wrapper [1]  to get
> the same results without the need for the explicit COPY in the query.
>
> BTW, this brings into question whether the [WHEN condition] clause
> should be included in the options of file_fdw, as it supports pretty
> much all COPY options.
>
> Also, psql's \copy should gain the option too?

tl;dr: \copy support is a very large can of worms.

psql's \copy is something which should probably be handled separately
from COPY, as it's both a way to access the filesystem without
superuser permission and an interface to the COPY part of the
protocol.  It seems like poor design to add grammar to support a
single client, so we'd need to think about this in terms of what we
want to support on the client side independent of specific clients. It
also seems like a violation of separation of concerns to couple FEBE
to grammar, so there'd need to be some way to do those things
separately, too.

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

12