Allow COPY's 'text' format to output a header

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Allow COPY's 'text' format to output a header

Simon Muller
This patch adds the capability to use the HEADER feature with the "text" format of the COPY command. The patch includes the related update to documentation and an additional regression test for this feature.

Currently you can only add a header line (which lists the column names) when exporting with COPY to the CSV format, but I much prefer using the default "text" format. This feature is also currently listed on the to-do list (https://wiki.postgresql.org/wiki/Todo#COPY) where it seems to have been requested some years ago.

Hopefully I've done everything correctly and the patch is acceptable enough to be considered for application.

Simon Muller


0001-Allow-COPY-s-text-format-to-output-a-header.patch (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Allow COPY's 'text' format to output a header

David Steele
Hi Simon,

On 5/13/18 6:18 PM, Simon Muller wrote:

> This patch adds the capability to use the HEADER feature with the "text"
> format of the COPY command. The patch includes the related update to
> documentation and an additional regression test for this feature.
>
> Currently you can only add a header line (which lists the column names)
> when exporting with COPY to the CSV format, but I much prefer using the
> default "text" format. This feature is also currently listed on the
> to-do list (https://wiki.postgresql.org/wiki/Todo#COPY) where it seems
> to have been requested some years ago.
>
> Hopefully I've done everything correctly and the patch is acceptable
> enough to be considered for application.

This patch makes sense to me and looks reasonable.

We're in the middle of a feature freeze that will last most of the
summer, so be sure to enter your patch into the next commitfest so it
can be considered when the freeze is over.

https://commitfest.postgresql.org/18/

Regards,
--
-David
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Allow COPY's 'text' format to output a header

Michael Paquier-2
On Sun, May 13, 2018 at 07:01:00PM -0400, David Steele wrote:
> This patch makes sense to me and looks reasonable.

One "potential" problem is if a relation has a full set of column which
allows the input of text-like data: if the header has been added with
COPY TO, and that the user forgets to add again the header option with
COPY FROM, then an extra row will be generated but there is the same
problem with CSV format :)

One comment I have about the patch is that there is no test for
COPY FROM with an output file which has a header.  In this case if
HEADER is true then the file can be loaded.  If HEADER is wrong, an
error should normally be raised because of the format (well, let's
discard the case of the relation with text-only columns).  So the tests
could be extended a bit even for CSV.

> We're in the middle of a feature freeze that will last most of the
> summer, so be sure to enter your patch into the next commitfest so it
> can be considered when the freeze is over.
>
> https://commitfest.postgresql.org/18/

Yes, you will need to be patient a couple of months here.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Allow COPY's 'text' format to output a header

Simon Muller
Okay, I've added this to the next commitfest at https://commitfest.postgresql.org/18/1629/.

Thanks both Michael and David for the feedback so far.


On 14 May 2018 at 02:37, Michael Paquier <[hidden email]> wrote:
On Sun, May 13, 2018 at 07:01:00PM -0400, David Steele wrote:
> This patch makes sense to me and looks reasonable.

One "potential" problem is if a relation has a full set of column which
allows the input of text-like data: if the header has been added with
COPY TO, and that the user forgets to add again the header option with
COPY FROM, then an extra row will be generated but there is the same
problem with CSV format :)

One comment I have about the patch is that there is no test for
COPY FROM with an output file which has a header.  In this case if
HEADER is true then the file can be loaded.  If HEADER is wrong, an
error should normally be raised because of the format (well, let's
discard the case of the relation with text-only columns).  So the tests
could be extended a bit even for CSV.

> We're in the middle of a feature freeze that will last most of the
> summer, so be sure to enter your patch into the next commitfest so it
> can be considered when the freeze is over.
>
> https://commitfest.postgresql.org/18/

Yes, you will need to be patient a couple of months here.
--
Michael

Reply | Threaded
Open this post in threaded view
|

Re: Allow COPY's 'text' format to output a header

Andrew Dunstan-8


On 05/14/2018 02:35 AM, Simon Muller wrote:
> Okay, I've added this to the next commitfest at
> https://commitfest.postgresql.org/18/1629/.
>
> Thanks both Michael and David for the feedback so far.
>
>


(Please don't top-post on PostgreSQL lists.)

I'm not necessarily opposed to this, but I'm not certain about the use
case either. The original request seemed to stem from a false impression
that CSV mode can't produce or consume tab-delimited files. But it can,
and in fact it's saner for almost all uses than text format. Postgres'
text format is really intended for Postgres' use. CSV format is more
appropriate for dealing with external programs, whether the delimiter be
a tab or a comma.

cheers

andrew

--
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Allow COPY's 'text' format to output a header

Garick Hamlin-2
In reply to this post by Michael Paquier-2
On Mon, May 14, 2018 at 09:37:07AM +0900, Michael Paquier wrote:
> On Sun, May 13, 2018 at 07:01:00PM -0400, David Steele wrote:
> > This patch makes sense to me and looks reasonable.
>
> One "potential" problem is if a relation has a full set of column which
> allows the input of text-like data: if the header has been added with
> COPY TO, and that the user forgets to add again the header option with
> COPY FROM, then an extra row will be generated but there is the same
> problem with CSV format :)

Yeah, I wonder if that can be addressed.  

I wonder if there was a way to let COPY FROM detect or ignore headers
as appropriate and rather than cause silently result in headers being
added as data.

Maybe a blank line after the header line could prevent this confusion?

Garick

Reply | Threaded
Open this post in threaded view
|

Re: Allow COPY's 'text' format to output a header

David G Johnston
On Mon, May 14, 2018 at 11:44 AM, Garick Hamlin <[hidden email]> wrote:
I wonder if there was a way to let COPY FROM detect or ignore headers
as appropriate and rather than cause silently result in headers being
added as data.

​Not reliably
 
Maybe a blank line after the header line could prevent this confusion

​No​

+1 for allowing HEADER with FORMAT text.  It doesn't interfere with COPY and even if I were to agree that CSV format is the better one this seems like an unnecessary area to impose preferences.  If TSV with Header meets someone's need providing a minimal (and consistent with expectations) syntax to accomplish that goal seems reasonable, as does the patch.

David J.

Reply | Threaded
Open this post in threaded view
|

Re: Allow COPY's 'text' format to output a header

Isaac Morland
While we're discussing COPY options, what do people think of an option for COPY FROM with header to require that the headers match the target column names? This would help to ensure that the file is actually the right one.

On 14 May 2018 at 14:55, David G. Johnston <[hidden email]> wrote:
On Mon, May 14, 2018 at 11:44 AM, Garick Hamlin <[hidden email]> wrote:
I wonder if there was a way to let COPY FROM detect or ignore headers
as appropriate and rather than cause silently result in headers being
added as data.

​Not reliably
 
Maybe a blank line after the header line could prevent this confusion

​No​

+1 for allowing HEADER with FORMAT text.  It doesn't interfere with COPY and even if I were to agree that CSV format is the better one this seems like an unnecessary area to impose preferences.  If TSV with Header meets someone's need providing a minimal (and consistent with expectations) syntax to accomplish that goal seems reasonable, as does the patch.

David J.


Reply | Threaded
Open this post in threaded view
|

Re: Allow COPY's 'text' format to output a header

Michael Paquier-2
On Mon, May 14, 2018 at 04:08:47PM -0400, Isaac Morland wrote:
> While we're discussing COPY options, what do people think of an option for
> COPY FROM with header to require that the headers match the target column
> names? This would help to ensure that the file is actually the right one.

I am personally not much into such sanity check logics in COPY FWIW if
we can live without.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Allow COPY's 'text' format to output a header

Daniel Verite
In reply to this post by Andrew Dunstan-8
        Andrew Dunstan wrote:

> I'm not necessarily opposed to this, but I'm not certain about the use
> case either.

+1.
The downside is that it would create the need, when using COPY TO,
to know whether an input file was generated with or without header,
and a hazard on mistakes.
If you say it was and it wasn't, you quietly loose the first row of data.
If you say it wasn't and in fact it was, either there's a
datatype mismatch or you quietly get a spurious row of data.

This complication should be balanced by some advantage.
What can we do with the header?
If you already have the table ready to COPY in, you don't
need that information. The only reason why COPY TO
needs to know about the header is to throw it away.
And if you don't have the table created yet, a header
with just the column names is hardly sufficient to create it,
isn't it?


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Reply | Threaded
Open this post in threaded view
|

Re: Allow COPY's 'text' format to output a header

Isaac Morland
On 15 May 2018 at 10:26, Daniel Verite <[hidden email]> wrote:
        Andrew Dunstan wrote:

> I'm not necessarily opposed to this, but I'm not certain about the use
> case either.

+1.
The downside is that it would create the need, when using COPY TO,
to know whether an input file was generated with or without header,
and a hazard on mistakes.
If you say it was and it wasn't, you quietly loose the first row of data.
If you say it wasn't and in fact it was, either there's a
datatype mismatch or you quietly get a spurious row of data.

Just to be clear, we're talking about my "header match" feature, not the basic idea of allowing a header in text format?

You already need to know whether or not there is a header, no matter what: there is no way to avoid needing to know the format of the data to be imported. And certainly if "header" is an option, one has to know whether or not to set it in any given situation.

The "header match" helps ensure the file is the right one by requiring the header contents to match the field names, rather than just being thrown away.

I don't view it as a way to avoid pre-defining the table. It just increases the chance that the wrong file won't load but will instead trigger an error condition immediately.

Note that this advantage includes what happens if you specify header but the file has no header: as long as you actually specified header match, the error will be caught unless the first row of actual data happens to match the field names, which is almost always highly unlikely and frequently impossible (e.g., a person with firstname "firstname", surname "surname", birthday "birthday" and so on).

One can imagine extensions of the idea: for example, the header could actually be used to identify the columns, so the column order in the file doesn't matter. There could also be an "AS" syntax to allow the target field names to be different from the field names in the header. I have occasionally found myself wanting to ignore certain columns of the file. But these are all significantly more complicated than just looking at the header and requiring it to match the target field names.

If one specifies no header but there actually is a header in the file, then loading will fail in many cases but it depends on what the header in the file looks like. This part is unaffected by my idea.
 
This complication should be balanced by some advantage.
What can we do with the header?
If you already have the table ready to COPY in, you don't
need that information. The only reason why COPY TO
needs to know about the header is to throw it away.
And if you don't have the table created yet, a header
with just the column names is hardly sufficient to create it,
isn't it?
Reply | Threaded
Open this post in threaded view
|

Re: Allow COPY's 'text' format to output a header

Tom Lane-2
Isaac Morland <[hidden email]> writes:
> On 15 May 2018 at 10:26, Daniel Verite <[hidden email]> wrote:
>> Andrew Dunstan wrote:
>>> I'm not necessarily opposed to this, but I'm not certain about the use
>>> case either.

>> The downside is that it would create the need, when using COPY TO,
>> to know whether an input file was generated with or without header,
>> and a hazard on mistakes.
>> If you say it was and it wasn't, you quietly loose the first row of data.
>> If you say it wasn't and in fact it was, either there's a
>> datatype mismatch or you quietly get a spurious row of data.

> Just to be clear, we're talking about my "header match" feature, not the
> basic idea of allowing a header in text format?

AFAICS, Daniel's just reacting to the basic idea of a header line.
I agree that by itself that's not worth much.  However, if we added
your proposed option to insist that the column names match during COPY
IN, I think that that could have some value.  It would allow
forestalling one common type of pilot error, ie copying the wrong file
entirely.  (It'd also prevent copying in data that has the wrong column
order, but I think that's a less common scenario.  I might be wrong
about that.)

> One can imagine extensions of the idea: for example, the header could
> actually be used to identify the columns, so the column order in the file
> doesn't matter. There could also be an "AS" syntax to allow the target
> field names to be different from the field names in the header. I have
> occasionally found myself wanting to ignore certain columns of the file.
> But these are all significantly more complicated than just looking at the
> header and requiring it to match the target field names.

Yeah, and every bit of flexibility you add raises the chance of an
undetected error.  COPY isn't intended as a general ETL facility,
so I'd mostly be -1 on adding such things.  But I can see the value
of confirming that you're copying the right file, and a header match
check would go a long way towards doing that.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: Allow COPY's 'text' format to output a header

David G Johnston
On Tuesday, May 15, 2018, Tom Lane <[hidden email]> wrote:

AFAICS, Daniel's just reacting to the basic idea of a header line.
I agree that by itself that's not worth much.  However, if we added
your proposed option to insist that the column names match during COPY
IN, I think that that could have some value. 

I'm fine for adding it without the added matching behavior, though turning the boolean into an enum is appealing.

HEADER { true | false | match }

Though we'd need to accept all variants of Boolean for compatability...

I'm of the opinion that text and csv should be the same excepting their defaults for some of the options.

David J.
Reply | Threaded
Open this post in threaded view
|

Re: Allow COPY's 'text' format to output a header

Daniel Verite
In reply to this post by Isaac Morland
        Isaac Morland wrote:

> Just to be clear, we're talking about my "header match" feature, not the
> basic idea of allowing a header in text format?

For my reply it was on merely allowing it, as does the current
patch at https://commitfest.postgresql.org/18/1629

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Reply | Threaded
Open this post in threaded view
|

Re: Allow COPY's 'text' format to output a header

Robert Haas
In reply to this post by Tom Lane-2
On Tue, May 15, 2018 at 12:06 PM, Tom Lane <[hidden email]> wrote:

>> One can imagine extensions of the idea: for example, the header could
>> actually be used to identify the columns, so the column order in the file
>> doesn't matter. There could also be an "AS" syntax to allow the target
>> field names to be different from the field names in the header. I have
>> occasionally found myself wanting to ignore certain columns of the file.
>> But these are all significantly more complicated than just looking at the
>> header and requiring it to match the target field names.
>
> Yeah, and every bit of flexibility you add raises the chance of an
> undetected error.  COPY isn't intended as a general ETL facility,
> so I'd mostly be -1 on adding such things.  But I can see the value
> of confirming that you're copying the right file, and a header match
> check would go a long way towards doing that.

True.

FWIW, I'm +1 on this idea.  I think a header line is a pretty common
need, and if you're exporting a large amount of data, it could be
pretty annoying to have to first run COPY, and then do

(echo blah,blah1,blah2; cat copyoutput.txt)>whatireallywant.txt

There's a lot of value in being able to export from program A
*exactly* what program B wants to import, rather than something that
is close but has to be massaged.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Previous Thread Next Thread