BUG #16300: Text line order corruption with COPY command

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

BUG #16300: Text line order corruption with COPY command

apt.postgresql.org Repository Update
The following bug has been logged on the website:

Bug reference:      16300
Logged by:          Hans Buschmann
Email address:      [hidden email]
PostgreSQL version: 12.2
Operating system:   Windows Server 2019 64bit
Description:        

A reproducable line order corruption occurs when copying a quite large test
file into Postgres.

I was trying to import and parse a big .xml file (about 41 MB, 643407 lines)
into a simple import table using the following sequence:


create database x86db template=template0 encoding 'UTF8' lc_collate='C';

\c x86db

create table uops_imp2 (
cline varchar
)
;

copy uops_imp2 from 'N:/downloads/uops_info_instructions_200226.xml';
or
copy uops_imp2 from '/usr/local/hb/uops_info_instructions_200226.xml';

This was tested on different machines under Windows Server 2019 64bit and
Fedora 31 x86-64 under Postgres 12.2 respective 12.1:

x86db=# select version ();
                          version
------------------------------------------------------------
 PostgreSQL 12.2, compiled by Visual C++ build 1914, 64-bit
(1 row)

x86db=# select version ();
                                                version
--------------------------------------------------------------------------------------------------------
 PostgreSQL 12.1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 9.2.1
20190827 (Red Hat 9.2.1-1), 64-bit
(1 row)

The original order of the input lines from the original file was verified
under 2 different editors under Windows:

notepad++ 7.8.5 x64
notepad (as build in), with status line turned on to show line numbers

Here are shown the line 627365 til 627392: (the correct original)

        <doc TP="1.0"/>
      </architecture>
    </instruction>
    <instruction asm="VPMADDWD" category="AVX512" cpl="3" evex="1"
extension="AVX512EVEX" iclass="VPMADDWD"
iform="VPMADDWD_ZMMi32_MASKmskw_ZMMi16_MEMi16_AVX512" isa-set="AVX512BW_512"
mask="0" string="VPMADDWD (ZMM, ZMM, M512)" zeroing="0">
      <operand idx="1" name="REG0" type="reg" w="1" width="512"
xtype="i32">ZMM0,ZMM1,ZMM2,ZMM3,ZMM4,ZMM5,ZMM6,ZMM7,ZMM8,ZMM9,ZMM10,ZMM11,ZMM12,ZMM13,ZMM14,ZMM15,ZMM16,ZMM17,ZMM18,ZMM19,ZMM20,ZMM21,ZMM22,ZMM23,ZMM24,ZMM25,ZMM26,ZMM27,ZMM28,ZMM29,ZMM30,ZMM31</operand>
      <operand idx="2" name="REG2" r="1" type="reg" width="512"
xtype="i16">ZMM0,ZMM1,ZMM2,ZMM3,ZMM4,ZMM5,ZMM6,ZMM7,ZMM8,ZMM9,ZMM10,ZMM11,ZMM12,ZMM13,ZMM14,ZMM15,ZMM16,ZMM17,ZMM18,ZMM19,ZMM20,ZMM21,ZMM22,ZMM23,ZMM24,ZMM25,ZMM26,ZMM27,ZMM28,ZMM29,ZMM30,ZMM31</operand>
      <operand idx="3" memory-prefix="zmmword ptr" name="MEM0" r="1"
type="mem" width="512" xtype="i16"/>
      <architecture name="SKX">
        <IACA TP="0.50" TP_ports="0.50" fusion_occurred="1"
ports="1*p05+1*p23" uops="2" version="2.3"/>
        <IACA TP="0.50" TP_ports="0.50" fusion_occurred="1"
ports="1*p05+1*p23" uops="2" version="3.0"/>
        <measurement TP="0.54" TP_ports="0.50" ports="1*p05+1*p23" uops="2"
uops_retire_slots="1">
          <latency cycles="5" start_op="2" target_op="1"/>
          <latency cycles_addr="13" cycles_addr_is_upper_bound="1"
cycles_addr_same_reg="14" cycles_addr_same_reg_is_upper_bound="1"
cycles_mem="10" cycles_mem_is_upper_bound="1" start_op="3" target_op="1"/>
        </measurement>
      </architecture>
      <architecture name="CNL">
        <measurement TP="1.00" TP_ports="1.00" ports="1*p0+1*p23" uops="2"
uops_retire_slots="1">
          <latency cycles="5" start_op="2" target_op="1"/>
          <latency cycles_addr="13" cycles_addr_is_upper_bound="1"
cycles_mem="10" cycles_mem_is_upper_bound="1" start_op="3" target_op="1"/>
        </measurement>
      </architecture>
      <architecture name="ICL">
        <measurement TP="1.00" TP_ports="1.00" ports="1*p0+1*p23" uops="2"
uops_retire_slots="1">
          <latency cycles="5" start_op="2" target_op="1"/>
          <latency cycles_addr="13" cycles_addr_is_upper_bound="1"
cycles_mem="10" cycles_mem_is_upper_bound="1" start_op="3" target_op="1"/>
        </measurement>
        <doc TP="1.0"/>
      </architecture>
    </instruction>

when querying the table by

select * from uops_imp2 offset 627365 limit 27;

I get a different part from the original lines with another line mangled in
between (see ###)
x86db=#
x86db=# select * from uops_imp2 offset 627365 limit 27;
                                                                           
                                                       cline                
                                             
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
           <latency cycles="5" start_op="4" target_op="1"/>
         </measurement>
         <doc TP="1.0"/>
       </architecture>
     </instruction>
     <instruction asm="VPMADDWD" category="AVX512" cpl="3" evex="1"
extension="AVX512EVEX" iclass="VPMADDWD"
iform="VPMADDWD_ZMMi32_MASKmskw_ZMMi16_MEMi16_AVX512" isa-set="AVX512BW_512"
mask="0" string="VPMADDWD (ZMM, ZMM, M512)" zeroing="0">
       <operand idx="1" name="REG0" type="reg" w="1" width="512"
xtype="i32">ZMM0,ZMM1,ZMM2,ZMM3,ZMM4,ZMM5,ZMM6,ZMM7,ZMM8,ZMM9,ZMM10,ZMM11,ZMM12,ZMM13,ZMM14,ZMM15,ZMM16,ZMM17,ZMM18,ZMM19,ZMM20,ZMM21,ZMM22,ZMM23,ZMM24,ZMM25,ZMM26,ZMM27,ZMM28,ZMM29,ZMM30,ZMM31</operand>
###           <latency cycles="6" start_op="2" target_op="1"/>
       <operand idx="2" name="REG2" r="1" type="reg" width="512"
xtype="i16">ZMM0,ZMM1,ZMM2,ZMM3,ZMM4,ZMM5,ZMM6,ZMM7,ZMM8,ZMM9,ZMM10,ZMM11,ZMM12,ZMM13,ZMM14,ZMM15,ZMM16,ZMM17,ZMM18,ZMM19,ZMM20,ZMM21,ZMM22,ZMM23,ZMM24,ZMM25,ZMM26,ZMM27,ZMM28,ZMM29,ZMM30,ZMM31</operand>
       <operand idx="3" memory-prefix="zmmword ptr" name="MEM0" r="1"
type="mem" width="512" xtype="i16"/>
       <architecture name="SKX">
         <IACA TP="0.50" TP_ports="0.50" fusion_occurred="1"
ports="1*p05+1*p23" uops="2" version="2.3"/>
         <IACA TP="0.50" TP_ports="0.50" fusion_occurred="1"
ports="1*p05+1*p23" uops="2" version="3.0"/>
         <measurement TP="0.54" TP_ports="0.50" ports="1*p05+1*p23" uops="2"
uops_retire_slots="1">
           <latency cycles="5" start_op="2" target_op="1"/>
           <latency cycles_addr="13" cycles_addr_is_upper_bound="1"
cycles_addr_same_reg="14" cycles_addr_same_reg_is_upper_bound="1"
cycles_mem="10" cycles_mem_is_upper_bound="1" start_op="3" target_op="1"/>
         </measurement>
       </architecture>
       <architecture name="CNL">
         <measurement TP="1.00" TP_ports="1.00" ports="1*p0+1*p23" uops="2"
uops_retire_slots="1">
           <latency cycles="5" start_op="2" target_op="1"/>
           <latency cycles_addr="13" cycles_addr_is_upper_bound="1"
cycles_mem="10" cycles_mem_is_upper_bound="1" start_op="3" target_op="1"/>
         </measurement>
       </architecture>
       <architecture name="ICL">
         <measurement TP="1.00" TP_ports="1.00" ports="1*p0+1*p23" uops="2"
uops_retire_slots="1">
           <latency cycles="5" start_op="2" target_op="1"/>
(27 rows)


In all cases i tried the original order of the lines was not preserved and
the disorder was the same.

The count of all lines seems correct:

x86db=# select count(*) from uops_imp2;
 count
--------
 643407
(1 row)

The same error occurred when using \copy on the psql client side.

To reproduce, the XML-file is directly downloadable under the following
address:

https://uops.info/xml.html

and choosing the file instructions.xml

I have not further analyzed other regions of line order corruption because
it is very difficult when you cant rely on postgres COPY.

I fear similar problems could occur when restoring a pg_dump file, which
also relies on copy commands.

Thanks in advance

Hans Buschmann

Reply | Threaded
Open this post in threaded view
|

Re: BUG #16300: Text line order corruption with COPY command

David G Johnston
On Thu, Mar 12, 2020 at 1:05 PM PG Bug reporting form <[hidden email]> wrote:
The following bug has been logged on the website:

Bug reference:      16300
Logged by:          Hans Buschmann
Email address:      [hidden email]
PostgreSQL version: 12.2
Operating system:   Windows Server 2019 64bit
Description:       

A reproducable line order corruption occurs when copying a quite large test
file into Postgres.


IIUC you copied data into PostgreSQL and then immediately read it out using SELECT without an ORDER BY and expected to get the same physical row order as the inserted data.  Nothing promises that things will work this way though often times they do.

If order matters to you you need to add an ORDER BY.  Since there is nothing natural to order by here you have to decide on how you want to change your setup.  I believe that adding a bigserial column to the table will result in the sequence generator applying numbers in strictly ascending order matching the input order and then you can sort on that column.  This is probably the simplest solution.  You can add row numbers to the source file just before importing it.  Or you can import the entire file to a text field and then split_to_array and rely upon that ordering.

David J.

Reply | Threaded
Open this post in threaded view
|

AW: BUG #16300: Text line order corruption with COPY command

Hans Buschmann


Von: David G. Johnston <[hidden email]>
Gesendet: Donnerstag, 12. März 2020 21:42
An: Hans Buschmann; PostgreSQL mailing lists
Betreff: Re: BUG #16300: Text line order corruption with COPY command
 


Thank you for the quick reply.

 

When looking into the documentation I find under SQL COPY command:

 

"If a column list is specified, COPY TO copies only the data in the specified columns to the file.
For COPY FROM, each field in the file is inserted, in order, into the specified column.
Table columns not specified in the COPY FROM column list will receive their default values. "

 

So I expected the insertion in order as said above.

 

In my opinion it is essential to preserve the order of textfile input in COPY FROM. This also holds true when copying from another source like a program, where often it is not practicable to add an orderable column on input.

 

Not preserving (and rendering on a select) the order makes the COPY FROM mostly unusable for cases where the order must be preserved. (Think of XML, JSON, Source code, Log files, Disassembly etc.).

 

The size of such kind of files can it make also impractible to take the mentioned circumventing methods you provided.

 

I have not inspected the resulting tuple orders on the physical file.

 

I will try to nail down the first occurence of the disordering during the import by comparing import and export.

 

My concern also goes to pg_dump and friends, which could change such not natural orderable tables through pg_dump/pg_restore.

 

Hans Buschmann

 

Reply | Threaded
Open this post in threaded view
|

Re: BUG #16300: Text line order corruption with COPY command

Francisco Olarte
Hans:

On Fri, Mar 13, 2020 at 10:05 AM Hans Buschmann <[hidden email]> wrote:
...
> When looking into the documentation I find under SQL COPY command:
> "If a column list is specified, COPY TO copies only the data in the specified columns to the file.
> For COPY FROM, each field in the file is inserted, in order, into the specified column.
> Table columns not specified in the COPY FROM column list will receive their default values. "
> So I expected the insertion in order as said above.

And the column order ( which is what you have quoted here ) is
preserved ( easily, since you only have one).

> In my opinion it is essential to preserve the order of textfile input in COPY FROM. This also holds true when copying from another source like a program, where often it is not practicable to add an orderable column on input.

You are talking about ROW order. This have been discussed countless
time for all sort of sql databases. SQL TABLES are SETS of TUPLES
(rows), without order.

> Not preserving (and rendering on a select) the order makes the COPY FROM mostly unusable for cases where the order must be preserved. (Think of XML, JSON, Source code, Log files, Disassembly etc.).

I think you are not using SQL correctly. When you put "files" into the
database ( XML, JSON, Source Code, Disassembly above), the correct
behaviour is normally to put each file in a single field in a single
row. For Log files the thing is different. I normally do not want to
preserve import order on these, I send an explicit order, typically by
the timestamp for similar field, when reading, so that even if I
import the weekend files in sunday-saturday order I get them right,
and I can mix logs from several sources and see them ordered ( that's
why I import them when I (rarely) do ).

If you want to model TEXT files as a table of lines you need to use a
line number of similar thing, and then you have to deal with several
problems ( in your XML example, I can delete some clines in the midle
of your xml, you cannot detect it. If I preserve insertion order and
want to INSERT a line in the middle of the document, how do O do it?
). Having a text file is generally bad idea.

> The size of such kind of files can it make also impractible to take the mentioned circumventing methods you provided.

Putting big text files in the database is not generally a good idea.
The only thing you can do with those is read the lines sequentally,
the plain file excels at this.


> I have not inspected the resulting tuple orders on the physical file.
> I will try to nail down the first occurence of the disordering during the import by comparing import and export.

You can do these, but it will be useless. Tuples can move, actually
they only do as a result of some operations, but nothing guarantees
they are not going to do it on their own for some reason in the
future. AFAIK, except for updates, they do not move, and I think
vacuum full preserves ordering, but nothing guarantees you postgres 42
is not going to gain a "repack" function which moves tuples to
increase the fill factor in the first pages of the files destroying
your order.

You do not have a command to ask an sql database for the tuples in
insertion order, orderless select just mean any order. Even if they
were ordered in the file you could have something like this:
- Connection 1 ask for all the lines of a fairly huge table ( much
bigger than caches), the engine is free to return them in any order so
it peeks the easier, file order, and starts returning them.
- When c1 is 20% into the job, c2 ask for the same rows. Engine
notices c1 is already doing the same and starts sending the rows from
20-100% of the file to both connections.
- c1 gets all of them, engine rereads rows in the 0-20% range and
sends them to c2. Huge win, only 120% of data read, instead of 200% (
remember the "much bigger than cache" note ) ( engine could have
stopped c1 and reread 0-20% to let them sync, but we will have a very
unhappy c1 them ).

I'm not saying postgres does this, but one of the reason sql language
does not impose any unrequested order is to let it do it.

> My concern also goes to pg_dump and friends, which could change such not natural orderable tables through pg_dump/pg_restore.

There is not such thing as a natural orderable table. Number them or
you are in for a lot of pain. Or use some kind of database built for
text line processing, not a relational one.

Francisco Olarte.


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16300: Text line order corruption with COPY command

David G Johnston
In reply to this post by Hans Buschmann
On Fri, Mar 13, 2020 at 1:50 AM Hans Buschmann <[hidden email]> wrote:


Von: David G. Johnston <[hidden email]>
Gesendet: Donnerstag, 12. März 2020 21:42
An: Hans Buschmann; PostgreSQL mailing lists
Betreff: Re: BUG #16300: Text line order corruption with COPY command
 


Thank you for the quick reply.

 

When looking into the documentation I find under SQL COPY command:

 

"If a column list is specified, COPY TO copies only the data in the specified columns to the file.
For COPY FROM, each field in the file is inserted, in order, into the specified column.
Table columns not specified in the COPY FROM column list will receive their default values. "

 

So I expected the insertion in order as said above.


The above is talking about COLUMNS.  You are talking about ROWS.

In my opinion it is essential to preserve the order of textfile input in COPY FROM. This also holds true when copying from another source like a program, where often it is not practicable to add an orderable column on input.

SQL does not do this.  You are free to either adapt your processing to conform to what SQL does provide or choose a different language that better meets your needs.  PostgreSQL is simply adhering to a fundamental property of the SQL language.

Not preserving (and rendering on a select) the order makes the COPY FROM mostly unusable for cases where the order must be preserved. (Think of XML, JSON, Source code, Log files, Disassembly etc.).


If you are storing source code, json, xml, etc... into an SQL database where each line in the original file ends up being a single record in the database I propose that you are doing something fundamentally wrong.  Either your solution is ill-designed or you've chosen the wrong tool for the job.  So yes, it (copy and SQL generally) is unusable for those cases, but it is perfectly usable for many others and does so with efficiency.  Trying to make it work for something like this would compromise that efficiency for use cases that are questionable.

My concern also goes to pg_dump and friends, which could change such not natural orderable tables through pg_dump/pg_restore.

If order matters there must be, somewhere, an identifier indicating the position of each record in that sequence.  SQL requires the user to be explicit in defining that identifier instead of magically creating one.  It provide a sequence feature to bridge the gap.

David J.