Parallel copy

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
127 messages Options
12345 ... 7
Reply | Threaded
Open this post in threaded view
|

Re: Parallel copy

Ants Aasma-2
On Tue, 18 Feb 2020 at 15:21, Amit Kapila <[hidden email]> wrote:

>
> On Tue, Feb 18, 2020 at 5:59 PM Ants Aasma <[hidden email]> wrote:
> >
> > On Tue, 18 Feb 2020 at 12:20, Amit Kapila <[hidden email]> wrote:
> > > This is something similar to what I had also in mind for this idea.  I
> > > had thought of handing over complete chunk (64K or whatever we
> > > decide).  The one thing that slightly bothers me is that we will add
> > > some additional overhead of copying to and from shared memory which
> > > was earlier from local process memory.  And, the tokenization (finding
> > > line boundaries) would be serial.  I think that tokenization should be
> > > a small part of the overall work we do during the copy operation, but
> > > will do some measurements to ascertain the same.
> >
> > I don't think any extra copying is needed.
> >
>
> I am talking about access to shared memory instead of the process
> local memory.  I understand that an extra copy won't be required.
>
> > The reader can directly
> > fread()/pq_copymsgbytes() into shared memory, and the workers can run
> > CopyReadLineText() inner loop directly off of the buffer in shared memory.
> >
>
> I am slightly confused here.  AFAIU, the for(;;) loop in
> CopyReadLineText is about finding the line endings which we thought
> that the reader process will do.

Indeed, I somehow misread the code while scanning over it. So CopyReadLineText
currently copies data from cstate->raw_buf to the StringInfo in
cstate->line_buf. In parallel mode it would copy it from the shared data buffer
to local line_buf until it hits the line end found by the data reader. The
amount of copying done is still exactly the same as it is now.

Regards,
Ants Aasma


Reply | Threaded
Open this post in threaded view
|

Re: Parallel copy

David Fetter
In reply to this post by akapila
On Tue, Feb 18, 2020 at 06:51:29PM +0530, Amit Kapila wrote:

> On Tue, Feb 18, 2020 at 5:59 PM Ants Aasma <[hidden email]> wrote:
> >
> > On Tue, 18 Feb 2020 at 12:20, Amit Kapila <[hidden email]> wrote:
> > > This is something similar to what I had also in mind for this idea.  I
> > > had thought of handing over complete chunk (64K or whatever we
> > > decide).  The one thing that slightly bothers me is that we will add
> > > some additional overhead of copying to and from shared memory which
> > > was earlier from local process memory.  And, the tokenization (finding
> > > line boundaries) would be serial.  I think that tokenization should be
> > > a small part of the overall work we do during the copy operation, but
> > > will do some measurements to ascertain the same.
> >
> > I don't think any extra copying is needed.
>
> I am talking about access to shared memory instead of the process
> local memory.  I understand that an extra copy won't be required.

Isn't accessing shared memory from different pieces of execution what
threads were designed to do?

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Reply | Threaded
Open this post in threaded view
|

Re: Parallel copy

akapila
On Tue, Feb 18, 2020 at 8:41 PM David Fetter <[hidden email]> wrote:

>
> On Tue, Feb 18, 2020 at 06:51:29PM +0530, Amit Kapila wrote:
> > On Tue, Feb 18, 2020 at 5:59 PM Ants Aasma <[hidden email]> wrote:
> > >
> > > On Tue, 18 Feb 2020 at 12:20, Amit Kapila <[hidden email]> wrote:
> > > > This is something similar to what I had also in mind for this idea.  I
> > > > had thought of handing over complete chunk (64K or whatever we
> > > > decide).  The one thing that slightly bothers me is that we will add
> > > > some additional overhead of copying to and from shared memory which
> > > > was earlier from local process memory.  And, the tokenization (finding
> > > > line boundaries) would be serial.  I think that tokenization should be
> > > > a small part of the overall work we do during the copy operation, but
> > > > will do some measurements to ascertain the same.
> > >
> > > I don't think any extra copying is needed.
> >
> > I am talking about access to shared memory instead of the process
> > local memory.  I understand that an extra copy won't be required.
>
> Isn't accessing shared memory from different pieces of execution what
> threads were designed to do?
>

Sorry, but I don't understand what you mean by the above?  We are
going to use background workers (which are processes) for parallel
workers.  In general, it might not make a big difference in accessing
shared memory as compared to local memory especially because the cost
of other stuff in the copy is relatively higher.  But still, it is a
point to consider.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Parallel copy

akapila
In reply to this post by Ants Aasma-2
On Tue, Feb 18, 2020 at 8:08 PM Ants Aasma <[hidden email]> wrote:

>
> On Tue, 18 Feb 2020 at 15:21, Amit Kapila <[hidden email]> wrote:
> >
> > On Tue, Feb 18, 2020 at 5:59 PM Ants Aasma <[hidden email]> wrote:
> > >
> > > On Tue, 18 Feb 2020 at 12:20, Amit Kapila <[hidden email]> wrote:
> > > > This is something similar to what I had also in mind for this idea.  I
> > > > had thought of handing over complete chunk (64K or whatever we
> > > > decide).  The one thing that slightly bothers me is that we will add
> > > > some additional overhead of copying to and from shared memory which
> > > > was earlier from local process memory.  And, the tokenization (finding
> > > > line boundaries) would be serial.  I think that tokenization should be
> > > > a small part of the overall work we do during the copy operation, but
> > > > will do some measurements to ascertain the same.
> > >
> > > I don't think any extra copying is needed.
> > >
> >
> > I am talking about access to shared memory instead of the process
> > local memory.  I understand that an extra copy won't be required.
> >
> > > The reader can directly
> > > fread()/pq_copymsgbytes() into shared memory, and the workers can run
> > > CopyReadLineText() inner loop directly off of the buffer in shared memory.
> > >
> >
> > I am slightly confused here.  AFAIU, the for(;;) loop in
> > CopyReadLineText is about finding the line endings which we thought
> > that the reader process will do.
>
> Indeed, I somehow misread the code while scanning over it. So CopyReadLineText
> currently copies data from cstate->raw_buf to the StringInfo in
> cstate->line_buf. In parallel mode it would copy it from the shared data buffer
> to local line_buf until it hits the line end found by the data reader. The
> amount of copying done is still exactly the same as it is now.
>

Yeah, on a broader level it will be something like that, but actual
details might vary during implementation.  BTW, have you given any
thoughts on one other approach I have shared above [1]?  We might not
go with that idea, but it is better to discuss different ideas and
evaluate their pros and cons.

[1] - https://www.postgresql.org/message-id/CAA4eK1LyAyPCtBk4rkwomeT6%3DyTse5qWws-7i9EFwnUFZhvu5w%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Parallel copy

akapila
In reply to this post by Mike Blackwell-2
On Tue, Feb 18, 2020 at 7:51 PM Mike Blackwell <[hidden email]> wrote:
On Sun, Feb 16, 2020 at 12:51 AM Andrew Dunstan <[hidden email]> wrote:

IIRC, in_quote only matters here in CSV mode (because CSV fields can
have embedded newlines). So why not just forbid parallel copy in CSV
mode, at least for now? I guess it depends on the actual use case. If we
expect to be parallel loading humungous CSVs then that won't fly.

Loading large CSV files is pretty common here.  I hope this can be supported.


Thank you for your inputs.  It is important and valuable.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Reply | Threaded
Open this post in threaded view
|

Re: Parallel copy

Ants Aasma-2
In reply to this post by akapila
On Wed, 19 Feb 2020 at 06:22, Amit Kapila <[hidden email]> wrote:

>
> On Tue, Feb 18, 2020 at 8:08 PM Ants Aasma <[hidden email]> wrote:
> >
> > On Tue, 18 Feb 2020 at 15:21, Amit Kapila <[hidden email]> wrote:
> > >
> > > On Tue, Feb 18, 2020 at 5:59 PM Ants Aasma <[hidden email]> wrote:
> > > >
> > > > On Tue, 18 Feb 2020 at 12:20, Amit Kapila <[hidden email]> wrote:
> > > > > This is something similar to what I had also in mind for this idea.  I
> > > > > had thought of handing over complete chunk (64K or whatever we
> > > > > decide).  The one thing that slightly bothers me is that we will add
> > > > > some additional overhead of copying to and from shared memory which
> > > > > was earlier from local process memory.  And, the tokenization (finding
> > > > > line boundaries) would be serial.  I think that tokenization should be
> > > > > a small part of the overall work we do during the copy operation, but
> > > > > will do some measurements to ascertain the same.
> > > >
> > > > I don't think any extra copying is needed.
> > > >
> > >
> > > I am talking about access to shared memory instead of the process
> > > local memory.  I understand that an extra copy won't be required.
> > >
> > > > The reader can directly
> > > > fread()/pq_copymsgbytes() into shared memory, and the workers can run
> > > > CopyReadLineText() inner loop directly off of the buffer in shared memory.
> > > >
> > >
> > > I am slightly confused here.  AFAIU, the for(;;) loop in
> > > CopyReadLineText is about finding the line endings which we thought
> > > that the reader process will do.
> >
> > Indeed, I somehow misread the code while scanning over it. So CopyReadLineText
> > currently copies data from cstate->raw_buf to the StringInfo in
> > cstate->line_buf. In parallel mode it would copy it from the shared data buffer
> > to local line_buf until it hits the line end found by the data reader. The
> > amount of copying done is still exactly the same as it is now.
> >
>
> Yeah, on a broader level it will be something like that, but actual
> details might vary during implementation.  BTW, have you given any
> thoughts on one other approach I have shared above [1]?  We might not
> go with that idea, but it is better to discuss different ideas and
> evaluate their pros and cons.
>
> [1] - https://www.postgresql.org/message-id/CAA4eK1LyAyPCtBk4rkwomeT6%3DyTse5qWws-7i9EFwnUFZhvu5w%40mail.gmail.com

It seems to be that at least for the general CSV case the tokenization to
tuples is an inherently serial task. Adding thread synchronization to that path
for coordinating between multiple workers is only going to make it slower. It
may be possible to enforce limitations on the input (e.g. no quotes allowed) or
do some speculative tokenization (e.g. if we encounter quote before newline
assume the chunk started in a quoted section) to make it possible to do the
tokenization in parallel. But given that the simpler and more featured approach
of handling it in a single reader process looks to be fast enough, I don't see
the point. I rather think that the next big step would be to overlap reading
input and tokenization, hopefully by utilizing Andres's work on asyncio.

Regards,
Ants Aasma


Reply | Threaded
Open this post in threaded view
|

Re: Parallel copy

Tomas Vondra-4
On Wed, Feb 19, 2020 at 11:02:15AM +0200, Ants Aasma wrote:

>On Wed, 19 Feb 2020 at 06:22, Amit Kapila <[hidden email]> wrote:
>>
>> On Tue, Feb 18, 2020 at 8:08 PM Ants Aasma <[hidden email]> wrote:
>> >
>> > On Tue, 18 Feb 2020 at 15:21, Amit Kapila <[hidden email]> wrote:
>> > >
>> > > On Tue, Feb 18, 2020 at 5:59 PM Ants Aasma <[hidden email]> wrote:
>> > > >
>> > > > On Tue, 18 Feb 2020 at 12:20, Amit Kapila <[hidden email]> wrote:
>> > > > > This is something similar to what I had also in mind for this idea.  I
>> > > > > had thought of handing over complete chunk (64K or whatever we
>> > > > > decide).  The one thing that slightly bothers me is that we will add
>> > > > > some additional overhead of copying to and from shared memory which
>> > > > > was earlier from local process memory.  And, the tokenization (finding
>> > > > > line boundaries) would be serial.  I think that tokenization should be
>> > > > > a small part of the overall work we do during the copy operation, but
>> > > > > will do some measurements to ascertain the same.
>> > > >
>> > > > I don't think any extra copying is needed.
>> > > >
>> > >
>> > > I am talking about access to shared memory instead of the process
>> > > local memory.  I understand that an extra copy won't be required.
>> > >
>> > > > The reader can directly
>> > > > fread()/pq_copymsgbytes() into shared memory, and the workers can run
>> > > > CopyReadLineText() inner loop directly off of the buffer in shared memory.
>> > > >
>> > >
>> > > I am slightly confused here.  AFAIU, the for(;;) loop in
>> > > CopyReadLineText is about finding the line endings which we thought
>> > > that the reader process will do.
>> >
>> > Indeed, I somehow misread the code while scanning over it. So CopyReadLineText
>> > currently copies data from cstate->raw_buf to the StringInfo in
>> > cstate->line_buf. In parallel mode it would copy it from the shared data buffer
>> > to local line_buf until it hits the line end found by the data reader. The
>> > amount of copying done is still exactly the same as it is now.
>> >
>>
>> Yeah, on a broader level it will be something like that, but actual
>> details might vary during implementation.  BTW, have you given any
>> thoughts on one other approach I have shared above [1]?  We might not
>> go with that idea, but it is better to discuss different ideas and
>> evaluate their pros and cons.
>>
>> [1] - https://www.postgresql.org/message-id/CAA4eK1LyAyPCtBk4rkwomeT6%3DyTse5qWws-7i9EFwnUFZhvu5w%40mail.gmail.com
>
>It seems to be that at least for the general CSV case the tokenization to
>tuples is an inherently serial task. Adding thread synchronization to that path
>for coordinating between multiple workers is only going to make it slower. It
>may be possible to enforce limitations on the input (e.g. no quotes allowed) or
>do some speculative tokenization (e.g. if we encounter quote before newline
>assume the chunk started in a quoted section) to make it possible to do the
>tokenization in parallel. But given that the simpler and more featured approach
>of handling it in a single reader process looks to be fast enough, I don't see
>the point. I rather think that the next big step would be to overlap reading
>input and tokenization, hopefully by utilizing Andres's work on asyncio.
>

I generally agree with the impression that parsing CSV is tricky and
unlikely to benefit from parallelism in general. There may be cases with
restrictions making it easier (e.g. restrictions on the format) but that
might be a bit too complex to start with.

For example, I had an idea to parallelise the planning by splitting it
into two phases:

1) indexing

Splits the CSV file into equally-sized chunks, make each worker to just
scan through it's chunk and store positions of delimiters, quotes,
newlines etc. This is probably the most expensive part of the parsing
(essentially go char by char), and we'd speed it up linearly.

2) merge

Combine the information from (1) in a single process, and actually parse
the CSV data - we would not have to inspect each character, because we'd
know positions of interesting chars, so this should be fast. We might
have to recheck some stuff (e.g. escaping) but it should still be much
faster.

But yes, this may be a bit complex and I'm not sure it's worth it.

The one piece of information I'm missing here is at least a very rough
quantification of the individual steps of CSV processing - for example
if parsing takes only 10% of the time, it's pretty pointless to start by
parallelising this part and we should focus on the rest. If it's 50% it
might be a different story. Has anyone done any measurements?


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Parallel copy

akapila
On Wed, Feb 19, 2020 at 4:08 PM Tomas Vondra
<[hidden email]> wrote:
>
> The one piece of information I'm missing here is at least a very rough
> quantification of the individual steps of CSV processing - for example
> if parsing takes only 10% of the time, it's pretty pointless to start by
> parallelising this part and we should focus on the rest. If it's 50% it
> might be a different story.
>

Right, this is important information to know.

> Has anyone done any measurements?
>

Not yet, but planning to work on it.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Parallel copy

David Fetter
In reply to this post by akapila
On Fri, Feb 14, 2020 at 01:41:54PM +0530, Amit Kapila wrote:
> This work is to parallelize the copy command and in particular "Copy
> <table_name> from 'filename' Where <condition>;" command.

Apropos of the initial parsing issue generally, there's an interesting
approach taken here: https://github.com/robertdavidgraham/wc2

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Reply | Threaded
Open this post in threaded view
|

Re: Parallel copy

akapila
On Thu, Feb 20, 2020 at 5:12 AM David Fetter <[hidden email]> wrote:
>
> On Fri, Feb 14, 2020 at 01:41:54PM +0530, Amit Kapila wrote:
> > This work is to parallelize the copy command and in particular "Copy
> > <table_name> from 'filename' Where <condition>;" command.
>
> Apropos of the initial parsing issue generally, there's an interesting
> approach taken here: https://github.com/robertdavidgraham/wc2
>

Thanks for sharing.  I might be missing something, but I can't figure
out how this can help here.  Does this in some way help to allow
multiple workers to read and tokenize the chunks?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Parallel copy

Tomas Vondra-4
On Thu, Feb 20, 2020 at 04:11:39PM +0530, Amit Kapila wrote:

>On Thu, Feb 20, 2020 at 5:12 AM David Fetter <[hidden email]> wrote:
>>
>> On Fri, Feb 14, 2020 at 01:41:54PM +0530, Amit Kapila wrote:
>> > This work is to parallelize the copy command and in particular "Copy
>> > <table_name> from 'filename' Where <condition>;" command.
>>
>> Apropos of the initial parsing issue generally, there's an interesting
>> approach taken here: https://github.com/robertdavidgraham/wc2
>>
>
>Thanks for sharing.  I might be missing something, but I can't figure
>out how this can help here.  Does this in some way help to allow
>multiple workers to read and tokenize the chunks?
>

I think the wc2 is showing that maybe instead of parallelizing the
parsing, we might instead try using a different tokenizer/parser and
make the implementation more efficient instead of just throwing more
CPUs on it.

I don't know if our code is similar to what wc does, maytbe parsing
csv is more complicated than what wc does.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Parallel copy

David Fetter
On Thu, Feb 20, 2020 at 02:36:02PM +0100, Tomas Vondra wrote:

> On Thu, Feb 20, 2020 at 04:11:39PM +0530, Amit Kapila wrote:
> > On Thu, Feb 20, 2020 at 5:12 AM David Fetter <[hidden email]> wrote:
> > >
> > > On Fri, Feb 14, 2020 at 01:41:54PM +0530, Amit Kapila wrote:
> > > > This work is to parallelize the copy command and in particular "Copy
> > > > <table_name> from 'filename' Where <condition>;" command.
> > >
> > > Apropos of the initial parsing issue generally, there's an interesting
> > > approach taken here: https://github.com/robertdavidgraham/wc2
> > >
> >
> > Thanks for sharing.  I might be missing something, but I can't figure
> > out how this can help here.  Does this in some way help to allow
> > multiple workers to read and tokenize the chunks?
>
> I think the wc2 is showing that maybe instead of parallelizing the
> parsing, we might instead try using a different tokenizer/parser and
> make the implementation more efficient instead of just throwing more
> CPUs on it.

That was what I had in mind.

> I don't know if our code is similar to what wc does, maytbe parsing
> csv is more complicated than what wc does.

CSV parsing differs from wc in that there are more states in the state
machine, but I don't see anything fundamentally different.

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Reply | Threaded
Open this post in threaded view
|

Re: Parallel copy

Ants Aasma-2
On Thu, 20 Feb 2020 at 18:43, David Fetter <[hidden email]> wrote:>

> On Thu, Feb 20, 2020 at 02:36:02PM +0100, Tomas Vondra wrote:
> > I think the wc2 is showing that maybe instead of parallelizing the
> > parsing, we might instead try using a different tokenizer/parser and
> > make the implementation more efficient instead of just throwing more
> > CPUs on it.
>
> That was what I had in mind.
>
> > I don't know if our code is similar to what wc does, maytbe parsing
> > csv is more complicated than what wc does.
>
> CSV parsing differs from wc in that there are more states in the state
> machine, but I don't see anything fundamentally different.
The trouble with a state machine based approach is that the state
transitions form a dependency chain, which means that at best the
processing rate will be 4-5 cycles per byte (L1 latency to fetch the
next state).

I whipped together a quick prototype that uses SIMD and bitmap
manipulations to do the equivalent of CopyReadLineText() in csv mode
including quotes and escape handling, this runs at 0.25-0.5 cycles per
byte.

Regards,
Ants Aasma

simdcopy.c (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Parallel copy

Tomas Vondra-4
On Fri, Feb 21, 2020 at 02:54:31PM +0200, Ants Aasma wrote:

>On Thu, 20 Feb 2020 at 18:43, David Fetter <[hidden email]> wrote:>
>> On Thu, Feb 20, 2020 at 02:36:02PM +0100, Tomas Vondra wrote:
>> > I think the wc2 is showing that maybe instead of parallelizing the
>> > parsing, we might instead try using a different tokenizer/parser and
>> > make the implementation more efficient instead of just throwing more
>> > CPUs on it.
>>
>> That was what I had in mind.
>>
>> > I don't know if our code is similar to what wc does, maytbe parsing
>> > csv is more complicated than what wc does.
>>
>> CSV parsing differs from wc in that there are more states in the state
>> machine, but I don't see anything fundamentally different.
>
>The trouble with a state machine based approach is that the state
>transitions form a dependency chain, which means that at best the
>processing rate will be 4-5 cycles per byte (L1 latency to fetch the
>next state).
>
>I whipped together a quick prototype that uses SIMD and bitmap
>manipulations to do the equivalent of CopyReadLineText() in csv mode
>including quotes and escape handling, this runs at 0.25-0.5 cycles per
>byte.
>

Interesting. How does that compare to what we currently have?


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Parallel copy

Robert Haas
In reply to this post by akapila
On Tue, Feb 18, 2020 at 6:51 PM Amit Kapila <[hidden email]> wrote:
> I am talking about access to shared memory instead of the process
> local memory.  I understand that an extra copy won't be required.

You make it sound like there is some performance penalty for accessing
shared memory, but I don't think that's true. It's true that
*contended* access to shared memory can be slower, because if multiple
processes are trying to access the same memory, and especially if
multiple processes are trying to write the same memory, then the cache
lines have to be shared and that has a cost. However, I don't think
that would create any noticeable effect in this case. First, there's
presumably only one writer process. Second, you wouldn't normally have
multiple readers working on the same part of the data at the same
time.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: Parallel copy

Andres Freund
In reply to this post by Tomas Vondra-4
Hi,

On 2020-02-19 11:38:45 +0100, Tomas Vondra wrote:
> I generally agree with the impression that parsing CSV is tricky and
> unlikely to benefit from parallelism in general. There may be cases with
> restrictions making it easier (e.g. restrictions on the format) but that
> might be a bit too complex to start with.
>
> For example, I had an idea to parallelise the planning by splitting it
> into two phases:

FWIW, I think we ought to rewrite our COPY parsers before we go for
complex schemes. They're way slower than a decent green-field
CSV/... parser.


> The one piece of information I'm missing here is at least a very rough
> quantification of the individual steps of CSV processing - for example
> if parsing takes only 10% of the time, it's pretty pointless to start by
> parallelising this part and we should focus on the rest. If it's 50% it
> might be a different story. Has anyone done any measurements?

Not recently, but I'm pretty sure that I've observed CSV parsing to be
way more than 10%.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Parallel copy

Tomas Vondra-4
On Sun, Feb 23, 2020 at 05:09:51PM -0800, Andres Freund wrote:

>Hi,
>
>On 2020-02-19 11:38:45 +0100, Tomas Vondra wrote:
>> I generally agree with the impression that parsing CSV is tricky and
>> unlikely to benefit from parallelism in general. There may be cases with
>> restrictions making it easier (e.g. restrictions on the format) but that
>> might be a bit too complex to start with.
>>
>> For example, I had an idea to parallelise the planning by splitting it
>> into two phases:
>
>FWIW, I think we ought to rewrite our COPY parsers before we go for
>complex schemes. They're way slower than a decent green-field
>CSV/... parser.
>

Yep, that's quite possible.

>
>> The one piece of information I'm missing here is at least a very rough
>> quantification of the individual steps of CSV processing - for example
>> if parsing takes only 10% of the time, it's pretty pointless to start by
>> parallelising this part and we should focus on the rest. If it's 50% it
>> might be a different story. Has anyone done any measurements?
>
>Not recently, but I'm pretty sure that I've observed CSV parsing to be
>way more than 10%.
>

Perhaps. I guess it'll depend on the CSV file (number of fields, ...),
so I still think we need to do some measurements first. I'm willing to
do that, but (a) I doubt I'll have time for that until after 2020-03,
and (b) it'd be good to agree on some set of typical CSV files.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Parallel copy

akapila
On Tue, Feb 25, 2020 at 9:30 PM Tomas Vondra
<[hidden email]> wrote:

>
> On Sun, Feb 23, 2020 at 05:09:51PM -0800, Andres Freund wrote:
> >Hi,
> >
> >> The one piece of information I'm missing here is at least a very rough
> >> quantification of the individual steps of CSV processing - for example
> >> if parsing takes only 10% of the time, it's pretty pointless to start by
> >> parallelising this part and we should focus on the rest. If it's 50% it
> >> might be a different story. Has anyone done any measurements?
> >
> >Not recently, but I'm pretty sure that I've observed CSV parsing to be
> >way more than 10%.
> >
>
> Perhaps. I guess it'll depend on the CSV file (number of fields, ...),
> so I still think we need to do some measurements first.
>

Agreed.

> I'm willing to
> do that, but (a) I doubt I'll have time for that until after 2020-03,
> and (b) it'd be good to agree on some set of typical CSV files.
>

Right, I don't know what is the best way to define that.  I can think
of the below tests.

1. A table with 10 columns (with datatypes as integers, date, text).
It has one index (unique/primary). Load with 1 million rows (basically
the data should be probably 5-10 GB).
2. A table with 10 columns (with datatypes as integers, date, text).
It has three indexes, one index can be (unique/primary). Load with 1
million rows (basically the data should be probably 5-10 GB).
3. A table with 10 columns (with datatypes as integers, date, text).
It has three indexes, one index can be (unique/primary). It has before
and after trigeers. Load with 1 million rows (basically the data
should be probably 5-10 GB).
4. A table with 10 columns (with datatypes as integers, date, text).
It has five or six indexes, one index can be (unique/primary). Load
with 1 million rows (basically the data should be probably 5-10 GB).

Among all these tests, we can check how much time did we spend in
reading, parsing the csv files vs. rest of execution?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Parallel copy

Alastair Turner-2
On Wed, 26 Feb 2020 at 10:54, Amit Kapila <[hidden email]> wrote:
>
> On Tue, Feb 25, 2020 at 9:30 PM Tomas Vondra
> <[hidden email]> wrote:
> >
...

> >
> > Perhaps. I guess it'll depend on the CSV file (number of fields, ...),
> > so I still think we need to do some measurements first.
> >
>
> Agreed.
>
> > I'm willing to
> > do that, but (a) I doubt I'll have time for that until after 2020-03,
> > and (b) it'd be good to agree on some set of typical CSV files.
> >
>
> Right, I don't know what is the best way to define that.  I can think
> of the below tests.
>
> 1. A table with 10 columns (with datatypes as integers, date, text).
> It has one index (unique/primary). Load with 1 million rows (basically
> the data should be probably 5-10 GB).
> 2. A table with 10 columns (with datatypes as integers, date, text).
> It has three indexes, one index can be (unique/primary). Load with 1
> million rows (basically the data should be probably 5-10 GB).
> 3. A table with 10 columns (with datatypes as integers, date, text).
> It has three indexes, one index can be (unique/primary). It has before
> and after trigeers. Load with 1 million rows (basically the data
> should be probably 5-10 GB).
> 4. A table with 10 columns (with datatypes as integers, date, text).
> It has five or six indexes, one index can be (unique/primary). Load
> with 1 million rows (basically the data should be probably 5-10 GB).
>
> Among all these tests, we can check how much time did we spend in
> reading, parsing the csv files vs. rest of execution?

That's a good set of tests of what happens after the parse. Two
simpler test runs may provide useful baselines - no
constraints/indexes with all columns varchar and no
constraints/indexes with columns correctly typed.

For testing the impact of various parts of the parse process, my idea would be:
 - A base dataset with 10 columns including int, date and text. One
text field quoted and containing both delimiters and line terminators
 - A derivative to measure just line parsing - strip the quotes around
the text field and quote the whole row as one text field
 - A derivative to measure the impact of quoted fields - clean up the
text field so it doesn't require quoting
 - A derivative to measure the impact of row length - run ten rows
together to make 100 column rows, but only a tenth as many rows

If that sounds reasonable, I'll try to knock up a generator.

--
Alastair


Reply | Threaded
Open this post in threaded view
|

Re: Parallel copy

Ants Aasma-2
In reply to this post by Tomas Vondra-4
On Tue, 25 Feb 2020 at 18:00, Tomas Vondra <[hidden email]> wrote:
> Perhaps. I guess it'll depend on the CSV file (number of fields, ...),
> so I still think we need to do some measurements first. I'm willing to
> do that, but (a) I doubt I'll have time for that until after 2020-03,
> and (b) it'd be good to agree on some set of typical CSV files.

I agree that getting a nice varied dataset would be nice. Including
things like narrow integer only tables, strings with newlines and
escapes in them, extremely wide rows.

I tried to capture a quick profile just to see what it looks like.
Grabbed a random open data set from the web, about 800MB of narrow
rows CSV [1].

Script:
CREATE TABLE census (year int,age int,ethnic int,sex int,area text,count text);
COPY census FROM '.../Data8277.csv' WITH (FORMAT 'csv', HEADER true);

Profile:
# Samples: 59K of event 'cycles:u'
# Event count (approx.): 57644269486
#
# Overhead  Command   Shared Object       Symbol
# ........  ........  ..................
.......................................
#
    18.24%  postgres  postgres            [.] CopyReadLine
     9.23%  postgres  postgres            [.] NextCopyFrom
     8.87%  postgres  postgres            [.] NextCopyFromRawFields
     5.82%  postgres  postgres            [.] pg_verify_mbstr_len
     5.45%  postgres  postgres            [.] pg_strtoint32
     4.16%  postgres  postgres            [.] heap_fill_tuple
     4.03%  postgres  postgres            [.] heap_compute_data_size
     3.83%  postgres  postgres            [.] CopyFrom
     3.78%  postgres  postgres            [.] AllocSetAlloc
     3.53%  postgres  postgres            [.] heap_form_tuple
     2.96%  postgres  postgres            [.] InputFunctionCall
     2.89%  postgres  libc-2.30.so        [.] __memmove_avx_unaligned_erms
     1.82%  postgres  libc-2.30.so        [.] __strlen_avx2
     1.72%  postgres  postgres            [.] AllocSetReset
     1.72%  postgres  postgres            [.] RelationPutHeapTuple
     1.47%  postgres  postgres            [.] heap_prepare_insert
     1.31%  postgres  postgres            [.] heap_multi_insert
     1.25%  postgres  postgres            [.] textin
     1.24%  postgres  postgres            [.] int4in
     1.05%  postgres  postgres            [.] tts_buffer_heap_clear
     0.85%  postgres  postgres            [.] pg_any_to_server
     0.80%  postgres  postgres            [.] pg_comp_crc32c_sse42
     0.77%  postgres  postgres            [.] cstring_to_text_with_len
     0.69%  postgres  postgres            [.] AllocSetFree
     0.60%  postgres  postgres            [.] appendBinaryStringInfo
     0.55%  postgres  postgres            [.] tts_buffer_heap_materialize.part.0
     0.54%  postgres  postgres            [.] palloc
     0.54%  postgres  libc-2.30.so        [.] __memmove_avx_unaligned
     0.51%  postgres  postgres            [.] palloc0
     0.51%  postgres  postgres            [.] pg_encoding_max_length
     0.48%  postgres  postgres            [.] enlargeStringInfo
     0.47%  postgres  postgres            [.] ExecStoreVirtualTuple
     0.45%  postgres  postgres            [.] PageAddItemExtended

So that confirms that the parsing is a huge chunk of overhead with
current splitting into lines being the largest portion. Amdahl's law
says that splitting into tuples needs to be made fast before
parallelizing makes any sense.

Regards,
Ants Aasma

[1] https://www3.stats.govt.nz/2018census/Age-sex-by-ethnic-group-grouped-total-responses-census-usually-resident-population-counts-2006-2013-2018-Censuses-RC-TA-SA2-DHB.zip


12345 ... 7