Bulk Inserts

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Bulk Inserts

Souvik Bhattacherjee
Hi,

I'm trying to measure the performance of the following: Multiple txns inserting tuples into a table concurrently vs single txn doing the whole insertion.

new table created as:
create table tab2 (
id serial,
attr1 integer not null,
attr2 integer not null,
primary key(id)
);

EXP 1: inserts with multiple txn:
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where attr2 = 10);
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where attr2 = 20);

note: attr2 has only two values 10 and 20

EXP 2: inserts with a single txn:
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1);

I also performed another experiment as follows:
EXP 3: select attr1, attr2 into tab2 from tab1;

The observation here is EXP 3  is much faster than EXP 2 probably due to bulk inserts used by Postgres. However I could not find a way to insert id values in tab2 using EXP 3. Also select .. into .. from .. throws an error if we create a table first and then populate the tuples using the command.

I have the following questions:
1. Is it possible to have an id column in tab2 and perform a bulk insert using select .. into .. from .. or using some other means?
2. If a table is already created, is it possible to do bulk inserts via multiple txns inserting into the same table (EXP 3)?

Best,
-SB
Reply | Threaded
Open this post in threaded view
|

Re: Bulk Inserts

Adrian Klaver-4
On 8/9/19 3:06 PM, Souvik Bhattacherjee wrote:

> Hi,
>
> I'm trying to measure the performance of the following: Multiple txns
> inserting tuples into a table concurrently vs single txn doing the whole
> insertion.
>
> *new table created as:*
> create table tab2 (
> id serial,
> attr1 integer not null,
> attr2 integer not null,
> primary key(id)
> );
>
> *EXP 1: inserts with multiple txn:*
> insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
> attr2 = 10);
> insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
> attr2 = 20);
>
> note: attr2 has only two values 10 and 20
>
> *EXP 2: inserts with a single txn:*
> insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1);
>
> I also performed another experiment as follows:
> *EXP 3:* select attr1, attr2 into tab2 from tab1;
>
> The observation here is EXP 3  is much faster than EXP 2 probably due to
> bulk inserts used by Postgres. However I could not find a way to insert
> id values in tab2 using EXP 3. Also select .. into .. from .. throws an
> error if we create a table first and then populate the tuples using the
> command.

Yes as SELECT INTO is functionally the same as CREATE TABLE AS:

https://www.postgresql.org/docs/11/sql-selectinto.html

>
> I have the following questions:
> 1. Is it possible to have an id column in tab2 and perform a bulk insert
> using select .. into .. from .. or using some other means?

Not using SELECT INTO for reasons given above.
Though it is possible to SELECT INTO as you show in EXP 3 and then:
        alter table tab2 add column id serial primary key;
EXP 2 shows the other means.

> 2. If a table is already created, is it possible to do bulk inserts via
> multiple txns inserting into the same table (EXP 3)?

Yes, but you will some code via client or function that batches the
inserts for you.

>
> Best,
> -SB


--
Adrian Klaver
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Bulk Inserts

Souvik Bhattacherjee
Hi Adrian,

Thanks for the response.

> Yes, but you will some code via client or function that batches the 
> inserts for you.

Could you please elaborate a bit on how EXP 1 could be performed such that it uses bulk inserts?

Best,
-SB

On Fri, Aug 9, 2019 at 7:26 PM Adrian Klaver <[hidden email]> wrote:
On 8/9/19 3:06 PM, Souvik Bhattacherjee wrote:
> Hi,
>
> I'm trying to measure the performance of the following: Multiple txns
> inserting tuples into a table concurrently vs single txn doing the whole
> insertion.
>
> *new table created as:*
> create table tab2 (
> id serial,
> attr1 integer not null,
> attr2 integer not null,
> primary key(id)
> );
>
> *EXP 1: inserts with multiple txn:*
> insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
> attr2 = 10);
> insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
> attr2 = 20);
>
> note: attr2 has only two values 10 and 20
>
> *EXP 2: inserts with a single txn:*
> insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1);
>
> I also performed another experiment as follows:
> *EXP 3:* select attr1, attr2 into tab2 from tab1;
>
> The observation here is EXP 3  is much faster than EXP 2 probably due to
> bulk inserts used by Postgres. However I could not find a way to insert
> id values in tab2 using EXP 3. Also select .. into .. from .. throws an
> error if we create a table first and then populate the tuples using the
> command.

Yes as SELECT INTO is functionally the same as CREATE TABLE AS:

https://www.postgresql.org/docs/11/sql-selectinto.html

>
> I have the following questions:
> 1. Is it possible to have an id column in tab2 and perform a bulk insert
> using select .. into .. from .. or using some other means?

Not using SELECT INTO for reasons given above.
Though it is possible to SELECT INTO as you show in EXP 3 and then:
        alter table tab2 add column id serial primary key;
EXP 2 shows the other means.

> 2. If a table is already created, is it possible to do bulk inserts via
> multiple txns inserting into the same table (EXP 3)?

Yes, but you will some code via client or function that batches the
inserts for you.

>
> Best,
> -SB


--
Adrian Klaver
[hidden email]
lup
Reply | Threaded
Open this post in threaded view
|

Re: Bulk Inserts

lup


On Aug 10, 2019, at 8:47 PM, Souvik Bhattacherjee <[hidden email]> wrote:

Hi Adrian,

Thanks for the response.

> Yes, but you will some code via client or function that batches the 
> inserts for you.

Could you please elaborate a bit on how EXP 1 could be performed such that it uses bulk inserts?

Best,
-SB

On Fri, Aug 9, 2019 at 7:26 PM Adrian Klaver <[hidden email]> wrote:
On 8/9/19 3:06 PM, Souvik Bhattacherjee wrote:
> Hi,
>
> I'm trying to measure the performance of the following: Multiple txns
> inserting tuples into a table concurrently vs single txn doing the whole
> insertion.
>
> *new table created as:*
> create table tab2 (
> id serial,
> attr1 integer not null,
> attr2 integer not null,
> primary key(id)
> );
>
> *EXP 1: inserts with multiple txn:*
> insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
> attr2 = 10);
> insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
> attr2 = 20);
>
> note: attr2 has only two values 10 and 20
>
> *EXP 2: inserts with a single txn:*
> insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1);
>
> I also performed another experiment as follows:
> *EXP 3:* select attr1, attr2 into tab2 from tab1;
>
> The observation here is EXP 3  is much faster than EXP 2 probably due to
> bulk inserts used by Postgres. However I could not find a way to insert
> id values in tab2 using EXP 3. Also select .. into .. from .. throws an
> error if we create a table first and then populate the tuples using the
> command.

Yes as SELECT INTO is functionally the same as CREATE TABLE AS:

https://www.postgresql.org/docs/11/sql-selectinto.html

>
> I have the following questions:
> 1. Is it possible to have an id column in tab2 and perform a bulk insert
> using select .. into .. from .. or using some other means?

Not using SELECT INTO for reasons given above.
Though it is possible to SELECT INTO as you show in EXP 3 and then:
        alter table tab2 add column id serial primary key;
EXP 2 shows the other means.

> 2. If a table is already created, is it possible to do bulk inserts via
> multiple txns inserting into the same table (EXP 3)?

Yes, but you will some code via client or function that batches the
inserts for you.

>
> Best,
> -SB


--
Adrian Klaver
[hidden email]
Top-posting (i.e. putting your reply at the top is discouraged here)
Does this appeal to you:
COPY (SELECT * FROM relation) TO ... (https://www.postgresql.org/docs/10/sql-copy.html)

Reply | Threaded
Open this post in threaded view
|

Re: Bulk Inserts

Souvik Bhattacherjee
> Does this appeal to you:
> COPY (SELECT * FROM relation) TO ... (https://www.postgresql.org/docs/10/sql-copy.html)

Not sure if COPY can be used to transfer data between tables.

On Sat, Aug 10, 2019 at 11:01 PM Rob Sargent <[hidden email]> wrote:


On Aug 10, 2019, at 8:47 PM, Souvik Bhattacherjee <[hidden email]> wrote:

Hi Adrian,

Thanks for the response.

> Yes, but you will some code via client or function that batches the 
> inserts for you.

Could you please elaborate a bit on how EXP 1 could be performed such that it uses bulk inserts?

Best,
-SB

On Fri, Aug 9, 2019 at 7:26 PM Adrian Klaver <[hidden email]> wrote:
On 8/9/19 3:06 PM, Souvik Bhattacherjee wrote:
> Hi,
>
> I'm trying to measure the performance of the following: Multiple txns
> inserting tuples into a table concurrently vs single txn doing the whole
> insertion.
>
> *new table created as:*
> create table tab2 (
> id serial,
> attr1 integer not null,
> attr2 integer not null,
> primary key(id)
> );
>
> *EXP 1: inserts with multiple txn:*
> insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
> attr2 = 10);
> insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
> attr2 = 20);
>
> note: attr2 has only two values 10 and 20
>
> *EXP 2: inserts with a single txn:*
> insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1);
>
> I also performed another experiment as follows:
> *EXP 3:* select attr1, attr2 into tab2 from tab1;
>
> The observation here is EXP 3  is much faster than EXP 2 probably due to
> bulk inserts used by Postgres. However I could not find a way to insert
> id values in tab2 using EXP 3. Also select .. into .. from .. throws an
> error if we create a table first and then populate the tuples using the
> command.

Yes as SELECT INTO is functionally the same as CREATE TABLE AS:

https://www.postgresql.org/docs/11/sql-selectinto.html

>
> I have the following questions:
> 1. Is it possible to have an id column in tab2 and perform a bulk insert
> using select .. into .. from .. or using some other means?

Not using SELECT INTO for reasons given above.
Though it is possible to SELECT INTO as you show in EXP 3 and then:
        alter table tab2 add column id serial primary key;
EXP 2 shows the other means.

> 2. If a table is already created, is it possible to do bulk inserts via
> multiple txns inserting into the same table (EXP 3)?

Yes, but you will some code via client or function that batches the
inserts for you.

>
> Best,
> -SB


--
Adrian Klaver
[hidden email]
Top-posting (i.e. putting your reply at the top is discouraged here)
Does this appeal to you:
COPY (SELECT * FROM relation) TO ... (https://www.postgresql.org/docs/10/sql-copy.html)

Reply | Threaded
Open this post in threaded view
|

Quoting style (was: Bulk Inserts)

Peter J. Holzer
In reply to this post by lup
On 2019-08-10 21:01:50 -0600, Rob Sargent wrote:

>     On Aug 10, 2019, at 8:47 PM, Souvik Bhattacherjee <[hidden email]>
>     wrote:
>
>     Hi Adrian,
>
>     Thanks for the response.
>
>     > Yes, but you will some code via client or function that batches the
>     > inserts for you.
>
>     Could you please elaborate a bit on how EXP 1 could be performed such that
>     it uses bulk inserts?
>
>     Best,
>     -SB
>
>     On Fri, Aug 9, 2019 at 7:26 PM Adrian Klaver <[hidden email]>
>     wrote:
>
[70 lines of full quote removed]


> Top-posting (i.e. putting your reply at the top is discouraged here)

He didn't really top-post. He quoted the relevant part of Adrian's
posting and then wrote his reply below that. This is the style I prefer,
because it makes it really clear what one is replying to.

After his reply, he quoted Adrian's posting again, this time completely.
I think this is unnecessary and confusing (you apparently didn't even
see that he quoted something above his reply). But it's not as bad as
quoting everything below the answer (or - as you did - quoting
everything before the answer which I think is even worse: If I don't see
any original content within the first 100 lines or so I usually skip the
rest).

        hp

--
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | [hidden email]         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

signature.asc (849 bytes) Download Attachment
lup
Reply | Threaded
Open this post in threaded view
|

Re: Quoting style (was: Bulk Inserts)

lup
Sorry.  I thought I had cut most of the redundancy

> On Aug 11, 2019, at 2:26 AM, Peter J. Holzer <[hidden email]> wrote:
>
>> On 2019-08-10 21:01:50 -0600, Rob Sargent wrote:
>>    On Aug 10, 2019, at 8:47 PM, Souvik Bhattacherjee <[hidden email]>
>>    wrote:
>>
>>    Hi Adrian,
>>
>>    Thanks for the response.
>>
>>> Yes, but you will some code via client or function that batches the
>>> inserts for you.
>>
>>    Could you please elaborate a bit on how EXP 1 could be performed such that
>>    it uses bulk inserts?
>>
>>    Best,
>>    -SB
>>
>>    On Fri, Aug 9, 2019 at 7:26 PM Adrian Klaver <[hidden email]>
>>    wrote:
>>
> [70 lines of full quote removed]
>
>
>> Top-posting (i.e. putting your reply at the top is discouraged here)
>
> He didn't really top-post. He quoted the relevant part of Adrian's
> posting and then wrote his reply below that. This is the style I prefer,
> because it makes it really clear what one is replying to.
>
> After his reply, he quoted Adrian's posting again, this time completely.
> I think this is unnecessary and confusing (you apparently didn't even
> see that he quoted something above his reply). But it's not as bad as
> quoting everything below the answer (or - as you did - quoting
> everything before the answer which I think is even worse: If I don't see
> any original content within the first 100 lines or so I usually skip the
> rest).
>
>        hp
>
> --
>   _  | Peter J. Holzer    | we build much bigger, better disasters now
> |_|_) |                    | because we have much more sophisticated
> | |   | [hidden email]         | management tools.
> __/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>


Reply | Threaded
Open this post in threaded view
|

Re: Bulk Inserts

Adrian Klaver-4
In reply to this post by Souvik Bhattacherjee
On 8/10/19 7:47 PM, Souvik Bhattacherjee wrote:
> Hi Adrian,
>
> Thanks for the response.
>
>  > Yes, but you will some code via client or function that batches the
>  > inserts for you.
>
> Could you please elaborate a bit on how EXP 1 could be performed such
> that it uses bulk inserts?

I guess it comes down to what you define as bulk inserts. From your OP:

EXP 1: inserts with multiple txn:
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
attr2 = 10);
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
attr2 = 20);

If the selects are returning more then one row then you are already
doing bulk inserts. If they are returning single rows or you want to
batch them then you need some sort of code to do that. Something
like(pseudo Python like code):

attr2_vals= [(10, 20, 30, 40), (50, 60, 70, 80)]

for val_batch in attr2_vals:
        BEGIN
        for id in val_batch:
                insert into tab2 (attr1, attr2) (select attr1, attr2
                 from tab1 where attr2 = id)
         COMMIT

>
> Best,
> -SB



--
Adrian Klaver
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Bulk Inserts

Souvik Bhattacherjee
> If the selects are returning more then one row then you are already
> doing bulk inserts. If they are returning single rows or you want to
> batch them then you need some sort of code to do that. Something
> like(pseudo Python like code):

> attr2_vals= [(10, 20, 30, 40), (50, 60, 70, 80)]

> for val_batch in attr2_vals:
        BEGIN
        for id in val_batch:
                insert into tab2 (attr1, attr2) (select attr1, attr2
                 from tab1 where attr2 = id)
         COMMIT

For EXP 1: inserts with multiple txn:
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where attr2 = 10);
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where attr2 = 20);

tab1 has ~6M rows and there are only two values for the attribute attr2 in
tab1 which are evenly distributed. So, yes, I guess I'm already doing batching
here.

Also, I ran the following two statements to see if their performances are comparable. 
While STMT 1 always runs faster in my machine but their performances seem to differ
by a couple of seconds at most.

STMT 1: select attr1, attr2 into tab2 from tab1;
STMT 2: insert into tab2 (select attr1, attr2 from tab1);

However adding the serial id column as an ALTER TABLE statement actually takes more time 
than inserting the tuples, so the combined total time is more than double the time taken to insert
the tuples into tab2 without serial id column.

Best,
-SB



On Sun, Aug 11, 2019 at 11:11 AM Adrian Klaver <[hidden email]> wrote:
On 8/10/19 7:47 PM, Souvik Bhattacherjee wrote:
> Hi Adrian,
>
> Thanks for the response.
>
>  > Yes, but you will some code via client or function that batches the
>  > inserts for you.
>
> Could you please elaborate a bit on how EXP 1 could be performed such
> that it uses bulk inserts?

I guess it comes down to what you define as bulk inserts. From your OP:

EXP 1: inserts with multiple txn:
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
attr2 = 10);
insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
attr2 = 20);

If the selects are returning more then one row then you are already
doing bulk inserts. If they are returning single rows or you want to
batch them then you need some sort of code to do that. Something
like(pseudo Python like code):

attr2_vals= [(10, 20, 30, 40), (50, 60, 70, 80)]

for val_batch in attr2_vals:
        BEGIN
        for id in val_batch:
                insert into tab2 (attr1, attr2) (select attr1, attr2
                 from tab1 where attr2 = id)
         COMMIT

>
> Best,
> -SB



--
Adrian Klaver
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Bulk Inserts

Adrian Klaver-4
On 8/13/19 6:34 AM, Souvik Bhattacherjee wrote:

>  > If the selects are returning more then one row then you are already
>  > doing bulk inserts. If they are returning single rows or you want to
>  > batch them then you need some sort of code to do that. Something
>  > like(pseudo Python like code):
>
>  > attr2_vals= [(10, 20, 30, 40), (50, 60, 70, 80)]
>
>  > for val_batch in attr2_vals:
>          BEGIN
>          for id in val_batch:
>                  insert into tab2 (attr1, attr2) (select attr1, attr2
>                   from tab1 where attr2 = id)
>           COMMIT
>
> For *EXP 1: inserts with multiple txn:*
> insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
> attr2 = 10);
> insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
> attr2 = 20);
>
> tab1 has ~6M rows and there are only two values for the attribute attr2 in
> tab1 which are evenly distributed. So, yes, I guess I'm already doing
> batching
> here.
>
> Also, I ran the following two statements to see if their performances
> are comparable.
> While STMT 1 always runs faster in my machine but their performances
> seem to differ
> by a couple of seconds at most.
>
> STMT 1: select attr1, attr2 into tab2 from tab1;
> STMT 2: insert into tab2 (select attr1, attr2 from tab1);

All I have left is:

select tab2 row_number () OVER (order by attr2, attr1 )AS id, attr1,
attr2 into tab2 from tab1;

That will not create a serial type in the id column though. You can
attach a sequence to that column. Something like:

1) create sequence tab2_id start <max id + 1> owned by tab2.id;

2) alter table tab2 alter COLUMN id set default nextval('tab2_id');



>
> However adding the serial id column as an ALTER TABLE statement actually
> takes more time
> than inserting the tuples, so the combined total time is more than
> double the time taken to insert
> the tuples into tab2 without serial id column.
>
> Best,
> -SB
>
>
>



--
Adrian Klaver
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Bulk Inserts

Souvik Bhattacherjee
> All I have left is:

> select tab2 row_number () OVER (order by attr2, attr1 )AS id, attr1,
> attr2 into tab2 from tab1;

> That will not create a serial type in the id column though. You can
> attach a sequence to that column. Something like:

> 1) create sequence tab2_id start <max id + 1> owned by tab2.id;

> 2) alter table tab2 alter COLUMN id set default nextval('tab2_id');

Thanks. This is a bit indirect but works fine. Performance wise this turns
out to the best when inserting rows from one table to another (new) table
with a serial id column in the new table.

Best,
-SB

On Tue, Aug 13, 2019 at 11:08 AM Adrian Klaver <[hidden email]> wrote:
On 8/13/19 6:34 AM, Souvik Bhattacherjee wrote:
>  > If the selects are returning more then one row then you are already
>  > doing bulk inserts. If they are returning single rows or you want to
>  > batch them then you need some sort of code to do that. Something
>  > like(pseudo Python like code):
>
>  > attr2_vals= [(10, 20, 30, 40), (50, 60, 70, 80)]
>
>  > for val_batch in attr2_vals:
>          BEGIN
>          for id in val_batch:
>                  insert into tab2 (attr1, attr2) (select attr1, attr2
>                   from tab1 where attr2 = id)
>           COMMIT
>
> For *EXP 1: inserts with multiple txn:*
> insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
> attr2 = 10);
> insert into tab2 (attr1, attr2) (select attr1, attr2 from tab1 where
> attr2 = 20);
>
> tab1 has ~6M rows and there are only two values for the attribute attr2 in
> tab1 which are evenly distributed. So, yes, I guess I'm already doing
> batching
> here.
>
> Also, I ran the following two statements to see if their performances
> are comparable.
> While STMT 1 always runs faster in my machine but their performances
> seem to differ
> by a couple of seconds at most.
>
> STMT 1: select attr1, attr2 into tab2 from tab1;
> STMT 2: insert into tab2 (select attr1, attr2 from tab1);

All I have left is:

select tab2 row_number () OVER (order by attr2, attr1 )AS id, attr1,
attr2 into tab2 from tab1;

That will not create a serial type in the id column though. You can
attach a sequence to that column. Something like:

1) create sequence tab2_id start <max id + 1> owned by tab2.id;

2) alter table tab2 alter COLUMN id set default nextval('tab2_id');



>
> However adding the serial id column as an ALTER TABLE statement actually
> takes more time
> than inserting the tuples, so the combined total time is more than
> double the time taken to insert
> the tuples into tab2 without serial id column.
>
> Best,
> -SB
>
>
>



--
Adrian Klaver
[hidden email]