Batch insert in CTAS/MatView code

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
23 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Batch insert in CTAS/MatView code

Paul Guo-2
Hello, Postgres hackers,

The copy code has used batch insert with function heap_multi_insert() to speed up. It seems that Create Table As or Materialized View could leverage that code also to boost the performance also. Attached is a patch to implement that. That was done by Taylor (cc-ed) and me.

The patch also modifies heap_multi_insert() a bit to do a bit further code-level optimization by using static memory, instead of using memory context and dynamic allocation. For Modifytable->insert, it seems that there are more limitations for batch insert (trigger, etc?) but it seems that it is possible that we could do batch insert for the case that we could do? 

By the way, while looking at the code, I noticed that there are 9 local arrays with large length in toast_insert_or_update() which seems to be a risk of stack overflow. Maybe we should put it as static or global.

Here is a quick simple performance testing on a mirrorless Postgres instance with the SQLs below. The tests cover tables with small column length, large column length and toast.

-- tuples with small size.
drop table if exists t1;
create table t1 (a int);

insert into t1 select * from generate_series(1, 10000000);
drop table if exists t2;
\timing
create table t2 as select * from t1;
\timing

-- tuples that are untoasted and data that is 1664 bytes wide
drop table if exists t1;
create table t1 (a name, b name, c name, d name, e name, f name, g name, h name, i name, j name, k name, l name, m name, n name, o name, p name, q name, r name, s name, t name, u name, v name, w name, x name, y name, z name);

insert into t1 select 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z' from generate_series(1, 500000);
drop table if exists t2;
\timing
create table t2 as select * from t1;
\timing

-- tuples that are toastable.
drop table if exists t1;
create table t1 (a text, b text, c text, d text, e text, f text, g text, h text, i text, j text, k text, l text, m text, n text, o text, p text, q text, r text, s text, t text, u text, v text, w text, x text, y text, z text);

insert into t1 select i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i from (select repeat('123456789', 10000) from generate_series(1,2000)) i;
drop table if exists t2;
\timing
create table t2 as select * from t1;
\timing

Here are the timing results:

With the patch,

Time: 4728.142 ms (00:04.728)
Time: 14203.983 ms (00:14.204)
Time: 1008.669 ms (00:01.009)

Baseline,
Time: 11096.146 ms (00:11.096)
Time: 13106.741 ms (00:13.107)
Time: 1100.174 ms (00:01.100)

While for toast and large column size there is < 10% decrease but for small column size the improvement is super good. Actually if I hardcode the batch count as 4 all test cases are better but the improvement for small column size is smaller than that with current patch. Pretty much the number 4 is quite case specific so I can not hardcode that in the patch. Of course we could further tune that but the current value seems to be a good trade-off?

Thanks.


0001-Heap-batch-insert-for-CTAS.patch (9K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Batch insert in CTAS/MatView code

Michael Paquier-2
Hi,

On Wed, Mar 06, 2019 at 10:06:27PM +0800, Paul Guo wrote:
> The copy code has used batch insert with function heap_multi_insert() to
> speed up. It seems that Create Table As or Materialized View could leverage
> that code also to boost the performance also. Attached is a patch to
> implement that. That was done by Taylor (cc-ed) and me.

Please note that we are currently in the last commit fest of Postgres
12, and it is too late to propose new features.  Please feel free to
add an entry to the commit fest happening afterwards.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Batch insert in CTAS/MatView code

Heikki Linnakangas
In reply to this post by Paul Guo-2
On 06/03/2019 22:06, Paul Guo wrote:
> The patch also modifies heap_multi_insert() a bit to do a bit further
> code-level optimization by using static memory, instead of using memory
> context and dynamic allocation.

If toasting is required, heap_prepare_insert() creates a palloc'd tuple.
That is still leaked to the current memory context.

Leaking into the current memory context is not a bad thing, because
resetting a memory context is faster than doing a lot of pfree() calls.
The callers just need to be prepared for that, and use a short-lived
memory context.

> By the way, while looking at the code, I noticed that there are 9 local
> arrays with large length in toast_insert_or_update() which seems to be a
> risk of stack overflow. Maybe we should put it as static or global.

Hmm. We currently reserve 512 kB between the kernel's limit, and the
limit we check in check_stack_depth(). See STACK_DEPTH_SLOP. Those
arrays add up to 52800 bytes on a 64-bit maching, if I did my math
right. So there's still a lot of headroom. I agree that it nevertheless
seems a bit excessive, though.

> With the patch,
>
> Time: 4728.142 ms (00:04.728)
> Time: 14203.983 ms (00:14.204)
> Time: 1008.669 ms (00:01.009)
>
> Baseline,
> Time: 11096.146 ms (00:11.096)
> Time: 13106.741 ms (00:13.107)
> Time: 1100.174 ms (00:01.100)

Nice speedup!

> While for toast and large column size there is < 10% decrease but for
> small column size the improvement is super good. Actually if I hardcode
> the batch count as 4 all test cases are better but the improvement for
> small column size is smaller than that with current patch. Pretty much
> the number 4 is quite case specific so I can not hardcode that in the
> patch. Of course we could further tune that but the current value seems
> to be a good trade-off?

Have you done any profiling, on why the multi-insert is slower with
large tuples? In principle, I don't see why it should be slower.

- Heikki

Reply | Threaded
Open this post in threaded view
|

Re: Batch insert in CTAS/MatView code

Paul Guo-2
Sorry for the late reply.

To Michael. Thank you. I know this commitfest is ongoing and I'm not targeting for this.

On Thu, Mar 7, 2019 at 4:54 PM Heikki Linnakangas <[hidden email]> wrote:
On 06/03/2019 22:06, Paul Guo wrote:
> The patch also modifies heap_multi_insert() a bit to do a bit further
> code-level optimization by using static memory, instead of using memory
> context and dynamic allocation.

If toasting is required, heap_prepare_insert() creates a palloc'd tuple.
That is still leaked to the current memory context.


Thanks. I checked the code for that but apparently, I missed that one. I'll see what proper context can be used for CTAS. For copy code maybe just revert my change.

 
Leaking into the current memory context is not a bad thing, because
resetting a memory context is faster than doing a lot of pfree() calls.
The callers just need to be prepared for that, and use a short-lived
memory context.

> By the way, while looking at the code, I noticed that there are 9 local
> arrays with large length in toast_insert_or_update() which seems to be a
> risk of stack overflow. Maybe we should put it as static or global.

Hmm. We currently reserve 512 kB between the kernel's limit, and the
limit we check in check_stack_depth(). See STACK_DEPTH_SLOP. Those
arrays add up to 52800 bytes on a 64-bit maching, if I did my math
right. So there's still a lot of headroom. I agree that it nevertheless
seems a bit excessive, though.
 
I was worried about some recursive calling of it, but probably there should be no worry for toast_insert_or_update().


> With the patch,
>
> Time: 4728.142 ms (00:04.728)
> Time: 14203.983 ms (00:14.204)
> Time: 1008.669 ms (00:01.009)
>
> Baseline,
> Time: 11096.146 ms (00:11.096)
> Time: 13106.741 ms (00:13.107)
> Time: 1100.174 ms (00:01.100)

Nice speedup!

> While for toast and large column size there is < 10% decrease but for
> small column size the improvement is super good. Actually if I hardcode
> the batch count as 4 all test cases are better but the improvement for
> small column size is smaller than that with current patch. Pretty much
> the number 4 is quite case specific so I can not hardcode that in the
> patch. Of course we could further tune that but the current value seems
> to be a good trade-off?

Have you done any profiling, on why the multi-insert is slower with
large tuples? In principle, I don't see why it should be slower.

Thanks for the suggestion. I'll explore a bit more on this.
 

- Heikki
Reply | Threaded
Open this post in threaded view
|

Re: Batch insert in CTAS/MatView code

David Fetter
In reply to this post by Paul Guo-2
On Wed, Mar 06, 2019 at 10:06:27PM +0800, Paul Guo wrote:
> Hello, Postgres hackers,
>
> The copy code has used batch insert with function heap_multi_insert() to
> speed up. It seems that Create Table As or Materialized View could leverage
> that code also to boost the performance also. Attached is a patch to
> implement that.

This is great!

Is this optimization doable for multi-row INSERTs, either with tuples
spelled out in the body of the query or in constructs like INSERT ...
SELECT ...?

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Reply | Threaded
Open this post in threaded view
|

Re: Batch insert in CTAS/MatView code

Paul Guo-2


On Mon, Mar 11, 2019 at 2:58 AM David Fetter <[hidden email]> wrote:
On Wed, Mar 06, 2019 at 10:06:27PM +0800, Paul Guo wrote:
> Hello, Postgres hackers,
>
> The copy code has used batch insert with function heap_multi_insert() to
> speed up. It seems that Create Table As or Materialized View could leverage
> that code also to boost the performance also. Attached is a patch to
> implement that.

This is great!

Is this optimization doable for multi-row INSERTs, either with tuples
spelled out in the body of the query or in constructs like INSERT ...
SELECT ...?

Yes. That's "batch insert" in the ModifyTable nodes which I mentioned in the first email.
By the way, batch is a usual optimization mechanism for iteration kind model (like postgres executor),
so batch should benefit many executor nodes in theory also.
 

Best,
David.
--
David Fetter <david(at)fetter(dot)org> https://urldefense.proofpoint.com/v2/url?u=http-3A__fetter.org_&d=DwIBAg&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=Usi0ex6Ch92MsB5QQDgYFw&m=wgGDTDFzZV7nnMm0NFt-yGKmm_KZk18RXKP9HL8h6UE&s=tnaoLdajjR0Ew-93XUliHW1FUspVl09pIFd9aXxvqc8&e=
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
Reply | Threaded
Open this post in threaded view
|

Re: Batch insert in CTAS/MatView code

Paul Guo-2
In reply to this post by Heikki Linnakangas
Hi all,

I've been working other things until recently I restarted the work, profiling & refactoring the code.
It's been a long time since the last patch was proposed. The new patch has now been firstly refactored due to
4da597edf1bae0cf0453b5ed6fc4347b6334dfe1 (Make TupleTableSlots extensible, finish split of existing slot type).

Now that TupleTableSlot, instead of HeapTuple is one argument of intorel_receive() so we can not get the
tuple length directly. This patch now gets the tuple length if we know all columns are with fixed widths, else
we calculate an avg. tuple length using the first MAX_MULTI_INSERT_SAMPLES (defined as 1000) tuples
and use for the total length of tuples in a batch. 

I noticed that to do batch insert, we might need additional memory copy sometimes comparing with "single insert"
(that should be the reason that we previously saw a bit regressions) so a good solution seems to fall back
to "single insert" if the tuple length is larger than a threshold. I set this as 2000 after quick testing.

To make test stable and strict, I run checkpoint before each ctas, the test script looks like this:
 
checkpoint;
\timing
create table tt as select a,b,c from t11;
\timing
drop table tt;

Also previously I just tested the BufferHeapTupleTableSlot (i.e. create table tt as select * from t11),
this time I test VirtualTupleTableSlot (i.e. create table tt as select a,b,c from t11) additionally.
It seems that VirtualTupleTableSlot is very common in real cases.

I tested four kinds of tables, see below SQLs.

-- tuples with small size.
create table t11 (a int, b int, c int, d int);
insert into t11 select s,s,s,s from generate_series(1, 10000000) s;
analyze t11;

-- tuples that are untoasted and tuple size is 1984 bytes.
create table t12 (a name, b name, c name, d name, e name, f name, g name, h name, i name, j name, k name, l name, m name, n name, o name, p name, q name, r name, s name, t name, u name, v name, w name, x name, y name, z name, a1 name, a2 name, a3 name, a4 name, a5 name);
insert into t12 select 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'a', 'b', 'c', 'd', 'e' from generate_series(1, 500000);
analyze t12;

-- tuples that are untoasted and tuple size is 2112 bytes.
create table t13 (a name, b name, c name, d name, e name, f name, g name, h name, i name, j name, k name, l name, m name, n name, o name, p name, q name, r name, s name, t name, u name, v name, w name, x name, y name, z name, a1 name, a2 name, a3 name, a4 name, a5 name, a6 name, a7 name);
insert into t13 select 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g' from generate_series(1, 500000);
analyze t13;

-- tuples that are toastable and tuple compressed size is 1084.
create table t14 (a text, b text, c text, d text, e text, f text, g text, h text, i text, j text, k text, l text, m text, n text, o text, p text, q text, r text, s text, t text, u text, v text, w text, x text, y text, z text);
insert into t14 select i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i from (select repeat('123456789', 10000) from generate_series(1,5000)) i;
analyze t14;


I also tested two scenarios for each testing.

One is to clean up all kernel caches (page & inode & dentry on Linux) using the command below and then run the test,
    sync; echo 3 > /proc/sys/vm/drop_caches
After running all tests all relation files will be in kernel cache (my test system memory is large enough to accommodate all relation files),
then I run the tests again. I run like this because in real scenario the result of the test should be among the two results. Also I rerun
each test and finally I calculate the average results as the experiment results. Below are some results:


scenario1: All related kernel caches are cleaned up (note the first two columns are time with second).

baseline    patch   diff%   SQL                                                                                                                                                                                                                               

10.1        5.57    44.85%  create table tt as select * from t11;

10.7        5.52    48.41%  create table tt as select a,b,c from t11;

9.57        10.2    -6.58%  create table tt as select * from t12;

9.64        8.63    10.48%  create table tt as select a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4 from t12;

14.2        14.46   -1.83%  create table tt as select * from t13;

11.88       12.05   -1.43%  create table tt as select a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4,a5,a6 from t13; 

3.17        3.25    -2.52%  create table tt as select * from t14;                                        

2.93        3.12    -6.48%  create table tt as select a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y from t14;                                                                                                                                             

scenario2: all related kernel caches are populated after previous testing.                                                                                                                                                                                                                                                              

baseline    patch   diff%   SQL                                                                                                                                                                                                                               

9.6         4.97    48.23%  create table tt as select * from t11;  

10.41       5.32    48.90%  create table tt as select a,b,c from t11; 

9.12        9.52    -4.38%  create table tt as select * from t12;

9.66        8.6     10.97%  create table tt as select a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4 from t12;  

13.56       13.6    -0.30%  create table tt as select * from t13;               

11.36       11.7    -2.99%  create table tt as select a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4,a5,a6 from t13;

3.08        3.13    -1.62%  create table tt as select * from t14;               

2.95        3.03    -2.71%  create table tt as select a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y from t14;


From above we can get some tentative conclusions:

1. t11: For short-size tables, batch insert improves much (40%+).

2. t12: For  BufferHeapTupleTableSlot, the patch slows down 4.x%-6.x%, but for VirtualTupleTableSlot it improves 10.x%.
 If we look at execTuples.c, it looks like this is quite relevant to additional memory copy. It seems that VirtualTupleTableSlot is
more common than the BufferHeapTupleTableSlot so probably the current code should be fine for most real cases. Or it's possible
to determine multi-insert also according to the input slot tuple but this seems to be ugly in code. Or continue to lower the threshold a bit
so that "create table tt as select * from t12;" also improves although this hurts the VirtualTupleTableSlot case.

3. for t13, new code still uses single insert so the difference should be small. I just want to see the regression when even we use "single insert".

4. For toast case t14, the degradation is small, not a big deal.

By the way, did we try or think about allow better prefetch (on Linux) for seqscan. i.e. POSIX_FADV_SEQUENTIAL in posix_fadvise() to enlarge the kernel readahead window. Suppose this should help if seq tuple handling is faster than default kernel readahead setting.


v2 patch is attached.


On Thu, Mar 7, 2019 at 4:54 PM Heikki Linnakangas <[hidden email]> wrote:
On 06/03/2019 22:06, Paul Guo wrote:
> The patch also modifies heap_multi_insert() a bit to do a bit further
> code-level optimization by using static memory, instead of using memory
> context and dynamic allocation.

If toasting is required, heap_prepare_insert() creates a palloc'd tuple.
That is still leaked to the current memory context.

Leaking into the current memory context is not a bad thing, because
resetting a memory context is faster than doing a lot of pfree() calls.
The callers just need to be prepared for that, and use a short-lived
memory context.

> By the way, while looking at the code, I noticed that there are 9 local
> arrays with large length in toast_insert_or_update() which seems to be a
> risk of stack overflow. Maybe we should put it as static or global.

Hmm. We currently reserve 512 kB between the kernel's limit, and the
limit we check in check_stack_depth(). See STACK_DEPTH_SLOP. Those
arrays add up to 52800 bytes on a 64-bit maching, if I did my math
right. So there's still a lot of headroom. I agree that it nevertheless
seems a bit excessive, though.

> With the patch,
>
> Time: 4728.142 ms (00:04.728)
> Time: 14203.983 ms (00:14.204)
> Time: 1008.669 ms (00:01.009)
>
> Baseline,
> Time: 11096.146 ms (00:11.096)
> Time: 13106.741 ms (00:13.107)
> Time: 1100.174 ms (00:01.100)

Nice speedup!

> While for toast and large column size there is < 10% decrease but for
> small column size the improvement is super good. Actually if I hardcode
> the batch count as 4 all test cases are better but the improvement for
> small column size is smaller than that with current patch. Pretty much
> the number 4 is quite case specific so I can not hardcode that in the
> patch. Of course we could further tune that but the current value seems
> to be a good trade-off?

Have you done any profiling, on why the multi-insert is slower with
large tuples? In principle, I don't see why it should be slower.

- Heikki

v2-0001-Heap-batch-insert-for-CTAS-MatView.patch (17K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Batch insert in CTAS/MatView code

Paul Guo-2


On Mon, Jun 17, 2019 at 8:53 PM Paul Guo <[hidden email]> wrote:
Hi all,

I've been working other things until recently I restarted the work, profiling & refactoring the code.
It's been a long time since the last patch was proposed. The new patch has now been firstly refactored due to
4da597edf1bae0cf0453b5ed6fc4347b6334dfe1 (Make TupleTableSlots extensible, finish split of existing slot type).

Now that TupleTableSlot, instead of HeapTuple is one argument of intorel_receive() so we can not get the
tuple length directly. This patch now gets the tuple length if we know all columns are with fixed widths, else
we calculate an avg. tuple length using the first MAX_MULTI_INSERT_SAMPLES (defined as 1000) tuples
and use for the total length of tuples in a batch. 

I noticed that to do batch insert, we might need additional memory copy sometimes comparing with "single insert"
(that should be the reason that we previously saw a bit regressions) so a good solution seems to fall back
to "single insert" if the tuple length is larger than a threshold. I set this as 2000 after quick testing.

To make test stable and strict, I run checkpoint before each ctas, the test script looks like this:
 
checkpoint;
\timing
create table tt as select a,b,c from t11;
\timing
drop table tt;

Also previously I just tested the BufferHeapTupleTableSlot (i.e. create table tt as select * from t11),
this time I test VirtualTupleTableSlot (i.e. create table tt as select a,b,c from t11) additionally.
It seems that VirtualTupleTableSlot is very common in real cases.

I tested four kinds of tables, see below SQLs.

-- tuples with small size.
create table t11 (a int, b int, c int, d int);
insert into t11 select s,s,s,s from generate_series(1, 10000000) s;
analyze t11;

-- tuples that are untoasted and tuple size is 1984 bytes.
create table t12 (a name, b name, c name, d name, e name, f name, g name, h name, i name, j name, k name, l name, m name, n name, o name, p name, q name, r name, s name, t name, u name, v name, w name, x name, y name, z name, a1 name, a2 name, a3 name, a4 name, a5 name);
insert into t12 select 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'a', 'b', 'c', 'd', 'e' from generate_series(1, 500000);
analyze t12;

-- tuples that are untoasted and tuple size is 2112 bytes.
create table t13 (a name, b name, c name, d name, e name, f name, g name, h name, i name, j name, k name, l name, m name, n name, o name, p name, q name, r name, s name, t name, u name, v name, w name, x name, y name, z name, a1 name, a2 name, a3 name, a4 name, a5 name, a6 name, a7 name);
insert into t13 select 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g' from generate_series(1, 500000);
analyze t13;

-- tuples that are toastable and tuple compressed size is 1084.
create table t14 (a text, b text, c text, d text, e text, f text, g text, h text, i text, j text, k text, l text, m text, n text, o text, p text, q text, r text, s text, t text, u text, v text, w text, x text, y text, z text);
insert into t14 select i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i from (select repeat('123456789', 10000) from generate_series(1,5000)) i;
analyze t14;


I also tested two scenarios for each testing.

One is to clean up all kernel caches (page & inode & dentry on Linux) using the command below and then run the test,
    sync; echo 3 > /proc/sys/vm/drop_caches
After running all tests all relation files will be in kernel cache (my test system memory is large enough to accommodate all relation files),
then I run the tests again. I run like this because in real scenario the result of the test should be among the two results. Also I rerun
each test and finally I calculate the average results as the experiment results. Below are some results:


scenario1: All related kernel caches are cleaned up (note the first two columns are time with second).

baseline    patch   diff%   SQL                                                                                                                                                                                                                               

10.1        5.57    44.85%  create table tt as select * from t11;

10.7        5.52    48.41%  create table tt as select a,b,c from t11;

9.57        10.2    -6.58%  create table tt as select * from t12;

9.64        8.63    10.48%  create table tt as select a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4 from t12;

14.2        14.46   -1.83%  create table tt as select * from t13;

11.88       12.05   -1.43%  create table tt as select a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4,a5,a6 from t13; 

3.17        3.25    -2.52%  create table tt as select * from t14;                                        

2.93        3.12    -6.48%  create table tt as select a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y from t14;                                                                                                                                             

scenario2: all related kernel caches are populated after previous testing.                                                                                                                                                                                                                                                              

baseline    patch   diff%   SQL                                                                                                                                                                                                                               

9.6         4.97    48.23%  create table tt as select * from t11;  

10.41       5.32    48.90%  create table tt as select a,b,c from t11; 

9.12        9.52    -4.38%  create table tt as select * from t12;

9.66        8.6     10.97%  create table tt as select a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4 from t12;  

13.56       13.6    -0.30%  create table tt as select * from t13;               

11.36       11.7    -2.99%  create table tt as select a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4,a5,a6 from t13;

3.08        3.13    -1.62%  create table tt as select * from t14;               

2.95        3.03    -2.71%  create table tt as select a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y from t14;


From above we can get some tentative conclusions:

1. t11: For short-size tables, batch insert improves much (40%+).

2. t12: For  BufferHeapTupleTableSlot, the patch slows down 4.x%-6.x%, but for VirtualTupleTableSlot it improves 10.x%.
 If we look at execTuples.c, it looks like this is quite relevant to additional memory copy. It seems that VirtualTupleTableSlot is
more common than the BufferHeapTupleTableSlot so probably the current code should be fine for most real cases. Or it's possible
to determine multi-insert also according to the input slot tuple but this seems to be ugly in code. Or continue to lower the threshold a bit
so that "create table tt as select * from t12;" also improves although this hurts the VirtualTupleTableSlot case.


To alleviate this. I tuned MAX_TUP_LEN_FOR_MULTI_INSERT a bit and set it from 2000 to 1600. With a table with 24 name-typed columns (total size 1536), I tried both
case1: create table tt as select * from t12;
case2: create table tt as select a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w from t12; 

This patch increases the performance for both. Note, of course, this change (MAX_TUP_LEN_FOR_MULTI_INSERT) does not affect the test results of previous t11, t13, t14 in theory since the code path is not affected.

kernel caches cleaned up:

           baseline(s)   patch(s)   diff%
case1: 7.65             7.30          4.6%
case2: 7.75             6.80         12.2%


relation files are in cache:

case1: 7.09             6.66          6.1%
case2: 7.49             6.83          8.8%

We do not need to find a larger threshold that just makes the case1 improvement near to zero since on other test environments the threshold might be a bit different so it should be set as a rough value, and it seems that 1600 should benefit most cases.

I attached the v3 patch which just has the MAX_TUP_LEN_FOR_MULTI_INSERT change.

Thanks.

 
3. for t13, new code still uses single insert so the difference should be small. I just want to see the regression when even we use "single insert".

4. For toast case t14, the degradation is small, not a big deal.

By the way, did we try or think about allow better prefetch (on Linux) for seqscan. i.e. POSIX_FADV_SEQUENTIAL in posix_fadvise() to enlarge the kernel readahead window. Suppose this should help if seq tuple handling is faster than default kernel readahead setting.


v2 patch is attached.


On Thu, Mar 7, 2019 at 4:54 PM Heikki Linnakangas <[hidden email]> wrote:
On 06/03/2019 22:06, Paul Guo wrote:
> The patch also modifies heap_multi_insert() a bit to do a bit further
> code-level optimization by using static memory, instead of using memory
> context and dynamic allocation.

If toasting is required, heap_prepare_insert() creates a palloc'd tuple.
That is still leaked to the current memory context.

Leaking into the current memory context is not a bad thing, because
resetting a memory context is faster than doing a lot of pfree() calls.
The callers just need to be prepared for that, and use a short-lived
memory context.

> By the way, while looking at the code, I noticed that there are 9 local
> arrays with large length in toast_insert_or_update() which seems to be a
> risk of stack overflow. Maybe we should put it as static or global.

Hmm. We currently reserve 512 kB between the kernel's limit, and the
limit we check in check_stack_depth(). See STACK_DEPTH_SLOP. Those
arrays add up to 52800 bytes on a 64-bit maching, if I did my math
right. So there's still a lot of headroom. I agree that it nevertheless
seems a bit excessive, though.

> With the patch,
>
> Time: 4728.142 ms (00:04.728)
> Time: 14203.983 ms (00:14.204)
> Time: 1008.669 ms (00:01.009)
>
> Baseline,
> Time: 11096.146 ms (00:11.096)
> Time: 13106.741 ms (00:13.107)
> Time: 1100.174 ms (00:01.100)

Nice speedup!

> While for toast and large column size there is < 10% decrease but for
> small column size the improvement is super good. Actually if I hardcode
> the batch count as 4 all test cases are better but the improvement for
> small column size is smaller than that with current patch. Pretty much
> the number 4 is quite case specific so I can not hardcode that in the
> patch. Of course we could further tune that but the current value seems
> to be a good trade-off?

Have you done any profiling, on why the multi-insert is slower with
large tuples? In principle, I don't see why it should be slower.

- Heikki

v3-0001-Heap-batch-insert-for-CTAS-MatView.patch (17K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Batch insert in CTAS/MatView code

Heikki Linnakangas
In reply to this post by Paul Guo-2
On 17/06/2019 15:53, Paul Guo wrote:
> I noticed that to do batch insert, we might need additional memory copy
> sometimes comparing with "single insert"
> (that should be the reason that we previously saw a bit regressions) so a
> good solution seems to fall back
> to "single insert" if the tuple length is larger than a threshold. I set
> this as 2000 after quick testing.

Where does the additional memory copy come from? Can we avoid doing it
in the multi-insert case?

- Heikki


Reply | Threaded
Open this post in threaded view
|

Re: Batch insert in CTAS/MatView code

Paul Guo-2


On Fri, Aug 2, 2019 at 2:55 AM Heikki Linnakangas <[hidden email]> wrote:
On 17/06/2019 15:53, Paul Guo wrote:
> I noticed that to do batch insert, we might need additional memory copy
> sometimes comparing with "single insert"
> (that should be the reason that we previously saw a bit regressions) so a
> good solution seems to fall back
> to "single insert" if the tuple length is larger than a threshold. I set
> this as 2000 after quick testing.

Where does the additional memory copy come from? Can we avoid doing it
in the multi-insert case?

Hi Heikki,

Sorry for the late reply. I took some time on looking at & debugging the code of TupleTableSlotOps
of various TupleTableSlot types carefully, especially the BufferHeapTupleTableSlot case on which
we seemed to see regression if no threshold is set, also debugging & testing more of the CTAS case.
I found my previous word "additional memory copy" (mainly tuple content copy against single insert)
is wrong based on the latest code (probably is wrong also with previous code)So in theory
we should not worry about additional tuple copy overhead now, and then I tried the patch without setting
multi-insert threshold as attached.

To make test results more stable, this time I run a simple ' select count(*) from tbl' before each CTAS to
warm up the shared buffer, run checkpoint before each CTAS, disable autovacuum by setting
'autovacuum = off', set larger shared buffers (but < 25% of total memory which is recommended
by PG doc) so that CTAS all hits shared buffer read if there exists warm buffers (double-checked via
explain(analyze, buffers)). These seem to be reasonable for performance testing. Each kind of CTAS
testing is run three times (Note before each run we do warm up and checkpoint as mentioned).

I mainly tested the t12 (normal table with tuple size ~ 2k) case since for others our patch either
performs better or similarly.

Patch:                         1st_run         2nd_run    3rd_run

t12_BufferHeapTuple 7883.400      7549.966    8090.080
t12_Virtual                 8041.637       8191.317    8182.404

Baseline:                         1st_run         2nd_run    3rd_run 

t12_BufferHeapTuple:     8264.290      7508.410   7681.702
t12_Virtual                       8167.792      7970.537   8106.874

I actually roughly tested other tables we mentioned also (t11 and t14) - the test results and conclusions are same.
t12_BufferHeapTuple means: create table tt as select * from t12;
t12_Virtual means: create table tt as select *partial columns* from t12;

So it looks like for t12 the results between our code and baseline are similar so not setting
threshoud seem to be good though it looks like t12_BufferHeapTuple test results varies a
lot (at most 0.5 seconds) for both our patch and baseline vs the virtual case which is quite stable.

This actually confused me a bit given we've cached the source table in shared buffers. I suspected checkpoint affects,
so I disabled checkpoint by setting max_wal_size = 3000 during CTAS, the BufferHeapTuple case (see below)
still varies some. I'm not sure what's the reason but this does not seem to a be blocker for the patch.
Patch:                         1st_run         2nd_run    3rd_run
t12_BufferHeapTuple 7717.304    7413.259    7452.773
t12_Virtual                  7445.742     7483.148   7593.583

Baseline:                         1st_run         2nd_run    3rd_run 
t12_BufferHeapTuple      8186.302     7736.541   7759.056
t12_Virtual                       8004.880      8096.712   7961.483



v4-0001-Multi-insert-in-Create-Table-As.patch (13K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Batch insert in CTAS/MatView code

Asim R P



On Mon, Sep 9, 2019 at 4:02 PM Paul Guo <[hidden email]> wrote:
>
> So in theory
> we should not worry about additional tuple copy overhead now, and then I tried the patch without setting
> multi-insert threshold as attached.
>

I reviewed your patch today.  It looks good overall.  My concern is that the ExecFetchSlotHeapTuple call does not seem appropriate.  In a generic place such as createas.c, we should be using generic tableam API only.  However, I can also see that there is no better alternative.  We need to compute the size of accumulated tuples so far, in order to decide whether to stop accumulating tuples.  There is no convenient way to obtain the length of the tuple, given a slot.  How about making that decision solely based on number of tuples, so that we can avoid ExecFetchSlotHeapTuple call altogether?

The multi insert copy code deals with index tuples also, which I don't see in the patch.  Don't we need to consider populating indexes?

Asim

Reply | Threaded
Open this post in threaded view
|

Re: Batch insert in CTAS/MatView code

Alvaro Herrera-9
On 2019-Sep-25, Asim R P wrote:

> I reviewed your patch today.  It looks good overall.  My concern is that
> the ExecFetchSlotHeapTuple call does not seem appropriate.  In a generic
> place such as createas.c, we should be using generic tableam API only.
> However, I can also see that there is no better alternative.  We need to
> compute the size of accumulated tuples so far, in order to decide whether
> to stop accumulating tuples.  There is no convenient way to obtain the
> length of the tuple, given a slot.  How about making that decision solely
> based on number of tuples, so that we can avoid ExecFetchSlotHeapTuple call
> altogether?

... maybe we should add a new operation to slots, that returns the
(approximate?) size of a tuple?  That would make this easy.  (I'm not
sure however what to do about TOAST considerations -- is it size in
memory that we're worried about?)

Also:

+       myState->mi_slots_size >= 65535)

This magic number should be placed in a define next to the other one,
but I'm not sure that heapam.h is a good location, since surely this
applies to matviews in other table AMs too.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Batch insert in CTAS/MatView code

Paul Guo-2
In reply to this post by Asim R P

Asim Thanks for the review.

On Wed, Sep 25, 2019 at 6:39 PM Asim R P <[hidden email]> wrote:



On Mon, Sep 9, 2019 at 4:02 PM Paul Guo <[hidden email]> wrote:
>
> So in theory
> we should not worry about additional tuple copy overhead now, and then I tried the patch without setting
> multi-insert threshold as attached.
>

I reviewed your patch today.  It looks good overall.  My concern is that the ExecFetchSlotHeapTuple call does not seem appropriate.  In a generic place such as createas.c, we should be using generic tableam API only.  However, I can also see that there is no better alternative.  We need to compute the size of accumulated tuples so far, in order to decide whether to stop accumulating tuples.  There is no convenient way to obtain the length of the tuple, given a slot.  How about making that decision solely based on number of tuples, so that we can avoid ExecFetchSlotHeapTuple call altogether?

For heapam, ExecFetchSlotHeapTuple() will be called again in heap_multi_insert() to prepare the final multi-insert. if we check ExecFetchSlotHeapTuple(), we could find that calling it multiple time just involves very very few overhead for the BufferHeapTuple case. Note for virtual tuple case the 2nd ExecFetchSlotHeapTuple() call still copies slot contents, but we've called ExecCopySlot(batchslot, slot); to copy to a BufferHeap case so no worries for the virtual tuple case (as a source). 

Previously (long ago) I probably understood the code incorrectly so had the concern also. I used sampling to do that (for variable-length tuple), but now apparently we do not need that.

The multi insert copy code deals with index tuples also, which I don't see in the patch.  Don't we need to consider populating indexes?

create table as/create mat view DDL does not involve index creation for the table/matview. The code seems to be able to used in RefreshMatView also, for that we need to consider if we use multi-insert in that code.
 
Reply | Threaded
Open this post in threaded view
|

Re: Batch insert in CTAS/MatView code

Paul Guo-2
In reply to this post by Alvaro Herrera-9


On Thu, Sep 26, 2019 at 9:43 PM Alvaro Herrera <[hidden email]> wrote:
On 2019-Sep-25, Asim R P wrote:

> I reviewed your patch today.  It looks good overall.  My concern is that
> the ExecFetchSlotHeapTuple call does not seem appropriate.  In a generic
> place such as createas.c, we should be using generic tableam API only.
> However, I can also see that there is no better alternative.  We need to
> compute the size of accumulated tuples so far, in order to decide whether
> to stop accumulating tuples.  There is no convenient way to obtain the
> length of the tuple, given a slot.  How about making that decision solely
> based on number of tuples, so that we can avoid ExecFetchSlotHeapTuple call
> altogether?

... maybe we should add a new operation to slots, that returns the
(approximate?) size of a tuple?  That would make this easy.  (I'm not
sure however what to do about TOAST considerations -- is it size in
memory that we're worried about?)

Also:

+       myState->mi_slots_size >= 65535)

This magic number should be placed in a define next to the other one,
but I'm not sure that heapam.h is a good location, since surely this
applies to matviews in other table AMs too.

yes defining 65535 seems better. Let's fix this one later when having more feedback. Thanks.
Reply | Threaded
Open this post in threaded view
|

Re: Batch insert in CTAS/MatView code

Asim R P
In reply to this post by Alvaro Herrera-9


On Thu, Sep 26, 2019 at 7:13 PM Alvaro Herrera <[hidden email]> wrote:

>
> On 2019-Sep-25, Asim R P wrote:
>
> > I reviewed your patch today.  It looks good overall.  My concern is that
> > the ExecFetchSlotHeapTuple call does not seem appropriate.  In a generic
> > place such as createas.c, we should be using generic tableam API only.
> > However, I can also see that there is no better alternative.  We need to
> > compute the size of accumulated tuples so far, in order to decide whether
> > to stop accumulating tuples.  There is no convenient way to obtain the
> > length of the tuple, given a slot.  How about making that decision solely
> > based on number of tuples, so that we can avoid ExecFetchSlotHeapTuple call
> > altogether?
>
> ... maybe we should add a new operation to slots, that returns the
> (approximate?) size of a tuple?  That would make this easy.  (I'm not
> sure however what to do about TOAST considerations -- is it size in
> memory that we're worried about?)

That will help.  For slots containing heap tuples, heap_compute_data_size() is what we need.  Approximate size is better than nothing.
In case of CTAS, we are dealing with slots returned by a scan node.  Wouldn't TOAST datums be already expanded in those slots?

Asim
Reply | Threaded
Open this post in threaded view
|

Re: Batch insert in CTAS/MatView code

Andres Freund
In reply to this post by Alvaro Herrera-9
Hi,

On 2019-09-26 10:43:27 -0300, Alvaro Herrera wrote:
> On 2019-Sep-25, Asim R P wrote:
>
> > I reviewed your patch today.  It looks good overall.  My concern is that
> > the ExecFetchSlotHeapTuple call does not seem appropriate.  In a generic
> > place such as createas.c, we should be using generic tableam API
> > only.

Indeed.


> > However, I can also see that there is no better alternative.  We need to
> > compute the size of accumulated tuples so far, in order to decide whether
> > to stop accumulating tuples.  There is no convenient way to obtain the
> > length of the tuple, given a slot.  How about making that decision solely
> > based on number of tuples, so that we can avoid ExecFetchSlotHeapTuple call
> > altogether?
>
> ... maybe we should add a new operation to slots, that returns the
> (approximate?) size of a tuple?

Hm, I'm not convinced that it's worth adding that as a dedicated
operation. It's not that clear what it'd exactly mean anyway - what
would it measure? As referenced in the slot? As if it were stored on
disk? etc?

I wonder if the right answer wouldn't be to just measure the size of a
memory context containing the batch slots, or something like that.


> That would make this easy.  (I'm not sure however what to do about
> TOAST considerations -- is it size in memory that we're worried
> about?)

The in-memory size is probably fine, because in all likelihood the
toasted cols are just going to point to on-disk datums, no?


> Also:
>
> +       myState->mi_slots_size >= 65535)
>
> This magic number should be placed in a define next to the other one,
> but I'm not sure that heapam.h is a good location, since surely this
> applies to matviews in other table AMs too.

Right. I think it'd be better to move this into an AM independent place.


Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Batch insert in CTAS/MatView code

Andres Freund
In reply to this post by Paul Guo-2
Hi,

On 2019-09-09 18:31:54 +0800, Paul Guo wrote:

> diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
> index e9544822bf..8a844b3b5f 100644
> --- a/src/backend/access/heap/heapam.c
> +++ b/src/backend/access/heap/heapam.c
> @@ -2106,7 +2106,6 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
>    CommandId cid, int options, BulkInsertState bistate)
>  {
>   TransactionId xid = GetCurrentTransactionId();
> - HeapTuple  *heaptuples;
>   int i;
>   int ndone;
>   PGAlignedBlock scratch;
> @@ -2115,6 +2114,10 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
>   Size saveFreeSpace;
>   bool need_tuple_data = RelationIsLogicallyLogged(relation);
>   bool need_cids = RelationIsAccessibleInLogicalDecoding(relation);
> + /* Declare it as static to let this memory be not on stack. */
> + static HeapTuple heaptuples[MAX_MULTI_INSERT_TUPLES];
> +
> + Assert(ntuples <= MAX_MULTI_INSERT_TUPLES);
>  
>   /* currently not needed (thus unsupported) for heap_multi_insert() */
>   AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
> @@ -2124,7 +2127,6 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
>     HEAP_DEFAULT_FILLFACTOR);
>  
>   /* Toast and set header data in all the slots */
> - heaptuples = palloc(ntuples * sizeof(HeapTuple));
>   for (i = 0; i < ntuples; i++)
>   {
>   HeapTuple tuple;

I don't think this is a good idea. We shouldn't unnecessarily allocate
8KB on the stack. Is there any actual evidence this is a performance
benefit? To me this just seems like it'll reduce the flexibility of the
API, without any benefit.  I'll also note that you've apparently not
updated tableam.h to document this new restriction.


Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Batch insert in CTAS/MatView code

Paul Guo-2


On Sat, Sep 28, 2019 at 5:49 AM Andres Freund <[hidden email]> wrote:
Hi,

On 2019-09-09 18:31:54 +0800, Paul Guo wrote:
> diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
> index e9544822bf..8a844b3b5f 100644
> --- a/src/backend/access/heap/heapam.c
> +++ b/src/backend/access/heap/heapam.c
> @@ -2106,7 +2106,6 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
>                                 CommandId cid, int options, BulkInsertState bistate)
>  {
>       TransactionId xid = GetCurrentTransactionId();
> -     HeapTuple  *heaptuples;
>       int                     i;
>       int                     ndone;
>       PGAlignedBlock scratch;
> @@ -2115,6 +2114,10 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
>       Size            saveFreeSpace;
>       bool            need_tuple_data = RelationIsLogicallyLogged(relation);
>       bool            need_cids = RelationIsAccessibleInLogicalDecoding(relation);
> +     /* Declare it as static to let this memory be not on stack. */
> +     static HeapTuple        heaptuples[MAX_MULTI_INSERT_TUPLES];
> +
> +     Assert(ntuples <= MAX_MULTI_INSERT_TUPLES);

>       /* currently not needed (thus unsupported) for heap_multi_insert() */
>       AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
> @@ -2124,7 +2127,6 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
>                                                                                                  HEAP_DEFAULT_FILLFACTOR);

>       /* Toast and set header data in all the slots */
> -     heaptuples = palloc(ntuples * sizeof(HeapTuple));
>       for (i = 0; i < ntuples; i++)
>       {
>               HeapTuple       tuple;

I don't think this is a good idea. We shouldn't unnecessarily allocate
8KB on the stack. Is there any actual evidence this is a performance
benefit? To me this just seems like it'll reduce the flexibility of the

Previous  heaptuples is palloc-ed in each batch, which should be slower than
pre-allocated & reusing memory in theory.

API, without any benefit.  I'll also note that you've apparently not
updated tableam.h to document this new restriction.
 
Yes it should be moved from heapam.h to that file along with the 65535 definition.
Reply | Threaded
Open this post in threaded view
|

Re: Batch insert in CTAS/MatView code

Paul Guo-2
In reply to this post by Andres Freund

> > However, I can also see that there is no better alternative.  We need to
> > compute the size of accumulated tuples so far, in order to decide whether
> > to stop accumulating tuples.  There is no convenient way to obtain the
> > length of the tuple, given a slot.  How about making that decision solely
> > based on number of tuples, so that we can avoid ExecFetchSlotHeapTuple call
> > altogether?
>
> ... maybe we should add a new operation to slots, that returns the
> (approximate?) size of a tuple?

Hm, I'm not convinced that it's worth adding that as a dedicated
operation. It's not that clear what it'd exactly mean anyway - what
would it measure? As referenced in the slot? As if it were stored on
disk? etc?

I wonder if the right answer wouldn't be to just measure the size of a
memory context containing the batch slots, or something like that.


Probably a better way is to move those logic (append slot to slots, judge
when to flush, flush, clean up slots) into table_multi_insert()? Generally
the final implementation of table_multi_insert() should be able to know
the sizes easily. One concern is that currently just COPY in the repo uses
multi insert, so not sure if other callers in the future want their own
logic (or set up a flag to allow customization but seems a bit over-designed?).
Reply | Threaded
Open this post in threaded view
|

Re: Batch insert in CTAS/MatView code

Andres Freund
Hi,

On 2019-09-30 12:12:31 +0800, Paul Guo wrote:

> > > > However, I can also see that there is no better alternative.  We need
> > to
> > > > compute the size of accumulated tuples so far, in order to decide
> > whether
> > > > to stop accumulating tuples.  There is no convenient way to obtain the
> > > > length of the tuple, given a slot.  How about making that decision
> > solely
> > > > based on number of tuples, so that we can avoid ExecFetchSlotHeapTuple
> > call
> > > > altogether?
> > >
> > > ... maybe we should add a new operation to slots, that returns the
> > > (approximate?) size of a tuple?
> >
> > Hm, I'm not convinced that it's worth adding that as a dedicated
> > operation. It's not that clear what it'd exactly mean anyway - what
> > would it measure? As referenced in the slot? As if it were stored on
> > disk? etc?
> >
> > I wonder if the right answer wouldn't be to just measure the size of a
> > memory context containing the batch slots, or something like that.
> >
> >
> Probably a better way is to move those logic (append slot to slots, judge
> when to flush, flush, clean up slots) into table_multi_insert()?

That does not strike me as a good idea. The upper layer is going to need
to manage some resources (e.g. it's the only bit that knows about how to
manage lifetime of the incoming data), and by exposing it to each AM
we're going to duplicate the necessary code too.


> Generally the final implementation of table_multi_insert() should be
> able to know the sizes easily. One concern is that currently just COPY
> in the repo uses multi insert, so not sure if other callers in the
> future want their own logic (or set up a flag to allow customization
> but seems a bit over-designed?).

And that is also a concern, it seems unlikely that we'll get the
interface good.


Greetings,

Andres Freund


12