Parallel INSERT (INTO ... SELECT ...)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
363 messages Options
1 ... 1213141516171819
Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

akapila
On Wed, Feb 24, 2021 at 6:21 PM Greg Nancarrow <[hidden email]> wrote:

>
> On Wed, Feb 24, 2021 at 10:38 PM Amit Kapila <[hidden email]> wrote:
> > >
> > > Thanks, I'll try it.
> > > I did, however, notice a few concerns with your suggested alternative fix:
> > > - It is not restricted to INSERT (as current fix is).
> > >
> >
> > So what? The Select also has a similar problem.
> >
>
> Yes, but you're potentially now adding overhead to every
> SELECT/UPDATE/DELETE with a subquery, by the added recursive checking
> and walking done by the new call to  max_parallel_hazard_walker().and
> code block looking for a modifying CTE
>

Can you please share an example where it has added an overhead?

> And anyway I'm not sure it's really right putting in a fix for SELECT
> with a modifying CTE, into a patch that adds parallel INSERT
> functionality - in any case you'd need to really spell this out in
> code comments, as this is at best a temporary fix that would need to
> be removed whenever the query rewriter is fixed to set hasModifyingCTE
> correctly.
>
> > > - It does not set parse->hasModifyingCTE (as current fix does), so the
> > > return value (PlannedStmt) from standard_planner() won't have
> > > hasModifyingCTE set correctly in the cases where the rewriter doesn't
> > > set it correctly (and I'm not sure what strange side effects ??? that
> > > might have).
> >
> > Here end goal is not to set hasModifyingCTE but do let me know if you
> > see any problem or impact.
> >
>
> parse->hasModifyingCTE is not just used in the shortcut-test for
> parallel-safety, its value is subsequently copied into the PlannedStmt
> returned by standard_planner.
> It's inconsistent to leave hasModifyingCTE FALSE when by the fix it
> has found a modifying CTE.
> Even if no existing tests detect an issue with this, PlannedStmt is
> left with a bad hasModifyingCTE value in this case, so there is the
> potential for something to go wrong.
>
> > > - Although the invocation of max_parallel_hazard_walker() on a RTE
> > > subquery will "work" in finally locating your fix's added
> > > "CommonTableExpr" parallel-safety disabling block for commandType !=
> > > CMD_SELECT, it looks like it potentially results in checking and
> > > walking over a lot of other stuff within the subquery not related to
> > > CTEs. The current fix does a more specific and efficient search for a
> > > modifying CTE.
> > >
> >
> > I find the current fix proposed by you quite ad-hoc and don't think we
> > can go that way.
> >
>
> At least my current fix is very specific, efficient and clear in its
> purpose, and suitably documented, so it is very clear when and how it
> is to be removed, when the issue is fixed in the query rewriter.
> Another concern with the alternative fix is that it always searches
> for a modifying CTE, even when parse->hasModifyingCTE is true after
> the query rewriter processing.
>

There is a check in standard_planner such that if
parse->hasModifyingCTE is true then we won't try checking
parallel-safety.


--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

Greg Nancarrow
On Thu, Feb 25, 2021 at 12:19 AM Amit Kapila <[hidden email]> wrote:

>
> On Wed, Feb 24, 2021 at 6:21 PM Greg Nancarrow <[hidden email]> wrote:
> >
> > On Wed, Feb 24, 2021 at 10:38 PM Amit Kapila <[hidden email]> wrote:
> > > >
> > > > Thanks, I'll try it.
> > > > I did, however, notice a few concerns with your suggested alternative fix:
> > > > - It is not restricted to INSERT (as current fix is).
> > > >
> > >
> > > So what? The Select also has a similar problem.
> > >
> >
> > Yes, but you're potentially now adding overhead to every
> > SELECT/UPDATE/DELETE with a subquery, by the added recursive checking
> > and walking done by the new call to  max_parallel_hazard_walker().and
> > code block looking for a modifying CTE
> >
>
> Can you please share an example where it has added an overhead?
>
> > And anyway I'm not sure it's really right putting in a fix for SELECT
> > with a modifying CTE, into a patch that adds parallel INSERT
> > functionality - in any case you'd need to really spell this out in
> > code comments, as this is at best a temporary fix that would need to
> > be removed whenever the query rewriter is fixed to set hasModifyingCTE
> > correctly.
> >
> > > > - It does not set parse->hasModifyingCTE (as current fix does), so the
> > > > return value (PlannedStmt) from standard_planner() won't have
> > > > hasModifyingCTE set correctly in the cases where the rewriter doesn't
> > > > set it correctly (and I'm not sure what strange side effects ??? that
> > > > might have).
> > >
> > > Here end goal is not to set hasModifyingCTE but do let me know if you
> > > see any problem or impact.
> > >
> >
> > parse->hasModifyingCTE is not just used in the shortcut-test for
> > parallel-safety, its value is subsequently copied into the PlannedStmt
> > returned by standard_planner.
> > It's inconsistent to leave hasModifyingCTE FALSE when by the fix it
> > has found a modifying CTE.
> > Even if no existing tests detect an issue with this, PlannedStmt is
> > left with a bad hasModifyingCTE value in this case, so there is the
> > potential for something to go wrong.
> >
> > > > - Although the invocation of max_parallel_hazard_walker() on a RTE
> > > > subquery will "work" in finally locating your fix's added
> > > > "CommonTableExpr" parallel-safety disabling block for commandType !=
> > > > CMD_SELECT, it looks like it potentially results in checking and
> > > > walking over a lot of other stuff within the subquery not related to
> > > > CTEs. The current fix does a more specific and efficient search for a
> > > > modifying CTE.
> > > >
> > >
> > > I find the current fix proposed by you quite ad-hoc and don't think we
> > > can go that way.
> > >
> >
> > At least my current fix is very specific, efficient and clear in its
> > purpose, and suitably documented, so it is very clear when and how it
> > is to be removed, when the issue is fixed in the query rewriter.
> > Another concern with the alternative fix is that it always searches
> > for a modifying CTE, even when parse->hasModifyingCTE is true after
> > the query rewriter processing.
> >
>
> There is a check in standard_planner such that if
> parse->hasModifyingCTE is true then we won't try checking
> parallel-safety.
>

OK, I retract that last concern, parallel-safety checks are skipped
when parse->hasModifyingCTE is true.
Examples of overhead will need to wait until tomorrow (and would need
to test), but seems fairly clear max_parallel_hazard_walker() first
checks parallel-unsafe functions in the node, then does numerous
node-type checks before getting to CommonTableExpr - exactly how much
extra work would depend on the SQL.

Regards,
Greg Nancarrow
Fujitsu Australia


Reply | Threaded
Open this post in threaded view
|

RE: Parallel INSERT (INTO ... SELECT ...)

houzj.fnst@fujitsu.com
In reply to this post by tsunakawa.takay@fujitsu.com
> > It is quite possible what you are saying is correct but I feel that is
> > not this patch's fault. So, won't it better to discuss this in a
> > separate thread?
> >
> > Good use case but again, I think this can be done as a separate patch.
>
> Agreed.
> I think even the current patch offers great benefits and can be committed in PG
> 14, even if all my four feedback comments are left unaddressed.  I just touched
> on them for completeness in terms of typically expected use cases.  They will
> probably be able to be implemented along the current design.
>
>
>
> > I think here you are talking about the third patch (Parallel Inserts).
> > I guess if one has run inserts parallelly from psql then also similar
> > behavior would have been observed. For tables, it might lead to better
> > results once we have the patch discussed at [1]. Actually, this needs
> > more investigation.
> >
> > [1] -
> >
> https://www.postgresql.org/message-id/20200508072545.GA9701%40telsas
> o
> > ft.com
>
> That looks interesting and worth a try.

Hi,

I test the bitmapscan with both multi-insert patch and parallel insert patch applied.
But the performance degradation and table size increased still happened in my machine.

To better analyze this issue, I did some more research on it (only applied parallel insert patch)

I add some code to track the time spent in index operation.
From the results[1], we can see more workers will bring more cost in _bt_search_insert() in each worker.
After debugged, the most cost part is the following:
-----
                /* drop the read lock on the page, then acquire one on its child */
                *bufP = _bt_relandgetbuf(rel, *bufP, child, page_access);
-----

It seems the order of parallel bitmap scan's result will result in more lock time in parallel insert.
[1]---------------parallel bitmap scan------------------
worker 0:
psql:test.sql:10: INFO:  insert index _bt_search_insert time:834735
psql:test.sql:10: INFO:  insert index total time:1895330
psql:test.sql:10: INFO:  insert tuple time:628064

worker 2:
psql:test.sql:10: INFO:  insert index _bt_search_insert time:1552242
psql:test.sql:10: INFO:  insert index total time:2374741
psql:test.sql:10: INFO:  insert tuple time:314571

worker 4:
psql:test.sql:10: INFO:  insert index _bt_search_insert time:2496424
psql:test.sql:10: INFO:  insert index total time:3016150
psql:test.sql:10: INFO:  insert tuple time:211741
----------------------------

Based on above, I tried to change the order of results that bitmapscan return.
In the original test, we prepare data in order (like: generate_series(1,10000,1)),
If we change the order we insert the data in the source table, the performance degradation will not always happen[2].
And table size difference will be small.

-------------------out of order source table-----------------------------
insert into x(a,b,c) select i,i+1,i+2 from generate_series(1,600000000) as t(i) order by random();
----------------------------------------------------------------------------

Test results when source table out of order(using bitmap heap scan):
[2]--------------------------------------------------------
Worker 0:
Execution Time: 37028.006 ms
Worker 2:
Execution Time: 11355.153 ms
Worker 4:
Execution Time: 9273.398 ms
--------------------------------------------------------

So, this performance degradation issue seems related on the order of the data in the source table.
It does not always happen. Do we need to do some specific fix for it ?

For multi-insert, I guess the reason why it does not solve the performance problem is that we do not actually have a api for multi-index insert,
Like the api for tableam rd_tableam->multi_insert(), so we still execute ExecInsertIndexTuples in a loop for the multi index insert.

I plan to do some more test for multi-insert and parallel insert with out of order source table.

Best regards,
houzj
Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

Amit Langote
In reply to this post by akapila
On Wed, Feb 24, 2021 at 6:03 PM Amit Kapila <[hidden email]> wrote:

> On Wed, Feb 24, 2021 at 2:14 PM Greg Nancarrow <[hidden email]> wrote:
> > On Wed, Feb 24, 2021 at 3:12 PM Amit Kapila <[hidden email]> wrote:
> > > On Wed, Feb 24, 2021 at 8:41 AM Greg Nancarrow <[hidden email]> wrote:
> > > > On Tue, Feb 23, 2021 at 10:53 PM Amit Kapila <[hidden email]> wrote:
> > > > > > But the non-parallel plan was chosen (instead of a parallel plan)
> > > > > > because of parallel-safety checks on the partitions, which found
> > > > > > attributes of the partitions which weren't parallel-safe.
> > > > > > So it's not so clear to me that the dependency doesn't exist - the
> > > > > > non-parallel plan does in fact depend on the state of the partitions.
> > > > >
> > > > > Hmm, I think that is not what we can consider as a dependency.
> > > >
> > > > Then if it's not a dependency, then we shouldn't have to check the
> > > > attributes of the partitions for parallel-safety, to determine whether
> > > > we must use a non-parallel plan (or can use a parallel plan).
> > > > Except, of course, we do have to ...
> > >
> > > I don't think the plan-dependency and checking for parallel-safety are
> > > directly related.
> >
> > That is certainly not my understanding. Why do you think that they are
> > not directly related?
> > This whole issue came about because Amit L pointed out that there is a
> > need to add partition OIDs as plan-dependencies BECAUSE the checking
> > for parallel-safety and plan-dependency are related - since now, for
> > Parallel INSERT, we're executing extra parallel-safety checks that
> > check partition properties, so the resultant plan is dependent on the
> > partitions and their properties.
>
> He has pointed out an issue when the plan is parallel and you can see
> in that example that it fails if we didn't invalidate it. For
> non-parallel plans, there won't be any such issue.

Yes.  I checked around a bit (code and -hackers archive [1]) and came
away with the impression that there do not appear to be any set rules
for deciding which object changes to send an invalidation message for
(sending side: ddl, vacuum/analyze) and which items of a Query or a
Plan to track changes for (receiving side: planner, plancache).  One
could say the foremost rule is to avoid broken cached plans and only
in some really obvious cases do the thing that produces a better plan
[2].  While no compromises can be made for the former, whether or not
to go for the latter probably involves some cost-benefit analysis,
something we can probably revisit.

I don't think we're compromising by not adding the partition OIDs when
the insert plan is not parallel, but the benefits of adding them in
all cases are not so clear cut that maybe it's not worth it.

--
Amit Langote
EDB: http://www.enterprisedb.com

[1] Plan invalidation design:
https://www.postgresql.org/message-id/flat/20244.1171734513%40sss.pgh.pa.us

[2] ...contradicts what I said before, but I found this comment in
DefineIndex():

    /*
     * The pg_index update will cause backends (including this one) to update
     * relcache entries for the index itself, but we should also send a
     * relcache inval on the parent table to force replanning of cached plans.
     * Otherwise existing sessions might fail to use the new index where it
     * would be useful.  (Note that our earlier commits did not create reasons
     * to replan; so relcache flush on the index itself was sufficient.)
     */
    CacheInvalidateRelcacheByRelid(heaprelid.relId);

So this invalidates any plans referencing the index's parent relation
to trigger replanning so as to take the index into account.  The old
plans would not really be "unrunnable" without the index though.


Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

Greg Nancarrow
Posting a new version of the patches, with the following updates:

- Added parallel-safety check for index predicate and added extra
tests for this.
- Removed fix for query rewriter hasModifyingCTE bug and instead
handle this by detecting a modifying CTE within the parallel-safety
checks (and returning UNSAFE if one is found).
- Updated registration of parallel-safety-checked partition OIDs as
plan dependencies (only do it for parallel plans).
- Integrated parallel_dml patch.


Regards,
Greg Nancarrow
Fujitsu Australia

v19-0001-Enable-parallel-SELECT-for-INSERT-INTO-.-SELECT.patch (38K) Download Attachment
v19-0005-Parallel-INSERT-and-or-SELECT-for-INSERT-INTO-tests-and-doc.patch (30K) Download Attachment
v19-0003-Add-new-parallel-dml-GUC-and-table-options.patch (26K) Download Attachment
v19-0004-Enable-parallel-INSERT-and-or-SELECT-for-INSERT-INTO.patch (60K) Download Attachment
v19-0002-Parallel-SELECT-for-INSERT-INTO-.-SELECT-tests-and-doc.patch (95K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

Amit Langote
In reply to this post by Amit Langote
Hi Greg,

Replying to an earlier email in the thread because I think I found a
problem relevant to the topic that was brought up.

On Wed, Feb 17, 2021 at 10:34 PM Amit Langote <[hidden email]> wrote:
> On Wed, Feb 17, 2021 at 10:44 AM Greg Nancarrow <[hidden email]> wrote:
> > Is the use of "table_close(rel, NoLock)'' intentional? That will keep
> > the lock (lockmode) until end-of-transaction.
>
> I think we always keep any locks on relations that are involved in a
> plan until end-of-transaction.  What if a partition is changed in an
> unsafe manner between being considered safe for parallel insertion and
> actually performing the parallel insert?

I realized that there is a race condition that will allow a concurrent
backend to invalidate a parallel plan (for insert into a partitioned
table) at a point in time when it's too late for plancache.c to detect
it.  It has to do with how plancache.c locks the relations affected by
a cached query and its cached plan.  Specifically,
AcquireExecutorLocks(), which locks the relations that need to be
locked before the plan could be considered safe to execute, does not
notice the partitions that would have been checked for parallel safety
when the plan was made.  Given that AcquireExecutorLocks() only loops
over PlannedStmt.rtable and locks exactly the relations found there,
partitions are not locked.  This means that a concurrent backend can,
for example, add an unsafe trigger to a partition before a parallel
worker inserts into it only to fail when it does.

Steps to reproduce (tried with v19 set of patches):

drop table if exists rp, foo;
create table rp (a int) partition by range (a);
create table rp1 partition of rp for values from (minvalue) to (0);
create table rp2 partition of rp for values from (0) to (maxvalue);
create table foo (a) as select generate_series(1, 1000000);
set plan_cache_mode to force_generic_plan;
set enable_parallel_dml to on;
deallocate all;
prepare q as insert into rp select * from foo where a%2 = 0;
explain analyze execute q;

At this point, attach a debugger to this backend and set a breakpoint
in AcquireExecutorLocks() and execute q again:

-- will execute the cached plan
explain analyze execute q;

Breakpoint will be hit.  Continue till return from the function and
stop. Start another backend and execute this:

-- will go through, because no lock still taken on the partition
create or replace function make_table () returns trigger language
plpgsql as $$ begin create table bar(); return null; end; $$ parallel
unsafe;
create trigger ai_rp2 after insert on rp2 for each row execute
function make_table();

Back to the debugger, hit continue to let the plan execute.  You
should see this error:

ERROR:  cannot start commands during a parallel operation
CONTEXT:  SQL statement "create table bar()"
PL/pgSQL function make_table() line 1 at SQL statement parallel worker

The attached patch fixes this, although I am starting to have second
thoughts about how we're tracking partitions in this patch.  Wondering
if we should bite the bullet and add partitions into the main range
table instead of tracking them separately in partitionOids, which
might result in a cleaner patch overall.

--
Amit Langote
EDB: http://www.enterprisedb.com

AcquireExecutorLocks-partition-insert.patch (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: Parallel INSERT (INTO ... SELECT ...)

houzj.fnst@fujitsu.com
In reply to this post by houzj.fnst@fujitsu.com
> I add some code to track the time spent in index operation.
> From the results[1], we can see more workers will bring more cost in
> _bt_search_insert() in each worker.
> After debugged, the most cost part is the following:
> -----
> /* drop the read lock on the page, then acquire one on its child
> */
> *bufP = _bt_relandgetbuf(rel, *bufP, child, page_access);
> -----
>
> It seems the order of parallel bitmap scan's result will result in more lock time in
> parallel insert.
> [1]---------------parallel bitmap scan------------------ worker 0:
> psql:test.sql:10: INFO:  insert index _bt_search_insert time:834735
> psql:test.sql:10: INFO:  insert index total time:1895330
> psql:test.sql:10: INFO:  insert tuple time:628064
>
> worker 2:
> psql:test.sql:10: INFO:  insert index _bt_search_insert time:1552242
> psql:test.sql:10: INFO:  insert index total time:2374741
> psql:test.sql:10: INFO:  insert tuple time:314571
>
> worker 4:
> psql:test.sql:10: INFO:  insert index _bt_search_insert time:2496424
> psql:test.sql:10: INFO:  insert index total time:3016150
> psql:test.sql:10: INFO:  insert tuple time:211741
> ----------------------------
>
> Based on above, I tried to change the order of results that bitmapscan return.
> In the original test, we prepare data in order (like: generate_series(1,10000,1)), If
> we change the order we insert the data in the source table, the performance
> degradation will not always happen[2].
> And table size difference will be small.
>
> -------------------out of order source table-----------------------------
> insert into x(a,b,c) select i,i+1,i+2 from generate_series(1,600000000) as t(i)
> order by random();
> ----------------------------------------------------------------------------
>
> Test results when source table out of order(using bitmap heap scan):
> [2]--------------------------------------------------------
> Worker 0:
> Execution Time: 37028.006 ms
> Worker 2:
> Execution Time: 11355.153 ms
> Worker 4:
> Execution Time: 9273.398 ms
> --------------------------------------------------------
>
> So, this performance degradation issue seems related on the order of the data
> in the source table.
> It does not always happen. Do we need to do some specific fix for it ?

After doing some more tests on it (performance degradation will not happen when source table is out of order).
I think we can say the performance degradation is related to the order of the data in source table.
Also , In master branch, I found the order of data in source table will not affect the planner when generating plan(for select part).

As we can see from [1][2], If source table's data is in order, when set max_parallel_workers_per_gather to 4,
the planner will choose parallel seqscan here but it is actually slower than serial bitmapscan.
If data in source table is out of order, the performance degradation will not happen again[3][4].

So, the order of data 's influence seems a normal phenomenon, I think it seems we do not need to do anything about it (currently).
It seems better to mark it as todo which we can improve this in the future.

(Since the performance degradation in parallel bitmap is because of the lock in _bt_search, It will not always happen
when the target table already have data, less race condition will happened when parallel insert into a evenly distributed btree).



Test result with sql: "postgres=# explain analyze verbose select a from x where a<80000 or (a%2=0 and a>199000000);"

[1]-----------Source table data in order and max_parallel_workers_per_gather = 0-----------
Bitmap Heap Scan on public.x  (cost=22999.40..4991850.30 rows=91002 width=4) (actual time=45.445..201.322 rows=579999 loops=1)
   Output: a
   Recheck Cond: ((x.a < 80000) OR (x.a > 199000000))
   Filter: ((x.a < 80000) OR (((x.a % 2) = 0) AND (x.a > 199000000)))
   Rows Removed by Filter: 500000
   Heap Blocks: exact=5840
   ->  BitmapOr  (cost=22999.40..22999.40 rows=1242768 width=0) (actual time=44.622..44.637 rows=0 loops=1)
         ->  Bitmap Index Scan on x_a_idx  (cost=0.00..1575.70 rows=85217 width=0) (actual time=3.622..3.634 rows=79999 loops=1)
               Index Cond: (x.a < 80000)
         ->  Bitmap Index Scan on x_a_idx  (cost=0.00..21378.20 rows=1157551 width=0) (actual time=40.998..40.998 rows=1000000 loops=1)
               Index Cond: (x.a > 199000000)
 Planning Time: 0.199 ms
 Execution Time: 232.327 ms

[2]-----------Source table data in order and max_parallel_workers_per_gather = 4-----------
Gather  (cost=1000.00..2091183.08 rows=91002 width=4) (actual time=0.594..4216.197 rows=579999 loops=1)
   Output: a
   Workers Planned: 4
   Workers Launched: 4
   ->  Parallel Seq Scan on public.x  (cost=0.00..2081082.88 rows=22750 width=4) (actual time=0.087..4197.570 rows=116000 loops=5)
         Output: a
         Filter: ((x.a < 80000) OR (((x.a % 2) = 0) AND (x.a > 199000000)))
         Rows Removed by Filter: 39884000
         Worker 0:  actual time=0.083..4219.087 rows=96574 loops=1
         Worker 1:  actual time=0.076..4201.339 rows=195354 loops=1
         Worker 2:  actual time=0.096..4218.637 rows=96474 loops=1
         Worker 3:  actual time=0.118..4205.825 rows=96847 loops=1
 Planning Time: 0.175 ms
 Execution Time: 4243.089 ms
(14 rows)


[3]-----------Source table data out of order and max_parallel_workers_per_gather = 0-----------
 Bitmap Heap Scan on public.x2  (cost=19815.15..4953678.20 rows=81178 width=4) (actual time=263.640..15653.251 rows=579999 loops=1)
   Output: a
   Recheck Cond: ((x2.a < 80000) OR (x2.a > 199000000))
   Rows Removed by Index Recheck: 115394602
   Filter: ((x2.a < 80000) OR (((x2.a % 2) = 0) AND (x2.a > 199000000)))
   Rows Removed by Filter: 500000
   Heap Blocks: exact=55343 lossy=629270
   ->  BitmapOr  (cost=19815.15..19815.15 rows=1070588 width=0) (actual time=248.701..248.715 rows=0 loops=1)
         ->  Bitmap Index Scan on x2_a_idx  (cost=0.00..1408.13 rows=76208 width=0) (actual time=18.116..18.117 rows=79999 loops=1)
               Index Cond: (x2.a < 80000)
         ->  Bitmap Index Scan on x2_a_idx  (cost=0.00..18366.43 rows=994381 width=0) (actual time=230.581..230.582 rows=1000000 loops=1)
               Index Cond: (x2.a > 199000000)
 Planning Time: 0.530 ms
 Execution Time: 15700.488 ms

[4]-----------Source table data out of order and max_parallel_workers_per_gather = 4-----------
Gather  (cost=1000.00..2090199.80 rows=81178 width=4) (actual time=0.674..5154.622 rows=579999 loops=1)
   Output: a
   Workers Planned: 4
   Workers Launched: 4
   ->  Parallel Seq Scan on public.x2  (cost=0.00..2081082.00 rows=20294 width=4) (actual time=0.124..5112.635 rows=116000 loops=5)
         Output: a
         Filter: ((x2.a < 80000) OR (((x2.a % 2) = 0) AND (x2.a > 199000000)))
         Rows Removed by Filter: 39884000
         Worker 0:  actual time=0.220..5136.219 rows=107646 loops=1
         Worker 1:  actual time=0.170..5133.046 rows=114824 loops=1
         Worker 2:  actual time=0.127..5128.256 rows=115010 loops=1
         Worker 3:  actual time=0.066..5133.022 rows=120061 loops=1
 Planning Time: 0.194 ms
 Execution Time: 5187.682 ms

Best regards,
houzj






Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

Greg Nancarrow
In reply to this post by Amit Langote
On Fri, Feb 26, 2021 at 4:07 PM Amit Langote <[hidden email]> wrote:

>
> Hi Greg,
>
> Replying to an earlier email in the thread because I think I found a
> problem relevant to the topic that was brought up.
>
> On Wed, Feb 17, 2021 at 10:34 PM Amit Langote <[hidden email]> wrote:
> > On Wed, Feb 17, 2021 at 10:44 AM Greg Nancarrow <[hidden email]> wrote:
> > > Is the use of "table_close(rel, NoLock)'' intentional? That will keep
> > > the lock (lockmode) until end-of-transaction.
> >
> > I think we always keep any locks on relations that are involved in a
> > plan until end-of-transaction.  What if a partition is changed in an
> > unsafe manner between being considered safe for parallel insertion and
> > actually performing the parallel insert?
>
> I realized that there is a race condition that will allow a concurrent
> backend to invalidate a parallel plan (for insert into a partitioned
> table) at a point in time when it's too late for plancache.c to detect
> it.  It has to do with how plancache.c locks the relations affected by
> a cached query and its cached plan.  Specifically,
> AcquireExecutorLocks(), which locks the relations that need to be
> locked before the plan could be considered safe to execute, does not
> notice the partitions that would have been checked for parallel safety
> when the plan was made.  Given that AcquireExecutorLocks() only loops
> over PlannedStmt.rtable and locks exactly the relations found there,
> partitions are not locked.  This means that a concurrent backend can,
> for example, add an unsafe trigger to a partition before a parallel
> worker inserts into it only to fail when it does.
>
> Steps to reproduce (tried with v19 set of patches):
>
> drop table if exists rp, foo;
> create table rp (a int) partition by range (a);
> create table rp1 partition of rp for values from (minvalue) to (0);
> create table rp2 partition of rp for values from (0) to (maxvalue);
> create table foo (a) as select generate_series(1, 1000000);
> set plan_cache_mode to force_generic_plan;
> set enable_parallel_dml to on;
> deallocate all;
> prepare q as insert into rp select * from foo where a%2 = 0;
> explain analyze execute q;
>
> At this point, attach a debugger to this backend and set a breakpoint
> in AcquireExecutorLocks() and execute q again:
>
> -- will execute the cached plan
> explain analyze execute q;
>
> Breakpoint will be hit.  Continue till return from the function and
> stop. Start another backend and execute this:
>
> -- will go through, because no lock still taken on the partition
> create or replace function make_table () returns trigger language
> plpgsql as $$ begin create table bar(); return null; end; $$ parallel
> unsafe;
> create trigger ai_rp2 after insert on rp2 for each row execute
> function make_table();
>
> Back to the debugger, hit continue to let the plan execute.  You
> should see this error:
>
> ERROR:  cannot start commands during a parallel operation
> CONTEXT:  SQL statement "create table bar()"
> PL/pgSQL function make_table() line 1 at SQL statement parallel worker
>
> The attached patch fixes this, although I am starting to have second
> thoughts about how we're tracking partitions in this patch.  Wondering
> if we should bite the bullet and add partitions into the main range
> table instead of tracking them separately in partitionOids, which
> might result in a cleaner patch overall.
>

Thanks Amit,

I was able to reproduce the problem using your instructions (though I
found I had to run that explain an extra time, in order to hit the
breakpoint).
Also, I can confirm that the problem doesn't occur after application
of your patch.
I'll leave it to your better judgement as to what to do next  - if you
feel the current tracking method is not sufficient

Regards,
Greg Nancarrow
Fujitsu Australia


Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

Amit Langote
On Fri, Feb 26, 2021 at 3:35 PM Greg Nancarrow <[hidden email]> wrote:

> On Fri, Feb 26, 2021 at 4:07 PM Amit Langote <[hidden email]> wrote:
> > The attached patch fixes this, although I am starting to have second
> > thoughts about how we're tracking partitions in this patch.  Wondering
> > if we should bite the bullet and add partitions into the main range
> > table instead of tracking them separately in partitionOids, which
> > might result in a cleaner patch overall.
>
> Thanks Amit,
>
> I was able to reproduce the problem using your instructions (though I
> found I had to run that explain an extra time, in order to hit the
> breakpoint).
> Also, I can confirm that the problem doesn't occur after application
> of your patch.
> I'll leave it to your better judgement as to what to do next  - if you
> feel the current tracking method is not sufficient

Just to be clear, I think the tracking method added by the patch is
sufficient AFAICS for the problems we were able to discover.  The
concern I was trying to express is that we seem to be duct-taping
holes in our earlier chosen design to track partitions separately from
the range table.  If we had decided to add partitions to the range
table as "extra" target relations from the get-go, both the issues I
mentioned with cached plans -- partitions not being counted as a
dependency and partitions not being locked before execution -- would
have been prevented.  I haven't fully grasped how invasive that design
would be, but it sure sounds like it would be a bit more robust.

--
Amit Langote
EDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

RE: Parallel INSERT (INTO ... SELECT ...)

tsunakawa.takay@fujitsu.com
In reply to this post by houzj.fnst@fujitsu.com
From: Hou, Zhijie/侯 志杰 <[hidden email]>
> After doing some more tests on it (performance degradation will not happen
> when source table is out of order).
> I think we can say the performance degradation is related to the order of the
> data in source table.
...
> So, the order of data 's influence seems a normal phenomenon, I think it seems
> we do not need to do anything about it (currently).
> It seems better to mark it as todo which we can improve this in the future.
>
> (Since the performance degradation in parallel bitmap is because of the lock in
> _bt_search, It will not always happen when the target table already have data,
> less race condition will happened when parallel insert into a evenly distributed
> btree).

I think so, too.  The slowness of parallel insert operation due to index page contention, and index bloat, would occur depending on the order of the index key values of source records.

I guess other DBMSs exhibit similar phenomenon, but I couldn't find such description in the manual, whitepapers, or several books on Oracle.  One relevant excerpt is the following.  This is about parallel index build.  Oracle tries to minimize page contention and index bloat.  This is completely my guess, but they may do similar things in parallel INSERT SELECT, because Oracle holds an exclusive lock on the target table.  SQL Server also acquires an exclusive lock.  Maybe we can provide an option to do so in the future.

https://docs.oracle.com/en/database/oracle/oracle-database/21/vldbg/parallel-exec-tips.html#GUID-08A08783-C243-4872-AFFA-56B603F1F0F5
--------------------------------------------------
Optimizing Performance by Creating Indexes in Parallel
...
Multiple processes can work simultaneously to create an index. By dividing the work necessary to create an index among multiple server processes, Oracle Database can create the index more quickly than if a single server process created the index serially.

Parallel index creation works in much the same way as a table scan with an ORDER BY clause. The table is randomly sampled and a set of index keys is found that equally divides the index into the same number of pieces as the DOP. A first set of query processes scans the table, extracts key-rowid pairs, and sends each pair to a process in a second set of query processes based on a key. Each process in the second set sorts the keys and builds an index in the usual fashion. After all index pieces are built, the parallel execution coordinator simply concatenates the pieces (which are ordered) to form the final index.
...
When creating an index in parallel, the STORAGE clause refers to the storage of each of the subindexes created by the query server processes. Therefore, an index created with an INITIAL value of 5 MB and a DOP of 12 consumes at least 60 MB of storage during index creation because each process starts with an extent of 5 MB. When the query coordinator process combines the sorted subindexes, some extents might be trimmed, and the resulting index might be smaller than the requested 60 MB.
--------------------------------------------------


IIRC, the current patch showd nice performance improvement in some (many?) patterns.  So, I think it can be committed in PG 14, when it has addressed the plan cache issue that Amit Langote-san posed.  I remember the following issues/comments are pending, but they are not blockers:

1. Insert operation is run serially when the target table has a foreign key, sequence or identity column.
This can be added later based on the current design without requiring rework.  That is, the current patch leaves no debt.  (Personally, foreign key and sequence support will also be wanted in PG 14.  We may try them in the last CF once the current patch is likely to be committable.)

2. There's a plausible reason for the performance variation and index bloat with the bitmap scan case.
Ideally, we want to come up with a solution that can be incorporated in PG 15.

Or, it may be one conclusion that we can't prevent performance degradation in all cases.  That may be one hidden reason why Oracle and SQL Server doesn't enable parallel DML by default.

We can advise the user in the manual that parallel DML is not always faster than serial operation so he should test performance by enabling and disabling parallel DML.  Also, maybe we should honestly state that indexes can get a bit bigger after parallel insert than after serial insert, and advise the user to do REINDEX CONCURRENTLY if necessary.

3. The total time of parallel execution can get longer because of unbalanced work distribution among parallel workers.
This seems to be an existing problem, so we can pursue the improvement later, hopefully before the release of PG 14.


Does anyone see any problem with committing the current patch (after polishing it)?


Regards
Takayuki Tsunakawa


Reply | Threaded
Open this post in threaded view
|

RE: Parallel INSERT (INTO ... SELECT ...)

tanghy.fnst@fujitsu.com
From: Tsunakawa, Takayuki/綱川 貴之 <[hidden email]>
>the current patch showd nice performance improvement in some (many?) patterns.  
>So, I think it can be committed in PG 14, when it has addressed the plan cache issue that Amit Langote-san posed.

Agreed.
I summarized my test results for the current patch(V18) in the attached file(Please use VSCode or Notepad++ to open it, the context is a bit long).
As you can see, the performance is good in many patterns(Please refer to  my test script, test NO in text is correspond to number in sql file).
If you have any question on my test, please feel free to ask.

Regards,
Tang


performance test result_20210226.txt (6K) Download Attachment
performance_test.sql (28K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

Greg Nancarrow
In reply to this post by Amit Langote
On Fri, Feb 26, 2021 at 5:50 PM Amit Langote <[hidden email]> wrote:

>
> On Fri, Feb 26, 2021 at 3:35 PM Greg Nancarrow <[hidden email]> wrote:
> > On Fri, Feb 26, 2021 at 4:07 PM Amit Langote <[hidden email]> wrote:
> > > The attached patch fixes this, although I am starting to have second
> > > thoughts about how we're tracking partitions in this patch.  Wondering
> > > if we should bite the bullet and add partitions into the main range
> > > table instead of tracking them separately in partitionOids, which
> > > might result in a cleaner patch overall.
> >
> > Thanks Amit,
> >
> > I was able to reproduce the problem using your instructions (though I
> > found I had to run that explain an extra time, in order to hit the
> > breakpoint).
> > Also, I can confirm that the problem doesn't occur after application
> > of your patch.
> > I'll leave it to your better judgement as to what to do next  - if you
> > feel the current tracking method is not sufficient
>
> Just to be clear, I think the tracking method added by the patch is
> sufficient AFAICS for the problems we were able to discover.  The
> concern I was trying to express is that we seem to be duct-taping
> holes in our earlier chosen design to track partitions separately from
> the range table.  If we had decided to add partitions to the range
> table as "extra" target relations from the get-go, both the issues I
> mentioned with cached plans -- partitions not being counted as a
> dependency and partitions not being locked before execution -- would
> have been prevented.  I haven't fully grasped how invasive that design
> would be, but it sure sounds like it would be a bit more robust.
>
Posting an updated set of patches that includes Amit Langote's patch
to the partition tracking scheme...
(the alternative of adding partitions to the range table needs further
investigation)

Regards,
Greg Nancarrow
Fujitsu Australia

v20-0001-Enable-parallel-SELECT-for-INSERT-INTO-.-SELECT.patch (45K) Download Attachment
v20-0002-Parallel-SELECT-for-INSERT-INTO-.-SELECT-tests-and-doc.patch (95K) Download Attachment
v20-0003-Add-new-parallel-dml-GUC-and-table-options.patch (26K) Download Attachment
v20-0004-Enable-parallel-INSERT-and-or-SELECT-for-INSERT-INTO.patch (60K) Download Attachment
v20-0005-Parallel-INSERT-and-or-SELECT-for-INSERT-INTO-tests-and-doc.patch (30K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

Amit Langote
On Mon, Mar 1, 2021 at 12:38 PM Greg Nancarrow <[hidden email]> wrote:

> On Fri, Feb 26, 2021 at 5:50 PM Amit Langote <[hidden email]> wrote:
> >
> > On Fri, Feb 26, 2021 at 3:35 PM Greg Nancarrow <[hidden email]> wrote:
> > > On Fri, Feb 26, 2021 at 4:07 PM Amit Langote <[hidden email]> wrote:
> > > > The attached patch fixes this, although I am starting to have second
> > > > thoughts about how we're tracking partitions in this patch.  Wondering
> > > > if we should bite the bullet and add partitions into the main range
> > > > table instead of tracking them separately in partitionOids, which
> > > > might result in a cleaner patch overall.
> > >
> > > Thanks Amit,
> > >
> > > I was able to reproduce the problem using your instructions (though I
> > > found I had to run that explain an extra time, in order to hit the
> > > breakpoint).
> > > Also, I can confirm that the problem doesn't occur after application
> > > of your patch.
> > > I'll leave it to your better judgement as to what to do next  - if you
> > > feel the current tracking method is not sufficient
> >
> > Just to be clear, I think the tracking method added by the patch is
> > sufficient AFAICS for the problems we were able to discover.  The
> > concern I was trying to express is that we seem to be duct-taping
> > holes in our earlier chosen design to track partitions separately from
> > the range table.  If we had decided to add partitions to the range
> > table as "extra" target relations from the get-go, both the issues I
> > mentioned with cached plans -- partitions not being counted as a
> > dependency and partitions not being locked before execution -- would
> > have been prevented.  I haven't fully grasped how invasive that design
> > would be, but it sure sounds like it would be a bit more robust.
>
> Posting an updated set of patches that includes Amit Langote's patch
> to the partition tracking scheme...

Thanks Greg.

> (the alternative of adding partitions to the range table needs further
> investigation)

I looked at this today with an intention to write and post a PoC
patch, but was quickly faced with the task of integrating that design
with how ModifyTable works for inserts.  Specifically if, in addition
to adding partitions to the range table, we also their RT indexes to
ModifyTable.resultRelations, then maybe we'll need to rethink our
executor tuple routing code.  That code currently tracks the insert's
target partitions separately from the range table, exactly because
they are not there to begin with.  So if we change the latter as in
this hypothetical PoC, maybe we should also revisit the former.  Also,
it would not be great that the planner's arrays in PlannerInfo would
get longer as a result of Query.rtable getting longer by adding
partitions, thus making all the loops over those arrays slower for no
benefit.

So, let's do this in a follow-on patch, if at all, instead of
considering this topic a blocker for committing the current set of
patches.




--
Amit Langote
EDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

akapila
In reply to this post by Greg Nancarrow
On Mon, Mar 1, 2021 at 9:08 AM Greg Nancarrow <[hidden email]> wrote:
>
> Posting an updated set of patches that includes Amit Langote's patch
> to the partition tracking scheme...
>

Few comments:
==============
1.
"Prior to entering parallel-mode for execution of INSERT with parallel SELECT,
a TransactionId is acquired and assigned to the current transaction state which
is then serialized in the parallel DSM for the parallel workers to use."

This point in the commit message seems a bit misleading to me because
IIUC we need to use transaction id only in the master backend for the
purpose of this patch. And, we are doing this at an early stage
because we don't allow to allocate it in parallel-mode. I think here
we should mention in some way that this has a disadvantage that if the
underneath select doesn't return any row then this transaction id
allocation would not be of use. However, that shouldn't happen in
practice in many cases. But, if I am missing something and this is
required by parallel workers then we can ignore what I said.

2.
/*
+ * Prepare for entering parallel mode by assigning a
+ * FullTransactionId, to be included in the transaction state that is
+ * serialized in the parallel DSM.
+ */
+ (void) GetCurrentTransactionId();
+ }

Similar to the previous comment this also seems to indicate that we
require TransactionId for workers. If that is not the case then this
comment also needs to be modified.

3.
 static int common_prefix_cmp(const void *a, const void *b);

-
 /*****************************************************************************

Spurious line removal.

4.
 * Assess whether it's feasible to use parallel mode for this query. We
  * can't do this in a standalone backend, or if the command will try to
- * modify any data, or if this is a cursor operation, or if GUCs are set
- * to values that don't permit parallelism, or if parallel-unsafe
- * functions are present in the query tree.
+ * modify any data using a CTE, or if this is a cursor operation, or if
+ * GUCs are set to values that don't permit parallelism, or if
+ * parallel-unsafe functions are present in the query tree.

This comment change is not required because this is quite similar to
what we do for CTAS. Your further comment changes in this context are
sufficient.

5.
+ (IsModifySupportedInParallelMode(parse->commandType) &&
+ is_parallel_possible_for_modify(parse))) &&

I think it would be better if we move the check
IsModifySupportedInParallelMode inside
is_parallel_possible_for_modify. Also, it might be better to name this
function as is_parallel_allowed_for_modify.

6.
@@ -260,6 +265,21 @@ set_plan_references(PlannerInfo *root, Plan *plan)
  */
  add_rtes_to_flat_rtable(root, false);

+ /*
+ * If modifying a partitioned table, add its parallel-safety-checked
+ * partitions too to glob->relationOids, to register them as plan
+ * dependencies. This is only really needed in the case of a parallel
+ * plan, so that if parallel-unsafe properties are subsequently defined
+ * on the partitions, the cached parallel plan will be invalidated and
+ * a non-parallel plan will be generated.
+ */
+ if (IsModifySupportedInParallelMode(root->parse->commandType))
+ {
+ if (glob->partitionOids != NIL && glob->parallelModeNeeded)
+ glob->relationOids =
+ list_concat(glob->relationOids, glob->partitionOids);
+ }
+

Isn't it possible to add these partitionOids in set_plan_refs with the
T_Gather(Merge) node handling? That looks like a more natural place to
me, if that is feasible then we don't need parallelModeNeeded check
and maybe we don't need to even check IsModifySupportedInParallelMode
but I am less sure of the second check requirement.

7.
+#include "access/table.h"
+#include "access/xact.h"
 #include "access/transam.h"
+#include "catalog/pg_class.h"
 #include "catalog/pg_type.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
@@ -24,6 +27,8 @@
 #include "optimizer/planmain.h"
 #include "optimizer/planner.h"
 #include "optimizer/tlist.h"
+#include "parser/parsetree.h"
+#include "partitioning/partdesc.h"

I think apart from xact.h, we don't need new additional includes.

8. I see that in function target_rel_max_parallel_hazard, we don't
release the lock on the target table after checking parallel-safety
but then in function target_rel_max_parallel_hazard_recurse, we do
release the lock on partition tables after checking their
parallel-safety. Can we add some comments to explain these two cases?

9. I noticed that the tests added for the first patch in
v18-0002-Parallel-SELECT-for-INSERT-INTO-.-SELECT-tests-and-doc take
even more time than select_parallel. I think we should see if we can
reduce the timing of this test. I haven't studied it in detail but
maybe some Inserts execution can be avoided. In some cases like below
just checking the plan should be sufficient. I think you can try to
investigate and see how much we can reduce it without affecting on
code-coverage of newly added code.

+--
+-- Parallel unsafe column default, should not use a parallel plan
+--
+explain (costs off) insert into testdef(a,c,d) select a,a*4,a*8 from test_data;
+         QUERY PLAN
+-----------------------------
+ Insert on testdef
+   ->  Seq Scan on test_data
+(2 rows)
+
+insert into testdef(a,c,d) select a,a*4,a*8 from test_data;
+select * from testdef order by a;
+ a  | b | c  | d
+----+---+----+----
+  1 | 5 |  4 |  8
+  2 | 5 |  8 | 16
+  3 | 5 | 12 | 24
+  4 | 5 | 16 | 32
+  5 | 5 | 20 | 40
+  6 | 5 | 24 | 48
+  7 | 5 | 28 | 56
+  8 | 5 | 32 | 64
+  9 | 5 | 36 | 72
+ 10 | 5 | 40 | 80
+(10 rows)
+

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

Greg Nancarrow
On Tue, Mar 2, 2021 at 11:19 PM Amit Kapila <[hidden email]> wrote:

>
> On Mon, Mar 1, 2021 at 9:08 AM Greg Nancarrow <[hidden email]> wrote:
> >
> > Posting an updated set of patches that includes Amit Langote's patch
> > to the partition tracking scheme...
> >
>
> Few comments:
> ==============
> 1.
> "Prior to entering parallel-mode for execution of INSERT with parallel SELECT,
> a TransactionId is acquired and assigned to the current transaction state which
> is then serialized in the parallel DSM for the parallel workers to use."
>
> This point in the commit message seems a bit misleading to me because
> IIUC we need to use transaction id only in the master backend for the
> purpose of this patch. And, we are doing this at an early stage
> because we don't allow to allocate it in parallel-mode. I think here
> we should mention in some way that this has a disadvantage that if the
> underneath select doesn't return any row then this transaction id
> allocation would not be of use. However, that shouldn't happen in
> practice in many cases. But, if I am missing something and this is
> required by parallel workers then we can ignore what I said.
>

I'll remove the part that says "which is then serialized ...", as this
may imply that the patch is doing this (which of course it is not,
Postgres always serializes the transaction state in the parallel DSM
for the parallel workers to use).
The acquiring of the TransactionId up-front, prior to entering
parallel-mode, is absolutely required because heap_insert() lazily
tries to get the current transaction-id and attempts to assign one if
the current transaction state hasn't got one, and this assignment is
not allowed in parallel-mode ("ERROR:  cannot assign XIDs during a
parallel operation"), and this mode is required for use of parallel
SELECT.
I'll add the part about the disadvantage of the transaction-id not
being used if the underlying SELECT doesn't return any rows.


> 2.
> /*
> + * Prepare for entering parallel mode by assigning a
> + * FullTransactionId, to be included in the transaction state that is
> + * serialized in the parallel DSM.
> + */
> + (void) GetCurrentTransactionId();
> + }
>
> Similar to the previous comment this also seems to indicate that we
> require TransactionId for workers. If that is not the case then this
> comment also needs to be modified.
>

I'll update to comment to remove the part about the serialization (as
this always happens, not a function of the patch) and mention it is
needed to avoid attempting to assign a FullTransactionId in
parallel-mode.

> 3.
>  static int common_prefix_cmp(const void *a, const void *b);
>
> -
>  /*****************************************************************************
>
> Spurious line removal.
>

I will reinstate that empty line.

> 4.
>  * Assess whether it's feasible to use parallel mode for this query. We
>   * can't do this in a standalone backend, or if the command will try to
> - * modify any data, or if this is a cursor operation, or if GUCs are set
> - * to values that don't permit parallelism, or if parallel-unsafe
> - * functions are present in the query tree.
> + * modify any data using a CTE, or if this is a cursor operation, or if
> + * GUCs are set to values that don't permit parallelism, or if
> + * parallel-unsafe functions are present in the query tree.
>
> This comment change is not required because this is quite similar to
> what we do for CTAS. Your further comment changes in this context are
> sufficient.

An INSERT modifies data, so according to the original comment, then
it's not feasible to use parallel mode, because the command tries to
modify data ("We can't do this in a standalone backend, or if the
command will try to modify any data ...").
Except now we need to use parallel-mode for "INSERT with parallel
SELECT", and INSERT is a command that modifies data.
So isn't the comment change actually needed?


>
> 5.
> + (IsModifySupportedInParallelMode(parse->commandType) &&
> + is_parallel_possible_for_modify(parse))) &&
>
> I think it would be better if we move the check
> IsModifySupportedInParallelMode inside
> is_parallel_possible_for_modify.

The reason it is done outside of is_parallel_possible_for_modify() is
to avoid the function overhead of calling
is_parallel_possible_for_modify() for SELECT statements, only to
return without doing anything. Note also that
IsModifySupportedInParallelMode() is an inline function.

>Also, it might be better to name this
> function as is_parallel_allowed_for_modify.

I do tend to think that in this case "possible" is better than "allowed".
Only the "parallel_dml" GUC test is checking for something that is "allowed".
The other two checks are checking for things that determine whether
parallel-mode is even "possible".

>
> 6.
> @@ -260,6 +265,21 @@ set_plan_references(PlannerInfo *root, Plan *plan)
>   */
>   add_rtes_to_flat_rtable(root, false);
>
> + /*
> + * If modifying a partitioned table, add its parallel-safety-checked
> + * partitions too to glob->relationOids, to register them as plan
> + * dependencies. This is only really needed in the case of a parallel
> + * plan, so that if parallel-unsafe properties are subsequently defined
> + * on the partitions, the cached parallel plan will be invalidated and
> + * a non-parallel plan will be generated.
> + */
> + if (IsModifySupportedInParallelMode(root->parse->commandType))
> + {
> + if (glob->partitionOids != NIL && glob->parallelModeNeeded)
> + glob->relationOids =
> + list_concat(glob->relationOids, glob->partitionOids);
> + }
> +
>
> Isn't it possible to add these partitionOids in set_plan_refs with the
> T_Gather(Merge) node handling? That looks like a more natural place to
> me, if that is feasible then we don't need parallelModeNeeded check
> and maybe we don't need to even check IsModifySupportedInParallelMode
> but I am less sure of the second check requirement.
>

There may be multiple Gather/GatherMerge nodes in the plan (when they
are not top-level nodes), and I think by moving this code into
set_plan_refs() you risk adding those partitionOids multiple times to
glob->relationOids, when the Gather/GatherMerge nodes are traversed
(note that set_plan_refs() is called from set_plan_references() and is
recursive).
Leaving the code where it is is set_plan_references() guarantees that
the partitionOids can only be added ONCE.

> 7.
> +#include "access/table.h"
> +#include "access/xact.h"
>  #include "access/transam.h"
> +#include "catalog/pg_class.h"
>  #include "catalog/pg_type.h"
>  #include "nodes/makefuncs.h"
>  #include "nodes/nodeFuncs.h"
> @@ -24,6 +27,8 @@
>  #include "optimizer/planmain.h"
>  #include "optimizer/planner.h"
>  #include "optimizer/tlist.h"
> +#include "parser/parsetree.h"
> +#include "partitioning/partdesc.h"
>
> I think apart from xact.h, we don't need new additional includes.
>

OK, I'll remove, it seems those other headers are getting dragged in
from the existing headers.

> 8. I see that in function target_rel_max_parallel_hazard, we don't
> release the lock on the target table after checking parallel-safety
> but then in function target_rel_max_parallel_hazard_recurse, we do
> release the lock on partition tables after checking their
> parallel-safety. Can we add some comments to explain these two cases?
>

It looks like the comment on the first case was lost when the code was
integrated into max_parallel_hazard().
The target relation is always locked during the parse/analyze phase
and left locked until end-of-transaction. So the lock here is just
incrementing a reference count (AFAIK). Note that originally NoLock
was passed to table_open(), which would have a similar overall effect,
as the table has already been locked and will be unlocked at
end-of-transaction.
I'll add comments in both cases.



> 9. I noticed that the tests added for the first patch in
> v18-0002-Parallel-SELECT-for-INSERT-INTO-.-SELECT-tests-and-doc take
> even more time than select_parallel. I think we should see if we can
> reduce the timing of this test. I haven't studied it in detail but
> maybe some Inserts execution can be avoided. In some cases like below
> just checking the plan should be sufficient. I think you can try to
> investigate and see how much we can reduce it without affecting on
> code-coverage of newly added code.
>
> +--
> +-- Parallel unsafe column default, should not use a parallel plan
> +--
> +explain (costs off) insert into testdef(a,c,d) select a,a*4,a*8 from test_data;
> +         QUERY PLAN
> +-----------------------------
> + Insert on testdef
> +   ->  Seq Scan on test_data
> +(2 rows)
> +
> +insert into testdef(a,c,d) select a,a*4,a*8 from test_data;
> +select * from testdef order by a;
> + a  | b | c  | d
> +----+---+----+----
> +  1 | 5 |  4 |  8
> +  2 | 5 |  8 | 16
> +  3 | 5 | 12 | 24
> +  4 | 5 | 16 | 32
> +  5 | 5 | 20 | 40
> +  6 | 5 | 24 | 48
> +  7 | 5 | 28 | 56
> +  8 | 5 | 32 | 64
> +  9 | 5 | 36 | 72
> + 10 | 5 | 40 | 80
> +(10 rows)
> +
>

OK, I'll try to remove INSERT executions (and checking) for the cases
primarily checking the type of plan being generated, where the related
INSERT functionality has been previously tested.


Regards,
Greg Nancarrow
Fujitsu Australia


Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

akapila
On Wed, Mar 3, 2021 at 12:52 PM Greg Nancarrow <[hidden email]> wrote:

>
> On Tue, Mar 2, 2021 at 11:19 PM Amit Kapila <[hidden email]> wrote:
> >
> > On Mon, Mar 1, 2021 at 9:08 AM Greg Nancarrow <[hidden email]> wrote:
>
> > 2.
> > /*
> > + * Prepare for entering parallel mode by assigning a
> > + * FullTransactionId, to be included in the transaction state that is
> > + * serialized in the parallel DSM.
> > + */
> > + (void) GetCurrentTransactionId();
> > + }
> >
> > Similar to the previous comment this also seems to indicate that we
> > require TransactionId for workers. If that is not the case then this
> > comment also needs to be modified.
> >
>
> I'll update to comment to remove the part about the serialization (as
> this always happens, not a function of the patch) and mention it is
> needed to avoid attempting to assign a FullTransactionId in
> parallel-mode.
>

Okay, but just use TransactionId in comments if there is a need, we
still don't use FullTransactionId for the heap.

> > 4.
> >  * Assess whether it's feasible to use parallel mode for this query. We
> >   * can't do this in a standalone backend, or if the command will try to
> > - * modify any data, or if this is a cursor operation, or if GUCs are set
> > - * to values that don't permit parallelism, or if parallel-unsafe
> > - * functions are present in the query tree.
> > + * modify any data using a CTE, or if this is a cursor operation, or if
> > + * GUCs are set to values that don't permit parallelism, or if
> > + * parallel-unsafe functions are present in the query tree.
> >
> > This comment change is not required because this is quite similar to
> > what we do for CTAS. Your further comment changes in this context are
> > sufficient.
>
> An INSERT modifies data, so according to the original comment, then
> it's not feasible to use parallel mode, because the command tries to
> modify data ("We can't do this in a standalone backend, or if the
> command will try to modify any data ...").
> Except now we need to use parallel-mode for "INSERT with parallel
> SELECT", and INSERT is a command that modifies data.
> So isn't the comment change actually needed?
>

If we want to change, we can say "if the command will try to modify
any data except for inserts ..." or something like that but saying
only about CTE is not correct because then what about updates and
deletes.

>
> >
> > 5.
> > + (IsModifySupportedInParallelMode(parse->commandType) &&
> > + is_parallel_possible_for_modify(parse))) &&
> >
> > I think it would be better if we move the check
> > IsModifySupportedInParallelMode inside
> > is_parallel_possible_for_modify.
>
> The reason it is done outside of is_parallel_possible_for_modify() is
> to avoid the function overhead of calling
> is_parallel_possible_for_modify() for SELECT statements, only to
> return without doing anything. Note also that
> IsModifySupportedInParallelMode() is an inline function.
>

I don't see any reason to be worried about that here. It is more
important for code and checks to look simpler.

> >Also, it might be better to name this
> > function as is_parallel_allowed_for_modify.
>
> I do tend to think that in this case "possible" is better than "allowed".
> Only the "parallel_dml" GUC test is checking for something that is "allowed".
> The other two checks are checking for things that determine whether
> parallel-mode is even "possible".
>

I think I don't like this proposed name for the function. See, if you
can think of something better.

> >
> > 6.
> > @@ -260,6 +265,21 @@ set_plan_references(PlannerInfo *root, Plan *plan)
> >   */
> >   add_rtes_to_flat_rtable(root, false);
> >
> > + /*
> > + * If modifying a partitioned table, add its parallel-safety-checked
> > + * partitions too to glob->relationOids, to register them as plan
> > + * dependencies. This is only really needed in the case of a parallel
> > + * plan, so that if parallel-unsafe properties are subsequently defined
> > + * on the partitions, the cached parallel plan will be invalidated and
> > + * a non-parallel plan will be generated.
> > + */
> > + if (IsModifySupportedInParallelMode(root->parse->commandType))
> > + {
> > + if (glob->partitionOids != NIL && glob->parallelModeNeeded)
> > + glob->relationOids =
> > + list_concat(glob->relationOids, glob->partitionOids);
> > + }
> > +
> >
> > Isn't it possible to add these partitionOids in set_plan_refs with the
> > T_Gather(Merge) node handling? That looks like a more natural place to
> > me, if that is feasible then we don't need parallelModeNeeded check
> > and maybe we don't need to even check IsModifySupportedInParallelMode
> > but I am less sure of the second check requirement.
> >
>
> There may be multiple Gather/GatherMerge nodes in the plan (when they
> are not top-level nodes), and I think by moving this code into
> set_plan_refs() you risk adding those partitionOids multiple times to
> glob->relationOids, when the Gather/GatherMerge nodes are traversed
> (note that set_plan_refs() is called from set_plan_references() and is
> recursive).
> Leaving the code where it is is set_plan_references() guarantees that
> the partitionOids can only be added ONCE.
>

Okay, is there a reason to use IsModifySupportedInParallelMode? Isn't
the second check sufficient?

--
With Regards,
Amit Kapila.


Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

Dilip Kumar-2
In reply to this post by Greg Nancarrow
On Mon, Mar 1, 2021 at 9:08 AM Greg Nancarrow <[hidden email]> wrote:
>
> Posting an updated set of patches that includes Amit Langote's patch
> to the partition tracking scheme...
> (the alternative of adding partitions to the range table needs further
> investigation)

I was reviewing your latest patch and I have a few comments.

In patch 0001
1.
+static bool
+target_rel_max_parallel_hazard_recurse(Relation rel,
+   CmdType command_type,
+   max_parallel_hazard_context *context)
+{
+ TupleDesc tupdesc;
+ int attnum;
+
+ /* Currently only CMD_INSERT is supported */
+ Assert(command_type == CMD_INSERT);
…….
+ /*
+ * Column default expressions and check constraints are only applicable to
+ * INSERT and UPDATE, but since only parallel INSERT is currently supported,
+ * only command_type==CMD_INSERT is checked here.
+ */
+ if (command_type == CMD_INSERT)

If we have an assert at the beginning of the function, then why do we
want to put the if check here?

2.
In patch 0004,  We are still charging the parallel_tuple_cost for each
tuple, are we planning to do something about this?  I mean after this
patch tuple will not be transferred through the tuple queue, so we
should not add that cost.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

Greg Nancarrow
On Wed, Mar 3, 2021 at 10:45 PM Dilip Kumar <[hidden email]> wrote:

>
> On Mon, Mar 1, 2021 at 9:08 AM Greg Nancarrow <[hidden email]> wrote:
> >
> > Posting an updated set of patches that includes Amit Langote's patch
> > to the partition tracking scheme...
> > (the alternative of adding partitions to the range table needs further
> > investigation)
>
> I was reviewing your latest patch and I have a few comments.
>
> In patch 0001
> 1.
> +static bool
> +target_rel_max_parallel_hazard_recurse(Relation rel,
> +   CmdType command_type,
> +   max_parallel_hazard_context *context)
> +{
> + TupleDesc tupdesc;
> + int attnum;
> +
> + /* Currently only CMD_INSERT is supported */
> + Assert(command_type == CMD_INSERT);
> …….
> + /*
> + * Column default expressions and check constraints are only applicable to
> + * INSERT and UPDATE, but since only parallel INSERT is currently supported,
> + * only command_type==CMD_INSERT is checked here.
> + */
> + if (command_type == CMD_INSERT)
>
> If we have an assert at the beginning of the function, then why do we
> want to put the if check here?
>

Asserts are normally only enabled in a debug-build, so for a
release-build that Assert has no effect.
The Assert is being used as a sanity-check that the function is only
currently getting called for INSERT, because that's all it currently
supports.
Further code below specifically checks for INSERT because the block
contains code that is specific to INSERT (and actually DELETE too, but
that is not currently supported - code is commented accordingly).
In future, this function will support DELETE and UPDATE, and the
Assert serves to alert the developer ASAP that it needs updating to
support those.


> 2.
> In patch 0004,  We are still charging the parallel_tuple_cost for each
> tuple, are we planning to do something about this?  I mean after this
> patch tuple will not be transferred through the tuple queue, so we
> should not add that cost.
>

I believe that for Parallel INSERT, cost_modifytable() will set
path->path.rows to 0 (unless there is a RETURNING list), so, for
example, in cost_gather(), it will not add to the run_cost as
"run_cost += parallel_tuple_cost * path->path.rows;"


Regards,
Greg Nancarrow
Fujitsu Australia


Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

Dilip Kumar-2
On Wed, Mar 3, 2021 at 5:50 PM Greg Nancarrow <[hidden email]> wrote:
>
> Asserts are normally only enabled in a debug-build, so for a
> release-build that Assert has no effect.
> The Assert is being used as a sanity-check that the function is only
> currently getting called for INSERT, because that's all it currently
> supports.

I agree that assert is only for debug build, but once we add and
assert that means we are sure that it should only be called for insert
and if it is called for anything else then it is a programming error
from the caller's side.  So after the assert, adding if check for the
same condition doesn't look like a good idea.  That means we think
that the code can hit assert in the debug mode so we need an extra
protection in the release mode.


>
> > 2.
> > In patch 0004,  We are still charging the parallel_tuple_cost for each
> > tuple, are we planning to do something about this?  I mean after this
> > patch tuple will not be transferred through the tuple queue, so we
> > should not add that cost.
> >
>
> I believe that for Parallel INSERT, cost_modifytable() will set
> path->path.rows to 0 (unless there is a RETURNING list), so, for
> example, in cost_gather(), it will not add to the run_cost as
> "run_cost += parallel_tuple_cost * path->path.rows;"
>

But the cost_modifytable is setting the number of rows to 0 in
ModifyTablePath whereas the cost_gather will multiply the rows from
the GatherPath.  I can not see the rows from GatherPath is ever set to
0.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Parallel INSERT (INTO ... SELECT ...)

Greg Nancarrow
On Thu, Mar 4, 2021 at 1:07 AM Dilip Kumar <[hidden email]> wrote:

>
> On Wed, Mar 3, 2021 at 5:50 PM Greg Nancarrow <[hidden email]> wrote:
> >
> > Asserts are normally only enabled in a debug-build, so for a
> > release-build that Assert has no effect.
> > The Assert is being used as a sanity-check that the function is only
> > currently getting called for INSERT, because that's all it currently
> > supports.
>
> I agree that assert is only for debug build, but once we add and
> assert that means we are sure that it should only be called for insert
> and if it is called for anything else then it is a programming error
> from the caller's side.  So after the assert, adding if check for the
> same condition doesn't look like a good idea.  That means we think
> that the code can hit assert in the debug mode so we need an extra
> protection in the release mode.
>

The if-check isn't there for "extra protection".
It's to help with future changes; inside that if-block is code only
applicable to INSERT (and to UPDATE - sorry, before I said DELETE), as
the code-comment indicates, whereas the rest of the function is
generic to all command types. I don't see any problem with having this
if-block here, to help in this way, when other command types are
added.


>
> >
> > > 2.
> > > In patch 0004,  We are still charging the parallel_tuple_cost for each
> > > tuple, are we planning to do something about this?  I mean after this
> > > patch tuple will not be transferred through the tuple queue, so we
> > > should not add that cost.
> > >
> >
> > I believe that for Parallel INSERT, cost_modifytable() will set
> > path->path.rows to 0 (unless there is a RETURNING list), so, for
> > example, in cost_gather(), it will not add to the run_cost as
> > "run_cost += parallel_tuple_cost * path->path.rows;"
> >
>
> But the cost_modifytable is setting the number of rows to 0 in
> ModifyTablePath whereas the cost_gather will multiply the rows from
> the GatherPath.  I can not see the rows from GatherPath is ever set to
> 0.
>

OK, I see the problem now.
It works the way I described, but currently there's a problem with
where it's getting the rows for the GatherPath, so there's a
disconnect.
When generating the GatherPaths, it's currently always taking the
rel's (possibly estimated) row-count, rather than using the rows from
the cheapest_partial_path (the subpath: ModifyTablePath). See
generate_gather_paths().
So when generate_useful_gather_paths() is called from the planner, for
the added partial paths for Parallel INSERT, it should be passing
"true" for the "override_rows" parameter, not "false".

So I think that in the 0004 patch, the if-block:

+       if (parallel_modify_partial_path_added)
+       {
+               final_rel->rows = current_rel->rows;    /* ??? why
hasn't this been
+
                          * set above somewhere ???? */
+               generate_useful_gather_paths(root, final_rel, false);
+       }
+

can be reduced to:

+       if (parallel_modify_partial_path_added)
+               generate_useful_gather_paths(root, final_rel, true);
+

Regards,
Greg Nancarrow
Fujitsu Australia


1 ... 1213141516171819