Yet another vectorized engine

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Yet another vectorized engine

Hubert Zhang
Hi hackers,

We just want to introduce another POC for vectorized execution engine https://github.com/zhangh43/vectorize_engine and want to get some feedback on the idea.

The basic idea is to extend the TupleTableSlot and introduce VectorTupleTableSlot, which is an array of datums organized by projected columns.  The array of datum per column is continuous in memory. This makes the expression evaluation cache friendly and SIMD could be utilized. We have refactored the SeqScanNode and AggNode to support VectorTupleTableSlot currently.

Below are features in our design.
1. Pure extension. We don't hack any code into postgres kernel.

2. CustomScan node. We use CustomScan framework to replace original executor node such as SeqScan, Agg etc. Based on CustomScan, we could extend the CustomScanState, BeginCustomScan(), ExecCustomScan(), EndCustomScan() interface to implement vectorize executor logic.

3. Post planner hook. After plan is generated, we use plan_tree_walker to traverse the plan tree and check whether it could be vectorized. If yes, the non-vectorized nodes (SeqScan, Agg etc.) are replaced with vectorized nodes (in form of CustomScan node) and use vectorized executor. If no, we will revert to the original plan and use non-vectorized executor. In future this part could be enhanced, for example, instead of revert to original plan when some nodes cannot be vectorized, we could add Batch/UnBatch node to generate a plan with both vectorized as well as non-vectorized node. 

4. Support implement new vectorized executor node gradually. We currently only vectorized SeqScan and Agg but other queries which including Join could also be run when vectorize extension is enabled.

5. Inherit original executor code. Instead of rewriting the whole executor, we choose a more smooth method to modify current Postgres executor node and make it vectorized. We copy the current executor node's c file into our extension, and add vectorize logic based on it. When Postgres enhance its executor, we could relatively easily merge them back. We want to know whether this is a good way to write vectorized executor extension?

6. Pluggable storage. Postgres has supported pluggable storage now. TupleTableSlot is refactored as abstract struct TupleTableSlotOps. VectorTupleTableSlot could be implemented under this framework when we upgrade the extension to latest PG.

We run the TPCH(10G) benchmark and result of Q1 is 50sec(PG) V.S. 28sec(Vectorized PG). Performance gain can be improved by:
1. heap tuple deform occupy many CPUs. We will try zedstore in future, since vectorized executor is more compatible with column store.

2. vectorized agg is not fully vectorized and we have many optimization need to do. For example, batch compute the hash value, optimize hash table for vectorized HashAgg. 

3. Conversion cost from Datum to actual type and vice versa is also high, for example DatumGetFloat4 & Float4GetDatum. One optimization maybe that we store the actual type in VectorTupleTableSlot directly, instead of an array of datums.

Related works:
1. VOPS is a vectorized execution extension. Link: https://github.com/postgrespro/vops.
It doesn't use custom scan framework and use UDF to do the vectorized operation e.g. it changes the SQL syntax to do aggregation.

2. Citus vectorized executor is another POC. Link: https://github.com/citusdata/postgres_vectorization_test.
It uses ExecutorRun_hook to run the vectorized executor and uses cstore fdw to support column storage.

Note that the vectorized executor engine is based on PG9.6 now, but it could be ported to master / zedstore with some effort.  We would appreciate some feedback before moving further in that direction.

Thanks, 
Hubert Zhang, Gang Xiong, Ning Yu, Asim Praveen
Reply | Threaded
Open this post in threaded view
|

Re: Yet another vectorized engine

konstantin knizhnik


On 28.11.2019 12:23, Hubert Zhang wrote:
Hi hackers,

We just want to introduce another POC for vectorized execution engine https://github.com/zhangh43/vectorize_engine and want to get some feedback on the idea.

The basic idea is to extend the TupleTableSlot and introduce VectorTupleTableSlot, which is an array of datums organized by projected columns.  The array of datum per column is continuous in memory. This makes the expression evaluation cache friendly and SIMD could be utilized. We have refactored the SeqScanNode and AggNode to support VectorTupleTableSlot currently.

Below are features in our design.
1. Pure extension. We don't hack any code into postgres kernel.

2. CustomScan node. We use CustomScan framework to replace original executor node such as SeqScan, Agg etc. Based on CustomScan, we could extend the CustomScanState, BeginCustomScan(), ExecCustomScan(), EndCustomScan() interface to implement vectorize executor logic.

3. Post planner hook. After plan is generated, we use plan_tree_walker to traverse the plan tree and check whether it could be vectorized. If yes, the non-vectorized nodes (SeqScan, Agg etc.) are replaced with vectorized nodes (in form of CustomScan node) and use vectorized executor. If no, we will revert to the original plan and use non-vectorized executor. In future this part could be enhanced, for example, instead of revert to original plan when some nodes cannot be vectorized, we could add Batch/UnBatch node to generate a plan with both vectorized as well as non-vectorized node. 

4. Support implement new vectorized executor node gradually. We currently only vectorized SeqScan and Agg but other queries which including Join could also be run when vectorize extension is enabled.

5. Inherit original executor code. Instead of rewriting the whole executor, we choose a more smooth method to modify current Postgres executor node and make it vectorized. We copy the current executor node's c file into our extension, and add vectorize logic based on it. When Postgres enhance its executor, we could relatively easily merge them back. We want to know whether this is a good way to write vectorized executor extension?

6. Pluggable storage. Postgres has supported pluggable storage now. TupleTableSlot is refactored as abstract struct TupleTableSlotOps. VectorTupleTableSlot could be implemented under this framework when we upgrade the extension to latest PG.

We run the TPCH(10G) benchmark and result of Q1 is 50sec(PG) V.S. 28sec(Vectorized PG). Performance gain can be improved by:
1. heap tuple deform occupy many CPUs. We will try zedstore in future, since vectorized executor is more compatible with column store.

2. vectorized agg is not fully vectorized and we have many optimization need to do. For example, batch compute the hash value, optimize hash table for vectorized HashAgg. 

3. Conversion cost from Datum to actual type and vice versa is also high, for example DatumGetFloat4 & Float4GetDatum. One optimization maybe that we store the actual type in VectorTupleTableSlot directly, instead of an array of datums.

Related works:
1. VOPS is a vectorized execution extension. Link: https://github.com/postgrespro/vops.
It doesn't use custom scan framework and use UDF to do the vectorized operation e.g. it changes the SQL syntax to do aggregation.

2. Citus vectorized executor is another POC. Link: https://github.com/citusdata/postgres_vectorization_test.
It uses ExecutorRun_hook to run the vectorized executor and uses cstore fdw to support column storage.

Note that the vectorized executor engine is based on PG9.6 now, but it could be ported to master / zedstore with some effort.  We would appreciate some feedback before moving further in that direction.

Thanks, 
Hubert Zhang, Gang Xiong, Ning Yu, Asim Praveen

Hi,

I think that vectorized executor is absolutely necessary thing for Postgres, especially taken in account that now we have columnar store prototype (zedstore).
To take all advantages of columnar store we definitely need a vectorized executor.

But I do not completely understand why you are proposing to implement it as extension.
Yes, custom nodes makes it possible to provide vector execution without affecting Postgres core.
But for efficient integration of zedstore and vectorized executor we need to extend table-AM (VectorTupleTableSlot and correspondent scan functions).
Certainly it is easier to contribute vectorized executor as extension, but sooner or later I think it should be added to Postgres core.

As far as I understand you already have some prototype implementation (otherwise how you got the performance results)?
If so, are you planning to publish it or you think that executor should be developed from scratch?

Some my concerns based on VOPS experience:

1. Vertical (columnar) model is preferable for some kind of queries, but there are some classes of queries for which it is less efficient.
Moreover, data is used to be imported in the database in row format. Inserting it in columnar store record-by-record is very inefficient.
So you need some kind of bulk loader which will be able to buffer input data before loading it in columnar store.
Actually this problem it is more related with data model rather than vectorized executor. But what I want to express here is that it may be better to have both representation (horizontal and vertical)
and let optimizer choose most efficient one for particular query.

2. Columnar store and vectorized executor are most efficient for query like "select sum(x) from T where ...".
Unfortunately such simple queries are rarely used in real life. Usually analytic queries contain group-by and joins.
And here vertical model is not always optimal (you have to reconstruct rows from columns to perform join or grouping).
To provide efficient execution of queries you may need to create multiple different projections of the same data (sorted by different subset of attributes).
This is why Vertica (one of the most popular columnar store DBMS) is supporting projections.
The same can be done in VOPS: using create_projection function you can specify which attributes should be scalar (grouping attributes) and which vectorized.
In this case you can perform grouping and joins using standard Postgres executor, while perform vectorized operations for filtering and accumulating aggregates.

This is why Q1 is 20 times faster in VOPS and not 2 times as in your prototype.
So I think that columnar store should make it possible to maintain several projections of table and optimizer should be able to automatically choose one of them for particular query.
Definitely synchronization of projections is challenged problem. Fortunately OLAP usually not require most recent data.

3. I wonder if vectorized executor should support only built-in types and predefined operators? Or it should be able to work with any user defined types, operators and aggregates?
Certainly it is much easier to support only built-in scalar types. But it contradicts to open and extensible nature of Postgres.

4. Did you already think about format of storing data in VectorTupleTableSlot? Should it be array of Datum? Or we need to represent vector in more low level format (for example
as array of floats for real4 type)?

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Reply | Threaded
Open this post in threaded view
|

Re: Yet another vectorized engine

Michael Paquier-2
In reply to this post by Hubert Zhang
On Thu, Nov 28, 2019 at 05:23:59PM +0800, Hubert Zhang wrote:
> Note that the vectorized executor engine is based on PG9.6 now, but it
> could be ported to master / zedstore with some effort.  We would appreciate
> some feedback before moving further in that direction.

There has been no feedback yet, unfortunately.  The patch does not
apply anymore, so a rebase is necessary.  For now I am moving the
patch to next CF, waiting on author.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Yet another vectorized engine

Hubert Zhang
In reply to this post by konstantin knizhnik
Hi Konstantin,
Thanks for your reply.

On Fri, Nov 29, 2019 at 12:09 AM Konstantin Knizhnik <[hidden email]> wrote:
On 28.11.2019 12:23, Hubert Zhang wrote:
We just want to introduce another POC for vectorized execution engine https://github.com/zhangh43/vectorize_engine and want to get some feedback on the idea.
But I do not completely understand why you are proposing to implement it as extension.
Yes, custom nodes makes it possible to provide vector execution without affecting Postgres core.
But for efficient integration of zedstore and vectorized executor we need to extend table-AM (VectorTupleTableSlot and correspondent scan functions).
Certainly it is easier to contribute vectorized executor as extension, but sooner or later I think it should be added to Postgres core.

As far as I understand you already have some prototype implementation (otherwise how you got the performance results)?
If so, are you planning to publish it or you think that executor should be developed from scratch?

The prototype extension is at https://github.com/zhangh43/vectorize_engine
 I agree vectorized executor should be added to Postgres core some days. But it is such a huge feature and need to change from not only the extended table-AM you mentioned and also every executor node , such as Agg,Join,Sort node etc. What's more, the expression evaluation function and aggregate's transition function, combine function etc. We all need to supply a vectorized version for them. Hence, implementing it as an extension first and if it is popular among community and stable, we could merge it into Postgres core whenever we want.

We do want to get some feedback from the community about CustomScan. CustomScan is just an abstract layer. It's typically used to support user defined scan node, but some other PG extensions(pgstorm) have already used it as a general CustomNode e.g. Agg, Join etc. Since vectorized engine need to support vectorized processing in all executor node, follow the above idea, our choice is to use CustomScan.
 
 Some my concerns based on VOPS experience:

1. Vertical (columnar) model is preferable for some kind of queries, but there are some classes of queries for which it is less efficient.
Moreover, data is used to be imported in the database in row format. Inserting it in columnar store record-by-record is very inefficient.
So you need some kind of bulk loader which will be able to buffer input data before loading it in columnar store.
Actually this problem it is more related with data model rather than vectorized executor. But what I want to express here is that it may be better to have both representation (horizontal and vertical)
and let optimizer choose most efficient one for particular query.


Yes, in general, for OLTP queries, row format is better and for OLAP queries column format is better.
As for storage type(or data model), I think DBA should choose row or column store to use for a specific table.
As for executor, it's a good idea to let optimizer to choose based on cost. It is a long term goal and our extension now will fallback to original row executor for Insert,Update,IndexScan cases in a rough way.
We want our extension could be enhanced in a gradual way.
 
2. Columnar store and vectorized executor are most efficient for query like "select sum(x) from T where ...".
Unfortunately such simple queries are rarely used in real life. Usually analytic queries contain group-by and joins.
And here vertical model is not always optimal (you have to reconstruct rows from columns to perform join or grouping).
To provide efficient execution of queries you may need to create multiple different projections of the same data (sorted by different subset of attributes).
This is why Vertica (one of the most popular columnar store DBMS) is supporting projections.
The same can be done in VOPS: using create_projection function you can specify which attributes should be scalar (grouping attributes) and which vectorized.
In this case you can perform grouping and joins using standard Postgres executor, while perform vectorized operations for filtering and accumulating aggregates. 

This is why Q1 is 20 times faster in VOPS and not 2 times as in your prototype.
So I think that columnar store should make it possible to maintain several projections of table and optimizer should be able to automatically choose one of them for particular query.
Definitely synchronization of projections is challenged problem. Fortunately OLAP usually not require most recent data.


Projection in Vertica is useful. I tested, VOPS is really faster. It could be nice if you could contribute it to PG core. Our extension is aimed to not change any Postgres code as well as user's sql and existing table.
We will continue to optimize our vectorize implementation. Vectorized hashagg need vectorized hashtable implementation, e.g. calculate hashkey in a batched way, probe hashtable in a batched way. Original hashtable in PG is not a vectorised hash table of course.
 
3. I wonder if vectorized executor should support only built-in types and predefined operators? Or it should be able to work with any user defined types, operators and aggregates?
Certainly it is much easier to support only built-in scalar types. But it contradicts to open and extensible nature of Postgres.

Yes, we should support user defined type. This could be done by introducing a register layer which mapping the row type with vector type. E.g. int4->vint4 and also for each operator. 

4. Did you already think about format of storing data in VectorTupleTableSlot? Should it be array of Datum? Or we need to represent vector in more low level format (for example
as array of floats for real4 type)?


Our perf results show that datum conversion is not effective, and we prepare to implement to datum array as low level format array as you mentioned.
--
Thanks

Hubert Zhang
Reply | Threaded
Open this post in threaded view
|

Re: Yet another vectorized engine

Hubert Zhang
In reply to this post by Michael Paquier-2
On Sun, Dec 1, 2019 at 10:05 AM Michael Paquier <[hidden email]> wrote:
On Thu, Nov 28, 2019 at 05:23:59PM +0800, Hubert Zhang wrote:
> Note that the vectorized executor engine is based on PG9.6 now, but it
> could be ported to master / zedstore with some effort.  We would appreciate
> some feedback before moving further in that direction.

There has been no feedback yet, unfortunately.  The patch does not
apply anymore, so a rebase is necessary.  For now I am moving the
patch to next CF, waiting on author.
--
Michael


Thanks we'll rebase and resubmit the patch.
--
Thanks

Hubert Zhang
Reply | Threaded
Open this post in threaded view
|

Re: Yet another vectorized engine

konstantin knizhnik
In reply to this post by Hubert Zhang


On 02.12.2019 4:15, Hubert Zhang wrote:

The prototype extension is at https://github.com/zhangh43/vectorize_engine

I am very sorry, that I have no followed this link.
Few questions concerning your design decisions:

1. Will it be more efficient to use native arrays in vtype instead of array of Datum? I think it will allow compiler to generate more efficient code for operations with float4 and int32 types.
It is possible to use union to keep fixed size of vtype.
2. Why VectorTupleSlot contains array (batch) of heap tuples rather than vectors (array of vtype)?
3. Why you have to implement your own plan_tree_mutator and not using expression_tree_mutator?
4. As far as I understand you now always try to replace SeqScan with your custom vectorized scan. But it makes sense only if there are quals for this scan or aggregation is performed.
In other cases batch+unbatch just adds extra overhead, doesn't it?
5. Throwing and catching exception for queries which can not be vectorized seems to be not the safest and most efficient way of handling such cases.
May be it is better to return error code in plan_tree_mutator and propagate this error upstairs?
6. Have you experimented with different batch size? I have done similar experiments in VOPS and find out that tile size larger than 128 are not providing noticable increase of performance.
You are currently using batch size 1024 which is significantly larger than typical amount of tuples on one page.
7. How vectorized scan can be combined with parallel execution (it is already supported in9.6, isn't it?)

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Reply | Threaded
Open this post in threaded view
|

Re: Yet another vectorized engine

Hubert Zhang
Thanks Konstantin for your detailed review!

On Tue, Dec 3, 2019 at 5:58 PM Konstantin Knizhnik <[hidden email]> wrote:


On 02.12.2019 4:15, Hubert Zhang wrote:

The prototype extension is at https://github.com/zhangh43/vectorize_engine

I am very sorry, that I have no followed this link.
Few questions concerning your design decisions:

1. Will it be more efficient to use native arrays in vtype instead of array of Datum? I think it will allow compiler to generate more efficient code for operations with float4 and int32 types.
It is possible to use union to keep fixed size of vtype.
 
Yes, I'm also considering that when scan a column store, the column batch is loaded into a continuous memory region. For int32, the size of this region is 4*BATCHSIZE, while for int16, the size is 2*BATCHSIZE. So using native array could just do a single memcpy to fill the vtype batch.
 
2. Why VectorTupleSlot contains array (batch) of heap tuples rather than vectors (array of vtype)?

a. VectorTupleSlot stores array of vtype in tts_values field which is used to reduce the code change and reuse functions like ExecProject. Of course we could use separate field to store vtypes.
b. VectorTupleSlot also contains array of heap tuples. This used to do heap tuple deform. In fact, the tuples in a batch may across many pages, so we also need to pin an array of related pages instead of just one page.

3. Why you have to implement your own plan_tree_mutator and not using expression_tree_mutator?

I also want to replace plan node, e.g. Agg->CustomScan(with VectorAgg implementation). expression_tree_mutator cannot be used to mutate plan node such as Agg, am I right?
 
4. As far as I understand you now always try to replace SeqScan with your custom vectorized scan. But it makes sense only if there are quals for this scan or aggregation is performed.
In other cases batch+unbatch just adds extra overhead, doesn't it?

Probably extra overhead for heap format and query like 'select i from t;' without qual, projection, aggregation.
But with column store, VectorScan could directly read batch, and no additional batch cost. Column store is the better choice for OLAP queries.
Can we conclude that it would be better to use vector engine for OLAP queries and row engine for OLTP queries.

5. Throwing and catching exception for queries which can not be vectorized seems to be not the safest and most efficient way of handling such cases.
May be it is better to return error code in plan_tree_mutator and propagate this error upstairs? 
 
Yes, as for efficiency, another way is to enable some plan node to be vectorized and leave other nodes not vectorized and add batch/unbatch layer between them(Is this what you said "propagate this error upstairs"). As you mentioned, this could introduce additional overhead. Is there any other good approaches?
What do you mean by not safest? PG catch will receive the ERROR, and fallback to the original non-vectorized plan.


6. Have you experimented with different batch size? I have done similar experiments in VOPS and find out that tile size larger than 128 are not providing noticable increase of performance.
You are currently using batch size 1024 which is significantly larger than typical amount of tuples on one page.

Good point, We will do some experiments on it. 

7. How vectorized scan can be combined with parallel execution (it is already supported in9.6, isn't it?)

We didn't implement it yet. But the idea is the same as non parallel one. Copy the current parallel scan and implement vectorized Gather, keeping their interface to be VectorTupleTableSlot.
Our basic idea to reuse most of the current PG executor logic, and make them vectorized, then tuning performance gradually.

--
Thanks

Hubert Zhang
Reply | Threaded
Open this post in threaded view
|

Re: Yet another vectorized engine

konstantin knizhnik


On 04.12.2019 12:13, Hubert Zhang wrote:
3. Why you have to implement your own plan_tree_mutator and not using expression_tree_mutator?

I also want to replace plan node, e.g. Agg->CustomScan(with VectorAgg implementation). expression_tree_mutator cannot be used to mutate plan node such as Agg, am I right?

O, sorry, I see.
 
4. As far as I understand you now always try to replace SeqScan with your custom vectorized scan. But it makes sense only if there are quals for this scan or aggregation is performed.
In other cases batch+unbatch just adds extra overhead, doesn't it?

Probably extra overhead for heap format and query like 'select i from t;' without qual, projection, aggregation.
But with column store, VectorScan could directly read batch, and no additional batch cost. Column store is the better choice for OLAP queries.

Generally, yes.
But will it be true for the query with a lot of joins?

select * from T1 join T2 on (T1.pk=T2.fk) join T3 on (T2.pk=T3.fk) join T4 ...

How can batching improve performance in this case?
Also if query contains LIMIT clause or cursors, then batching can cause fetching of useless records (which never will be requested by client).

Can we conclude that it would be better to use vector engine for OLAP queries and row engine for OLTP queries.

5. Throwing and catching exception for queries which can not be vectorized seems to be not the safest and most efficient way of handling such cases.
May be it is better to return error code in plan_tree_mutator and propagate this error upstairs? 
 
Yes, as for efficiency, another way is to enable some plan node to be vectorized and leave other nodes not vectorized and add batch/unbatch layer between them(Is this what you said "propagate this error upstairs"). As you mentioned, this could introduce additional overhead. Is there any other good approaches?
What do you mean by not safest?
PG catch will receive the ERROR, and fallback to the original non-vectorized plan.

The problem with catching and ignoring exception was many times discussed in hackers.
Unfortunately Postgres PG_TRY/PG_CATCH mechanism is not analog of exception mechanism in more high level languages, like C++, Java...
It doesn't perform stack unwind. If some resources (files, locks, memory,...) were obtained before throwing error, then them are not reclaimed.
Only rollback of transaction is guaranteed to release all resources. And it actually happen in case of normal error processing.
But if you catch and ignore exception , trying to continue execution, then it can cause many problems.

May be in your case it is not a problem, because you know for sure where error can happen: it is thrown by plan_tree_mutator
and looks like there are no resources obtained by this function.  But in any case overhead of setjmp is much higher than of explicit checks of return code.
So checking return codes will not actually add some noticeable overhead except code complication by adding extra checks.
But in can be hidden in macros which are used in any case (like MUTATE).

7. How vectorized scan can be combined with parallel execution (it is already supported in9.6, isn't it?)

We didn't implement it yet. But the idea is the same as non parallel one. Copy the current parallel scan and implement vectorized Gather, keeping their interface to be VectorTupleTableSlot.
Our basic idea to reuse most of the current PG executor logic, and make them vectorized, then tuning performance gradually.

Parallel scan is scattering pages between parallel workers.
To fill VectorTupleSlot with data you may need more than one page (unless you make a decision that it can fetch tuples only from single page).
So it should be somehow take in account specific of parallel search.
Also there is special nodes for parallel search so if we want to provide parallel execution for vectorized operations we need also to substitute this nodes with
custom nodes.

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Reply | Threaded
Open this post in threaded view
|

Re: Yet another vectorized engine

Hubert Zhang
Thanks Konstantin,
Your suggestions are very helpful. I have added them into issues of vectorize_engine repo

On Wed, Dec 4, 2019 at 10:08 PM Konstantin Knizhnik <[hidden email]> wrote:


On 04.12.2019 12:13, Hubert Zhang wrote:
3. Why you have to implement your own plan_tree_mutator and not using expression_tree_mutator?

I also want to replace plan node, e.g. Agg->CustomScan(with VectorAgg implementation). expression_tree_mutator cannot be used to mutate plan node such as Agg, am I right?

O, sorry, I see.
 
4. As far as I understand you now always try to replace SeqScan with your custom vectorized scan. But it makes sense only if there are quals for this scan or aggregation is performed.
In other cases batch+unbatch just adds extra overhead, doesn't it?

Probably extra overhead for heap format and query like 'select i from t;' without qual, projection, aggregation.
But with column store, VectorScan could directly read batch, and no additional batch cost. Column store is the better choice for OLAP queries.

Generally, yes.
But will it be true for the query with a lot of joins?

select * from T1 join T2 on (T1.pk=T2.fk) join T3 on (T2.pk=T3.fk) join T4 ...

How can batching improve performance in this case?
Also if query contains LIMIT clause or cursors, then batching can cause fetching of useless records (which never will be requested by client).

Can we conclude that it would be better to use vector engine for OLAP queries and row engine for OLTP queries.

5. Throwing and catching exception for queries which can not be vectorized seems to be not the safest and most efficient way of handling such cases.
May be it is better to return error code in plan_tree_mutator and propagate this error upstairs? 
 
Yes, as for efficiency, another way is to enable some plan node to be vectorized and leave other nodes not vectorized and add batch/unbatch layer between them(Is this what you said "propagate this error upstairs"). As you mentioned, this could introduce additional overhead. Is there any other good approaches?
What do you mean by not safest?
PG catch will receive the ERROR, and fallback to the original non-vectorized plan.

The problem with catching and ignoring exception was many times discussed in hackers.
Unfortunately Postgres PG_TRY/PG_CATCH mechanism is not analog of exception mechanism in more high level languages, like C++, Java...
It doesn't perform stack unwind. If some resources (files, locks, memory,...) were obtained before throwing error, then them are not reclaimed.
Only rollback of transaction is guaranteed to release all resources. And it actually happen in case of normal error processing.
But if you catch and ignore exception , trying to continue execution, then it can cause many problems.

May be in your case it is not a problem, because you know for sure where error can happen: it is thrown by plan_tree_mutator
and looks like there are no resources obtained by this function.  But in any case overhead of setjmp is much higher than of explicit checks of return code.
So checking return codes will not actually add some noticeable overhead except code complication by adding extra checks.
But in can be hidden in macros which are used in any case (like MUTATE).

7. How vectorized scan can be combined with parallel execution (it is already supported in9.6, isn't it?)

We didn't implement it yet. But the idea is the same as non parallel one. Copy the current parallel scan and implement vectorized Gather, keeping their interface to be VectorTupleTableSlot.
Our basic idea to reuse most of the current PG executor logic, and make them vectorized, then tuning performance gradually.

Parallel scan is scattering pages between parallel workers.
To fill VectorTupleSlot with data you may need more than one page (unless you make a decision that it can fetch tuples only from single page).
So it should be somehow take in account specific of parallel search.
Also there is special nodes for parallel search so if we want to provide parallel execution for vectorized operations we need also to substitute this nodes with
custom nodes.

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 


--
Thanks

Hubert Zhang