Implementing Incremental View Maintenance

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
63 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Implementing Incremental View Maintenance

Yugo Nagata
Hi,

I would like to implement Incremental View Maintenance (IVM) on PostgreSQL.  
IVM is a technique to maintain materialized views which computes and applies
only the incremental changes to the materialized views rather than
recomputate the contents as the current REFRESH command does.

I had a presentation on our PoC implementation of IVM at PGConf.eu 2018 [1].
Our implementation uses row OIDs to compute deltas for materialized views.  
The basic idea is that if we have information about which rows in base tables
are contributing to generate a certain row in a matview then we can identify
the affected rows when a base table is updated. This is based on an idea of
Dr. Masunaga [2] who is a member of our group and inspired from ID-based
approach[3].

In our implementation, the mapping of the row OIDs of the materialized view
and the base tables are stored in "OID map". When a base relation is modified,
AFTER trigger is executed and the delta is recorded in delta tables using
the transition table feature. The accual udpate of the matview is triggerd
by REFRESH command with INCREMENTALLY option.

However, we realize problems of our implementation. First, WITH OIDS will
be removed since PG12, so OIDs are no longer available. Besides this, it would
be hard to implement this since it needs many changes of executor nodes to
collect base tables's OIDs during execuing a query. Also, the cost of maintaining
OID map would be high.

For these reasons, we started to think to implement IVM without relying on OIDs
and made a bit more surveys.  

We also looked at Kevin Grittner's discussion [4] on incremental matview
maintenance.  In this discussion, Kevin proposed to use counting algorithm [5]
to handle projection views (using DISTNICT) properly. This algorithm need an
additional system column, count_t, in materialized views and delta tables of
base tables.

However, the discussion about IVM is now stoped, so we would like to restart and
progress this.


Through our PoC inplementation and surveys, I think we need to think at least
the followings for implementing IVM.

1. How to extract changes on base tables

I think there would be at least two approaches for it.

 - Using transition table in AFTER triggers
 - Extracting changes from WAL using logical decoding

In our PoC implementation, we used AFTER trigger and transition tables, but using
logical decoding might be better from the point of performance of base table
modification.

If we can represent a change of UPDATE on a base table as query-like rather than
OLD and NEW, it may be possible to update the materialized view directly instead
of performing delete & insert.


2. How to compute the delta to be applied to materialized views

Essentially, IVM is based on relational algebra. Theorically, changes on base
tables are represented as deltas on this, like "R <- R + dR", and the delta on
the materialized view is computed using base table deltas based on "change
propagation equations".  For implementation, we have to derive the equation from
the view definition query (Query tree, or Plan tree?) and describe this as SQL
query to compulte delta to be applied to the materialized view.

There could be several operations for view definition: selection, projection,
join,  aggregation, union, difference, intersection, etc.  If we can prepare a
module for each operation, it makes IVM extensable, so we can start a simple
view definition, and then support more complex views.


3. How to identify rows to be modifed in materialized views

When applying the delta to the materialized view, we have to identify which row
in the matview is corresponding to a row in the delta.  A naive method is matching
by using all columns in a tuple, but clearly this is unefficient. If thematerialized
view has unique index, we can use this. Maybe, we have to force materialized views
to have all primary key colums in their base tables.  In our PoC implementation, we
used OID to identify rows, but this will be no longer available as said above.


4. When to maintain materialized views

There are two candidates of the timing of maintenance, immediate (eager) or deferred.

In eager maintenance, the materialized view is updated in the same transaction
where the base table is updated. In deferred maintenance, this is done after the
transaction is commited, for example, when view is accessed, as a response to user
request, etc.

In the previous discussion[4], it is planned to start from "eager" approach. In our PoC
implementaion, we used the other aproach, that is, using REFRESH command to perform IVM.
I am not sure which is better as a start point, but I begin to think that the eager
approach may be more simple since we don't have to maintain base table changes in other
past transactions.

In the eager maintenance approache, we have to consider a race condition where two
different transactions change base tables simultaneously as discussed in [4].


[1] https://www.postgresql.eu/events/pgconfeu2018/schedule/session/2195-implementing-incremental-view-maintenance-on-postgresql/
[2] https://ipsj.ixsq.nii.ac.jp/ej/index.php?active_action=repository_view_main_item_detail&page_id=13&block_id=8&item_id=191254&item_no=1 (Japanese only)
[3] https://dl.acm.org/citation.cfm?id=2750546
[4] https://www.postgresql.org/message-id/flat/1368561126.64093.YahooMailNeo%40web162904.mail.bf1.yahoo.com
[5] https://dl.acm.org/citation.cfm?id=170066

Regards,
--
Yugo Nagata <[hidden email]>

Reply | Threaded
Open this post in threaded view
|

Re: Implementing Incremental View Maintenance

denty
Hi Yugo.

> I would like to implement Incremental View Maintenance (IVM) on
> PostgreSQL.

Great. :-)

I think it would address an important gap in PostgreSQL’s feature set.

> 2. How to compute the delta to be applied to materialized views
>
> Essentially, IVM is based on relational algebra. Theorically, changes on
> base
> tables are represented as deltas on this, like "R <- R + dR", and the
> delta on
> the materialized view is computed using base table deltas based on "change
> propagation equations".  For implementation, we have to derive the
> equation from
> the view definition query (Query tree, or Plan tree?) and describe this as
> SQL
> query to compulte delta to be applied to the materialized view.

We had a similar discussion in this thread
https://www.postgresql.org/message-id/flat/FC784A9F-F599-4DCC-A45D-DBF6FA582D30%40QQdd.eu,
and I’m very much in agreement that the "change propagation equations”
approach can solve for a very substantial subset of common MV use cases.

> There could be several operations for view definition: selection,
> projection,
> join,  aggregation, union, difference, intersection, etc.  If we can
> prepare a
> module for each operation, it makes IVM extensable, so we can start a
> simple
> view definition, and then support more complex views.

Such a decomposition also allows ’stacking’, allowing complex MV definitions
to be attacked even with only a small handful of modules.

I did a bit of an experiment to see if "change propagation equations” could
be computed directly from the MV’s pg_node_tree representation in the
catalog in PlPgSQL. I found that pg_node_trees are not particularly friendly
to manipulation in PlPgSQL. Even with a more friendly-to-PlPgSQL
representation (I played with JSONB), then the next problem is making sense
of the structures, and unfortunately amongst the many plan/path/tree utility
functions in the code base, I figured only a very few could be sensibly
exposed to PlPgSQL. Ultimately, although I’m still attracted to the idea,
and I think it could be made to work, native code is the way to go at least
for now.

> 4. When to maintain materialized views
>
> [...]
>
> In the previous discussion[4], it is planned to start from "eager"
> approach. In our PoC
> implementaion, we used the other aproach, that is, using REFRESH command
> to perform IVM.
> I am not sure which is better as a start point, but I begin to think that
> the eager
> approach may be more simple since we don't have to maintain base table
> changes in other
> past transactions.

Certainly the eager approach allows progress to be made with less
infrastructure.

I am concerned that the eager approach only addresses a subset of the MV use
case space, though. For example, if we presume that an MV is present because
the underlying direct query would be non-performant, then we have to at
least question whether applying the delta-update would also be detrimental
to some use cases.

In the eager maintenance approache, we have to consider a race condition
where two
different transactions change base tables simultaneously as discussed in
[4].

I wonder if that nudges towards a logged approach. If the race is due to
fact of JOIN-worthy tuples been made visible after a COMMIT, but not before,
then does it not follow that the eager approach has to fire some kind of
reconciliation work at COMMIT time? That seems to imply a persistent queue
of some kind, since we can’t assume transactions to be so small to be able
to hold the queue in memory.

Hmm. I hadn’t really thought about that particular corner case. I guess a
‘catch' could be simply be to detect such a concurrent update and demote the
refresh approach by marking the MV stale awaiting a full refresh.

denty.



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

Reply | Threaded
Open this post in threaded view
|

Re: Implementing Incremental View Maintenance

Adam Brusselback
Hi all, just wanted to say  I am very happy to see progress made on this, my codebase has multiple "materialized tables" which are maintained with statement triggers (transition tables) and custom functions. They are ugly and a pain to maintain, but they work because I have no other solution...for now at least.

I am concerned that the eager approach only addresses a subset of the MV use
case space, though. For example, if we presume that an MV is present because
the underlying direct query would be non-performant, then we have to at
least question whether applying the delta-update would also be detrimental
to some use cases.

I will say that in my case, as long as my reads of the materialized view are always consistent with the underlying data, that's what's important.  I don't mind if it's eager, or lazy (as long as lazy still means it will refresh prior to reading).
Reply | Threaded
Open this post in threaded view
|

Re: Implementing Incremental View Maintenance

Nguyễn Trần Quốc Vinh
Dear all,

We have some result on incremental update for MVs. We generate triggers on C to do the incremental maintenance. We posted the code to github about 1 year ago, but unfortunately i posted a not-right ctrigger.h header. The mistake was exposed to me when a person could not compile the generated triggers and reported to me. And now i re-posted with the right ctrigger.h file.

You can find the codes of the generator here: https://github.com/ntqvinh/PgMvIncrementalUpdate/commits/master. You can find how did we do here: https://link.springer.com/article/10.1134/S0361768816050066. The paper is about generating of codes in pl/pgsql. Anyway i see it is useful for reading the codes. I don't know if i can share the paper or not so that i don't publish anywhere else. The text about how to generate triggers in C was published with open-access but unfortunately, it is in Vietnamese.

We are happy if the codes are useful for someone.

Thank you and best regards,

NTQ Vinh

TS. Nguyễn Trần Quốc Vinh
-----------------------------------------------
Chủ nhiệm khoa Tin học
Trường ĐH Sư phạm - ĐH Đà Nẵng
------------------------------------------------
Nguyen Tran Quoc Vinh, PhD
Dean
Faculty of Information Technology
Danang University of Education
Phone: (+84) 511.6-512-586
Mobile: (+84) 914.78-08-98

On Mon, Dec 31, 2018 at 11:20 PM Adam Brusselback <[hidden email]> wrote:
Hi all, just wanted to say  I am very happy to see progress made on this, my codebase has multiple "materialized tables" which are maintained with statement triggers (transition tables) and custom functions. They are ugly and a pain to maintain, but they work because I have no other solution...for now at least.

I am concerned that the eager approach only addresses a subset of the MV use
case space, though. For example, if we presume that an MV is present because
the underlying direct query would be non-performant, then we have to at
least question whether applying the delta-update would also be detrimental
to some use cases.

I will say that in my case, as long as my reads of the materialized view are always consistent with the underlying data, that's what's important.  I don't mind if it's eager, or lazy (as long as lazy still means it will refresh prior to reading).
Reply | Threaded
Open this post in threaded view
|

Re: Implementing Incremental View Maintenance

Tatsuo Ishii-3
In reply to this post by Adam Brusselback
> Hi all, just wanted to say  I am very happy to see progress made on this,
> my codebase has multiple "materialized tables" which are maintained with
> statement triggers (transition tables) and custom functions. They are ugly
> and a pain to maintain, but they work because I have no other
> solution...for now at least.
>
> I am concerned that the eager approach only addresses a subset of the MV use
>> case space, though. For example, if we presume that an MV is present
>> because
>> the underlying direct query would be non-performant, then we have to at
>> least question whether applying the delta-update would also be detrimental
>> to some use cases.
>>
>
> I will say that in my case, as long as my reads of the materialized view
> are always consistent with the underlying data, that's what's important.  I
> don't mind if it's eager, or lazy (as long as lazy still means it will
> refresh prior to reading).

Assuming that we want to implement IVM incrementally (that means, for
example, we implement DELETE for IVM in PostgreSQL XX, then INSERT for
IVM for PostgreSQL XX+1... etc.), I think it's hard to do it with an
eager approach if want to MV is always consistent with base tables.

On the other hand, a lazy approach allows to implement IVM
incrementally because we could always let full MV build from scratch
if operations on MV include queries we do not support.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Reply | Threaded
Open this post in threaded view
|

Re: Implementing Incremental View Maintenance

Nguyễn Trần Quốc Vinh
Dear All,

The tool analyzes the input query and then generates triggers (trigger functions and pl/pgsql scripts as well) on all manipulating events (insert/updates/delete) for all underlying base tables. The triggers do incremental updates to the table that contains the query result (MV). You can build the tool, then see the provided example and try the tool. It is for synchronous maintenance.  It was hard tested but you can use it with your own risk.

For Asynchronous maintenance, we generate 1) triggers on all manipulating events on base tables to collect all the data changes and save to the 'special' tables; then 2) the tool to do incremental updates of MVs.

Best regards,

Vinh

TS. Nguyễn Trần Quốc Vinh
-----------------------------------------------
Chủ nhiệm khoa Tin học
Trường ĐH Sư phạm - ĐH Đà Nẵng
------------------------------------------------
Nguyen Tran Quoc Vinh, PhD
Dean
Faculty of Information Technology
Danang University of Education
Phone: (+84) 511.6-512-586
Mobile: (+84) 914.78-08-98


On Mon, Jan 7, 2019 at 9:00 AM Tatsuo Ishii <[hidden email]> wrote:
> Hi all, just wanted to say  I am very happy to see progress made on this,
> my codebase has multiple "materialized tables" which are maintained with
> statement triggers (transition tables) and custom functions. They are ugly
> and a pain to maintain, but they work because I have no other
> solution...for now at least.
>
> I am concerned that the eager approach only addresses a subset of the MV use
>> case space, though. For example, if we presume that an MV is present
>> because
>> the underlying direct query would be non-performant, then we have to at
>> least question whether applying the delta-update would also be detrimental
>> to some use cases.
>>
>
> I will say that in my case, as long as my reads of the materialized view
> are always consistent with the underlying data, that's what's important.  I
> don't mind if it's eager, or lazy (as long as lazy still means it will
> refresh prior to reading).

Assuming that we want to implement IVM incrementally (that means, for
example, we implement DELETE for IVM in PostgreSQL XX, then INSERT for
IVM for PostgreSQL XX+1... etc.), I think it's hard to do it with an
eager approach if want to MV is always consistent with base tables.

On the other hand, a lazy approach allows to implement IVM
incrementally because we could always let full MV build from scratch
if operations on MV include queries we do not support.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Reply | Threaded
Open this post in threaded view
|

Re: Implementing Incremental View Maintenance

Mitar
In reply to this post by Yugo Nagata
Hi!

On Thu, Dec 27, 2018 at 4:57 AM Yugo Nagata <[hidden email]> wrote:
> I would like to implement Incremental View Maintenance (IVM) on PostgreSQL.
> IVM is a technique to maintain materialized views which computes and applies
> only the incremental changes to the materialized views rather than
> recomputate the contents as the current REFRESH command does.

That sounds great! I am interested in this topic because I am
interested in reactive/live queries and support for them in
PostgreSQL. [1]

In that context, the problem is very similar: based on some state of
query results and updated source tables, determine what should be new
updates to send to the client describing changes to the query results.
So after computing those incremental changes, instead of applying them
to materialized view I would send them to the client. One could see
materialized views only type of consumers of such information about
incremental change.

So I would like to ask if whatever is done in this setting is done in
a way that one could also outside of the context of materialized view.
Not sure what would API be thought.

From the perspective of reactive/live queries, this package [2] is
interesting. To my understanding, it adds to all base tables two
columns, one for unique ID and one for revision of the row. And then
rewrites queries so that this information is passed all the way to
query results. In this way it can then determine mapping between
inputs and outputs. I am not sure if it then does incremental update
or just uses that to determine if view is invalidated. Not sure if
there is anything about such approach in literature. Or why both index
and revision columns are needed.

> For these reasons, we started to think to implement IVM without relying on OIDs
> and made a bit more surveys.

I also do not see much difference between asking users to have primary
key on base tables or asking them to have OIDs. Why do you think that
a requirement for primary keys is a hard one? I think we should first
focus on having IVM with base tables with primary keys. Maybe then
later on we could improve on that and make it also work without.

To me personally, having unique index on source tables and also on
materialized view is a reasonable restriction for this feature.
Especially for initial versions of it.

> However, the discussion about IVM is now stoped, so we would like to restart and
> progress this.

What would be next steps in your view to move this further?

> If we can represent a change of UPDATE on a base table as query-like rather than
> OLD and NEW, it may be possible to update the materialized view directly instead
> of performing delete & insert.

Why do you need OLD and NEW? Don't you need just NEW and a list of
columns which changed from those in NEW? I use such diffing query [4]
to represent changes: first column has a flag telling if the row is
representing insert, update, and remove, the second column tells which
column are being changed in the case of the update, and then the NEW
columns follow.

I think that maybe standardizing structure for representing those
changes would be a good step towards making this modular and reusable.
Because then we can have three parts:

* Recording and storing changes in a standard format.
* A function which given original data, stored changes, computes
updates needed, also in some standard format.
* A function which given original data and updates needed, applies them.

> In the previous discussion[4], it is planned to start from "eager" approach. In our PoC
> implementaion, we used the other aproach, that is, using REFRESH command to perform IVM.
> I am not sure which is better as a start point, but I begin to think that the eager
> approach may be more simple since we don't have to maintain base table changes in other
> past transactions.

I think if we split things into three parts as I described above, then
this is just a question of configuration. Or you call all three inside
one trigger to update in "eager" fashion. Or you store computed
updates somewhere and then on demand apply those in "lazy" fashion.

> In the eager maintenance approache, we have to consider a race condition where two
> different transactions change base tables simultaneously as discussed in [4].

But in the case of "lazy" maintenance there is a mirror problem: what
if later changes to base tables invalidate some previous change to the
materialized view. Imagine that one cell in a base table is first
updated too "foo" and we compute an update for the materialized view
to set it to "foo". And then the same cell is updated to "bar" and we
compute an update for the materialized view again. If we have not
applied any of those updates (because we are "lazy") now the
previously computed update can be discarded. We could still apply
both, but it would not be efficient.

[1] https://www.postgresql.org/message-id/flat/CAKLmikP%2BPPB49z8rEEvRjFOD0D2DV72KdqYN7s9fjh9sM_32ZA%40mail.gmail.com
[2] https://github.com/nothingisdead/pg-live-query
[3] https://www.postgresql.org/docs/devel/sql-createtable.html
[4] https://github.com/tozd/node-reactive-postgres/blob/eeda4f28d096b6e552d04c5ea138c258cb5b9389/index.js#L329-L340


Mitar

--
http://mitar.tnode.com/
https://twitter.com/mitar_m

Reply | Threaded
Open this post in threaded view
|

Re: Implementing Incremental View Maintenance

Yugo Nagata
In reply to this post by Nguyễn Trần Quốc Vinh
On Tue, 1 Jan 2019 14:46:25 +0700
Nguyễn Trần Quốc Vinh <[hidden email]> wrote:

> We have some result on incremental update for MVs. We generate triggers on
> C to do the incremental maintenance. We posted the code to github about 1
> year ago, but unfortunately i posted a not-right ctrigger.h header. The
> mistake was exposed to me when a person could not compile the generated
> triggers and reported to me. And now i re-posted with the right ctrigger.h
> file.
>
> You can find the codes of the generator here:
> https://github.com/ntqvinh/PgMvIncrementalUpdate/commits/master. You can
> find how did we do here:
> https://link.springer.com/article/10.1134/S0361768816050066. The paper is
> about generating of codes in pl/pgsql. Anyway i see it is useful for
> reading the codes. I don't know if i can share the paper or not so that i
> don't publish anywhere else. The text about how to generate triggers in C
> was published with open-access but unfortunately, it is in Vietnamese.
>
> We are happy if the codes are useful for someone.

I have read your paper. It is interesting and great so that the algorithm
is described concretely.

After reading this, I have a few questions about your implementation.
Although I may be able to understand by reading your paper and code carefully,
I would appreciate it if you could answer these.

- It is said there are many limitations on the view definition query.
  How does this work when the query is not supported?

- Is it possible to support materialized views that have DISTINCT,
  OUTER JOIN, or sub-query in your approach?

- It is said that AVG is splitted to SUM and COUNT. Are these new additional
  columns in MV visible for users?

- Does this can handle the race condition discussed in [1], that is,
  if concurrent transactions update different two tables in the join
  view definition, is MV updated sucessfully?

[1] https://www.postgresql.org/message-id/flat/1368561126.64093.YahooMailNeo%40web162904.mail.bf1.yahoo.com

Regards,
--
Yugo Nagata <[hidden email]>

Reply | Threaded
Open this post in threaded view
|

Re: Implementing Incremental View Maintenance

Yugo Nagata
In reply to this post by Mitar
On Mon, 7 Jan 2019 00:39:00 -0800
Mitar <[hidden email]> wrote:

> That sounds great! I am interested in this topic because I am
> interested in reactive/live queries and support for them in
> PostgreSQL. [1]
>
> In that context, the problem is very similar: based on some state of
> query results and updated source tables, determine what should be new
> updates to send to the client describing changes to the query results.
> So after computing those incremental changes, instead of applying them
> to materialized view I would send them to the client. One could see
> materialized views only type of consumers of such information about
> incremental change.
>
> So I would like to ask if whatever is done in this setting is done in
> a way that one could also outside of the context of materialized view.
> Not sure what would API be thought.

I didn't know about reactive/live queries but this seems share a part of
problem with IVM, so we might have common API.

BTW, what is uecase of reactive/live queries? (just curious)
 

> > For these reasons, we started to think to implement IVM without relying on OIDs
> > and made a bit more surveys.
>
> I also do not see much difference between asking users to have primary
> key on base tables or asking them to have OIDs. Why do you think that
> a requirement for primary keys is a hard one? I think we should first
> focus on having IVM with base tables with primary keys. Maybe then
> later on we could improve on that and make it also work without.
>  
> To me personally, having unique index on source tables and also on
> materialized view is a reasonable restriction for this feature.
> Especially for initial versions of it.

Initially, I chose to use OIDs for theoretical reason, that is, to handle
"bag-semantics" which allows duplicate rows in tables. However, I agree
that we start from the restriction of having unique index on base tables.
 

> > If we can represent a change of UPDATE on a base table as query-like rather than
> > OLD and NEW, it may be possible to update the materialized view directly instead
> > of performing delete & insert.
>
> Why do you need OLD and NEW? Don't you need just NEW and a list of
> columns which changed from those in NEW? I use such diffing query [4]
> to represent changes: first column has a flag telling if the row is
> representing insert, update, and remove, the second column tells which
> column are being changed in the case of the update, and then the NEW
> columns follow.

According the change propagation equation approach, OLD is necessary
to calculate tuples in MV to be deleted or modified. However, if tables
has unique keys, such tuples can be identifeid using the keys, so
OLD may not be needed, at least in eager approach.

In lazy approach, OLD contents of table is useful. For example, with
a join view MV = R * S, when dR is inserted into R and dS is inserted
into S, the delta to be inserted into MV will be
 
 dMV = (R_old * dS) + (dR * S_new)
    = (R_old * dS) + (dR * S_old) + (dR * dS)

, hence the old contents of tables R and S are needed.
 
> I think that maybe standardizing structure for representing those
> changes would be a good step towards making this modular and reusable.
> Because then we can have three parts:
>
> * Recording and storing changes in a standard format.
> * A function which given original data, stored changes, computes
> updates needed, also in some standard format.
> * A function which given original data and updates needed, applies them.

> I think if we split things into three parts as I described above, then
> this is just a question of configuration. Or you call all three inside
> one trigger to update in "eager" fashion. Or you store computed
> updates somewhere and then on demand apply those in "lazy" fashion.

I agree that defining the format to represent changes is important. However,
I am not sure both of eager and lazy can be handled in the same manner. I'll
consider about this more.

> > In the eager maintenance approache, we have to consider a race condition where two
> > different transactions change base tables simultaneously as discussed in [4].
>
> But in the case of "lazy" maintenance there is a mirror problem: what
> if later changes to base tables invalidate some previous change to the
> materialized view. Imagine that one cell in a base table is first
> updated too "foo" and we compute an update for the materialized view
> to set it to "foo". And then the same cell is updated to "bar" and we
> compute an update for the materialized view again. If we have not
> applied any of those updates (because we are "lazy") now the
> previously computed update can be discarded. We could still apply
> both, but it would not be efficient.

In our PoC implementation, I handled this situation by removing
old contents from NEW delata table. In your example, when the base
table is updated from "foo" to "bar", the "foo" tuple is removed
from and the "bar" tuple is inserted in NEW delta and the delta
of MV is computed using the final NEW delta.

Regards,
--
Yugo Nagata <[hidden email]>

Reply | Threaded
Open this post in threaded view
|

Re: Implementing Incremental View Maintenance

Mitar
Hi!

On Thu, Jan 31, 2019 at 6:20 AM Yugo Nagata <[hidden email]> wrote:
> BTW, what is uecase of reactive/live queries? (just curious)

It allows syncing the state between client and server. Client can then
have a subset of data and server can push changes as they are
happening to the client. Client can in a reactive manner render that
in the UI to the user. So you can easily create a reactive UI which
always shows up-to-date data without having to poll or something
similar.

How are things progressing? Any news on this topic?


Mitar

--
http://mitar.tnode.com/
https://twitter.com/mitar_m

Reply | Threaded
Open this post in threaded view
|

Re: Implementing Incremental View Maintenance

Yugo Nagata
In reply to this post by Yugo Nagata
On Thu, 27 Dec 2018 21:57:26 +0900
Yugo Nagata <[hidden email]> wrote:

> Hi,
>
> I would like to implement Incremental View Maintenance (IVM) on PostgreSQL.  

I am now working on an initial patch for implementing IVM on PostgreSQL.
This enables materialized views to be updated incrementally after one
of their base tables is modified.

At the first patch, I want to start from very simple features.

Firstly, this will handle simple definition views which includes only
selection, projection, and join.  Standard aggregations (count, sum, avg,
min, max) are not planned to be implemented in the first patch, but these
are commonly used in materialized views, so I'll implement them later on.
Views which include sub-query, outer-join, CTE, and window functions are also
out of scope of the first patch. Also, views including self-join or views
including other views in their definition is not considered well, either.
I need more investigation on these type of views although I found some papers
explaining how to handle sub-quries and outer-joins.

Next, this will handle materialized views with no duplicates in their
tuples. I am thinking of implementing an algorithm to handle duplicates
called "counting-algorithm" afterward, but I'll start from this
no-duplicates assumption in the first patch for simplicity.

In the first patch, I will implement only "immediate maintenance", that is, materialized views are updated immediately in a transaction where a base
table is modified.  On other hand, in "deferred maintenance", materialized
views are updated after the transaction, for example, by the user command
like REFRESH. Although I plan to implement both eventually, I'll start from "immediate" because this seems to need smaller code than "deferred". For
implementing "deferred", it is need to implement a mechanism to maintain logs
for recording changes and an algorithm to compute the delta to be applied to
materialized views are necessary.
 
I plan to implement the immediate maintenance using AFTER triggers created
automatically on a materialized view's base tables.  In AFTER trigger using
transition table features, changes occurs on base tables is recorded ephemeral relations. We can compute the delta to be applied to materialized views by
using these ephemeral relations and the view definition query, then update
the view by applying this delta.

--
Yugo Nagata <[hidden email]>


Reply | Threaded
Open this post in threaded view
|

Re: Implementing Incremental View Maintenance

Greg Stark
On Sun, 31 Mar 2019 at 23:22, Yugo Nagata <[hidden email]> wrote:
>
> Firstly, this will handle simple definition views which includes only
> selection, projection, and join.  Standard aggregations (count, sum, avg,
> min, max) are not planned to be implemented in the first patch, but these
> are commonly used in materialized views, so I'll implement them later on.

It's fine to not have all the features from day 1 of course. But I
just picked up this comment and the followup talking about splitting
AVG into SUM and COUNT and I had a comment. When you do look at
tackling aggregates I don't think you should restrict yourself to
these specific standard aggregations. We have all the necessary
abstractions to handle all aggregations that are feasible, see
https://www.postgresql.org/docs/devel/xaggr.html#XAGGR-MOVING-AGGREGATES

What you need to do -- I think -- is store the "moving aggregate
state" before the final function. Then whenever a row is inserted or
deleted or updated (or whenever another column is updated which causes
the value to row to enter or leave the aggregation) apply either
aggtransfn or aggminvtransfn to the state. I'm not sure if you want to
apply the final function on every update or only lazily either may be
better in some usage.


Reply | Threaded
Open this post in threaded view
|

Re: Implementing Incremental View Maintenance

Yugo Nagata
In reply to this post by Yugo Nagata
On Mon, 1 Apr 2019 12:11:22 +0900
Yugo Nagata <[hidden email]> wrote:

> On Thu, 27 Dec 2018 21:57:26 +0900
> Yugo Nagata <[hidden email]> wrote:
>
> > Hi,
> >
> > I would like to implement Incremental View Maintenance (IVM) on PostgreSQL.  
>
> I am now working on an initial patch for implementing IVM on PostgreSQL.
> This enables materialized views to be updated incrementally after one
> of their base tables is modified.
Attached is a WIP patch of Incremental View Maintenance (IVM).
Major part is written by me, and changes in syntax and pg_class
are Hoshiai-san's work.

Although this is sill a draft patch in work-in-progress, any
suggestions or thoughts would be appreciated.
 
* What it is

This allows a kind of Immediate Maintenance of materialized views. if a
materialized view is created by CRATE INCREMENTAL MATERIALIZED VIEW command,
the contents of the mateview is updated automatically and incrementally
after base tables are updated. Noted this syntax is just tentative, so it
may be changed.

====== Example 1 ======
postgres=# CREATE INCREMENTAL MATERIALIZED VIEW m AS SELECT * FROM t0;
SELECT 3
postgres=# SELECT * FROM m;
 i
---
 3
 2
 1
(3 rows)

postgres=# INSERT INTO t0 VALUES (4);
INSERT 0 1
postgres=# SELECt * FROM m; -- automatically updated
 i
---
 3
 2
 1
 4
(4 rows)
=============================

This implementation also supports matviews including duplicate tuples or
DISTINCT clause in its view definition query.  For example, even if a matview
is defined with DISTINCT to remove duplication of tuples in a base table, this
can perform incremental update of the matview properly. That is, the contents
of the matview doesn't change when exiting tuples are inserted into the base
tables, and a tuple in the matview is deleted only when duplicity of the
corresponding tuple in the base table becomes zero.

This is due to "colunting alogorithm" in which the number of each tuple is
stored in matviews as a special column value.

====== Example 2 ======
postgres=# SELECT * FROM t1;
 id | t
----+---
  1 | A
  2 | B
  3 | C
  4 | A
(4 rows)

postgres=# CREATE INCREMENTAL MATERIALIZED VIEW m1 AS SELECT t FROM t1;
SELECT 3
postgres=# CREATE INCREMENTAL MATERIALIZED VIEW m2 AS SELECT DISTINCT t FROM t1;
SELECT 3
postgres=# SELECT * FROM m1; -- with duplicity
 t
---
 A
 A
 C
 B
(4 rows)

postgres=# SELECT * FROM m2;
 t
---
 A
 B
 C
(3 rows)

postgres=# INSERT INTO t1 VALUES (5, 'B');
INSERT 0 1
postgres=# DELETE FROM t1 WHERE id IN (1,3); -- delete (1,A),(3,C)
DELETE 2
postgres=# SELECT * FROM m1; -- one A left and one more B
 t
---
 B
 B
 A
(3 rows)

postgres=# SELECT * FROM m2; -- only C is removed
 t
---
 B
 A
(2 rows)
=============================

* How it works

1. Creating matview

When a matview is created, AFTER triggers are internally created
on its base tables. When the base tables is modified (INSERT, DELETE,
UPDATE), the matview is updated incrementally in the trigger function.

When populating the matview, GROUP BY and count(*) are added to the
view definition query before this is executed for counting duplicity
of tuples in the matview. The result of count is stored in the matview
as a special column named "__ivm_count__".

2. Maintenance of matview

When base tables are modified, the change set of the table can be
referred as Ephemeral Named Relations (ENRs) thanks to Transition Table
(a feature of trigger implemented since PG10). We can calculate the diff
set of the matview by replacing the base table in the view definition
query with the ENR (at least if it is Selection-Projection -Join view).
As well as view definition time, GROUP BY and count(*) is added in order
to count the duplicity of tuples in the diff set. As a result, two diff
sets (to be deleted from and to be inserted into the matview) are
calculated, and the results are stored into temporary tables respectively.

The matiview is updated by merging these change sets. Instead of executing
DELETE or INSERT simply, the values of __ivm_count__ column in the matview
is decreased or increased. When the values becomes zero, the corresponding
tuple is deleted from the matview.

3. Access to matview

When SELECT is issued for IVM matviews defined with DISTINCT, all columns
except to __ivm_count__ of each tuple in the matview are returned.  This is
correct because duplicity of tuples are eliminated by GROUP BY.

When DISTINCT is not used, SELECT for the IVM matviews returns each tuple
__ivm_count__ times. Currently, this is implemented by rewriting the SELECT
query to replace the matview RTE with a subquery which joins the matview
and generate_series function as bellow.

 SELECT mv.* FROM mv, generate_series(1, mv.__ivm_count__);

__ivm_count__ column is invisible for users when "SELECT * FROM ..." is
issued, but users can see the value by specifying in target list explicitly.

====== Example 3 ======
postgres=# \d+ m1
                                  Materialized view "public.m1"
    Column     |  Type  | Collation | Nullable | Default | Storage  | Stats target | Description
---------------+--------+-----------+----------+---------+----------+--------------+-------------
 t             | text   |           |          |         | extended |              |
 __ivm_count__ | bigint |           |          |         | plain    |              |
View definition:
 SELECT t1.t
   FROM t1;
Access method: heap

postgres=# \d+ m2
                                  Materialized view "public.m2"
    Column     |  Type  | Collation | Nullable | Default | Storage  | Stats target | Description
---------------+--------+-----------+----------+---------+----------+--------------+-------------
 t             | text   |           |          |         | extended |              |
 __ivm_count__ | bigint |           |          |         | plain    |              |
View definition:
 SELECT DISTINCT t1.t
   FROM t1;
Access method: heap

postgres=# SELECT *, __ivm_count__ FROM m1;
 t | __ivm_count__
---+---------------
 B |             2
 B |             2
 A |             1
(3 rows)

postgres=# SELECT *, __ivm_count__ FROM m2;
 t | __ivm_count__
---+---------------
 B |             2
 A |             1
(2 rows)

postgres=# EXPLAIN SELECT * FROM m1;
                                  QUERY PLAN                                  
------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..61.03 rows=3000 width=2)
   ->  Seq Scan on m1 mv  (cost=0.00..1.03 rows=3 width=10)
   ->  Function Scan on generate_series  (cost=0.00..10.00 rows=1000 width=0)
(3 rows)
=============================

* Simple Performance Evaluation

I confirmed that "incremental" update of matviews is more effective
than the standard REFRESH by using simple exapmle.  I used tables
of pgbench (SF=100) here.

Create two matviews, that is, without and with IVM.

test=# CREATE MATERIALIZED VIEW bench1 AS
        SELECT aid, bid, abalance, bbalance
        FROM pgbench_accounts JOIN pgbench_branches USING (bid)
        WHERE abalance > 0 OR bbalance > 0;
SELECT 5001054
test=# CREATE INCREMENTAL MATERIALIZED VIEW bench2 AS
        SELECT aid, bid, abalance, bbalance
        FROM pgbench_accounts JOIN pgbench_branches USING (bid)
        WHERE abalance > 0 OR bbalance > 0;
SELECT 5001054

The standard REFRESH of bench1 took more than 10 seconds.

test=# \timing
Timing is on.
test=# REFRESH MATERIALIZED VIEW bench1 ;
REFRESH MATERIALIZED VIEW
Time: 11210.563 ms (00:11.211)

Create an index on the IVM matview (bench2).

test=# CREATE INDEX on bench2(aid,bid);
CREATE INDEX

Updating a tuple in pgbench_accounts took 18ms. After this, bench2
was updated automatically and correctly.

test=# SELECT * FROM bench2 WHERE aid = 1;
 aid | bid | abalance | bbalance
-----+-----+----------+----------
   1 |   1 |       10 |       10
(1 row)

Time: 2.498 ms
test=# UPDATE pgbench_accounts SET abalance = 1000 WHERE aid = 1;
UPDATE 1
Time: 18.634 ms
test=# SELECT * FROM bench2 WHERE aid = 1;
 aid | bid | abalance | bbalance
-----+-----+----------+----------
   1 |   1 |     1000 |       10
(1 row)

However, if there is not the index on bench2, it took 4 sec, so
appropriate indexes are needed on IVM matviews.

test=# DROP INDEX bench2_aid_bid_idx ;
DROP INDEX
Time: 10.613 ms
test=# UPDATE pgbench_accounts SET abalance = 2000 WHERE aid = 1;
UPDATE 1
Time: 3931.274 ms (00:03.931)

* Restrictions on view definition

This patch is still in Work-in-Progress and there are many restrictions
on the view definition query of matviews.

The current implementation supports views including selection, projection,
and inner join with or without DISTINCT. Aggregation and GROUP BY are not
supported yet, but I plan to deal with these by the first release.
Self-join, subqueries, OUTER JOIN, CTE, window functions are not
considered well, either. I need more investigation on these type of views
although I found some papers explaining how to handle sub-queries and
outer-joins.

These unsupported views should be checked when a matview is created, but
this is not implemented yet.  Hoshiai-san are working on this.

* Timing of view maintenance

This patch implements a kind of Immediate Maintenance, that is, a matview
is updated immediately when a base table is modified.  On other hand, in
"Deferred Maintenance", matviews are updated after the transaction, for
example, by the user command like REFRESH.

For implementing "deferred", it is need to implement a mechanism to maintain
logs for recording changes of base tables and an algorithm to compute the
delta to be applied to matviews.
 
In addition, there could be another implementation of Immediate Maintenance
in which matview is updated at the end of a transaction that modified base
table, rather than in AFTER trigger. Oracle supports this type of IVM. To
implement this, we will need a mechanism to maintain change logs on base
tables as well as Deferred maintenance.

* Counting algorithm implementation

There will be also discussions on counting-algorithm implementation.
Firstly, the current patch treats "__ivm_count__" as a special column name
in a somewhat ad hoc way. This is used when maintaining and accessing matviews,
and when "SELECT * FROM ..." is issued,  __ivm_count__ column is invisible for
users.  Maybe this name has to be inhibited in user tables. Is it acceptable
to use such columns for IVM, and is there better way, if not?

Secondly, a matview with duplicate tuples is replaces with a subquery which
uses generate_series function. It does not have to be generate_series, and we
can make a new set returning function for this. Anyway, this internal behaviour
is visible in EXPLAIN results as shown in Example 3. Also, there is a
performance impact because   estimated rows number is wrong, and what is worse,
the cost of join is not small when the size of matview is large. Therefore, we
might have to add a new plan node for selecting from matviews rather than using
such a special set returning function.


Ragards,
--
Yugo Nagata <[hidden email]>

WIP_immediate_IVM.patch (45K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Implementing Incremental View Maintenance

Yugo Nagata
Hi hackers,

Thank you for your many questions and feedbacks at PGCon 2019.
Attached is the patch rebased for the current master branch.

Regards,
Yugo Nagata

On Tue, 14 May 2019 15:46:48 +0900
Yugo Nagata <[hidden email]> wrote:

> On Mon, 1 Apr 2019 12:11:22 +0900
> Yugo Nagata <[hidden email]> wrote:
>
> > On Thu, 27 Dec 2018 21:57:26 +0900
> > Yugo Nagata <[hidden email]> wrote:
> >
> > > Hi,
> > >
> > > I would like to implement Incremental View Maintenance (IVM) on PostgreSQL.  
> >
> > I am now working on an initial patch for implementing IVM on PostgreSQL.
> > This enables materialized views to be updated incrementally after one
> > of their base tables is modified.
>
> Attached is a WIP patch of Incremental View Maintenance (IVM).
> Major part is written by me, and changes in syntax and pg_class
> are Hoshiai-san's work.
>
> Although this is sill a draft patch in work-in-progress, any
> suggestions or thoughts would be appreciated.
>  
> * What it is
>
> This allows a kind of Immediate Maintenance of materialized views. if a
> materialized view is created by CRATE INCREMENTAL MATERIALIZED VIEW command,
> the contents of the mateview is updated automatically and incrementally
> after base tables are updated. Noted this syntax is just tentative, so it
> may be changed.
>
> ====== Example 1 ======
> postgres=# CREATE INCREMENTAL MATERIALIZED VIEW m AS SELECT * FROM t0;
> SELECT 3
> postgres=# SELECT * FROM m;
>  i
> ---
>  3
>  2
>  1
> (3 rows)
>
> postgres=# INSERT INTO t0 VALUES (4);
> INSERT 0 1
> postgres=# SELECt * FROM m; -- automatically updated
>  i
> ---
>  3
>  2
>  1
>  4
> (4 rows)
> =============================
>
> This implementation also supports matviews including duplicate tuples or
> DISTINCT clause in its view definition query.  For example, even if a matview
> is defined with DISTINCT to remove duplication of tuples in a base table, this
> can perform incremental update of the matview properly. That is, the contents
> of the matview doesn't change when exiting tuples are inserted into the base
> tables, and a tuple in the matview is deleted only when duplicity of the
> corresponding tuple in the base table becomes zero.
>
> This is due to "colunting alogorithm" in which the number of each tuple is
> stored in matviews as a special column value.
>
> ====== Example 2 ======
> postgres=# SELECT * FROM t1;
>  id | t
> ----+---
>   1 | A
>   2 | B
>   3 | C
>   4 | A
> (4 rows)
>
> postgres=# CREATE INCREMENTAL MATERIALIZED VIEW m1 AS SELECT t FROM t1;
> SELECT 3
> postgres=# CREATE INCREMENTAL MATERIALIZED VIEW m2 AS SELECT DISTINCT t FROM t1;
> SELECT 3
> postgres=# SELECT * FROM m1; -- with duplicity
>  t
> ---
>  A
>  A
>  C
>  B
> (4 rows)
>
> postgres=# SELECT * FROM m2;
>  t
> ---
>  A
>  B
>  C
> (3 rows)
>
> postgres=# INSERT INTO t1 VALUES (5, 'B');
> INSERT 0 1
> postgres=# DELETE FROM t1 WHERE id IN (1,3); -- delete (1,A),(3,C)
> DELETE 2
> postgres=# SELECT * FROM m1; -- one A left and one more B
>  t
> ---
>  B
>  B
>  A
> (3 rows)
>
> postgres=# SELECT * FROM m2; -- only C is removed
>  t
> ---
>  B
>  A
> (2 rows)
> =============================
>
> * How it works
>
> 1. Creating matview
>
> When a matview is created, AFTER triggers are internally created
> on its base tables. When the base tables is modified (INSERT, DELETE,
> UPDATE), the matview is updated incrementally in the trigger function.
>
> When populating the matview, GROUP BY and count(*) are added to the
> view definition query before this is executed for counting duplicity
> of tuples in the matview. The result of count is stored in the matview
> as a special column named "__ivm_count__".
>
> 2. Maintenance of matview
>
> When base tables are modified, the change set of the table can be
> referred as Ephemeral Named Relations (ENRs) thanks to Transition Table
> (a feature of trigger implemented since PG10). We can calculate the diff
> set of the matview by replacing the base table in the view definition
> query with the ENR (at least if it is Selection-Projection -Join view).
> As well as view definition time, GROUP BY and count(*) is added in order
> to count the duplicity of tuples in the diff set. As a result, two diff
> sets (to be deleted from and to be inserted into the matview) are
> calculated, and the results are stored into temporary tables respectively.
>
> The matiview is updated by merging these change sets. Instead of executing
> DELETE or INSERT simply, the values of __ivm_count__ column in the matview
> is decreased or increased. When the values becomes zero, the corresponding
> tuple is deleted from the matview.
>
> 3. Access to matview
>
> When SELECT is issued for IVM matviews defined with DISTINCT, all columns
> except to __ivm_count__ of each tuple in the matview are returned.  This is
> correct because duplicity of tuples are eliminated by GROUP BY.
>
> When DISTINCT is not used, SELECT for the IVM matviews returns each tuple
> __ivm_count__ times. Currently, this is implemented by rewriting the SELECT
> query to replace the matview RTE with a subquery which joins the matview
> and generate_series function as bellow.
>
>  SELECT mv.* FROM mv, generate_series(1, mv.__ivm_count__);
>
> __ivm_count__ column is invisible for users when "SELECT * FROM ..." is
> issued, but users can see the value by specifying in target list explicitly.
>
> ====== Example 3 ======
> postgres=# \d+ m1
>                                   Materialized view "public.m1"
>     Column     |  Type  | Collation | Nullable | Default | Storage  | Stats target | Description
> ---------------+--------+-----------+----------+---------+----------+--------------+-------------
>  t             | text   |           |          |         | extended |              |
>  __ivm_count__ | bigint |           |          |         | plain    |              |
> View definition:
>  SELECT t1.t
>    FROM t1;
> Access method: heap
>
> postgres=# \d+ m2
>                                   Materialized view "public.m2"
>     Column     |  Type  | Collation | Nullable | Default | Storage  | Stats target | Description
> ---------------+--------+-----------+----------+---------+----------+--------------+-------------
>  t             | text   |           |          |         | extended |              |
>  __ivm_count__ | bigint |           |          |         | plain    |              |
> View definition:
>  SELECT DISTINCT t1.t
>    FROM t1;
> Access method: heap
>
> postgres=# SELECT *, __ivm_count__ FROM m1;
>  t | __ivm_count__
> ---+---------------
>  B |             2
>  B |             2
>  A |             1
> (3 rows)
>
> postgres=# SELECT *, __ivm_count__ FROM m2;
>  t | __ivm_count__
> ---+---------------
>  B |             2
>  A |             1
> (2 rows)
>
> postgres=# EXPLAIN SELECT * FROM m1;
>                                   QUERY PLAN                                  
> ------------------------------------------------------------------------------
>  Nested Loop  (cost=0.00..61.03 rows=3000 width=2)
>    ->  Seq Scan on m1 mv  (cost=0.00..1.03 rows=3 width=10)
>    ->  Function Scan on generate_series  (cost=0.00..10.00 rows=1000 width=0)
> (3 rows)
> =============================
>
> * Simple Performance Evaluation
>
> I confirmed that "incremental" update of matviews is more effective
> than the standard REFRESH by using simple exapmle.  I used tables
> of pgbench (SF=100) here.
>
> Create two matviews, that is, without and with IVM.
>
> test=# CREATE MATERIALIZED VIEW bench1 AS
>         SELECT aid, bid, abalance, bbalance
>         FROM pgbench_accounts JOIN pgbench_branches USING (bid)
>         WHERE abalance > 0 OR bbalance > 0;
> SELECT 5001054
> test=# CREATE INCREMENTAL MATERIALIZED VIEW bench2 AS
>         SELECT aid, bid, abalance, bbalance
>         FROM pgbench_accounts JOIN pgbench_branches USING (bid)
>         WHERE abalance > 0 OR bbalance > 0;
> SELECT 5001054
>
> The standard REFRESH of bench1 took more than 10 seconds.
>
> test=# \timing
> Timing is on.
> test=# REFRESH MATERIALIZED VIEW bench1 ;
> REFRESH MATERIALIZED VIEW
> Time: 11210.563 ms (00:11.211)
>
> Create an index on the IVM matview (bench2).
>
> test=# CREATE INDEX on bench2(aid,bid);
> CREATE INDEX
>
> Updating a tuple in pgbench_accounts took 18ms. After this, bench2
> was updated automatically and correctly.
>
> test=# SELECT * FROM bench2 WHERE aid = 1;
>  aid | bid | abalance | bbalance
> -----+-----+----------+----------
>    1 |   1 |       10 |       10
> (1 row)
>
> Time: 2.498 ms
> test=# UPDATE pgbench_accounts SET abalance = 1000 WHERE aid = 1;
> UPDATE 1
> Time: 18.634 ms
> test=# SELECT * FROM bench2 WHERE aid = 1;
>  aid | bid | abalance | bbalance
> -----+-----+----------+----------
>    1 |   1 |     1000 |       10
> (1 row)
>
> However, if there is not the index on bench2, it took 4 sec, so
> appropriate indexes are needed on IVM matviews.
>
> test=# DROP INDEX bench2_aid_bid_idx ;
> DROP INDEX
> Time: 10.613 ms
> test=# UPDATE pgbench_accounts SET abalance = 2000 WHERE aid = 1;
> UPDATE 1
> Time: 3931.274 ms (00:03.931)
>
> * Restrictions on view definition
>
> This patch is still in Work-in-Progress and there are many restrictions
> on the view definition query of matviews.
>
> The current implementation supports views including selection, projection,
> and inner join with or without DISTINCT. Aggregation and GROUP BY are not
> supported yet, but I plan to deal with these by the first release.
> Self-join, subqueries, OUTER JOIN, CTE, window functions are not
> considered well, either. I need more investigation on these type of views
> although I found some papers explaining how to handle sub-queries and
> outer-joins.
>
> These unsupported views should be checked when a matview is created, but
> this is not implemented yet.  Hoshiai-san are working on this.
>
> * Timing of view maintenance
>
> This patch implements a kind of Immediate Maintenance, that is, a matview
> is updated immediately when a base table is modified.  On other hand, in
> "Deferred Maintenance", matviews are updated after the transaction, for
> example, by the user command like REFRESH.
>
> For implementing "deferred", it is need to implement a mechanism to maintain
> logs for recording changes of base tables and an algorithm to compute the
> delta to be applied to matviews.
>  
> In addition, there could be another implementation of Immediate Maintenance
> in which matview is updated at the end of a transaction that modified base
> table, rather than in AFTER trigger. Oracle supports this type of IVM. To
> implement this, we will need a mechanism to maintain change logs on base
> tables as well as Deferred maintenance.
>
> * Counting algorithm implementation
>
> There will be also discussions on counting-algorithm implementation.
> Firstly, the current patch treats "__ivm_count__" as a special column name
> in a somewhat ad hoc way. This is used when maintaining and accessing matviews,
> and when "SELECT * FROM ..." is issued,  __ivm_count__ column is invisible for
> users.  Maybe this name has to be inhibited in user tables. Is it acceptable
> to use such columns for IVM, and is there better way, if not?
>
> Secondly, a matview with duplicate tuples is replaces with a subquery which
> uses generate_series function. It does not have to be generate_series, and we
> can make a new set returning function for this. Anyway, this internal behaviour
> is visible in EXPLAIN results as shown in Example 3. Also, there is a
> performance impact because   estimated rows number is wrong, and what is worse,
> the cost of join is not small when the size of matview is large. Therefore, we
> might have to add a new plan node for selecting from matviews rather than using
> such a special set returning function.
>
>
> Ragards,
> --
> Yugo Nagata <[hidden email]>

--
Yugo Nagata <[hidden email]>

WIP_immediate_IVM_v2.patch (45K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Implementing Incremental View Maintenance

Jim Finnerty
Hi Yugo,

    I'd like to compare the performance of your MV refresh algorithm versus
an approach that logs changes into an mv log table, and can then apply the
changes at some later point in time.  I'd like to handle the materialized
join view (mjv) case first, specifically a 2-way left outer join, with a UDF
in the SELECT list of the mjv.

    Does your refresh algorithm handle mjv's with connected join graphs that
consist entirely of inner and left outer joins?

    If so, I'd like to measure the overhead of your refresh algorithm on
pgbench, modified to include an mjv, versus a (hand coded) incremental
maintenance algorithm that uses mv log tables populated by ordinary
triggers.  We may also want to look at capturing the deltas using logical
replication, which ought to be faster than a trigger-based solution.

    I have someone available to do the performance testing for another 2
months, so if you can connect with me off-list to coordinate, we can set up
the performance experiments and run them on our AWS clusters.

best regards,

    /Jim F



-----
Jim Finnerty, AWS, Amazon Aurora PostgreSQL
--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html


Jim Finnerty, AWS, Amazon Aurora PostgreSQL
Reply | Threaded
Open this post in threaded view
|

Re: Implementing Incremental View Maintenance

Yugo Nagata
Hi Jim,

On Fri, 21 Jun 2019 08:41:11 -0700 (MST)
Jim Finnerty <[hidden email]> wrote:

> Hi Yugo,
>
>     I'd like to compare the performance of your MV refresh algorithm versus
> an approach that logs changes into an mv log table, and can then apply the
> changes at some later point in time.  I'd like to handle the materialized
> join view (mjv) case first, specifically a 2-way left outer join, with a UDF
> in the SELECT list of the mjv.

Do you mean you have your implementation of IVM that using log tables?
I'm so interested in this, and  I would appreciate it if you explain the  
detail.
 
>     Does your refresh algorithm handle mjv's with connected join graphs that
> consist entirely of inner and left outer joins?

>     If so, I'd like to measure the overhead of your refresh algorithm on
> pgbench, modified to include an mjv, versus a (hand coded) incremental
> maintenance algorithm that uses mv log tables populated by ordinary
> triggers.  We may also want to look at capturing the deltas using logical
> replication, which ought to be faster than a trigger-based solution.

In the current our implementation, outer joins is not yet supported though
we plan to handle this in future. So,we would not be able to compare these
directly in the same workload in the current status.

However, the current our implementation supports only the way to update
materialized views in a trigger, and the performance of modifying base tables
will be lower than the approach  which uses log tables. This is because queries
to update materialized views are issued in the trigger. This is not only a
overhead itself, but also takes a lock on a materialized view, which has an ]
impact on concurrent execution performance.

In the previous our PoC, we implemented IVM using log tables, in which logs are
captured by triggers and materialized views are update incrementally by a user
command[1]. However, to implement log table approach, we need a infrastructure
to maintain these logs. For example, which logs are necessary and which logs
can be discarded, etc. We thought this is not trivial work, so we decided to
start from the current approach which doesn't use log tables. We are now
preparing to implement this in the next step because this is also needed to
support deferred maintenance of views.

[1] https://www.postgresql.eu/events/pgconfeu2018/schedule/session/2195-implementing-incremental-view-maintenance-on-postgresql/

I agree that capturing the deltas using logical decoding will be faster than
using a trigger although we haven't yet consider this well.

Best regadrds,
Yugo Nagata


--
Yugo Nagata <[hidden email]>


Reply | Threaded
Open this post in threaded view
|

Re: Implementing Incremental View Maintenance

Yugo Nagata
In reply to this post by Yugo Nagata
Hi,

Attached is a WIP patch of IVM which supports some aggregate functions.

Currently, only count and sum are supported. Avg, min, or max is not supported
although I think supporting this would not be so hard.

As a restriction, expressions specified in GROUP BY must appear in the target
list of views because tuples to be updated in MV are identified by using this
group keys.


In the case of views without aggregate functions, only the number of tuple
duplicates (__ivm_count__) are updated at incremental maintenance. On the other
hand, in the case of vies with aggregations, the aggregated values are also
updated. The way of update depends the kind of aggregate function.

In the case of sum (or agg functions except to count), NULL in input values is
ignored, and this returns a null value when no rows are selected. To support
this specification, the number of not-NULL input values is counted and stored
in MV as a hidden column whose name is like __ivm_count_sum__, for example.

In the case of count, this returns zero when no rows are selected, and count(*)
doesn't ignore NULL input. These specification are also supported.

Tuples to be updated in MV are identified by using keys specified by GROUP BY
clause. However, in the case of aggregation without GROUP BY, there is only one
tuple in the view, so keys are not uses to identify tuples.


In addition, a race condition which occurred in the previous version is
prevented in this patch. In the previous version, when two translocations
change a base tables concurrently, an anormal update of MV was possible because
a change in one transaction was not visible for another transaction even in
READ COMMITTED level.

To prevent this, I fix this to take a lock in early stage of view maintenance
to wait for concurrent transactions which are updating the same MV end. Also,
we have to get the latest snapshot before computting delta tables because any
changes which occurs in other transaction during lock waiting is not visible
even in READ COMMITTED level.

In REPEATABLE READ or SERIALIZABLE level, don't wait a lock, and raise an error
immediately to prevent anormal update. These solutions might be ugly, but
something to prevent anormal update is anyway necessary. There may be better
way.


Moreover, some regression test are added for aggregate functions support.
This is Hoshiai-san's work.

Although the code is not refined yet and I will need a deal of refactoring
and  reorganizing, I submitted this to share the current status.


* Exapmle (from regression test)

=======================================================================
(1) creating tables

CREATE TABLE mv_base_a (i int, j int);
INSERT INTO mv_base_a VALUES
  (1,10),
  (2,20),
  (3,30),
  (4,40),
  (5,50);


(2) Views sith SUM() and COUNT() aggregation function

BEGIN;
CREATE INCREMENTAL MATERIALIZED VIEW mv_ivm_agg AS SELECT i, SUM(j), COUNT(i)  FROM mv_base_a GROUP BY i;
SELECT * FROM mv_ivm_agg ORDER BY 1,2,3;
 i | sum | count
---+-----+-------
 1 |  10 |     1
 2 |  20 |     1
 3 |  30 |     1
 4 |  40 |     1
 5 |  50 |     1
(5 rows)

INSERT INTO mv_base_a VALUES(2,100);
SELECT * FROM mv_ivm_agg ORDER BY 1,2,3;
 i | sum | count
---+-----+-------
 1 |  10 |     1
 2 | 120 |     2
 3 |  30 |     1
 4 |  40 |     1
 5 |  50 |     1
(5 rows)

UPDATE mv_base_a SET j = 200 WHERE (i,j) = (2,100);
SELECT * FROM mv_ivm_agg ORDER BY 1,2,3;
 i | sum | count
---+-----+-------
 1 |  10 |     1
 2 | 220 |     2
 3 |  30 |     1
 4 |  40 |     1
 5 |  50 |     1
(5 rows)

DELETE FROM mv_base_a WHERE (i,j) = (2,200);
SELECT * FROM mv_ivm_agg ORDER BY 1,2,3;
 i | sum | count
---+-----+-------
 1 |  10 |     1
 2 |  20 |     1
 3 |  30 |     1
 4 |  40 |     1
 5 |  50 |     1
(5 rows)

ROLLBACK;


(3) Views with COUNT(*) aggregation function

BEGIN;
CREATE INCREMENTAL MATERIALIZED VIEW mv_ivm_agg AS SELECT i, SUM(j),COUNT(*)  FROM mv_base_a GROUP BY i;
SELECT * FROM mv_ivm_agg ORDER BY 1,2,3;
 i | sum | count
---+-----+-------
 1 |  10 |     1
 2 |  20 |     1
 3 |  30 |     1
 4 |  40 |     1
 5 |  50 |     1
(5 rows)

INSERT INTO mv_base_a VALUES(2,100);
SELECT * FROM mv_ivm_agg ORDER BY 1,2,3;
 i | sum | count
---+-----+-------
 1 |  10 |     1
 2 | 120 |     2
 3 |  30 |     1
 4 |  40 |     1
 5 |  50 |     1
(5 rows)

ROLLBACK;

(4) Views with aggregation function without GROUP clause

BEGIN;
CREATE INCREMENTAL MATERIALIZED VIEW mv_ivm_group AS SELECT SUM(j)FROM mv_base_a;
SELECT * FROM mv_ivm_group ORDER BY 1;
 sum
-----
 150
(1 row)

INSERT INTO mv_base_a VALUES(6,20);
SELECT * FROM mv_ivm_group ORDER BY 1;
 sum
-----
 170
(1 row)
=======================================================================



On Thu, 20 Jun 2019 16:44:10 +0900
Yugo Nagata <[hidden email]> wrote:

> Hi hackers,
>
> Thank you for your many questions and feedbacks at PGCon 2019.
> Attached is the patch rebased for the current master branch.
>
> Regards,
> Yugo Nagata
>
> On Tue, 14 May 2019 15:46:48 +0900
> Yugo Nagata <[hidden email]> wrote:
>
> > On Mon, 1 Apr 2019 12:11:22 +0900
> > Yugo Nagata <[hidden email]> wrote:
> >
> > > On Thu, 27 Dec 2018 21:57:26 +0900
> > > Yugo Nagata <[hidden email]> wrote:
> > >
> > > > Hi,
> > > >
> > > > I would like to implement Incremental View Maintenance (IVM) on PostgreSQL.  
> > >
> > > I am now working on an initial patch for implementing IVM on PostgreSQL.
> > > This enables materialized views to be updated incrementally after one
> > > of their base tables is modified.
> >
> > Attached is a WIP patch of Incremental View Maintenance (IVM).
> > Major part is written by me, and changes in syntax and pg_class
> > are Hoshiai-san's work.
> >
> > Although this is sill a draft patch in work-in-progress, any
> > suggestions or thoughts would be appreciated.
> >  
> > * What it is
> >
> > This allows a kind of Immediate Maintenance of materialized views. if a
> > materialized view is created by CRATE INCREMENTAL MATERIALIZED VIEW command,
> > the contents of the mateview is updated automatically and incrementally
> > after base tables are updated. Noted this syntax is just tentative, so it
> > may be changed.
> >
> > ====== Example 1 ======
> > postgres=# CREATE INCREMENTAL MATERIALIZED VIEW m AS SELECT * FROM t0;
> > SELECT 3
> > postgres=# SELECT * FROM m;
> >  i
> > ---
> >  3
> >  2
> >  1
> > (3 rows)
> >
> > postgres=# INSERT INTO t0 VALUES (4);
> > INSERT 0 1
> > postgres=# SELECt * FROM m; -- automatically updated
> >  i
> > ---
> >  3
> >  2
> >  1
> >  4
> > (4 rows)
> > =============================
> >
> > This implementation also supports matviews including duplicate tuples or
> > DISTINCT clause in its view definition query.  For example, even if a matview
> > is defined with DISTINCT to remove duplication of tuples in a base table, this
> > can perform incremental update of the matview properly. That is, the contents
> > of the matview doesn't change when exiting tuples are inserted into the base
> > tables, and a tuple in the matview is deleted only when duplicity of the
> > corresponding tuple in the base table becomes zero.
> >
> > This is due to "colunting alogorithm" in which the number of each tuple is
> > stored in matviews as a special column value.
> >
> > ====== Example 2 ======
> > postgres=# SELECT * FROM t1;
> >  id | t
> > ----+---
> >   1 | A
> >   2 | B
> >   3 | C
> >   4 | A
> > (4 rows)
> >
> > postgres=# CREATE INCREMENTAL MATERIALIZED VIEW m1 AS SELECT t FROM t1;
> > SELECT 3
> > postgres=# CREATE INCREMENTAL MATERIALIZED VIEW m2 AS SELECT DISTINCT t FROM t1;
> > SELECT 3
> > postgres=# SELECT * FROM m1; -- with duplicity
> >  t
> > ---
> >  A
> >  A
> >  C
> >  B
> > (4 rows)
> >
> > postgres=# SELECT * FROM m2;
> >  t
> > ---
> >  A
> >  B
> >  C
> > (3 rows)
> >
> > postgres=# INSERT INTO t1 VALUES (5, 'B');
> > INSERT 0 1
> > postgres=# DELETE FROM t1 WHERE id IN (1,3); -- delete (1,A),(3,C)
> > DELETE 2
> > postgres=# SELECT * FROM m1; -- one A left and one more B
> >  t
> > ---
> >  B
> >  B
> >  A
> > (3 rows)
> >
> > postgres=# SELECT * FROM m2; -- only C is removed
> >  t
> > ---
> >  B
> >  A
> > (2 rows)
> > =============================
> >
> > * How it works
> >
> > 1. Creating matview
> >
> > When a matview is created, AFTER triggers are internally created
> > on its base tables. When the base tables is modified (INSERT, DELETE,
> > UPDATE), the matview is updated incrementally in the trigger function.
> >
> > When populating the matview, GROUP BY and count(*) are added to the
> > view definition query before this is executed for counting duplicity
> > of tuples in the matview. The result of count is stored in the matview
> > as a special column named "__ivm_count__".
> >
> > 2. Maintenance of matview
> >
> > When base tables are modified, the change set of the table can be
> > referred as Ephemeral Named Relations (ENRs) thanks to Transition Table
> > (a feature of trigger implemented since PG10). We can calculate the diff
> > set of the matview by replacing the base table in the view definition
> > query with the ENR (at least if it is Selection-Projection -Join view).
> > As well as view definition time, GROUP BY and count(*) is added in order
> > to count the duplicity of tuples in the diff set. As a result, two diff
> > sets (to be deleted from and to be inserted into the matview) are
> > calculated, and the results are stored into temporary tables respectively.
> >
> > The matiview is updated by merging these change sets. Instead of executing
> > DELETE or INSERT simply, the values of __ivm_count__ column in the matview
> > is decreased or increased. When the values becomes zero, the corresponding
> > tuple is deleted from the matview.
> >
> > 3. Access to matview
> >
> > When SELECT is issued for IVM matviews defined with DISTINCT, all columns
> > except to __ivm_count__ of each tuple in the matview are returned.  This is
> > correct because duplicity of tuples are eliminated by GROUP BY.
> >
> > When DISTINCT is not used, SELECT for the IVM matviews returns each tuple
> > __ivm_count__ times. Currently, this is implemented by rewriting the SELECT
> > query to replace the matview RTE with a subquery which joins the matview
> > and generate_series function as bellow.
> >
> >  SELECT mv.* FROM mv, generate_series(1, mv.__ivm_count__);
> >
> > __ivm_count__ column is invisible for users when "SELECT * FROM ..." is
> > issued, but users can see the value by specifying in target list explicitly.
> >
> > ====== Example 3 ======
> > postgres=# \d+ m1
> >                                   Materialized view "public.m1"
> >     Column     |  Type  | Collation | Nullable | Default | Storage  | Stats target | Description
> > ---------------+--------+-----------+----------+---------+----------+--------------+-------------
> >  t             | text   |           |          |         | extended |              |
> >  __ivm_count__ | bigint |           |          |         | plain    |              |
> > View definition:
> >  SELECT t1.t
> >    FROM t1;
> > Access method: heap
> >
> > postgres=# \d+ m2
> >                                   Materialized view "public.m2"
> >     Column     |  Type  | Collation | Nullable | Default | Storage  | Stats target | Description
> > ---------------+--------+-----------+----------+---------+----------+--------------+-------------
> >  t             | text   |           |          |         | extended |              |
> >  __ivm_count__ | bigint |           |          |         | plain    |              |
> > View definition:
> >  SELECT DISTINCT t1.t
> >    FROM t1;
> > Access method: heap
> >
> > postgres=# SELECT *, __ivm_count__ FROM m1;
> >  t | __ivm_count__
> > ---+---------------
> >  B |             2
> >  B |             2
> >  A |             1
> > (3 rows)
> >
> > postgres=# SELECT *, __ivm_count__ FROM m2;
> >  t | __ivm_count__
> > ---+---------------
> >  B |             2
> >  A |             1
> > (2 rows)
> >
> > postgres=# EXPLAIN SELECT * FROM m1;
> >                                   QUERY PLAN                                  
> > ------------------------------------------------------------------------------
> >  Nested Loop  (cost=0.00..61.03 rows=3000 width=2)
> >    ->  Seq Scan on m1 mv  (cost=0.00..1.03 rows=3 width=10)
> >    ->  Function Scan on generate_series  (cost=0.00..10.00 rows=1000 width=0)
> > (3 rows)
> > =============================
> >
> > * Simple Performance Evaluation
> >
> > I confirmed that "incremental" update of matviews is more effective
> > than the standard REFRESH by using simple exapmle.  I used tables
> > of pgbench (SF=100) here.
> >
> > Create two matviews, that is, without and with IVM.
> >
> > test=# CREATE MATERIALIZED VIEW bench1 AS
> >         SELECT aid, bid, abalance, bbalance
> >         FROM pgbench_accounts JOIN pgbench_branches USING (bid)
> >         WHERE abalance > 0 OR bbalance > 0;
> > SELECT 5001054
> > test=# CREATE INCREMENTAL MATERIALIZED VIEW bench2 AS
> >         SELECT aid, bid, abalance, bbalance
> >         FROM pgbench_accounts JOIN pgbench_branches USING (bid)
> >         WHERE abalance > 0 OR bbalance > 0;
> > SELECT 5001054
> >
> > The standard REFRESH of bench1 took more than 10 seconds.
> >
> > test=# \timing
> > Timing is on.
> > test=# REFRESH MATERIALIZED VIEW bench1 ;
> > REFRESH MATERIALIZED VIEW
> > Time: 11210.563 ms (00:11.211)
> >
> > Create an index on the IVM matview (bench2).
> >
> > test=# CREATE INDEX on bench2(aid,bid);
> > CREATE INDEX
> >
> > Updating a tuple in pgbench_accounts took 18ms. After this, bench2
> > was updated automatically and correctly.
> >
> > test=# SELECT * FROM bench2 WHERE aid = 1;
> >  aid | bid | abalance | bbalance
> > -----+-----+----------+----------
> >    1 |   1 |       10 |       10
> > (1 row)
> >
> > Time: 2.498 ms
> > test=# UPDATE pgbench_accounts SET abalance = 1000 WHERE aid = 1;
> > UPDATE 1
> > Time: 18.634 ms
> > test=# SELECT * FROM bench2 WHERE aid = 1;
> >  aid | bid | abalance | bbalance
> > -----+-----+----------+----------
> >    1 |   1 |     1000 |       10
> > (1 row)
> >
> > However, if there is not the index on bench2, it took 4 sec, so
> > appropriate indexes are needed on IVM matviews.
> >
> > test=# DROP INDEX bench2_aid_bid_idx ;
> > DROP INDEX
> > Time: 10.613 ms
> > test=# UPDATE pgbench_accounts SET abalance = 2000 WHERE aid = 1;
> > UPDATE 1
> > Time: 3931.274 ms (00:03.931)
> >
> > * Restrictions on view definition
> >
> > This patch is still in Work-in-Progress and there are many restrictions
> > on the view definition query of matviews.
> >
> > The current implementation supports views including selection, projection,
> > and inner join with or without DISTINCT. Aggregation and GROUP BY are not
> > supported yet, but I plan to deal with these by the first release.
> > Self-join, subqueries, OUTER JOIN, CTE, window functions are not
> > considered well, either. I need more investigation on these type of views
> > although I found some papers explaining how to handle sub-queries and
> > outer-joins.
> >
> > These unsupported views should be checked when a matview is created, but
> > this is not implemented yet.  Hoshiai-san are working on this.
> >
> > * Timing of view maintenance
> >
> > This patch implements a kind of Immediate Maintenance, that is, a matview
> > is updated immediately when a base table is modified.  On other hand, in
> > "Deferred Maintenance", matviews are updated after the transaction, for
> > example, by the user command like REFRESH.
> >
> > For implementing "deferred", it is need to implement a mechanism to maintain
> > logs for recording changes of base tables and an algorithm to compute the
> > delta to be applied to matviews.
> >  
> > In addition, there could be another implementation of Immediate Maintenance
> > in which matview is updated at the end of a transaction that modified base
> > table, rather than in AFTER trigger. Oracle supports this type of IVM. To
> > implement this, we will need a mechanism to maintain change logs on base
> > tables as well as Deferred maintenance.
> >
> > * Counting algorithm implementation
> >
> > There will be also discussions on counting-algorithm implementation.
> > Firstly, the current patch treats "__ivm_count__" as a special column name
> > in a somewhat ad hoc way. This is used when maintaining and accessing matviews,
> > and when "SELECT * FROM ..." is issued,  __ivm_count__ column is invisible for
> > users.  Maybe this name has to be inhibited in user tables. Is it acceptable
> > to use such columns for IVM, and is there better way, if not?
> >
> > Secondly, a matview with duplicate tuples is replaces with a subquery which
> > uses generate_series function. It does not have to be generate_series, and we
> > can make a new set returning function for this. Anyway, this internal behaviour
> > is visible in EXPLAIN results as shown in Example 3. Also, there is a
> > performance impact because   estimated rows number is wrong, and what is worse,
> > the cost of join is not small when the size of matview is large. Therefore, we
> > might have to add a new plan node for selecting from matviews rather than using
> > such a special set returning function.
> >
> >
> > Ragards,
> > --
> > Yugo Nagata <[hidden email]>
>
>
> --
> Yugo Nagata <[hidden email]>

--
Yugo Nagata <[hidden email]>

WIP_immediate_IVM_v3.patch (69K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Implementing Incremental View Maintenance

Yugo Nagata
In reply to this post by Greg Stark
Hi Greg,

On Wed, 3 Apr 2019 17:41:36 -0400
Greg Stark <[hidden email]> wrote:

> On Sun, 31 Mar 2019 at 23:22, Yugo Nagata <[hidden email]> wrote:
> >
> > Firstly, this will handle simple definition views which includes only
> > selection, projection, and join.  Standard aggregations (count, sum, avg,
> > min, max) are not planned to be implemented in the first patch, but these
> > are commonly used in materialized views, so I'll implement them later on.
>
> It's fine to not have all the features from day 1 of course. But I
> just picked up this comment and the followup talking about splitting
> AVG into SUM and COUNT and I had a comment. When you do look at
> tackling aggregates I don't think you should restrict yourself to
> these specific standard aggregations. We have all the necessary
> abstractions to handle all aggregations that are feasible, see
> https://www.postgresql.org/docs/devel/xaggr.html#XAGGR-MOVING-AGGREGATES
>
> What you need to do -- I think -- is store the "moving aggregate
> state" before the final function. Then whenever a row is inserted or
> deleted or updated (or whenever another column is updated which causes
> the value to row to enter or leave the aggregation) apply either
> aggtransfn or aggminvtransfn to the state. I'm not sure if you want to
> apply the final function on every update or only lazily either may be
> better in some usage.

Thank you for your suggestion! I submitted the latest patch just now supporting
some aggregate functions, but this supports only sum and count, and lacking a
kind of generalization. However,  I would like to refine this to support more
general aggregate functions. I think your suggestions is helpful for me to do
this. Thank you!

Best regards,
Yugo Nagata


--
Yugo Nagata <[hidden email]>


Reply | Threaded
Open this post in threaded view
|

Re: Implementing Incremental View Maintenance

Thomas Munro-5
In reply to this post by Yugo Nagata
On Fri, Jun 28, 2019 at 10:56 PM Yugo Nagata <[hidden email]> wrote:
> Attached is a WIP patch of IVM which supports some aggregate functions.

Hi Nagata-san and Hoshiai-san,

Thank you for working on this.  I enjoyed your talk at PGCon.  I've
added Kevin Grittner just in case he missed this thread; he has talked
often about implementing the counting algorithm, and he wrote the
"trigger transition tables" feature to support exactly this.  While
integrating trigger transition tables with the new partition features,
we had to make a number of decisions about how that should work, and
we tried to come up with answers that would work for IMV, and I hope
we made the right choices!

I am quite interested to learn how IVM interacts with SERIALIZABLE.

A couple of superficial review comments:

+            const char *aggname = get_func_name(aggref->aggfnoid);
...
+            else if (!strcmp(aggname, "sum"))

I guess you need a more robust way to detect the supported aggregates
than their name, or I guess some way for aggregates themselves to
specify that they support this and somehow supply the extra logic.
Perhaps I just waid what Greg Stark already said, except not as well.

+                            elog(ERROR, "Aggrege function %s is not
supported", aggname);

s/Aggrege/aggregate/

Of course it is not helpful to comment on typos at this early stage,
it's just that this one appears many times in the test output :-)

+static bool
+isIvmColumn(const char *s)
+{
+    char pre[7];
+
+     strlcpy(pre, s, sizeof(pre));
+    return (strcmp(pre, "__ivm_") == 0);
+}

What about strncmp(s, "__ivm_", 6) == 0)?  As for the question of how
to reserve a namespace for system columns that won't clash with user
columns, according to our manual the SQL standard doesn't allow $ in
identifier names, and according to my copy SQL92 "intermediate SQL"
doesn't allow identifiers that end in an underscore.  I don't know
what the best answer is but we should probably decide on a something
based the standard.

As for how to make internal columns invisible to SELECT *, previously
there have been discussions about doing that using a new flag in
pg_attribute:

https://www.postgresql.org/message-id/flat/CAEepm%3D3ZHh%3Dp0nEEnVbs1Dig_UShPzHUcMNAqvDQUgYgcDo-pA%40mail.gmail.com

+                            "WITH t AS ("
+                            "  SELECT diff.__ivm_count__,
(diff.__ivm_count__ = mv.__ivm_count__) AS for_dlt, mv.ctid"
+                            ", %s"
+                            "  FROM %s AS mv, %s AS diff WHERE (%s) = (%s)"
+                            "), updt AS ("
+                            "  UPDATE %s AS mv SET __ivm_count__ =
mv.__ivm_count__ - t.__ivm_count__"
+                            ", %s "
+                            "  FROM t WHERE mv.ctid = t.ctid AND NOT for_dlt"
+                            ") DELETE FROM %s AS mv USING t WHERE
mv.ctid = t.ctid AND for_dlt;",

I fully understand that this is POC code, but I am curious about one
thing.  These queries that are executed by apply_delta() would need to
be converted to C, or at least used reusable plans, right?  Hmm,
creating and dropping temporary tables every time is a clue that the
ultimate form of this should be tuplestores and C code, I think,
right?

> Moreover, some regression test are added for aggregate functions support.
> This is Hoshiai-san's work.

Great.  Next time you post a WIP patch, could you please fix this
small compiler warning?

describe.c: In function ‘describeOneTableDetails’:
describe.c:3270:55: error: ‘*((void *)&tableinfo+48)’ may be used
uninitialized in this function [-Werror=maybe-uninitialized]
if (verbose && tableinfo.relkind == RELKIND_MATVIEW && tableinfo.isivm)
^
describe.c:1495:4: note: ‘*((void *)&tableinfo+48)’ was declared here
} tableinfo;
^

Then our unofficial automatic CI system[1] will run these tests every
day, which sometimes finds problems.

[1] cfbot.cputube.org

--
Thomas Munro
https://enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Implementing Incremental View Maintenance

Tatsuo Ishii-3
> As for how to make internal columns invisible to SELECT *, previously
> there have been discussions about doing that using a new flag in
> pg_attribute:
>
> https://www.postgresql.org/message-id/flat/CAEepm%3D3ZHh%3Dp0nEEnVbs1Dig_UShPzHUcMNAqvDQUgYgcDo-pA%40mail.gmail.com

Now that I realized that there are several use cases for invisible
columns, I think this is the way what we shoud go for.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


1234