Surjective functional indexes

classic Classic list List threaded Threaded
99 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

Re: Surjective functional indexes

konstantin knizhnik


On 13.09.2017 13:14, Christoph Berg wrote:

> Re: Konstantin Knizhnik 2017-09-13 <[hidden email]>
>>
>> On 13.09.2017 10:51, Christoph Berg wrote:
>>> Re: Konstantin Knizhnik 2017-09-01 <[hidden email]>
>>>> +       Functional index is based on on projection function: function which extract subset of its argument.
>>>> +       In mathematic such functions are called non-injective. For injective function if any attribute used in the indexed
>>>> +       expression is changed, then value of index expression is also changed.
>>> This is Just Wrong. I still think what you are doing here doesn't have
>>> anything to do with the function being injective or not.
>> Sorry, can you please explain what is wrong?
> I don't get why you are reasoning about "projection" ->
> "non-injective" -> "injective". Can't you try to explain what this
> functionality is about without abusing math terms that just mean
> something else in the rest of the world?

I tried to explain it in my previous e-mail. In most cases (it is just
my filling, may be it is wrong), functional indexes are built for some
complex types, like JSON, arrays, structs,...
and index expression extracts some components of this compound value. It
means that even if underlying column is changes, there is good chance
that value of index function is not changed. So there is no need to
update index and we can use HOT. It allows to several time increase
performance.

The only reason of all this discussion about terms is that I need to
choose name for correspondent index option.
Simon think that we do not need this option at all. In this case we
should not worry about right term.
 From my point of view, "projection" is quite clear notion and not only
for mathematics. It is also widely used in IT and especially in DBMSes.

--

Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Surjective functional indexes

Simon Riggs
On 13 September 2017 at 11:30, Konstantin Knizhnik
<[hidden email]> wrote:

> The only reason of all this discussion about terms is that I need to choose
> name for correspondent index option.
> Simon think that we do not need this option at all. In this case we should
> not worry about right term.
> From my point of view, "projection" is quite clear notion and not only for
> mathematics. It is also widely used in IT and especially in DBMSes.

If we do have an option it won't be using fancy mathematical
terminology at all, it would be described in terms of its function,
e.g. recheck_on_update

Yes, I'd rather not have an option at all, just some simple code with
useful effect, like we have in many other places.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Surjective functional indexes

konstantin knizhnik


On 13.09.2017 14:00, Simon Riggs wrote:

> On 13 September 2017 at 11:30, Konstantin Knizhnik
> <[hidden email]> wrote:
>
>> The only reason of all this discussion about terms is that I need to choose
>> name for correspondent index option.
>> Simon think that we do not need this option at all. In this case we should
>> not worry about right term.
>>  From my point of view, "projection" is quite clear notion and not only for
>> mathematics. It is also widely used in IT and especially in DBMSes.
> If we do have an option it won't be using fancy mathematical
> terminology at all, it would be described in terms of its function,
> e.g. recheck_on_update
>
> Yes, I'd rather not have an option at all, just some simple code with
> useful effect, like we have in many other places.
>
Yehhh,
After more thinking I found out that my idea to use table/index
statistic (particularity number of distinct values) to determine
projection functions  was wrong.
Consider case column bookinfo of jsonb type and index expression
(bookinfo->'ISBN').
Both can be considered as unique. But it is an obvious example of
projection function, which value is  not changed if we update other
information related with this book.

So this approach doesn't work. Looks like the only thing we can do to
autotune is to collect own statistic: how frequently changing
attribute(s) doesn't affect result of the function.
By default we can considered function as projection and perform
comparison of old/new function results.
If after some number of comparisons  fraction of hits (when value of
function is not changed) is smaller than some threshold (0.5?, 0.9?,...)
then we can mark index as non-projective
and eliminate this checks in future. But it will require extending index
statistic. Do we really need/want it?

Despite to the possibility to implement autotune, I still think that we
should have manual switch, doesn't mater how it is named.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Surjective functional indexes

konstantin knizhnik
In reply to this post by Simon Riggs


On 13.09.2017 14:00, Simon Riggs wrote:

> On 13 September 2017 at 11:30, Konstantin Knizhnik
> <[hidden email]> wrote:
>
>> The only reason of all this discussion about terms is that I need to choose
>> name for correspondent index option.
>> Simon think that we do not need this option at all. In this case we should
>> not worry about right term.
>>  From my point of view, "projection" is quite clear notion and not only for
>> mathematics. It is also widely used in IT and especially in DBMSes.
> If we do have an option it won't be using fancy mathematical
> terminology at all, it would be described in terms of its function,
> e.g. recheck_on_update
>
> Yes, I'd rather not have an option at all, just some simple code with
> useful effect, like we have in many other places.
>
Attached please find new version of projection functional index
optimization patch.
I have implemented very simple autotune strategy: now I use table
statistic to compare total number of updates with number of hot updates.
If fraction of hot updates is relatively small, then there is no sense
to spend time performing extra evaluation of index expression and
comparing its old and new values.
Right now the formula is the following:

#define MIN_UPDATES_THRESHOLD 10
#define HOT_RATIO_THRESHOLD   2

         if (stat->tuples_updated > MIN_UPDATES_THRESHOLD
             && stat->tuples_updated >
stat->tuples_hot_updated*HOT_RATIO_THRESHOLD)
         {
             /* If percent of hot updates is small, then disable
projection index function
              * optimization to eliminate overhead of extra index
expression evaluations.
              */
             ii->ii_Projection = false;
         }

This threshold values are pulled out of a hat: I am not sure if this
heuristic is right.
I will be please to get feedback if such approach to autotune is promising.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

projection-autotune.patch (19K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Surjective functional indexes

Simon Riggs
On 14 September 2017 at 10:42, Konstantin Knizhnik
<[hidden email]> wrote:

>
>
> On 13.09.2017 14:00, Simon Riggs wrote:
>>
>> On 13 September 2017 at 11:30, Konstantin Knizhnik
>> <[hidden email]> wrote:
>>
>>> The only reason of all this discussion about terms is that I need to
>>> choose
>>> name for correspondent index option.
>>> Simon think that we do not need this option at all. In this case we
>>> should
>>> not worry about right term.
>>>  From my point of view, "projection" is quite clear notion and not only
>>> for
>>> mathematics. It is also widely used in IT and especially in DBMSes.
>>
>> If we do have an option it won't be using fancy mathematical
>> terminology at all, it would be described in terms of its function,
>> e.g. recheck_on_update
>>
>> Yes, I'd rather not have an option at all, just some simple code with
>> useful effect, like we have in many other places.
>>
> Attached please find new version of projection functional index optimization
> patch.
> I have implemented very simple autotune strategy: now I use table statistic
> to compare total number of updates with number of hot updates.
> If fraction of hot updates is relatively small, then there is no sense to
> spend time performing extra evaluation of index expression and comparing its
> old and new values.
> Right now the formula is the following:
>
> #define MIN_UPDATES_THRESHOLD 10
> #define HOT_RATIO_THRESHOLD   2
>
>         if (stat->tuples_updated > MIN_UPDATES_THRESHOLD
>             && stat->tuples_updated >
> stat->tuples_hot_updated*HOT_RATIO_THRESHOLD)
>         {
>             /* If percent of hot updates is small, then disable projection
> index function
>              * optimization to eliminate overhead of extra index expression
> evaluations.
>              */
>             ii->ii_Projection = false;
>         }
>
> This threshold values are pulled out of a hat: I am not sure if this
> heuristic is right.
> I will be please to get feedback if such approach to autotune is promising.

Hmm, not really, but thanks for trying.

This works by looking at overall stats, and only looks at the overall
HOT %, so its too heavyweight and coarse.

I suggested storing stat info on the relcache and was expecting you
would look at how often the expression evaluates to new == old. If we
evaluate new against old many times, then if the success rate is low
we should stop attempting the comparison. (<10%?)

Another idea:
If we don't make a check when we should have done then we will get a
non-HOT update, so we waste time extra time difference between a HOT
and non-HOT update. If we check and fail we waste time take to perform
check. So the question is how expensive the check is against how
expensive a non-HOT update is. Could we simply say we don't bother to
check functions that have a cost higher than 10000? So if the user
doesn't want to perform the check they can just increase the cost of
the function above the check threshold?

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Surjective functional indexes

konstantin knizhnik


On 14.09.2017 13:19, Simon Riggs wrote:

> On 14 September 2017 at 10:42, Konstantin Knizhnik
> <[hidden email]> wrote:
>>
>> On 13.09.2017 14:00, Simon Riggs wrote:
>>> On 13 September 2017 at 11:30, Konstantin Knizhnik
>>> <[hidden email]> wrote:
>>>
>>>> The only reason of all this discussion about terms is that I need to
>>>> choose
>>>> name for correspondent index option.
>>>> Simon think that we do not need this option at all. In this case we
>>>> should
>>>> not worry about right term.
>>>>   From my point of view, "projection" is quite clear notion and not only
>>>> for
>>>> mathematics. It is also widely used in IT and especially in DBMSes.
>>> If we do have an option it won't be using fancy mathematical
>>> terminology at all, it would be described in terms of its function,
>>> e.g. recheck_on_update
>>>
>>> Yes, I'd rather not have an option at all, just some simple code with
>>> useful effect, like we have in many other places.
>>>
>> Attached please find new version of projection functional index optimization
>> patch.
>> I have implemented very simple autotune strategy: now I use table statistic
>> to compare total number of updates with number of hot updates.
>> If fraction of hot updates is relatively small, then there is no sense to
>> spend time performing extra evaluation of index expression and comparing its
>> old and new values.
>> Right now the formula is the following:
>>
>> #define MIN_UPDATES_THRESHOLD 10
>> #define HOT_RATIO_THRESHOLD   2
>>
>>          if (stat->tuples_updated > MIN_UPDATES_THRESHOLD
>>              && stat->tuples_updated >
>> stat->tuples_hot_updated*HOT_RATIO_THRESHOLD)
>>          {
>>              /* If percent of hot updates is small, then disable projection
>> index function
>>               * optimization to eliminate overhead of extra index expression
>> evaluations.
>>               */
>>              ii->ii_Projection = false;
>>          }
>>
>> This threshold values are pulled out of a hat: I am not sure if this
>> heuristic is right.
>> I will be please to get feedback if such approach to autotune is promising.
> Hmm, not really, but thanks for trying.
>
> This works by looking at overall stats, and only looks at the overall
> HOT %, so its too heavyweight and coarse.
>
> I suggested storing stat info on the relcache and was expecting you
> would look at how often the expression evaluates to new == old. If we
> evaluate new against old many times, then if the success rate is low
> we should stop attempting the comparison. (<10%?)
>
> Another idea:
> If we don't make a check when we should have done then we will get a
> non-HOT update, so we waste time extra time difference between a HOT
> and non-HOT update. If we check and fail we waste time take to perform
> check. So the question is how expensive the check is against how
> expensive a non-HOT update is. Could we simply say we don't bother to
> check functions that have a cost higher than 10000? So if the user
> doesn't want to perform the check they can just increase the cost of
> the function above the check threshold?
>
Attached pleased find one more patch which calculates hot update check
hit rate more precisely: I have to extended PgStat_StatTabEntry with two
new fields:
hot_update_hits and hot_update_misses.

Concerning your idea to check cost of index function: it certainly makes
sense.
The only problems: I do not understand now how to calculate this cost.
It can be easily calculated by optimizer when it is building query
execution plan.
But inside BuildIndexInfo I have just reference to Relation and have no
idea how
I can propagate here information about index expression cost from optimizer.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

projection-autotune2.patch (22K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Surjective functional indexes

Simon Riggs
On 14 September 2017 at 16:37, Konstantin Knizhnik
<[hidden email]> wrote:
>
>
> On 14.09.2017 13:19, Simon Riggs wrote:

>> This works by looking at overall stats, and only looks at the overall
>> HOT %, so its too heavyweight and coarse.
>>
>> I suggested storing stat info on the relcache and was expecting you
>> would look at how often the expression evaluates to new == old. If we
>> evaluate new against old many times, then if the success rate is low
>> we should stop attempting the comparison. (<10%?)
>>
>> Another idea:
>> If we don't make a check when we should have done then we will get a
>> non-HOT update, so we waste time extra time difference between a HOT
>> and non-HOT update. If we check and fail we waste time take to perform
>> check. So the question is how expensive the check is against how
>> expensive a non-HOT update is. Could we simply say we don't bother to
>> check functions that have a cost higher than 10000? So if the user
>> doesn't want to perform the check they can just increase the cost of
>> the function above the check threshold?
>>
> Attached pleased find one more patch which calculates hot update check hit
> rate more precisely: I have to extended PgStat_StatTabEntry with two new
> fields:
> hot_update_hits and hot_update_misses.

It's not going to work, as already mentioned above. Those stats are at
table level and very little to do with this particular index.

But you've not commented on the design I mention that can work: index relcache.

> Concerning your idea to check cost of index function: it certainly makes
> sense.
> The only problems: I do not understand now how to calculate this cost.
> It can be easily calculated by optimizer when it is building query execution
> plan.
> But inside BuildIndexInfo I have just reference to Relation and have no idea
> how
> I can propagate here information about index expression cost from optimizer.

We could copy at create index, if we took that route. Or we can look
up the cost for the index expression and cache it.


Anyway, this is just jumping around because we still have a parameter
and the idea was to remove the parameter entirely by autotuning, which
I think is both useful and possible, just as HOT itself is autotuned.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Surjective functional indexes

konstantin knizhnik


On 14.09.2017 18:53, Simon Riggs wrote:

>
>>> This works by looking at overall stats, and only looks at the overall
>>> HOT %, so its too heavyweight and coarse.
>>>
>>> I suggested storing stat info on the relcache and was expecting you
>>> would look at how often the expression evaluates to new == old. If we
>>> evaluate new against old many times, then if the success rate is low
>>> we should stop attempting the comparison. (<10%?)
>>>
>>> Another idea:
>>> If we don't make a check when we should have done then we will get a
>>> non-HOT update, so we waste time extra time difference between a HOT
>>> and non-HOT update. If we check and fail we waste time take to perform
>>> check. So the question is how expensive the check is against how
>>> expensive a non-HOT update is. Could we simply say we don't bother to
>>> check functions that have a cost higher than 10000? So if the user
>>> doesn't want to perform the check they can just increase the cost of
>>> the function above the check threshold?
>>>
>> Attached pleased find one more patch which calculates hot update check hit
>> rate more precisely: I have to extended PgStat_StatTabEntry with two new
>> fields:
>> hot_update_hits and hot_update_misses.
> It's not going to work, as already mentioned above. Those stats are at
> table level and very little to do with this particular index.
>
> But you've not commented on the design I mention that can work: index relcache.
Sorry, I do not completely agree with you.
Yes, certainly whether functional index is projective or not is property
of the index, not of the table.
But the decision whether hot update is applicable or not is made for the
whole table - for all indexes.
If a value of just one indexed expressions is changed then we can not
use hot update and have to update all indexes.

Assume that we have table with "bookinfo" field of type JSONB.
And we create several functional indexes on this column:
(bookinfo->'isbn'), (bookinfo->'title'), (bookinfo->'author'),
(bookinfo->'rating').
Probability that indexed expression is changed is case of updating
"bookinfo" field my be different for all this three indexes.
But there is completely no sense to check if 'isbn' is changed or not,
if we already detect that most updates cause change of 'rating'
attribute and
so comparing old and new values of (bookinfo->'rating') is just waste of
time. In this case we should not also compare (bookinfo->'isbn') and
other indexed expressions because for already rejected possibility of
hot update.

So despite to the fact that this information depends on particular
index, it affects behavior of the whole table and it is reasonable (and
simpler) to collect it in table's statistic.

>> Concerning your idea to check cost of index function: it certainly makes
>> sense.
>> The only problems: I do not understand now how to calculate this cost.
>> It can be easily calculated by optimizer when it is building query execution
>> plan.
>> But inside BuildIndexInfo I have just reference to Relation and have no idea
>> how
>> I can propagate here information about index expression cost from optimizer.
> We could copy at create index, if we took that route. Or we can look
> up the cost for the index expression and cache it.
>
>
> Anyway, this is just jumping around because we still have a parameter
> and the idea was to remove the parameter entirely by autotuning, which
> I think is both useful and possible, just as HOT itself is autotuned.
>

Hot update in almost all cases is preferable to normal update, causing
update of indexes.
There are can be some scenarios when hot updates reduce speed of some
queries,
but it is very difficult to predict such cases user level.

But usually nature of index is well known by DBA or programmer. In
almost all cases it is clear for person creating functional index
whether it will perform projection or not
and whether comparing old/new expression value makes sense or is just
waste of time. We can guess it from autotune, but such decision may be
wrong (just because of application
business logic). Postgres indexes already have a lot of options. And I
think that "projection" option (or whatever we name it) is also needed.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Surjective functional indexes

konstantin knizhnik
In reply to this post by Simon Riggs


On 14.09.2017 18:53, Simon Riggs wrote:

> It's not going to work, as already mentioned above. Those stats are at
> table level and very little to do with this particular index.
>
> But you've not commented on the design I mention that can work: index relcache.
>
>> Concerning your idea to check cost of index function: it certainly makes
>> sense.
>> The only problems: I do not understand now how to calculate this cost.
>> It can be easily calculated by optimizer when it is building query execution
>> plan.
>> But inside BuildIndexInfo I have just reference to Relation and have no idea
>> how
>> I can propagate here information about index expression cost from optimizer.
> We could copy at create index, if we took that route. Or we can look
> up the cost for the index expression and cache it.
>
>
> Anyway, this is just jumping around because we still have a parameter
> and the idea was to remove the parameter entirely by autotuning, which
> I think is both useful and possible, just as HOT itself is autotuned.
>
Attached please find yet another version of the patch.
I have to significantly rewrite it,  because my first attempts to add
auto-tune were not correct.
New patch does it in correct way (I hope) and more efficiently.
I moved auto-tune code from BuildIndexInfo, which is called many times,
including heap_update (so at least once per update tuple).
to RelationGetIndexAttrBitmap which is called only when cached
RelationData is filled by backend.
The problem with my original implementation of auto-tune was that
switching off "projection" property of index, it doesn't update
attribute masks,
calculated by RelationGetIndexAttrBitmap.

I have also added check for maximal cost of indexed expression.
So now decision whether to apply projection index optimization (compare
old and new values of indexed expression)
is based  on three sources:
  1. Calculated hot update statistic: we compare number of hot updates
which are performed
     because projection index check shows that index expression is not
changed with total
     number of updates affecting attributes used in projection indexes.
If it is smaller than
     some threshold (10%), then index is considered as non-projective.
  2. Calculated cost of index expression: if it is higher than some
threshold (1000) then
     extra comparison of index expression values is expected to be too
expensive.
  3. "projection" index option explicitly set by user. This setting
overrides 1) and 2)



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

projection-autotune3.patch (31K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Surjective functional indexes

Simon Riggs
On 15 September 2017 at 16:34, Konstantin Knizhnik
<[hidden email]> wrote:

> Attached please find yet another version of the patch.

Thanks. I'm reviewing it.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Surjective functional indexes

Robert Haas
In reply to this post by Simon Riggs
On Wed, Sep 13, 2017 at 7:00 AM, Simon Riggs <[hidden email]> wrote:
> If we do have an option it won't be using fancy mathematical
> terminology at all, it would be described in terms of its function,
> e.g. recheck_on_update

+1.

> Yes, I'd rather not have an option at all, just some simple code with
> useful effect, like we have in many other places.

I think the question we need to be able to answer is: What is the
probability that an update that would otherwise be non-HOT can be made
into a HOT update by performing a recheck to see whether the value has
changed?  It doesn't seem easy to figure that out from any of the
statistics we have available today or could easily get, because it
depends not only on the behavior of the expression which appears in
the index definition but also on the application behavior.  For
example, consider a JSON blob representing a bank account.
b->'balance' is likely to change most of the time, but
b->'account_holder_name' only rarely.  That's going to be hard for an
automated system to determine.

We should clearly check as many of the other criteria for a HOT update
as possible before performing a recheck of this type, so that it only
gets performed when it might help.  For example, if column a is
indexed and b->'foo' is indexed, there's no point in checking whether
b->'foo' has changed if we know that a has changed.  I don't know
whether it would be feasible to postpone deciding whether to do a
recheck until after we've figured out whether the page seems to
contain enough free space to allow a HOT update.

Turning non-HOT updates into HOT updates is really good, so it seems
likely that the rechecks will often be worthwhile.  If we avoid a HOT
update in 25% of cases, that's probably easily worth the CPU overhead
of a recheck assuming the function isn't something ridiculously
expensive to compute; the extra CPU cost will be repaid by reduced
bloat.  However, if we avoid a HOT update only one time in a million,
it's probably not worth the cost of recomputing the expression the
other 999,999 times.  I wonder where the crossover point is -- it
seems like something that could be figured out by benchmarking.

While I agree that it would be nice to have this be a completely
automatic determination, I am not sure that will be practical.  I
oppose overloading some other marker (like function_cost>10000) for
this; that's too magical.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Surjective functional indexes

Oleg Bartunov-2
In reply to this post by konstantin knizhnik
On Thu, May 25, 2017 at 7:30 PM, Konstantin Knizhnik
<[hidden email]> wrote:

> Right now Postgres determines whether update operation touch index or not
> based only on set of the affected columns.
> But in case of functional indexes such policy quite frequently leads to
> unnecessary index updates.
> For example, functional index are widely use for indexing JSON data:
> info->>'name'.
>
> JSON data may contain multiple attributes and only few of them may be
> affected by update.
> Moreover, index is used to build for immutable attributes (like "id",
> "isbn", "name",...).
>
> Functions like (info->>'name') are named "surjective" ni mathematics.
> I have strong feeling that most of functional indexes are based on
> surjective functions.
> For such indexes current Postgresql index update policy is very inefficient.
> It cause disabling of hot updates
> and so leads to significant degrade of performance.
>
> Without this patch Postgres is slower than Mongo on YCSB benchmark with (50%
> update,50 % select)  workload.
> And after applying this patch Postgres beats Mongo at all workloads.

I confirm that the patch helps for workload A of YCSB, but actually
just extends #clients, where postgres outperforms mongodb (see
attached picture).  If we increase #clients > 100 postgres quickly
degrades not only for workload A, but even for workload B (5%
updates), while mongodb and mysql behave much-much better, but this is
another problem, we will discuss in different thread.

>
> My proposal is to check value of function for functional indexes instead of
> just comparing set of effected attributes.
> Obviously, for some complex functions it may  have negative effect on update
> speed.
> This is why I have added "surjective" option to index. By default it is
> switched on for all functional indexes (based on my assumption
> that most functions used in functional indexes are surjective). But it is
> possible to explicitly disable it and make decision weather index
> needs to be updated or not only based on set of effected attributes.
>
>
> --
> Konstantin Knizhnik
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>
>
> --
> Sent via pgsql-hackers mailing list ([hidden email])
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Surjective functional indexes

Oleg Bartunov-2
On Thu, Sep 28, 2017 at 11:24 PM, Oleg Bartunov <[hidden email]> wrote:

> On Thu, May 25, 2017 at 7:30 PM, Konstantin Knizhnik
> <[hidden email]> wrote:
>> Right now Postgres determines whether update operation touch index or not
>> based only on set of the affected columns.
>> But in case of functional indexes such policy quite frequently leads to
>> unnecessary index updates.
>> For example, functional index are widely use for indexing JSON data:
>> info->>'name'.
>>
>> JSON data may contain multiple attributes and only few of them may be
>> affected by update.
>> Moreover, index is used to build for immutable attributes (like "id",
>> "isbn", "name",...).
>>
>> Functions like (info->>'name') are named "surjective" ni mathematics.
>> I have strong feeling that most of functional indexes are based on
>> surjective functions.
>> For such indexes current Postgresql index update policy is very inefficient.
>> It cause disabling of hot updates
>> and so leads to significant degrade of performance.
>>
>> Without this patch Postgres is slower than Mongo on YCSB benchmark with (50%
>> update,50 % select)  workload.
>> And after applying this patch Postgres beats Mongo at all workloads.
>
> I confirm that the patch helps for workload A of YCSB, but actually
> just extends #clients, where postgres outperforms mongodb (see
> attached picture).  If we increase #clients > 100 postgres quickly
> degrades not only for workload A, but even for workload B (5%
> updates), while mongodb and mysql behave much-much better, but this is
> another problem, we will discuss in different thread.
sorry, now with picture attached.


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Screen Shot 2017-09-28 at 22.58.42.png (356K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Surjective functional indexes

konstantin knizhnik
In reply to this post by Robert Haas
On 09/28/2017 10:10 PM, Robert Haas wrote:
> On Wed, Sep 13, 2017 at 7:00 AM, Simon Riggs <[hidden email]> wrote:
>> If we do have an option it won't be using fancy mathematical
>> terminology at all, it would be described in terms of its function,
>> e.g. recheck_on_update
> +1.

I have nothing against renaming "projection" option to "recheck_on_update" or whatever else is suggested.
Just let me know the best version. From my point of view "recheck_on_update" is too verbose and still not self-explained (to understand the meaning of this option it is necessary to uunderstand how heap_update works). "projection"/"non-injective"/... are
more declarative notions, explaining the characteristic of the index, while "recheck_on_update"  is more procedural notion, explaining behavior of heap_update.

>
>> Yes, I'd rather not have an option at all, just some simple code with
>> useful effect, like we have in many other places.
> I think the question we need to be able to answer is: What is the
> probability that an update that would otherwise be non-HOT can be made
> into a HOT update by performing a recheck to see whether the value has
> changed?  It doesn't seem easy to figure that out from any of the
> statistics we have available today or could easily get, because it
> depends not only on the behavior of the expression which appears in
> the index definition but also on the application behavior.  For
> example, consider a JSON blob representing a bank account.
> b->'balance' is likely to change most of the time, but
> b->'account_holder_name' only rarely.  That's going to be hard for an
> automated system to determine.
>
> We should clearly check as many of the other criteria for a HOT update
> as possible before performing a recheck of this type, so that it only
> gets performed when it might help.  For example, if column a is
> indexed and b->'foo' is indexed, there's no point in checking whether
> b->'foo' has changed if we know that a has changed.  I don't know
> whether it would be feasible to postpone deciding whether to do a
> recheck until after we've figured out whether the page seems to
> contain enough free space to allow a HOT update.
>
> Turning non-HOT updates into HOT updates is really good, so it seems
> likely that the rechecks will often be worthwhile.  If we avoid a HOT
> update in 25% of cases, that's probably easily worth the CPU overhead
> of a recheck assuming the function isn't something ridiculously
> expensive to compute; the extra CPU cost will be repaid by reduced
> bloat.  However, if we avoid a HOT update only one time in a million,
> it's probably not worth the cost of recomputing the expression the
> other 999,999 times.  I wonder where the crossover point is -- it
> seems like something that could be figured out by benchmarking.
>
> While I agree that it would be nice to have this be a completely
> automatic determination, I am not sure that will be practical.  I
> oppose overloading some other marker (like function_cost>10000) for
> this; that's too magical.
>
I almost agree with you.
Just few remarks: indexes are rarely created for frequently changed attributes, like b->'balance'.
So in case of proper database schema design it is possible to expect that most of updates are hot updates: do not actually affect any index.
But certainly different attributes may have different probability of been updated.
Unfortunately we do not know before check which attribute of JSON field (or any other fields used in indexed expression) is changed.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Surjective functional indexes

Michael Paquier
In reply to this post by Simon Riggs
On Wed, Sep 27, 2017 at 4:07 PM, Simon Riggs <[hidden email]> wrote:
> On 15 September 2017 at 16:34, Konstantin Knizhnik
> <[hidden email]> wrote:
>
>> Attached please find yet another version of the patch.
>
> Thanks. I'm reviewing it.

Two months later, this patch is still waiting for a review (you are
listed as well as a reviewer of this patch). The documentation of the
patch has conflicts, please provide a rebased version. I am moving
this patch to next CF with waiting on author as status.
--
Michael

Reply | Threaded
Open this post in threaded view
|

Re: Surjective functional indexes

konstantin knizhnik
On 30.11.2017 05:02, Michael Paquier wrote:

> On Wed, Sep 27, 2017 at 4:07 PM, Simon Riggs <[hidden email]> wrote:
>> On 15 September 2017 at 16:34, Konstantin Knizhnik
>> <[hidden email]> wrote:
>>
>>> Attached please find yet another version of the patch.
>> Thanks. I'm reviewing it.
> Two months later, this patch is still waiting for a review (you are
> listed as well as a reviewer of this patch). The documentation of the
> patch has conflicts, please provide a rebased version. I am moving
> this patch to next CF with waiting on author as status.
Attached please find new patch with resolved documentation conflict.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


projection-autotune4.patch (31K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Surjective functional indexes

Simon Riggs
On 4 December 2017 at 15:35, Konstantin Knizhnik
<[hidden email]> wrote:

> On 30.11.2017 05:02, Michael Paquier wrote:
>>
>> On Wed, Sep 27, 2017 at 4:07 PM, Simon Riggs <[hidden email]>
>> wrote:
>>>
>>> On 15 September 2017 at 16:34, Konstantin Knizhnik
>>> <[hidden email]> wrote:
>>>
>>>> Attached please find yet another version of the patch.
>>>
>>> Thanks. I'm reviewing it.
>>
>> Two months later, this patch is still waiting for a review (you are
>> listed as well as a reviewer of this patch). The documentation of the
>> patch has conflicts, please provide a rebased version. I am moving
>> this patch to next CF with waiting on author as status.
>
> Attached please find new patch with resolved documentation conflict.

I’ve not posted a review because it was my hope that I could fix up
the problems with this clever and useful patch and just commit it. I
spent 2 days trying to do that but some problems remain and I’ve run
out of time.

Patch 3 has got some additional features that on balance I don’t think
we need. Patch 1 didn’t actually work which was confusing when I tried
to go back to that.

Patch 3 Issues

* Auto tuning added 2 extra items to Stats for each relation.  That is
too high a price to pay, so we should remove that. If we can’t do
autotuning without that, please remove it.

* Patch needs a proper test case added. We shouldn’t just have a
DEBUG1 statement in there for testing - very ugly. Please rewrite
tests using functions that show how many changes have occurred during
the current  statement/transaction.

* Parameter coding was specific to that section of code. We need a
capability to parse that parameter from other locations, just as we do
with all other reloptions. It looks like it was coded that way because
of difficulties with code placement. Please solve this with generic
code, not just code that solves only the current issue. I’d personally
be happier without any option at all, but if y’all want it, then
please add the code in the right place.

* The new coding for heap_update() uses the phrase “warm”, which is
already taken by another patch. Please avoid confusing things by
re-using the same term for different things.
The use of these technical terms like projection and surjective
doesn’t seem to add anything useful to the patch

* Rename parameter to recheck_on_update

* Remove random whitespace changes in the patch

So at this stage the concept is good and works, but the patch is only
just at the prototype stage and nowhere near committable. I hope you
can correct these issues and if you do I will be happy to review and
perhaps commit.

Thanks

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Surjective functional indexes

konstantin knizhnik
Thank you for review.

On 13.12.2017 14:29, Simon Riggs wrote:

> On 4 December 2017 at 15:35, Konstantin Knizhnik
> <[hidden email]> wrote:
>> On 30.11.2017 05:02, Michael Paquier wrote:
>>> On Wed, Sep 27, 2017 at 4:07 PM, Simon Riggs <[hidden email]>
>>> wrote:
>>>> On 15 September 2017 at 16:34, Konstantin Knizhnik
>>>> <[hidden email]> wrote:
>>>>
>>>>> Attached please find yet another version of the patch.
>>>> Thanks. I'm reviewing it.
>>> Two months later, this patch is still waiting for a review (you are
>>> listed as well as a reviewer of this patch). The documentation of the
>>> patch has conflicts, please provide a rebased version. I am moving
>>> this patch to next CF with waiting on author as status.
>> Attached please find new patch with resolved documentation conflict.
> I’ve not posted a review because it was my hope that I could fix up
> the problems with this clever and useful patch and just commit it. I
> spent 2 days trying to do that but some problems remain and I’ve run
> out of time.
>
> Patch 3 has got some additional features that on balance I don’t think
> we need. Patch 1 didn’t actually work which was confusing when I tried
> to go back to that.
>
> Patch 3 Issues
>
> * Auto tuning added 2 extra items to Stats for each relation.  That is
> too high a price to pay, so we should remove that. If we can’t do
> autotuning without that, please remove it.

Right now size of this structure is 168 bytes. Adding two extra fields
increase it to 184 bytes - 9%. Is it really so critical?
What is the typical/maximal number of relation in a database?
I can not believe that there can be more than thousand non-temporary
relations in any database.
Certainly application can generate arbitrary number of temporary tables.
But millions of such tables cause catalog bloating and will have awful
impact on performance in any case.
And for any reasonable number of relations (< million), extra memory
used by this statistic is negligible.
Also I think that these two fields can be useful not only for projection
indexes.
It can be also used by DBA and optimizer because them more precisely
explain relation update pattern.

I really love you idea with autotune and it will be pity to reject it
just because of extra 16 bytes per relation.

> * Patch needs a proper test case added. We shouldn’t just have a
> DEBUG1 statement in there for testing - very ugly. Please rewrite
> tests using functions that show how many changes have occurred during
> the current  statement/transaction.
>
> * Parameter coding was specific to that section of code. We need a
> capability to parse that parameter from other locations, just as we do
> with all other reloptions. It looks like it was coded that way because
> of difficulties with code placement. Please solve this with generic
> code, not just code that solves only the current issue. I’d personally
> be happier without any option at all, but if y’all want it, then
> please add the code in the right place.
Sorry, I do not completely understand what you mean by "parameter coding".
Do you mean IsProjectionFunctionalIndex function and filling relation
options in it?
It is the only reported issue which I do not understand how to fix.

> * The new coding for heap_update() uses the phrase “warm”, which is
> already taken by another patch. Please avoid confusing things by
> re-using the same term for different things.
> The use of these technical terms like projection and surjective
> doesn’t seem to add anything useful to the patch
>
> * Rename parameter to recheck_on_update
>
> * Remove random whitespace changes in the patch
>
> So at this stage the concept is good and works, but the patch is only
> just at the prototype stage and nowhere near committable. I hope you
> can correct these issues and if you do I will be happy to review and
> perhaps commit.

Thank you once again for your efforts in improving this patch.
I will fix all reported issues.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Reply | Threaded
Open this post in threaded view
|

Re: Surjective functional indexes

konstantin knizhnik
In reply to this post by Simon Riggs


On 13.12.2017 14:29, Simon Riggs wrote:

> On 4 December 2017 at 15:35, Konstantin Knizhnik
> <[hidden email]> wrote:
>> On 30.11.2017 05:02, Michael Paquier wrote:
>>> On Wed, Sep 27, 2017 at 4:07 PM, Simon Riggs <[hidden email]>
>>> wrote:
>>>> On 15 September 2017 at 16:34, Konstantin Knizhnik
>>>> <[hidden email]> wrote:
>>>>
>>>>> Attached please find yet another version of the patch.
>>>> Thanks. I'm reviewing it.
>>> Two months later, this patch is still waiting for a review (you are
>>> listed as well as a reviewer of this patch). The documentation of the
>>> patch has conflicts, please provide a rebased version. I am moving
>>> this patch to next CF with waiting on author as status.
>> Attached please find new patch with resolved documentation conflict.
> I’ve not posted a review because it was my hope that I could fix up
> the problems with this clever and useful patch and just commit it. I
> spent 2 days trying to do that but some problems remain and I’ve run
> out of time.
>
> Patch 3 has got some additional features that on balance I don’t think
> we need. Patch 1 didn’t actually work which was confusing when I tried
> to go back to that.
>
> Patch 3 Issues
>
> * Auto tuning added 2 extra items to Stats for each relation.  That is
> too high a price to pay, so we should remove that. If we can’t do
> autotuning without that, please remove it.
>
> * Patch needs a proper test case added. We shouldn’t just have a
> DEBUG1 statement in there for testing - very ugly. Please rewrite
> tests using functions that show how many changes have occurred during
> the current  statement/transaction.
>
> * Parameter coding was specific to that section of code. We need a
> capability to parse that parameter from other locations, just as we do
> with all other reloptions. It looks like it was coded that way because
> of difficulties with code placement. Please solve this with generic
> code, not just code that solves only the current issue. I’d personally
> be happier without any option at all, but if y’all want it, then
> please add the code in the right place.
>
> * The new coding for heap_update() uses the phrase “warm”, which is
> already taken by another patch. Please avoid confusing things by
> re-using the same term for different things.
> The use of these technical terms like projection and surjective
> doesn’t seem to add anything useful to the patch
>
> * Rename parameter to recheck_on_update
>
> * Remove random whitespace changes in the patch
>
> So at this stage the concept is good and works, but the patch is only
> just at the prototype stage and nowhere near committable. I hope you
> can correct these issues and if you do I will be happy to review and
> perhaps commit.
>
> Thanks
>
Attached please find new patch which fix all the reported issues except
disabling autotune.
If you still thing that additional 16 bytes per relation in statistic is
too high overhead, then I will also remove autotune.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


projection-autotune5.patch (31K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Surjective functional indexes

Robert Haas
In reply to this post by konstantin knizhnik
On Wed, Dec 13, 2017 at 12:32 PM, Konstantin Knizhnik
<[hidden email]> wrote:
> I can not believe that there can be more than thousand non-temporary
> relations in any database.

I ran across a cluster with more than 5 million non-temporary
relations just this week.  That's extreme, but having more than a
thousand non-temporary tables is not particularly uncommon.  I would
guess that the percentage of EnterpriseDB's customers who have more
than a thousand non-temporary tables in a database is in the double
digits.  That number is only going to go up as our partitioning
capabilities improve.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

12345