Binary search in ScalarArrayOpExpr for OR'd constant arrays

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
65 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Binary search in ScalarArrayOpExpr for OR'd constant arrays

James Coleman
Over in "execExprInterp() questions / How to improve scalar array op
expr eval?" [1] I'd mused about how we might be able to optimized
scalar array ops with OR'd semantics.

This patch implements a binary search for such expressions when the
array argument is a constant so that we can avoid needing to teach
expression execution to cache stable values or know when a param has
changed.

The speed-up for the target case can pretty impressive: in my
admittedly contrived and relatively unscientific test with a query in
the form:

select count(*) from generate_series(1,100000) n(i) where i in (<1000
random integers in the series>)

shows ~30ms for the patch versus ~640ms on master.

James

[1]: https://www.postgresql.org/message-id/flat/CAAaqYe-UQBba7sScrucDOyHb7cDoNbWf_rcLrOWeD4ikP3_qTQ%40mail.gmail.com

v1-0001-Binary-search-const-arrays-in-OR-d-ScalarArrayOps.patch (21K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

Tomas Vondra-4
On Mon, Apr 20, 2020 at 09:27:34PM -0400, James Coleman wrote:

>Over in "execExprInterp() questions / How to improve scalar array op
>expr eval?" [1] I'd mused about how we might be able to optimized
>scalar array ops with OR'd semantics.
>
>This patch implements a binary search for such expressions when the
>array argument is a constant so that we can avoid needing to teach
>expression execution to cache stable values or know when a param has
>changed.
>
>The speed-up for the target case can pretty impressive: in my
>admittedly contrived and relatively unscientific test with a query in
>the form:
>
>select count(*) from generate_series(1,100000) n(i) where i in (<1000
>random integers in the series>)
>
>shows ~30ms for the patch versus ~640ms on master.
>

Nice improvement, although 1000 items is probably a bit unusual. The
threshold used in the patch (9 elements) seems a bit too low - what
results have you seen with smaller arrays?

Another idea - would a bloom filter be useful here, as a second
optimization? That is, for large arrays build s small bloom filter,
allowing us to skip even the binary search.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

James Coleman
On Thu, Apr 23, 2020 at 8:47 AM Tomas Vondra
<[hidden email]> wrote:

>
> On Mon, Apr 20, 2020 at 09:27:34PM -0400, James Coleman wrote:
> >Over in "execExprInterp() questions / How to improve scalar array op
> >expr eval?" [1] I'd mused about how we might be able to optimized
> >scalar array ops with OR'd semantics.
> >
> >This patch implements a binary search for such expressions when the
> >array argument is a constant so that we can avoid needing to teach
> >expression execution to cache stable values or know when a param has
> >changed.
> >
> >The speed-up for the target case can pretty impressive: in my
> >admittedly contrived and relatively unscientific test with a query in
> >the form:
> >
> >select count(*) from generate_series(1,100000) n(i) where i in (<1000
> >random integers in the series>)
> >
> >shows ~30ms for the patch versus ~640ms on master.
> >
>
> Nice improvement, although 1000 items is probably a bit unusual. The
> threshold used in the patch (9 elements) seems a bit too low - what
> results have you seen with smaller arrays?

At least in our systems we regularly work with 1000 batches of items,
which means you get IN clauses of identifiers of that size. Admittedly
the most common case sees those IN clauses as simple index scans
(e.g., WHERE <primary key> IN (...)), but it's also common to have a
broader query that merely filters additionally on something like "...
AND <some foreign key> IN (...)" where it makes sense for the rest of
the quals to take precedence in generating a reasonable plan. In that
case, the IN becomes a regular filter, hence the idea behind the
patch.

Side note: I'd love for us to be able to treat "IN (VALUES)" the same
way...but as noted in the other thread that's an extremely large
amount of work, I think. But similarly you could use a hash here
instead of a binary search...but this seems quite good.

As to the choice of 9 elements: I just picked that as a starting
point; Andres had previously commented off hand that at 8 elements
serial scanning was faster, so I figured this was a reasonable
starting point for discussion.

Perhaps it would make sense to determine that minimum not as a pure
constant but scaled based on how many rows the planner expects us to
see? Of course that'd be a more invasive patch...so may or may not be
as feasible as a reasonable default.

> Another idea - would a bloom filter be useful here, as a second
> optimization? That is, for large arrays build s small bloom filter,
> allowing us to skip even the binary search.

That's an interesting idea. I actually haven't personally worked with
bloom filters, so didn't think about that.

Are you thinking that you'd also build the filter *and* presort the
array? Or try to get away with using only the bloom filter and not
expanding and sorting the array at all?

Thanks,
James


Reply | Threaded
Open this post in threaded view
|

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

Tomas Vondra-4
On Thu, Apr 23, 2020 at 09:02:26AM -0400, James Coleman wrote:

>On Thu, Apr 23, 2020 at 8:47 AM Tomas Vondra
><[hidden email]> wrote:
>>
>> On Mon, Apr 20, 2020 at 09:27:34PM -0400, James Coleman wrote:
>> >Over in "execExprInterp() questions / How to improve scalar array op
>> >expr eval?" [1] I'd mused about how we might be able to optimized
>> >scalar array ops with OR'd semantics.
>> >
>> >This patch implements a binary search for such expressions when the
>> >array argument is a constant so that we can avoid needing to teach
>> >expression execution to cache stable values or know when a param has
>> >changed.
>> >
>> >The speed-up for the target case can pretty impressive: in my
>> >admittedly contrived and relatively unscientific test with a query in
>> >the form:
>> >
>> >select count(*) from generate_series(1,100000) n(i) where i in (<1000
>> >random integers in the series>)
>> >
>> >shows ~30ms for the patch versus ~640ms on master.
>> >
>>
>> Nice improvement, although 1000 items is probably a bit unusual. The
>> threshold used in the patch (9 elements) seems a bit too low - what
>> results have you seen with smaller arrays?
>
>At least in our systems we regularly work with 1000 batches of items,
>which means you get IN clauses of identifiers of that size. Admittedly
>the most common case sees those IN clauses as simple index scans
>(e.g., WHERE <primary key> IN (...)), but it's also common to have a
>broader query that merely filters additionally on something like "...
>AND <some foreign key> IN (...)" where it makes sense for the rest of
>the quals to take precedence in generating a reasonable plan. In that
>case, the IN becomes a regular filter, hence the idea behind the
>patch.
>
>Side note: I'd love for us to be able to treat "IN (VALUES)" the same
>way...but as noted in the other thread that's an extremely large
>amount of work, I think. But similarly you could use a hash here
>instead of a binary search...but this seems quite good.
>
>As to the choice of 9 elements: I just picked that as a starting
>point; Andres had previously commented off hand that at 8 elements
>serial scanning was faster, so I figured this was a reasonable
>starting point for discussion.
>
>Perhaps it would make sense to determine that minimum not as a pure
>constant but scaled based on how many rows the planner expects us to
>see? Of course that'd be a more invasive patch...so may or may not be
>as feasible as a reasonable default.
>

Not sure. That seems a bit overcomplicated and I don't think it depends
on the number of rows the planner expects to see very much. I think we
usually assume the linear search is cheaper for small arrays and then at
some point the binary search starts winning The question is where this
"break even" point is.

I think we usually use something like 64 or so in other places, but
maybe I'm wrong. The current limit 9 seems a bit too low, but I may be
wrong. Let's not obsess about this too much, let's do some experiments
and pick a value based on that.


>> Another idea - would a bloom filter be useful here, as a second
>> optimization? That is, for large arrays build s small bloom filter,
>> allowing us to skip even the binary search.
>
>That's an interesting idea. I actually haven't personally worked with
>bloom filters, so didn't think about that.
>
>Are you thinking that you'd also build the filter *and* presort the
>array? Or try to get away with using only the bloom filter and not
>expanding and sorting the array at all?
>

Yeah, something like that. My intuition is the bloom filter is useful
only above some number of items, and the number is higher than for the
binary search. So we'd end up with two thresholds, first one enabling
binary search, the second one enabling bloom filter.

Of course, the "unknown" variable here is how often we actually find the
value in the array. If 100% of the queries has a match, then the bloom
filter is a waste of time. If there are no matches, it can make a
significant difference.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

David Rowley
On Fri, 24 Apr 2020 at 02:56, Tomas Vondra <[hidden email]> wrote:

>
> On Thu, Apr 23, 2020 at 09:02:26AM -0400, James Coleman wrote:
> >On Thu, Apr 23, 2020 at 8:47 AM Tomas Vondra
> ><[hidden email]> wrote:
> >As to the choice of 9 elements: I just picked that as a starting
> >point; Andres had previously commented off hand that at 8 elements
> >serial scanning was faster, so I figured this was a reasonable
> >starting point for discussion.
> >
> >Perhaps it would make sense to determine that minimum not as a pure
> >constant but scaled based on how many rows the planner expects us to
> >see? Of course that'd be a more invasive patch...so may or may not be
> >as feasible as a reasonable default.
> >
>
> Not sure. That seems a bit overcomplicated and I don't think it depends
> on the number of rows the planner expects to see very much. I think we
> usually assume the linear search is cheaper for small arrays and then at
> some point the binary search starts winning The question is where this
> "break even" point is.
>
> I think we usually use something like 64 or so in other places, but
> maybe I'm wrong. The current limit 9 seems a bit too low, but I may be
> wrong. Let's not obsess about this too much, let's do some experiments
> and pick a value based on that.

If single comparison for a binary search costs about the same as an
equality check, then wouldn't the crossover point be much lower than
64? The binary search should find or not find the target in log2(N)
rather than N.  ceil(log2(9)) is 4, which is of course less than 9.
For 64, it's 6, so are you not just doing a possible 58 equality
checks than necessary?  Of course, it's a bit more complex as for
values that *are* in the array, the linear search will, on average,
only check half the values. Assuming that, then 9 does not seem too
far off.  I guess benchmarks at various crossover points would speak a
thousand words.

David


Reply | Threaded
Open this post in threaded view
|

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

Andres Freund
Hi,

On 2020-04-24 10:09:36 +1200, David Rowley wrote:
> If single comparison for a binary search costs about the same as an
> equality check, then wouldn't the crossover point be much lower than
> 64?

The costs aren't quite as simple as that though. Binary search usually
has issues with cache misses: In contrast to linear accesses each step
will be a cache miss, as the address is not predictable; and even if the
CPU couldn't predict accesses in the linear search case, often multiple
entries fit on a single cache line. Additionally out-of-order execution
is usually a lot more effective for linear searches (e.g. the next
elements can be compared before the current one is finished if that's
what the branch predictor says is likely).

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

James Coleman
In reply to this post by Tomas Vondra-4
On Thu, Apr 23, 2020 at 10:55 AM Tomas Vondra
<[hidden email]> wrote:

>
> On Thu, Apr 23, 2020 at 09:02:26AM -0400, James Coleman wrote:
> >On Thu, Apr 23, 2020 at 8:47 AM Tomas Vondra
> ><[hidden email]> wrote:
> >>
> >> On Mon, Apr 20, 2020 at 09:27:34PM -0400, James Coleman wrote:
> >> >Over in "execExprInterp() questions / How to improve scalar array op
> >> >expr eval?" [1] I'd mused about how we might be able to optimized
> >> >scalar array ops with OR'd semantics.
> >> >
> >> >This patch implements a binary search for such expressions when the
> >> >array argument is a constant so that we can avoid needing to teach
> >> >expression execution to cache stable values or know when a param has
> >> >changed.
> >> >
> >> >The speed-up for the target case can pretty impressive: in my
> >> >admittedly contrived and relatively unscientific test with a query in
> >> >the form:
> >> >
> >> >select count(*) from generate_series(1,100000) n(i) where i in (<1000
> >> >random integers in the series>)
> >> >
> >> >shows ~30ms for the patch versus ~640ms on master.
> >> >
> >>
> >> Nice improvement, although 1000 items is probably a bit unusual. The
> >> threshold used in the patch (9 elements) seems a bit too low - what
> >> results have you seen with smaller arrays?
> >
> >At least in our systems we regularly work with 1000 batches of items,
> >which means you get IN clauses of identifiers of that size. Admittedly
> >the most common case sees those IN clauses as simple index scans
> >(e.g., WHERE <primary key> IN (...)), but it's also common to have a
> >broader query that merely filters additionally on something like "...
> >AND <some foreign key> IN (...)" where it makes sense for the rest of
> >the quals to take precedence in generating a reasonable plan. In that
> >case, the IN becomes a regular filter, hence the idea behind the
> >patch.
> >
> >Side note: I'd love for us to be able to treat "IN (VALUES)" the same
> >way...but as noted in the other thread that's an extremely large
> >amount of work, I think. But similarly you could use a hash here
> >instead of a binary search...but this seems quite good.
> >
> >As to the choice of 9 elements: I just picked that as a starting
> >point; Andres had previously commented off hand that at 8 elements
> >serial scanning was faster, so I figured this was a reasonable
> >starting point for discussion.
> >
> >Perhaps it would make sense to determine that minimum not as a pure
> >constant but scaled based on how many rows the planner expects us to
> >see? Of course that'd be a more invasive patch...so may or may not be
> >as feasible as a reasonable default.
> >
>
> Not sure. That seems a bit overcomplicated and I don't think it depends
> on the number of rows the planner expects to see very much. I think we
> usually assume the linear search is cheaper for small arrays and then at
> some point the binary search starts winning The question is where this
> "break even" point is.

Well since it has to do preprocessing work (expanding the array and
then sorting it), then the number of rows processed matters, right?
For example, doing a linear search on 1000 items only once is going to
be cheaper than preprocessing the array and then doing a binary
search, but only a very large row count the binary search +
preprocessing might very well win out for only a 10 element array.

I'm not trying to argue for more work for myself here: I think the
optimization is worth it on its own, and something like this could be
a further improvement on its own. But it is interesting to think
about.

James


Reply | Threaded
Open this post in threaded view
|

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

Tomas Vondra-4
On Fri, Apr 24, 2020 at 09:38:54AM -0400, James Coleman wrote:

>On Thu, Apr 23, 2020 at 10:55 AM Tomas Vondra
><[hidden email]> wrote:
>>
>> On Thu, Apr 23, 2020 at 09:02:26AM -0400, James Coleman wrote:
>> >On Thu, Apr 23, 2020 at 8:47 AM Tomas Vondra
>> ><[hidden email]> wrote:
>> >>
>> >> On Mon, Apr 20, 2020 at 09:27:34PM -0400, James Coleman wrote:
>> >> >Over in "execExprInterp() questions / How to improve scalar array op
>> >> >expr eval?" [1] I'd mused about how we might be able to optimized
>> >> >scalar array ops with OR'd semantics.
>> >> >
>> >> >This patch implements a binary search for such expressions when the
>> >> >array argument is a constant so that we can avoid needing to teach
>> >> >expression execution to cache stable values or know when a param has
>> >> >changed.
>> >> >
>> >> >The speed-up for the target case can pretty impressive: in my
>> >> >admittedly contrived and relatively unscientific test with a query in
>> >> >the form:
>> >> >
>> >> >select count(*) from generate_series(1,100000) n(i) where i in (<1000
>> >> >random integers in the series>)
>> >> >
>> >> >shows ~30ms for the patch versus ~640ms on master.
>> >> >
>> >>
>> >> Nice improvement, although 1000 items is probably a bit unusual. The
>> >> threshold used in the patch (9 elements) seems a bit too low - what
>> >> results have you seen with smaller arrays?
>> >
>> >At least in our systems we regularly work with 1000 batches of items,
>> >which means you get IN clauses of identifiers of that size. Admittedly
>> >the most common case sees those IN clauses as simple index scans
>> >(e.g., WHERE <primary key> IN (...)), but it's also common to have a
>> >broader query that merely filters additionally on something like "...
>> >AND <some foreign key> IN (...)" where it makes sense for the rest of
>> >the quals to take precedence in generating a reasonable plan. In that
>> >case, the IN becomes a regular filter, hence the idea behind the
>> >patch.
>> >
>> >Side note: I'd love for us to be able to treat "IN (VALUES)" the same
>> >way...but as noted in the other thread that's an extremely large
>> >amount of work, I think. But similarly you could use a hash here
>> >instead of a binary search...but this seems quite good.
>> >
>> >As to the choice of 9 elements: I just picked that as a starting
>> >point; Andres had previously commented off hand that at 8 elements
>> >serial scanning was faster, so I figured this was a reasonable
>> >starting point for discussion.
>> >
>> >Perhaps it would make sense to determine that minimum not as a pure
>> >constant but scaled based on how many rows the planner expects us to
>> >see? Of course that'd be a more invasive patch...so may or may not be
>> >as feasible as a reasonable default.
>> >
>>
>> Not sure. That seems a bit overcomplicated and I don't think it depends
>> on the number of rows the planner expects to see very much. I think we
>> usually assume the linear search is cheaper for small arrays and then at
>> some point the binary search starts winning The question is where this
>> "break even" point is.
>
>Well since it has to do preprocessing work (expanding the array and
>then sorting it), then the number of rows processed matters, right?
>For example, doing a linear search on 1000 items only once is going to
>be cheaper than preprocessing the array and then doing a binary
>search, but only a very large row count the binary search +
>preprocessing might very well win out for only a 10 element array.
>

Hmmm, good point. Essentially the initialization (sorting of the array)
has some cost, and the question is how much extra per-tuple cost this
adds. It's probably not worth it for a single lookup, but for many
lookups it's probably OK. Let's see if I can do the math right:

   N - number of lookups
   K - number of array elements

Cost to sort the array is

    O(K * log(K)) = C1 * K * log(K)

and the cost of a lookup is C2 * log(K), so with the extra cost amortized
for N lookups, the total "per lookup" cost is

    C1 * K * log(K) / N + C2 * log(K) = log(K) * (C1 * K / N + C2)

We need to compare this to the O(K) cost of simple linear search, and
the question is at which point the linear search gets more expensive:

    C3 * K = log(K) * (C1 * K / N + C2)

I think we can assume that C3 is somewhere in between 0.5 and 1, i.e. if
there's a matching item we find it half-way through on average, and if
there is not we have to walk the whole array. So let's say it's 1.

C1 and C2 are probably fairly low, I think - C1 is typically ~1.4 for
random pivot choice IIRC, and C2 is probably similar. With both values
being ~1.5 we get this:

    K = log(K) * (1.5 * K/N + 1.5)

for a fixed K, we get this formula for N:

    N = log(K) * 1.5 * K / (K - 1.5 * log(K))

and for a bunch of K values the results look like this:

         K |     N
    -------|-------
         1 |     0
        10 |  5.27
       100 |  7.42
      1000 | 10.47
     10000 | 13.83
    100000 | 17.27

i.e. the binary search with 10k values starts winning over linear search
with just ~13 lookups.

(Assuming I haven't made some silly mistake in the math ...)

Obviously, this only accounts for cost of comparisons and neglects e.g.
the indirect costs for less predictable memory access patterns mentioned
by Andres in his response.

But I think it still shows the number of lookups needed for the binary
search to be a win is pretty low - at least for reasonable number of
values in the array. Maybe it's 20 and not 10, but I don't think that
changes much.

The other question is if we can get N at all and how reliable the value
is. We can probably get the number of rows, but that will ignore other
conditions that may eliminate the row before the binary search.

>I'm not trying to argue for more work for myself here: I think the
>optimization is worth it on its own, and something like this could be
>a further improvement on its own. But it is interesting to think
>about.
>

I don't know. Clearly, if the user sends a query with 10k values and
only does a single lookup, that won't win. And if we can reasonably and
reliably protect against that, I wouldn't mind doing that, although it
means a risk of not using the bin search in case of underestimates etc.

I don't have any hard data about this, but I think we can assume the
number of rows processed by the clause is (much) higher than the number
of keys in it. If you have a clause with 10k values, then you probably
expect it to be applied to many rows, far more than the "beak even"
point of about 10-20 rows ...

So I wouldn't worry about this too much.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

Tomas Vondra-4
In reply to this post by Tomas Vondra-4
On Thu, Apr 23, 2020 at 04:55:51PM +0200, Tomas Vondra wrote:

>On Thu, Apr 23, 2020 at 09:02:26AM -0400, James Coleman wrote:
>>On Thu, Apr 23, 2020 at 8:47 AM Tomas Vondra
>><[hidden email]> wrote:
>>>
>>>On Mon, Apr 20, 2020 at 09:27:34PM -0400, James Coleman wrote:
>>>>Over in "execExprInterp() questions / How to improve scalar array op
>>>>expr eval?" [1] I'd mused about how we might be able to optimized
>>>>scalar array ops with OR'd semantics.
>>>>
>>>>This patch implements a binary search for such expressions when the
>>>>array argument is a constant so that we can avoid needing to teach
>>>>expression execution to cache stable values or know when a param has
>>>>changed.
>>>>
>>>>The speed-up for the target case can pretty impressive: in my
>>>>admittedly contrived and relatively unscientific test with a query in
>>>>the form:
>>>>
>>>>select count(*) from generate_series(1,100000) n(i) where i in (<1000
>>>>random integers in the series>)
>>>>
>>>>shows ~30ms for the patch versus ~640ms on master.
>>>>
>>>
>>>Nice improvement, although 1000 items is probably a bit unusual. The
>>>threshold used in the patch (9 elements) seems a bit too low - what
>>>results have you seen with smaller arrays?
>>
>>At least in our systems we regularly work with 1000 batches of items,
>>which means you get IN clauses of identifiers of that size. Admittedly
>>the most common case sees those IN clauses as simple index scans
>>(e.g., WHERE <primary key> IN (...)), but it's also common to have a
>>broader query that merely filters additionally on something like "...
>>AND <some foreign key> IN (...)" where it makes sense for the rest of
>>the quals to take precedence in generating a reasonable plan. In that
>>case, the IN becomes a regular filter, hence the idea behind the
>>patch.
>>
>>Side note: I'd love for us to be able to treat "IN (VALUES)" the same
>>way...but as noted in the other thread that's an extremely large
>>amount of work, I think. But similarly you could use a hash here
>>instead of a binary search...but this seems quite good.
>>
>>As to the choice of 9 elements: I just picked that as a starting
>>point; Andres had previously commented off hand that at 8 elements
>>serial scanning was faster, so I figured this was a reasonable
>>starting point for discussion.
>>
>>Perhaps it would make sense to determine that minimum not as a pure
>>constant but scaled based on how many rows the planner expects us to
>>see? Of course that'd be a more invasive patch...so may or may not be
>>as feasible as a reasonable default.
>>
>
>Not sure. That seems a bit overcomplicated and I don't think it depends
>on the number of rows the planner expects to see very much. I think we
>usually assume the linear search is cheaper for small arrays and then at
>some point the binary search starts winning The question is where this
>"break even" point is.
>
>I think we usually use something like 64 or so in other places, but
>maybe I'm wrong. The current limit 9 seems a bit too low, but I may be
>wrong. Let's not obsess about this too much, let's do some experiments
>and pick a value based on that.
>
>
>>>Another idea - would a bloom filter be useful here, as a second
>>>optimization? That is, for large arrays build s small bloom filter,
>>>allowing us to skip even the binary search.
>>
>>That's an interesting idea. I actually haven't personally worked with
>>bloom filters, so didn't think about that.
>>
>>Are you thinking that you'd also build the filter *and* presort the
>>array? Or try to get away with using only the bloom filter and not
>>expanding and sorting the array at all?
>>
>
>Yeah, something like that. My intuition is the bloom filter is useful
>only above some number of items, and the number is higher than for the
>binary search. So we'd end up with two thresholds, first one enabling
>binary search, the second one enabling bloom filter.
>
>Of course, the "unknown" variable here is how often we actually find the
>value in the array. If 100% of the queries has a match, then the bloom
>filter is a waste of time. If there are no matches, it can make a
>significant difference.
>
I did experiment with this is a bit, both to get a bit more familiar
with this code and to see if the bloom filter might help. The short
answer is the bloom filter does not seem to help at all, so I wouldn't
bother about it too much.

Attacched is an updated patch series and, script I used to collect some
performance measurements, and a spreadsheet with results. The patch
series is broken into four parts:

   0001 - the original patch with binary search
   0002 - adds GUCs to enable bin search / tweak threshold
   0003 - allows to use bloom filter + binary search
   0004 - try using murmurhash

The test script runs a wide range of queries with different number
of lookups, keys in the array, match probability (i.e. fraction of
lookups that find a match) ranging from 1% to 100%. And of course, it
runs this with the binsearch/bloom either enabled or disabled (the
threshold was lowered to 1, so it's the on/off GUCs that determine
whether the binsearch/bloom is used).

The results are summarized in the spreadsheet, demonstrating how useless
the bloom filter is. There's not a single case where it would beat the
binary search. I believe this is because theaccess to bloom filter is
random (determined by the hash function) and we don't save much compared
to the log(K) lookups in the sorted array.

That makes sense, I think the bloom filters are meant to be used in
cases when the main data don't fit into memory - which is not the case
here. But I wonder how would this change for cases with more expensive
comparisons - this was using just integers, so maybe strings would
result in different behavior.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

saop-results.ods (278K) Download Attachment
test-saop.sh (5K) Download Attachment
0001-binary-search.patch (15K) Download Attachment
0002-add-GUC-to-enable-binary-search.patch (3K) Download Attachment
0003-bloom-filter.patch (22K) Download Attachment
0004-try-using-murmuhash.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

James Coleman
In reply to this post by Tomas Vondra-4
On Fri, Apr 24, 2020 at 5:55 PM Tomas Vondra
<[hidden email]> wrote:

>
> On Fri, Apr 24, 2020 at 09:38:54AM -0400, James Coleman wrote:
> >On Thu, Apr 23, 2020 at 10:55 AM Tomas Vondra
> ><[hidden email]> wrote:
> >>
> >> On Thu, Apr 23, 2020 at 09:02:26AM -0400, James Coleman wrote:
> >> >On Thu, Apr 23, 2020 at 8:47 AM Tomas Vondra
> >> ><[hidden email]> wrote:
> >> >>
> >> >> On Mon, Apr 20, 2020 at 09:27:34PM -0400, James Coleman wrote:
> >> >> >Over in "execExprInterp() questions / How to improve scalar array op
> >> >> >expr eval?" [1] I'd mused about how we might be able to optimized
> >> >> >scalar array ops with OR'd semantics.
> >> >> >
> >> >> >This patch implements a binary search for such expressions when the
> >> >> >array argument is a constant so that we can avoid needing to teach
> >> >> >expression execution to cache stable values or know when a param has
> >> >> >changed.
> >> >> >
> >> >> >The speed-up for the target case can pretty impressive: in my
> >> >> >admittedly contrived and relatively unscientific test with a query in
> >> >> >the form:
> >> >> >
> >> >> >select count(*) from generate_series(1,100000) n(i) where i in (<1000
> >> >> >random integers in the series>)
> >> >> >
> >> >> >shows ~30ms for the patch versus ~640ms on master.
> >> >> >
> >> >>
> >> >> Nice improvement, although 1000 items is probably a bit unusual. The
> >> >> threshold used in the patch (9 elements) seems a bit too low - what
> >> >> results have you seen with smaller arrays?
> >> >
> >> >At least in our systems we regularly work with 1000 batches of items,
> >> >which means you get IN clauses of identifiers of that size. Admittedly
> >> >the most common case sees those IN clauses as simple index scans
> >> >(e.g., WHERE <primary key> IN (...)), but it's also common to have a
> >> >broader query that merely filters additionally on something like "...
> >> >AND <some foreign key> IN (...)" where it makes sense for the rest of
> >> >the quals to take precedence in generating a reasonable plan. In that
> >> >case, the IN becomes a regular filter, hence the idea behind the
> >> >patch.
> >> >
> >> >Side note: I'd love for us to be able to treat "IN (VALUES)" the same
> >> >way...but as noted in the other thread that's an extremely large
> >> >amount of work, I think. But similarly you could use a hash here
> >> >instead of a binary search...but this seems quite good.
> >> >
> >> >As to the choice of 9 elements: I just picked that as a starting
> >> >point; Andres had previously commented off hand that at 8 elements
> >> >serial scanning was faster, so I figured this was a reasonable
> >> >starting point for discussion.
> >> >
> >> >Perhaps it would make sense to determine that minimum not as a pure
> >> >constant but scaled based on how many rows the planner expects us to
> >> >see? Of course that'd be a more invasive patch...so may or may not be
> >> >as feasible as a reasonable default.
> >> >
> >>
> >> Not sure. That seems a bit overcomplicated and I don't think it depends
> >> on the number of rows the planner expects to see very much. I think we
> >> usually assume the linear search is cheaper for small arrays and then at
> >> some point the binary search starts winning The question is where this
> >> "break even" point is.
> >
> >Well since it has to do preprocessing work (expanding the array and
> >then sorting it), then the number of rows processed matters, right?
> >For example, doing a linear search on 1000 items only once is going to
> >be cheaper than preprocessing the array and then doing a binary
> >search, but only a very large row count the binary search +
> >preprocessing might very well win out for only a 10 element array.
> >
>
> Hmmm, good point. Essentially the initialization (sorting of the array)
> has some cost, and the question is how much extra per-tuple cost this
> adds. It's probably not worth it for a single lookup, but for many
> lookups it's probably OK. Let's see if I can do the math right:
>
>    N - number of lookups
>    K - number of array elements
>
> Cost to sort the array is
>
>     O(K * log(K)) = C1 * K * log(K)
>
> and the cost of a lookup is C2 * log(K), so with the extra cost amortized
> for N lookups, the total "per lookup" cost is
>
>     C1 * K * log(K) / N + C2 * log(K) = log(K) * (C1 * K / N + C2)
>
> We need to compare this to the O(K) cost of simple linear search, and
> the question is at which point the linear search gets more expensive:
>
>     C3 * K = log(K) * (C1 * K / N + C2)
>
> I think we can assume that C3 is somewhere in between 0.5 and 1, i.e. if
> there's a matching item we find it half-way through on average, and if
> there is not we have to walk the whole array. So let's say it's 1.
>
> C1 and C2 are probably fairly low, I think - C1 is typically ~1.4 for
> random pivot choice IIRC, and C2 is probably similar. With both values
> being ~1.5 we get this:
>
>     K = log(K) * (1.5 * K/N + 1.5)
>
> for a fixed K, we get this formula for N:
>
>     N = log(K) * 1.5 * K / (K - 1.5 * log(K))
>
> and for a bunch of K values the results look like this:
>
>          K |     N
>     -------|-------
>          1 |     0
>         10 |  5.27
>        100 |  7.42
>       1000 | 10.47
>      10000 | 13.83
>     100000 | 17.27
>
> i.e. the binary search with 10k values starts winning over linear search
> with just ~13 lookups.
>
> (Assuming I haven't made some silly mistake in the math ...)
>
> Obviously, this only accounts for cost of comparisons and neglects e.g.
> the indirect costs for less predictable memory access patterns mentioned
> by Andres in his response.
>
> But I think it still shows the number of lookups needed for the binary
> search to be a win is pretty low - at least for reasonable number of
> values in the array. Maybe it's 20 and not 10, but I don't think that
> changes much.
>
> The other question is if we can get N at all and how reliable the value
> is. We can probably get the number of rows, but that will ignore other
> conditions that may eliminate the row before the binary search.
>
> >I'm not trying to argue for more work for myself here: I think the
> >optimization is worth it on its own, and something like this could be
> >a further improvement on its own. But it is interesting to think
> >about.
> >
>
> I don't know. Clearly, if the user sends a query with 10k values and
> only does a single lookup, that won't win. And if we can reasonably and
> reliably protect against that, I wouldn't mind doing that, although it
> means a risk of not using the bin search in case of underestimates etc.
>
> I don't have any hard data about this, but I think we can assume the
> number of rows processed by the clause is (much) higher than the number
> of keys in it. If you have a clause with 10k values, then you probably
> expect it to be applied to many rows, far more than the "beak even"
> point of about 10-20 rows ...
>
> So I wouldn't worry about this too much.

Yeah. I think it becomes a lot more interesting in the future if/when
we end up with a way to use this with params and not just constant
arrays. Then the "group" size would matter a whole lot more.

For now, the constant amount of overhead is quite small, so even if we
only execute it once we won't make the query that much worse (or, at
least, the total query time will still be very small). Also, because
it's only applied to constants, there's a natural limit to how much
overhead we're likely to introduce into a query.

James


Reply | Threaded
Open this post in threaded view
|

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

Tomas Vondra-4
In reply to this post by Tomas Vondra-4
On Sat, Apr 25, 2020 at 12:21:06AM +0200, Tomas Vondra wrote:

>On Thu, Apr 23, 2020 at 04:55:51PM +0200, Tomas Vondra wrote:
>>On Thu, Apr 23, 2020 at 09:02:26AM -0400, James Coleman wrote:
>>>On Thu, Apr 23, 2020 at 8:47 AM Tomas Vondra
>>><[hidden email]> wrote:
>>>>
>>>>On Mon, Apr 20, 2020 at 09:27:34PM -0400, James Coleman wrote:
>>>>>Over in "execExprInterp() questions / How to improve scalar array op
>>>>>expr eval?" [1] I'd mused about how we might be able to optimized
>>>>>scalar array ops with OR'd semantics.
>>>>>
>>>>>This patch implements a binary search for such expressions when the
>>>>>array argument is a constant so that we can avoid needing to teach
>>>>>expression execution to cache stable values or know when a param has
>>>>>changed.
>>>>>
>>>>>The speed-up for the target case can pretty impressive: in my
>>>>>admittedly contrived and relatively unscientific test with a query in
>>>>>the form:
>>>>>
>>>>>select count(*) from generate_series(1,100000) n(i) where i in (<1000
>>>>>random integers in the series>)
>>>>>
>>>>>shows ~30ms for the patch versus ~640ms on master.
>>>>>
>>>>
>>>>Nice improvement, although 1000 items is probably a bit unusual. The
>>>>threshold used in the patch (9 elements) seems a bit too low - what
>>>>results have you seen with smaller arrays?
>>>
>>>At least in our systems we regularly work with 1000 batches of items,
>>>which means you get IN clauses of identifiers of that size. Admittedly
>>>the most common case sees those IN clauses as simple index scans
>>>(e.g., WHERE <primary key> IN (...)), but it's also common to have a
>>>broader query that merely filters additionally on something like "...
>>>AND <some foreign key> IN (...)" where it makes sense for the rest of
>>>the quals to take precedence in generating a reasonable plan. In that
>>>case, the IN becomes a regular filter, hence the idea behind the
>>>patch.
>>>
>>>Side note: I'd love for us to be able to treat "IN (VALUES)" the same
>>>way...but as noted in the other thread that's an extremely large
>>>amount of work, I think. But similarly you could use a hash here
>>>instead of a binary search...but this seems quite good.
>>>
>>>As to the choice of 9 elements: I just picked that as a starting
>>>point; Andres had previously commented off hand that at 8 elements
>>>serial scanning was faster, so I figured this was a reasonable
>>>starting point for discussion.
>>>
>>>Perhaps it would make sense to determine that minimum not as a pure
>>>constant but scaled based on how many rows the planner expects us to
>>>see? Of course that'd be a more invasive patch...so may or may not be
>>>as feasible as a reasonable default.
>>>
>>
>>Not sure. That seems a bit overcomplicated and I don't think it depends
>>on the number of rows the planner expects to see very much. I think we
>>usually assume the linear search is cheaper for small arrays and then at
>>some point the binary search starts winning The question is where this
>>"break even" point is.
>>
>>I think we usually use something like 64 or so in other places, but
>>maybe I'm wrong. The current limit 9 seems a bit too low, but I may be
>>wrong. Let's not obsess about this too much, let's do some experiments
>>and pick a value based on that.
>>
>>
>>>>Another idea - would a bloom filter be useful here, as a second
>>>>optimization? That is, for large arrays build s small bloom filter,
>>>>allowing us to skip even the binary search.
>>>
>>>That's an interesting idea. I actually haven't personally worked with
>>>bloom filters, so didn't think about that.
>>>
>>>Are you thinking that you'd also build the filter *and* presort the
>>>array? Or try to get away with using only the bloom filter and not
>>>expanding and sorting the array at all?
>>>
>>
>>Yeah, something like that. My intuition is the bloom filter is useful
>>only above some number of items, and the number is higher than for the
>>binary search. So we'd end up with two thresholds, first one enabling
>>binary search, the second one enabling bloom filter.
>>
>>Of course, the "unknown" variable here is how often we actually find the
>>value in the array. If 100% of the queries has a match, then the bloom
>>filter is a waste of time. If there are no matches, it can make a
>>significant difference.
>>
>
>I did experiment with this is a bit, both to get a bit more familiar
>with this code and to see if the bloom filter might help. The short
>answer is the bloom filter does not seem to help at all, so I wouldn't
>bother about it too much.
>
>Attacched is an updated patch series and, script I used to collect some
>performance measurements, and a spreadsheet with results. The patch
>series is broken into four parts:
>
>  0001 - the original patch with binary search
>  0002 - adds GUCs to enable bin search / tweak threshold
>  0003 - allows to use bloom filter + binary search
>  0004 - try using murmurhash
>
>The test script runs a wide range of queries with different number
>of lookups, keys in the array, match probability (i.e. fraction of
>lookups that find a match) ranging from 1% to 100%. And of course, it
>runs this with the binsearch/bloom either enabled or disabled (the
>threshold was lowered to 1, so it's the on/off GUCs that determine
>whether the binsearch/bloom is used).
>
>The results are summarized in the spreadsheet, demonstrating how useless
>the bloom filter is. There's not a single case where it would beat the
>binary search. I believe this is because theaccess to bloom filter is
>random (determined by the hash function) and we don't save much compared
>to the log(K) lookups in the sorted array.
>
>That makes sense, I think the bloom filters are meant to be used in
>cases when the main data don't fit into memory - which is not the case
>here. But I wonder how would this change for cases with more expensive
>comparisons - this was using just integers, so maybe strings would
>result in different behavior.
OK, I tried the same test with text columns (with md5 strings), and the
results are about as I predicted - the bloom filter actually makes a
difference in this case. Depending on the number of lookups and
selectivity (i.e. how many lookups have a match in the array) it can
mean additional speedup up to ~5x compared to binary search alone.

For the case with 100% selectivity (i.e. all rows have a match) this
can't really save any time - it's usually still much faster than master,
but it's a bit slower than binary search.

So I think this might be worth investigating further, once the simple
binary search gets committed. We'll probably need to factor in the cost
of the comparison (higher cost -> BF more useful), selectivity of the
filter (fewer matches -> BF more useful) and number of lookups.

This reminds me our attempts to add bloom filters to hash joins, which I
think ran into mostly the same challenge of deciding when the bloom
filter can be useful and is worth the extra work.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

test-saop.sh (11K) Download Attachment
saop-expensive-cmp.ods (143K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

Tomas Vondra-4
In reply to this post by James Coleman
On Fri, Apr 24, 2020 at 09:22:34PM -0400, James Coleman wrote:

>On Fri, Apr 24, 2020 at 5:55 PM Tomas Vondra
><[hidden email]> wrote:
>>
>> On Fri, Apr 24, 2020 at 09:38:54AM -0400, James Coleman wrote:
>> >On Thu, Apr 23, 2020 at 10:55 AM Tomas Vondra
>> ><[hidden email]> wrote:
>> >>
>> >> On Thu, Apr 23, 2020 at 09:02:26AM -0400, James Coleman wrote:
>> >> >On Thu, Apr 23, 2020 at 8:47 AM Tomas Vondra
>> >> ><[hidden email]> wrote:
>> >> >>
>> >> >> On Mon, Apr 20, 2020 at 09:27:34PM -0400, James Coleman wrote:
>> >> >> >Over in "execExprInterp() questions / How to improve scalar array op
>> >> >> >expr eval?" [1] I'd mused about how we might be able to optimized
>> >> >> >scalar array ops with OR'd semantics.
>> >> >> >
>> >> >> >This patch implements a binary search for such expressions when the
>> >> >> >array argument is a constant so that we can avoid needing to teach
>> >> >> >expression execution to cache stable values or know when a param has
>> >> >> >changed.
>> >> >> >
>> >> >> >The speed-up for the target case can pretty impressive: in my
>> >> >> >admittedly contrived and relatively unscientific test with a query in
>> >> >> >the form:
>> >> >> >
>> >> >> >select count(*) from generate_series(1,100000) n(i) where i in (<1000
>> >> >> >random integers in the series>)
>> >> >> >
>> >> >> >shows ~30ms for the patch versus ~640ms on master.
>> >> >> >
>> >> >>
>> >> >> Nice improvement, although 1000 items is probably a bit unusual. The
>> >> >> threshold used in the patch (9 elements) seems a bit too low - what
>> >> >> results have you seen with smaller arrays?
>> >> >
>> >> >At least in our systems we regularly work with 1000 batches of items,
>> >> >which means you get IN clauses of identifiers of that size. Admittedly
>> >> >the most common case sees those IN clauses as simple index scans
>> >> >(e.g., WHERE <primary key> IN (...)), but it's also common to have a
>> >> >broader query that merely filters additionally on something like "...
>> >> >AND <some foreign key> IN (...)" where it makes sense for the rest of
>> >> >the quals to take precedence in generating a reasonable plan. In that
>> >> >case, the IN becomes a regular filter, hence the idea behind the
>> >> >patch.
>> >> >
>> >> >Side note: I'd love for us to be able to treat "IN (VALUES)" the same
>> >> >way...but as noted in the other thread that's an extremely large
>> >> >amount of work, I think. But similarly you could use a hash here
>> >> >instead of a binary search...but this seems quite good.
>> >> >
>> >> >As to the choice of 9 elements: I just picked that as a starting
>> >> >point; Andres had previously commented off hand that at 8 elements
>> >> >serial scanning was faster, so I figured this was a reasonable
>> >> >starting point for discussion.
>> >> >
>> >> >Perhaps it would make sense to determine that minimum not as a pure
>> >> >constant but scaled based on how many rows the planner expects us to
>> >> >see? Of course that'd be a more invasive patch...so may or may not be
>> >> >as feasible as a reasonable default.
>> >> >
>> >>
>> >> Not sure. That seems a bit overcomplicated and I don't think it depends
>> >> on the number of rows the planner expects to see very much. I think we
>> >> usually assume the linear search is cheaper for small arrays and then at
>> >> some point the binary search starts winning The question is where this
>> >> "break even" point is.
>> >
>> >Well since it has to do preprocessing work (expanding the array and
>> >then sorting it), then the number of rows processed matters, right?
>> >For example, doing a linear search on 1000 items only once is going to
>> >be cheaper than preprocessing the array and then doing a binary
>> >search, but only a very large row count the binary search +
>> >preprocessing might very well win out for only a 10 element array.
>> >
>>
>> Hmmm, good point. Essentially the initialization (sorting of the array)
>> has some cost, and the question is how much extra per-tuple cost this
>> adds. It's probably not worth it for a single lookup, but for many
>> lookups it's probably OK. Let's see if I can do the math right:
>>
>>    N - number of lookups
>>    K - number of array elements
>>
>> Cost to sort the array is
>>
>>     O(K * log(K)) = C1 * K * log(K)
>>
>> and the cost of a lookup is C2 * log(K), so with the extra cost amortized
>> for N lookups, the total "per lookup" cost is
>>
>>     C1 * K * log(K) / N + C2 * log(K) = log(K) * (C1 * K / N + C2)
>>
>> We need to compare this to the O(K) cost of simple linear search, and
>> the question is at which point the linear search gets more expensive:
>>
>>     C3 * K = log(K) * (C1 * K / N + C2)
>>
>> I think we can assume that C3 is somewhere in between 0.5 and 1, i.e. if
>> there's a matching item we find it half-way through on average, and if
>> there is not we have to walk the whole array. So let's say it's 1.
>>
>> C1 and C2 are probably fairly low, I think - C1 is typically ~1.4 for
>> random pivot choice IIRC, and C2 is probably similar. With both values
>> being ~1.5 we get this:
>>
>>     K = log(K) * (1.5 * K/N + 1.5)
>>
>> for a fixed K, we get this formula for N:
>>
>>     N = log(K) * 1.5 * K / (K - 1.5 * log(K))
>>
>> and for a bunch of K values the results look like this:
>>
>>          K |     N
>>     -------|-------
>>          1 |     0
>>         10 |  5.27
>>        100 |  7.42
>>       1000 | 10.47
>>      10000 | 13.83
>>     100000 | 17.27
>>
>> i.e. the binary search with 10k values starts winning over linear search
>> with just ~13 lookups.
>>
>> (Assuming I haven't made some silly mistake in the math ...)
>>
>> Obviously, this only accounts for cost of comparisons and neglects e.g.
>> the indirect costs for less predictable memory access patterns mentioned
>> by Andres in his response.
>>
>> But I think it still shows the number of lookups needed for the binary
>> search to be a win is pretty low - at least for reasonable number of
>> values in the array. Maybe it's 20 and not 10, but I don't think that
>> changes much.
>>
>> The other question is if we can get N at all and how reliable the value
>> is. We can probably get the number of rows, but that will ignore other
>> conditions that may eliminate the row before the binary search.
>>
>> >I'm not trying to argue for more work for myself here: I think the
>> >optimization is worth it on its own, and something like this could be
>> >a further improvement on its own. But it is interesting to think
>> >about.
>> >
>>
>> I don't know. Clearly, if the user sends a query with 10k values and
>> only does a single lookup, that won't win. And if we can reasonably and
>> reliably protect against that, I wouldn't mind doing that, although it
>> means a risk of not using the bin search in case of underestimates etc.
>>
>> I don't have any hard data about this, but I think we can assume the
>> number of rows processed by the clause is (much) higher than the number
>> of keys in it. If you have a clause with 10k values, then you probably
>> expect it to be applied to many rows, far more than the "beak even"
>> point of about 10-20 rows ...
>>
>> So I wouldn't worry about this too much.
>
>Yeah. I think it becomes a lot more interesting in the future if/when
>we end up with a way to use this with params and not just constant
>arrays. Then the "group" size would matter a whole lot more.
>

True. That probably changes the calculations quite a bit.

>For now, the constant amount of overhead is quite small, so even if we
>only execute it once we won't make the query that much worse (or, at
>least, the total query time will still be very small). Also, because
>it's only applied to constants, there's a natural limit to how much
>overhead we're likely to introduce into a query.
>

FWIW the results from repeated test with both int and text columns that
I shared in [1] also have data for smaller numbers of rows. I haven't
tried very much to minimize noise (the original goal was to test speedup
for large numbers of rows and large arrays, where this is not an issue).
But I think it still shows that the threshold of ~10 elements is in the
right ballpark. We might use a higher value to be a bit more defensive,
but it's never going to be perfect for types with both cheap and more
expensive comparisons.

One more note - shouldn't this also tweak cost_qual_eval_walker which
computes cost for ScalarArrayOpExpr? At the moment it does this:

   /*
    * Estimate that the operator will be applied to about half of the
    * array elements before the answer is determined.
    */

but that's appropriate for linear search.


[1] https://www.postgresql.org/message-id/20200425124024.hsv7z6bia752uymz%40development


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

David Rowley
In reply to this post by Tomas Vondra-4
On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <[hidden email]> wrote:
> This reminds me our attempts to add bloom filters to hash joins, which I
> think ran into mostly the same challenge of deciding when the bloom
> filter can be useful and is worth the extra work.

Speaking of that, it would be interesting to see how a test where you
write the query as IN(VALUES(...)) instead of IN() compares. It would
be interesting to know if the planner is able to make a more suitable
choice and also to see how all the work over the years to improve Hash
Joins compares to the bsearch with and without the bloom filter.

David


Reply | Threaded
Open this post in threaded view
|

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

James Coleman
On Sat, Apr 25, 2020 at 5:41 PM David Rowley <[hidden email]> wrote:

>
> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <[hidden email]> wrote:
> > This reminds me our attempts to add bloom filters to hash joins, which I
> > think ran into mostly the same challenge of deciding when the bloom
> > filter can be useful and is worth the extra work.
>
> Speaking of that, it would be interesting to see how a test where you
> write the query as IN(VALUES(...)) instead of IN() compares. It would
> be interesting to know if the planner is able to make a more suitable
> choice and also to see how all the work over the years to improve Hash
> Joins compares to the bsearch with and without the bloom filter.

It would be interesting.

It also makes one wonder about optimizing these into to hash
joins...which I'd thought about over at [1]. I think it'd be a very
significant effort though.

James

[1]: https://www.postgresql.org/message-id/CAAaqYe_zVVOURfdPbAhssijw7yV0uKi350gQ%3D_QGDz7R%3DHpGGQ%40mail.gmail.com


Reply | Threaded
Open this post in threaded view
|

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

Tomas Vondra-4
On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote:

>On Sat, Apr 25, 2020 at 5:41 PM David Rowley <[hidden email]> wrote:
>>
>> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <[hidden email]> wrote:
>> > This reminds me our attempts to add bloom filters to hash joins, which I
>> > think ran into mostly the same challenge of deciding when the bloom
>> > filter can be useful and is worth the extra work.
>>
>> Speaking of that, it would be interesting to see how a test where you
>> write the query as IN(VALUES(...)) instead of IN() compares. It would
>> be interesting to know if the planner is able to make a more suitable
>> choice and also to see how all the work over the years to improve Hash
>> Joins compares to the bsearch with and without the bloom filter.
>
>It would be interesting.
>
>It also makes one wonder about optimizing these into to hash
>joins...which I'd thought about over at [1]. I think it'd be a very
>significant effort though.
>
I modified the script to also do the join version of the query. I can
only run it on my laptop at the moment, so the results may be a bit
different from those I shared before, but it's interesting I think.

In most cases it's comparable to the binsearch/bloom approach, and in
some cases it actually beats them quite significantly. It seems to
depend on how expensive the comparison is - for "int" the comparison is
very cheap and there's almost no difference. For "text" the comparison
is much more expensive, and there are significant speedups.

For example the test with 100k lookups in array of 10k elements and 10%
match probability, the timings are these

   master:  62362 ms
   binsearch: 201 ms
   bloom:      65 ms
   hashjoin:   36 ms

I do think the explanation is fairly simple - the bloom filter
eliminates about 90% of the expensive comparisons, so it's 20ms plus
some overhead to build and check the bits. The hash join probably
eliminates a lot of the remaining comparisons, because the hash table
is sized to have one tuple per bucket.

Note: I also don't claim the PoC has the most efficient bloom filter
implementation possible. I'm sure it could be made faster.

Anyway, I'm not sure transforming this to a hash join is worth the
effort - I agree that seems quite complex. But perhaps this suggest we
should not be doing binary search and instead just build a simple hash
table - that seems much simpler, and it'll probably give us about the
same benefits.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

test-saop-hashjoin.sh (18K) Download Attachment
saop-hashjoin.ods (50K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

James Coleman
On Sat, Apr 25, 2020 at 8:31 PM Tomas Vondra
<[hidden email]> wrote:

>
> On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote:
> >On Sat, Apr 25, 2020 at 5:41 PM David Rowley <[hidden email]> wrote:
> >>
> >> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <[hidden email]> wrote:
> >> > This reminds me our attempts to add bloom filters to hash joins, which I
> >> > think ran into mostly the same challenge of deciding when the bloom
> >> > filter can be useful and is worth the extra work.
> >>
> >> Speaking of that, it would be interesting to see how a test where you
> >> write the query as IN(VALUES(...)) instead of IN() compares. It would
> >> be interesting to know if the planner is able to make a more suitable
> >> choice and also to see how all the work over the years to improve Hash
> >> Joins compares to the bsearch with and without the bloom filter.
> >
> >It would be interesting.
> >
> >It also makes one wonder about optimizing these into to hash
> >joins...which I'd thought about over at [1]. I think it'd be a very
> >significant effort though.
> >
>
> I modified the script to also do the join version of the query. I can
> only run it on my laptop at the moment, so the results may be a bit
> different from those I shared before, but it's interesting I think.
>
> In most cases it's comparable to the binsearch/bloom approach, and in
> some cases it actually beats them quite significantly. It seems to
> depend on how expensive the comparison is - for "int" the comparison is
> very cheap and there's almost no difference. For "text" the comparison
> is much more expensive, and there are significant speedups.
>
> For example the test with 100k lookups in array of 10k elements and 10%
> match probability, the timings are these
>
>    master:  62362 ms
>    binsearch: 201 ms
>    bloom:      65 ms
>    hashjoin:   36 ms
>
> I do think the explanation is fairly simple - the bloom filter
> eliminates about 90% of the expensive comparisons, so it's 20ms plus
> some overhead to build and check the bits. The hash join probably
> eliminates a lot of the remaining comparisons, because the hash table
> is sized to have one tuple per bucket.
>
> Note: I also don't claim the PoC has the most efficient bloom filter
> implementation possible. I'm sure it could be made faster.
>
> Anyway, I'm not sure transforming this to a hash join is worth the
> effort - I agree that seems quite complex. But perhaps this suggest we
> should not be doing binary search and instead just build a simple hash
> table - that seems much simpler, and it'll probably give us about the
> same benefits.

That's actually what I originally thought about doing, but I chose
binary search since it seemed a lot easier to get off the ground.

If we instead build a hash is there anything else we need to be
concerned about? For example, work mem? I suppose for the binary
search we already have to expand the array, so perhaps it's not all
that meaningful relative to that...

I was looking earlier at what our standard hash implementation was,
and it seemed less obvious what was needed to set that up (so binary
search seemed a faster proof of concept). If you happen to have any
pointers to similar usages I should look at, please let me know.

James


Reply | Threaded
Open this post in threaded view
|

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

Tomas Vondra-4
On Sun, Apr 26, 2020 at 02:46:19PM -0400, James Coleman wrote:

>On Sat, Apr 25, 2020 at 8:31 PM Tomas Vondra
><[hidden email]> wrote:
>>
>> On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote:
>> >On Sat, Apr 25, 2020 at 5:41 PM David Rowley <[hidden email]> wrote:
>> >>
>> >> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <[hidden email]> wrote:
>> >> > This reminds me our attempts to add bloom filters to hash joins, which I
>> >> > think ran into mostly the same challenge of deciding when the bloom
>> >> > filter can be useful and is worth the extra work.
>> >>
>> >> Speaking of that, it would be interesting to see how a test where you
>> >> write the query as IN(VALUES(...)) instead of IN() compares. It would
>> >> be interesting to know if the planner is able to make a more suitable
>> >> choice and also to see how all the work over the years to improve Hash
>> >> Joins compares to the bsearch with and without the bloom filter.
>> >
>> >It would be interesting.
>> >
>> >It also makes one wonder about optimizing these into to hash
>> >joins...which I'd thought about over at [1]. I think it'd be a very
>> >significant effort though.
>> >
>>
>> I modified the script to also do the join version of the query. I can
>> only run it on my laptop at the moment, so the results may be a bit
>> different from those I shared before, but it's interesting I think.
>>
>> In most cases it's comparable to the binsearch/bloom approach, and in
>> some cases it actually beats them quite significantly. It seems to
>> depend on how expensive the comparison is - for "int" the comparison is
>> very cheap and there's almost no difference. For "text" the comparison
>> is much more expensive, and there are significant speedups.
>>
>> For example the test with 100k lookups in array of 10k elements and 10%
>> match probability, the timings are these
>>
>>    master:  62362 ms
>>    binsearch: 201 ms
>>    bloom:      65 ms
>>    hashjoin:   36 ms
>>
>> I do think the explanation is fairly simple - the bloom filter
>> eliminates about 90% of the expensive comparisons, so it's 20ms plus
>> some overhead to build and check the bits. The hash join probably
>> eliminates a lot of the remaining comparisons, because the hash table
>> is sized to have one tuple per bucket.
>>
>> Note: I also don't claim the PoC has the most efficient bloom filter
>> implementation possible. I'm sure it could be made faster.
>>
>> Anyway, I'm not sure transforming this to a hash join is worth the
>> effort - I agree that seems quite complex. But perhaps this suggest we
>> should not be doing binary search and instead just build a simple hash
>> table - that seems much simpler, and it'll probably give us about the
>> same benefits.
>
>That's actually what I originally thought about doing, but I chose
>binary search since it seemed a lot easier to get off the ground.
>

OK, that makes perfect sense.

>If we instead build a hash is there anything else we need to be
>concerned about? For example, work mem? I suppose for the binary
>search we already have to expand the array, so perhaps it's not all
>that meaningful relative to that...
>

I don't think we need to be particularly concerned about work_mem. We
don't care about it now, and it's not clear to me what we could do about
it - we already have the array in memory anyway, so it's a bit futile.
Furthermore, if we need to care about it, it probably applies to the
binary search too.

>I was looking earlier at what our standard hash implementation was,
>and it seemed less obvious what was needed to set that up (so binary
>search seemed a faster proof of concept). If you happen to have any
>pointers to similar usages I should look at, please let me know.
>

I think the hash join implementation is far too complicated. It has to
care about work_mem, so it implements batching, etc. That's a lot of
complexity we don't need here. IMO we could use either the usual
dynahash, or maybe even the simpler simplehash.

FWIW it'd be good to verify the numbers I shared, i.e. checking that the
benchmarks makes sense and running it independently. I'm not aware of
any issues but it was done late at night and only ran on my laptop.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

James Coleman
On Sun, Apr 26, 2020 at 4:49 PM Tomas Vondra
<[hidden email]> wrote:

>
> On Sun, Apr 26, 2020 at 02:46:19PM -0400, James Coleman wrote:
> >On Sat, Apr 25, 2020 at 8:31 PM Tomas Vondra
> ><[hidden email]> wrote:
> >>
> >> On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote:
> >> >On Sat, Apr 25, 2020 at 5:41 PM David Rowley <[hidden email]> wrote:
> >> >>
> >> >> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <[hidden email]> wrote:
> >> >> > This reminds me our attempts to add bloom filters to hash joins, which I
> >> >> > think ran into mostly the same challenge of deciding when the bloom
> >> >> > filter can be useful and is worth the extra work.
> >> >>
> >> >> Speaking of that, it would be interesting to see how a test where you
> >> >> write the query as IN(VALUES(...)) instead of IN() compares. It would
> >> >> be interesting to know if the planner is able to make a more suitable
> >> >> choice and also to see how all the work over the years to improve Hash
> >> >> Joins compares to the bsearch with and without the bloom filter.
> >> >
> >> >It would be interesting.
> >> >
> >> >It also makes one wonder about optimizing these into to hash
> >> >joins...which I'd thought about over at [1]. I think it'd be a very
> >> >significant effort though.
> >> >
> >>
> >> I modified the script to also do the join version of the query. I can
> >> only run it on my laptop at the moment, so the results may be a bit
> >> different from those I shared before, but it's interesting I think.
> >>
> >> In most cases it's comparable to the binsearch/bloom approach, and in
> >> some cases it actually beats them quite significantly. It seems to
> >> depend on how expensive the comparison is - for "int" the comparison is
> >> very cheap and there's almost no difference. For "text" the comparison
> >> is much more expensive, and there are significant speedups.
> >>
> >> For example the test with 100k lookups in array of 10k elements and 10%
> >> match probability, the timings are these
> >>
> >>    master:  62362 ms
> >>    binsearch: 201 ms
> >>    bloom:      65 ms
> >>    hashjoin:   36 ms
> >>
> >> I do think the explanation is fairly simple - the bloom filter
> >> eliminates about 90% of the expensive comparisons, so it's 20ms plus
> >> some overhead to build and check the bits. The hash join probably
> >> eliminates a lot of the remaining comparisons, because the hash table
> >> is sized to have one tuple per bucket.
> >>
> >> Note: I also don't claim the PoC has the most efficient bloom filter
> >> implementation possible. I'm sure it could be made faster.
> >>
> >> Anyway, I'm not sure transforming this to a hash join is worth the
> >> effort - I agree that seems quite complex. But perhaps this suggest we
> >> should not be doing binary search and instead just build a simple hash
> >> table - that seems much simpler, and it'll probably give us about the
> >> same benefits.
> >
> >That's actually what I originally thought about doing, but I chose
> >binary search since it seemed a lot easier to get off the ground.
> >
>
> OK, that makes perfect sense.
>
> >If we instead build a hash is there anything else we need to be
> >concerned about? For example, work mem? I suppose for the binary
> >search we already have to expand the array, so perhaps it's not all
> >that meaningful relative to that...
> >
>
> I don't think we need to be particularly concerned about work_mem. We
> don't care about it now, and it's not clear to me what we could do about
> it - we already have the array in memory anyway, so it's a bit futile.
> Furthermore, if we need to care about it, it probably applies to the
> binary search too.
>
> >I was looking earlier at what our standard hash implementation was,
> >and it seemed less obvious what was needed to set that up (so binary
> >search seemed a faster proof of concept). If you happen to have any
> >pointers to similar usages I should look at, please let me know.
> >
>
> I think the hash join implementation is far too complicated. It has to
> care about work_mem, so it implements batching, etc. That's a lot of
> complexity we don't need here. IMO we could use either the usual
> dynahash, or maybe even the simpler simplehash.
>
> FWIW it'd be good to verify the numbers I shared, i.e. checking that the
> benchmarks makes sense and running it independently. I'm not aware of
> any issues but it was done late at night and only ran on my laptop.

Some quick calculations (don't have the scripting in a form I can
attach yet; using this as an opportunity to hack on a genericized
performance testing framework of sorts) suggest your results are
correct. I was also testing on my laptop, but I showed 1.) roughly
equivalent results for IN (VALUES ...) and IN (<list>) for integers,
but when I switch to (short; average 3 characters long) text values I
show the hash join on VALUES is twice as fast as the binary search.

Given that, I'm planning to implement this as a hash lookup and share
a revised patch.

James


Reply | Threaded
Open this post in threaded view
|

Binary search in ScalarArrayOpExpr for OR'd constant arrays

James Coleman
On Sun, Apr 26, 2020 at 7:41 PM James Coleman <[hidden email]> wrote:
>
> On Sun, Apr 26, 2020 at 4:49 PM Tomas Vondra
> <[hidden email]> wrote:
> >
> > On Sun, Apr 26, 2020 at 02:46:19PM -0400, James Coleman wrote:
> > >On Sat, Apr 25, 2020 at 8:31 PM Tomas Vondra
> > ><[hidden email]> wrote:
> > >>
> > >> On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote:
> > >> >On Sat, Apr 25, 2020 at 5:41 PM David Rowley <[hidden email]> wrote:
> > >> >>
> > >> >> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <[hidden email]> wrote:
> > >> >> > This reminds me our attempts to add bloom filters to hash joins, which I
> > >> >> > think ran into mostly the same challenge of deciding when the bloom
> > >> >> > filter can be useful and is worth the extra work.
> > >> >>
> > >> >> Speaking of that, it would be interesting to see how a test where you
> > >> >> write the query as IN(VALUES(...)) instead of IN() compares. It would
> > >> >> be interesting to know if the planner is able to make a more suitable
> > >> >> choice and also to see how all the work over the years to improve Hash
> > >> >> Joins compares to the bsearch with and without the bloom filter.
> > >> >
> > >> >It would be interesting.
> > >> >
> > >> >It also makes one wonder about optimizing these into to hash
> > >> >joins...which I'd thought about over at [1]. I think it'd be a very
> > >> >significant effort though.
> > >> >
> > >>
> > >> I modified the script to also do the join version of the query. I can
> > >> only run it on my laptop at the moment, so the results may be a bit
> > >> different from those I shared before, but it's interesting I think.
> > >>
> > >> In most cases it's comparable to the binsearch/bloom approach, and in
> > >> some cases it actually beats them quite significantly. It seems to
> > >> depend on how expensive the comparison is - for "int" the comparison is
> > >> very cheap and there's almost no difference. For "text" the comparison
> > >> is much more expensive, and there are significant speedups.
> > >>
> > >> For example the test with 100k lookups in array of 10k elements and 10%
> > >> match probability, the timings are these
> > >>
> > >>    master:  62362 ms
> > >>    binsearch: 201 ms
> > >>    bloom:      65 ms
> > >>    hashjoin:   36 ms
> > >>
> > >> I do think the explanation is fairly simple - the bloom filter
> > >> eliminates about 90% of the expensive comparisons, so it's 20ms plus
> > >> some overhead to build and check the bits. The hash join probably
> > >> eliminates a lot of the remaining comparisons, because the hash table
> > >> is sized to have one tuple per bucket.
> > >>
> > >> Note: I also don't claim the PoC has the most efficient bloom filter
> > >> implementation possible. I'm sure it could be made faster.
> > >>
> > >> Anyway, I'm not sure transforming this to a hash join is worth the
> > >> effort - I agree that seems quite complex. But perhaps this suggest we
> > >> should not be doing binary search and instead just build a simple hash
> > >> table - that seems much simpler, and it'll probably give us about the
> > >> same benefits.
> > >
> > >That's actually what I originally thought about doing, but I chose
> > >binary search since it seemed a lot easier to get off the ground.
> > >
> >
> > OK, that makes perfect sense.
> >
> > >If we instead build a hash is there anything else we need to be
> > >concerned about? For example, work mem? I suppose for the binary
> > >search we already have to expand the array, so perhaps it's not all
> > >that meaningful relative to that...
> > >
> >
> > I don't think we need to be particularly concerned about work_mem. We
> > don't care about it now, and it's not clear to me what we could do about
> > it - we already have the array in memory anyway, so it's a bit futile.
> > Furthermore, if we need to care about it, it probably applies to the
> > binary search too.
> >
> > >I was looking earlier at what our standard hash implementation was,
> > >and it seemed less obvious what was needed to set that up (so binary
> > >search seemed a faster proof of concept). If you happen to have any
> > >pointers to similar usages I should look at, please let me know.
> > >
> >
> > I think the hash join implementation is far too complicated. It has to
> > care about work_mem, so it implements batching, etc. That's a lot of
> > complexity we don't need here. IMO we could use either the usual
> > dynahash, or maybe even the simpler simplehash.
> >
> > FWIW it'd be good to verify the numbers I shared, i.e. checking that the
> > benchmarks makes sense and running it independently. I'm not aware of
> > any issues but it was done late at night and only ran on my laptop.
>
> Some quick calculations (don't have the scripting in a form I can
> attach yet; using this as an opportunity to hack on a genericized
> performance testing framework of sorts) suggest your results are
> correct. I was also testing on my laptop, but I showed 1.) roughly
> equivalent results for IN (VALUES ...) and IN (<list>) for integers,
> but when I switch to (short; average 3 characters long) text values I
> show the hash join on VALUES is twice as fast as the binary search.
>
> Given that, I'm planning to implement this as a hash lookup and share
> a revised patch.

While working on this I noticed that dynahash.c line 499 has this assertion:

Assert(info->entrysize >= info->keysize);

Do you by any chance know why the entry would need to be larger than the key? In this case I'm really treating the hash like a set (if there's a hash set implementation that doesn't store a value, then I'd be happy to use that instead) so I've configured the entry as sizeof(bool) which is obviously smaller than the key.

If it helps, that line was added by Tom in fba2a104c6d.

Thanks,
James
Reply | Threaded
Open this post in threaded view
|

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

David Rowley
On Mon, 27 Apr 2020 at 15:12, James Coleman <[hidden email]> wrote:
> While working on this I noticed that dynahash.c line 499 has this assertion:
>
> Assert(info->entrysize >= info->keysize);
>
> Do you by any chance know why the entry would need to be larger than the key?

Larger or equal. They'd be equal if you the key was the data, since
you do need to store at least the key.  Looking at the code for
examples where dynahash is used in that situation, I see
_hash_finish_split().

David


1234