Using Btree to Provide Sorting on Suffix Keys with LIMIT

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Using Btree to Provide Sorting on Suffix Keys with LIMIT

James Coleman
tl;dr: I'd like to teach btrees to support returning ordered results
where the ordering is only a suffix of the index keys and the query
includes a scalar array op qual on the prefix key of the index.

Suppose we have the following schema:

CREATE TABLE foos(bar_fk integer, created_at timestamp);
CREATE INDEX index_foos_on_bar_fk_and_created_at ON foos(bar_fk, created_at);

and seed it with the following:

INSERT INTO foos (bar_fk, created_at)
SELECT i % 1000, now() - (random() * '5 years'::interval)
FROM generate_series(1, 500000) t(i);

then execute the following query:

SELECT *
FROM foos
WHERE bar_fk IN (1, 2, 3)
ORDER BY created_at
LIMIT 50;

currently we get a query plan (I've disabled bitmap scans for this test) like:

 Limit  (cost=5197.26..5197.38 rows=50 width=12) (actual
time=2.212..2.222 rows=50 loops=1)
   ->  Sort  (cost=5197.26..5201.40 rows=1657 width=12) (actual
time=2.211..2.217 rows=50 loops=1)
         Sort Key: created_at
         Sort Method: top-N heapsort  Memory: 27kB
         ->  Index Only Scan using index_foos_on_bar_fk_and_created_at
on foos  (cost=0.42..5142.21 rows=1657 width=12) (actual
time=0.025..1.736 rows=1500 loops=1)
               Index Cond: (bar_fk = ANY ('{1,2,3}'::integer[]))
               Heap Fetches: 1500
 Planning time: 0.137 ms
 Execution time: 2.255 ms

Note that the index scan (or bitmap scan) nodes return all 1500 rows
matching `bar_fk IN (1,2,3)`. After all rows are returned, that total
set is ordered, and finally the LIMIT is applied. While only 50 rows
were requested, 30x that were fetched from the heap.

I believe it is possible to use the index
`index_foos_on_bar_fk_and_created_at` to fulfill both the `bar_fk IN
(1,2,3)` qualifier and (at least partially; more on that later) the
ordering `ORDER BY created_at` while fetching fewer than all rows
matching the qualifier.

Areas of extension: (given index `(a, b, c)`) include `a = 1 and b in
(...) order by c` and `a in (...) and b = 1 order by c` (and further
similar derivations with increasing numbers of equality quals).

Note: Another (loosely) related problem is that sorting can't
currently take advantage of cases where an index provides a partial
(prefix of requested pathkeys) ordering.

Proposal 1:

General idea: teach btrees to apply LIMIT internally so that we bound
the number of rows fetched and sorted to the <number of array op
elements> * <limit>.

Pros:

- Presumably simpler to implement.
- Ought to be faster (then master) in virtually all cases (or at least
penalty in worst case is extremely small).

Cons:

- Doesn't capture all of the theoretical value. For example, for array
of size 100, limit 25, and 100 values per array key, the current code
fetches 10,000 values, this proposal would fetch 2,500, and the best
case is 25. In this sample scenario we fetch 25% of the original
tuples, but we also fetch 100x the theoretical minimum required.
- Index scan node has to learn about limit (this feels pretty dirty).
- Still needs a sort node.

Proposal 2:

General idea: find the "first" (or last depending on sort order) index
tuple for each array op element. Sort those index tuples by the suffix
keys. Return tuples from the first of these sorted index tuple
locations until we find one that's past the next ordered index tuple.
Continue this round-robin approach until we exhaust all tuples.

Pros:

- Best case is significantly improved over proposal 1.
- Always fetches the minimal number of tuples required by the limit.
- When ordered values are not evenly distributed should be even faster
(just continue to pull tuples for array value with lowest order
value).
- Index scan node remains unaware of limit.
- Doesn't need a sort node.

Cons:

- Presumably more difficult to implement.
- Degenerate cases: when ordered values are evenly distributed we may
have to re-search from the top of the tree for each tuple (so worst
case is when few tuples match within each prefix).

---

Questions:

- Do we need a new index access method for this? Or do we shoehorn it
into the existing index scan. (Perhaps answer depends on which
strategy chosen?)
- Similarly do we want a new scan node type? (This brings up the same
kinds of questions in the skip scan discussion about multiplying node
types.)
- Is holding a pin on multiple index pages acceptable?
- Or do we avoid that at the cost of most frequent searches from the
top of the tree?
- If we have to search from the top of the tree, then does the "many
duplicates" case become degenerate (to put it another way, how do we
search to proper next tuple in a non-unique index)?

Status:
I've begun work on proposal 2 as I believe it shows the most potential
gain (though it also has higher complexity). I don't have a patch in a
state worth showing yet, but I wanted to start the discussion on the
design considerations while I continue work on the patch.

- James Coleman

Reply | Threaded
Open this post in threaded view
|

Re: Using Btree to Provide Sorting on Suffix Keys with LIMIT

David Rowley-3
On Sun, 30 Dec 2018 at 13:00, James Coleman <[hidden email]> wrote:

> Note that the index scan (or bitmap scan) nodes return all 1500 rows
> matching `bar_fk IN (1,2,3)`. After all rows are returned, that total
> set is ordered, and finally the LIMIT is applied. While only 50 rows
> were requested, 30x that were fetched from the heap.
>
> I believe it is possible to use the index
> `index_foos_on_bar_fk_and_created_at` to fulfill both the `bar_fk IN
> (1,2,3)` qualifier and (at least partially; more on that later) the
> ordering `ORDER BY created_at` while fetching fewer than all rows
> matching the qualifier.
>
> Areas of extension: (given index `(a, b, c)`) include `a = 1 and b in
> (...) order by c` and `a in (...) and b = 1 order by c` (and further
> similar derivations with increasing numbers of equality quals).

I don't quite understand this the above paragraph, but I assume this
would be possible to do with some new index am routine which allowed
multiple different values for the initial key.  Probably execution for
something like this could be handled by having 1 IndexScanDesc per
initial key. A scan would have to scan all of those for the first
tuple and return the lowest order key. Subsequent scans would fetch
the next tuple for the IndexScanDesc used previously then again,
return the lowest order tuple.  There's some binary heap code and
examples in nodeMergeAppend.c about how that could be done fairly
efficiently.

The hard part about that would be knowing when to draw the line. If
there was 1000's of initial keys then some other method might be
better. Probably the planner would have to estimate which method was
best. There are also issues if there are multiple prefix keys as you'd
need to scan a cartesian product of the keys, which would likely get
out of hand quickly with 2 or more initial keys.

There were discussions and a patch for a planner-level implementation
of which could likely assist with this in [1]. I'm not sure if this
particular case was handled in the patch. I believe it was more
intended for queries such as: SELECT ... FROM t WHERE a = 1 OR b = 2
and could transform this into something more along the lines of:
SELECT .. FROM t WHERE a = 1 UNION SELECT ... FROM t WHERE b = 1, and
using the table's ctid to uniquify the rows. You could get away with
something similar but use UNION ALL instead. You don't need UNION
since your "OR" is on the same column, meaning there can be no
overlapping rows.

Something like:

(SELECT * FROM foos WHERE bar_fk = 1 LIMIT 50)
UNION ALL
(SELECT * FROM foos WHERE bar_fk = 2 LIMIT 50)
UNION ALL
(SELECT * FROM foos WHERE bar_fk = 3 LIMIT 50)
ORDER BY created_at LIMIT 50;

> Note: Another (loosely) related problem is that sorting can't
> currently take advantage of cases where an index provides a partial
> (prefix of requested pathkeys) ordering.

There has been a patch [2] around for about 4 years now that does
this.  I'm unsure of the current status, other than not yet committed.

[1] https://www.postgresql.org/message-id/flat/7f70bd5a-5d16-e05c-f0b4-2fdfc8873489%40BlueTreble.com
[2] https://www.postgresql.org/message-id/flat/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@...

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Using Btree to Provide Sorting on Suffix Keys with LIMIT

James Coleman
On Sat, Dec 29, 2018 at 9:50 PM David Rowley
<[hidden email]> wrote:

>
> On Sun, 30 Dec 2018 at 13:00, James Coleman <[hidden email]> wrote:
> > Note that the index scan (or bitmap scan) nodes return all 1500 rows
> > matching `bar_fk IN (1,2,3)`. After all rows are returned, that total
> > set is ordered, and finally the LIMIT is applied. While only 50 rows
> > were requested, 30x that were fetched from the heap.
> >
> > I believe it is possible to use the index
> > `index_foos_on_bar_fk_and_created_at` to fulfill both the `bar_fk IN
> > (1,2,3)` qualifier and (at least partially; more on that later) the
> > ordering `ORDER BY created_at` while fetching fewer than all rows
> > matching the qualifier.
> >
> > Areas of extension: (given index `(a, b, c)`) include `a = 1 and b in
> > (...) order by c` and `a in (...) and b = 1 order by c` (and further
> > similar derivations with increasing numbers of equality quals).
>
> I don't quite understand this the above paragraph, but I assume this
> would be possible to do with some new index am routine which allowed
> multiple different values for the initial key.  Probably execution for
> something like this could be handled by having 1 IndexScanDesc per
> initial key. A scan would have to scan all of those for the first
> tuple and return the lowest order key. Subsequent scans would fetch
> the next tuple for the IndexScanDesc used previously then again,
> return the lowest order tuple.  There's some binary heap code and
> examples in nodeMergeAppend.c about how that could be done fairly
> efficiently.

Mostly I was pointing out that the simple case (scalar array op qual
on first index column and order by second index column) isn't the only
potentially interesting one; we'd also want to handle, for example, a
3 column index with a equality qual on the first, an array op on the
second, and order by the third. And then as you note later we could
also theoretically do this for multiple array op quals.

Thanks for the pointer to nodeMergeAppend.c; I'll look at that to see
if it sparks any ideas.

I'm intrigued by the idea of having multiple IndexScanDesc in the
node. My current route had been to include an array of BTScanPos in
BTScanOpaqueData and work within the same scan. Do you think that a
new index am targeting multiple initial values would be a better route
than improving the existing native array handling in
nbtree.c/nbtutil.c? It seems to me that fitting it into the existing
code gives us the greater potential usefulness but with more effort to
maintain the existing efficiencies there.

It sounds like multiple pinned pages (if you suggest multiple
IndexScanDesc) for the same index in the same node is acceptable? We
should still not be locking multiple pages at once, so I don't think
there's risk of deadlock, but I wasn't sure if there were specific
expectations about the number of pinned pages in a single relation at
a given time.

> The hard part about that would be knowing when to draw the line. If
> there was 1000's of initial keys then some other method might be
> better. Probably the planner would have to estimate which method was
> best. There are also issues if there are multiple prefix keys as you'd
> need to scan a cartesian product of the keys, which would likely get
> out of hand quickly with 2 or more initial keys.

Agreed. I expect the costing here to both report a higher startup cost
(since it has to look at one index tuple per array element up front)
and higher per-tuple cost (since we might have to re-search), but if
there are a very large (e.g., millions) number of rows and a small
LIMIT then it's hard to imagine this not being the better option even
up to a large number of keys (though at some point memory becomes a
concern also).

> There were discussions and a patch for a planner-level implementation
> of which could likely assist with this in [1]. I'm not sure if this
> particular case was handled in the patch. I believe it was more
> intended for queries such as: SELECT ... FROM t WHERE a = 1 OR b = 2
> and could transform this into something more along the lines of:
> SELECT .. FROM t WHERE a = 1 UNION SELECT ... FROM t WHERE b = 1, and
> using the table's ctid to uniquify the rows. You could get away with
> something similar but use UNION ALL instead. You don't need UNION
> since your "OR" is on the same column, meaning there can be no
> overlapping rows.
>
> Something like:
>
> (SELECT * FROM foos WHERE bar_fk = 1 LIMIT 50)
> UNION ALL
> (SELECT * FROM foos WHERE bar_fk = 2 LIMIT 50)
> UNION ALL
> (SELECT * FROM foos WHERE bar_fk = 3 LIMIT 50)
> ORDER BY created_at LIMIT 50;

This sounds effectively like a way to do my first proposal. In theory
I think both are valuable and potentially complementary, so I'll read
up on that one also.

> > Note: Another (loosely) related problem is that sorting can't
> > currently take advantage of cases where an index provides a partial
> > (prefix of requested pathkeys) ordering.
>
> There has been a patch [2] around for about 4 years now that does
> this.  I'm unsure of the current status, other than not yet committed.

Doh, I should have linked to that; I've been following the incremental
sort patch for a while (and submitted a test-case review) since it
solves some significant problems for us.

- James Coleman

Reply | Threaded
Open this post in threaded view
|

Re: Using Btree to Provide Sorting on Suffix Keys with LIMIT

Peter Geoghegan-4
In reply to this post by David Rowley-3
On Sat, Dec 29, 2018 at 6:50 PM David Rowley
<[hidden email]> wrote:
> > Areas of extension: (given index `(a, b, c)`) include `a = 1 and b in
> > (...) order by c` and `a in (...) and b = 1 order by c` (and further
> > similar derivations with increasing numbers of equality quals).
>
> I don't quite understand this the above paragraph, but I assume this
> would be possible to do with some new index am routine which allowed
> multiple different values for the initial key.

I'm confused about James' use of the term "new index am". I guess he
just meant new support function or something?

> > Note: Another (loosely) related problem is that sorting can't
> > currently take advantage of cases where an index provides a partial
> > (prefix of requested pathkeys) ordering.
>
> There has been a patch [2] around for about 4 years now that does
> this.  I'm unsure of the current status, other than not yet committed.
>
> [1] https://www.postgresql.org/message-id/flat/7f70bd5a-5d16-e05c-f0b4-2fdfc8873489%40BlueTreble.com
> [2] https://www.postgresql.org/message-id/flat/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@...

I can see why you'd mention these two, but I also expected you to
mention the skip scan project, since that involves pushing down
knowledge about how the index is to be accessed to the index am (at
least, I assume that it does), and skipping leading attributes to use
the sort order from a suffix attribute. Actually, the partial sort
idea that you linked to is more or less a dual of skip scan, at least
to my mind (the former *extends* the sort order by adding a suffix
tie-breaker, while the latter *skips* a leading attribute to get to an
interesting suffix attribute).

The way James constructed his example suggested that there'd be some
kind of natural locality, that we'd expect to be able to take
advantage of at execution time when the new strategy is favorable. I'm
not sure if that was intended -- James? I think it might help James to
construct a more obviously realistic/practical motivating example. I'm
perfectly willing to believe that this idea would help his real world
queries, and having an example that can easily be played with is
helpful in other ways. But I'd like to know why this idea is important
is in detail, since I think that it would help me to place it in the
wider landscape of ideas that are like this. Placing it in that wider
landscape, and figuring out next steps at a high level seem to be the
problem right now.


--
Peter Geoghegan

Reply | Threaded
Open this post in threaded view
|

Re: Using Btree to Provide Sorting on Suffix Keys with LIMIT

James Coleman
On Thu, Jan 10, 2019 at 6:52 PM Peter Geoghegan <[hidden email]> wrote:

>
> On Sat, Dec 29, 2018 at 6:50 PM David Rowley
> <[hidden email]> wrote:
> > > Areas of extension: (given index `(a, b, c)`) include `a = 1 and b in
> > > (...) order by c` and `a in (...) and b = 1 order by c` (and further
> > > similar derivations with increasing numbers of equality quals).
> >
> > I don't quite understand this the above paragraph, but I assume this
> > would be possible to do with some new index am routine which allowed
> > multiple different values for the initial key.
>
> I'm confused about James' use of the term "new index am". I guess he
> just meant new support function or something?

Thanks for responding.

I was wondering if a new index access method would be the preferable
way to do this such as how the skip scan patch does (I was taking some
ideas from that one).

> > > Note: Another (loosely) related problem is that sorting can't
> > > currently take advantage of cases where an index provides a partial
> > > (prefix of requested pathkeys) ordering.
> >
> > There has been a patch [2] around for about 4 years now that does
> > this.  I'm unsure of the current status, other than not yet committed.
> >
> > [1] https://www.postgresql.org/message-id/flat/7f70bd5a-5d16-e05c-f0b4-2fdfc8873489%40BlueTreble.com
> > [2] https://www.postgresql.org/message-id/flat/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@...
>
> I can see why you'd mention these two, but I also expected you to
> mention the skip scan project, since that involves pushing down
> knowledge about how the index is to be accessed to the index am (at
> least, I assume that it does), and skipping leading attributes to use
> the sort order from a suffix attribute. Actually, the partial sort
> idea that you linked to is more or less a dual of skip scan, at least
> to my mind (the former *extends* the sort order by adding a suffix
> tie-breaker, while the latter *skips* a leading attribute to get to an
> interesting suffix attribute).

Yes, I'd been looking at the skip scan patch but didn't mention it. A
lot of my initial email was from my initial brainstorming notes on the
topic, and I should have cleaned it up a bit better before sending it.

> The way James constructed his example suggested that there'd be some
> kind of natural locality, that we'd expect to be able to take
> advantage of at execution time when the new strategy is favorable. I'm
> not sure if that was intended -- James? ...

I'm not sure what you mean by "natural locality"; I believe that the
artificial data I've constructed is actually somewhat close to worst
case for what I'm proposing. Evenly distributed (this is random, but
in this case I think that's close enough) data will realize the
smallest possible gains (and be the most likely to represent a
regression with few enough rows in each group) because it is the case
where we'd have have the most overhead of rotating among scan keys.

> ... I think it might help James to
> construct a more obviously realistic/practical motivating example. I'm
> perfectly willing to believe that this idea would help his real world
> queries, and having an example that can easily be played with is
> helpful in other ways. But I'd like to know why this idea is important
> is in detail, since I think that it would help me to place it in the
> wider landscape of ideas that are like this. Placing it in that wider
> landscape, and figuring out next steps at a high level seem to be the
> problem right now.

I'll attempt to describe a more real world scenario: suppose we have a
schema like:

users(id serial primary key)
orders(id serial primary key, user_id integer, created_at timestamp)

And wanted to find the most recent N orders for a specific group of
users (e.g., in a report or search). Your query might look like:

SELECT *
FROM orders
WHERE orders.user_id IN (1, 2, 3)
ORDER BY orders.created_at DESC
LIMIT 25

Currently an index on orders(user_id, created_at) will be used for
this query, but only to satisfy the scalar array op qual. Then all
matching orders (say, years worth) will be fetched, a sort node will
sort all of those results, and then a limit node will take the top N.

Generalized the problem is something like "find the top N rows across
a group of foreign keys" (though saying foreign keys probably is too
specific).

But under the scheme I'm proposing that same index would be able to
provide both the filter and guarantee ordering as well.

Does that more real-world-ish example help place the usefulness of this?

I think this goes beyond increasing the usefulness of indexes by
requiring less specific indexes (incremental sort does this), but
rather allows the index to support a kind of query you can't currently
(as far as I'm aware) can't express in a performant way at call
currently (other than a complex recursive cte or in some subset of
cases a bunch of union statements -- one per array entry).

James Coleman