Yet another fast GiST build

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
133 messages Options
12345 ... 7
Reply | Threaded
Open this post in threaded view
|

Re: Yet another fast GiST build (typo)

Andrey Borodin-2


> On 29 февр. 2020 г., at 17:20, Erik Rijkers <[hidden email]> wrote:
>
> Small typo alert:
> In v5-0002-Implement-GiST-build-using-sort-support.patch there is:
>
> +   method is also optional and is used diring fast GiST build.
>
> 'diring' should be 'during'
>

Thanks!

I've fixed this and added patch with GiST reloption to provide sort support function.

Best regards, Andrey Borodin.


v6-0001-Add-sort-support-for-point-gist_point_sortsupport.patch (3K) Download Attachment
v6-0002-Implement-GiST-build-using-sort-support.patch (31K) Download Attachment
v6-0003-Function-relopt-for-gist-build.patch (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Yet another fast GiST build

Darafei "Komяpa" Praliaskouski
In reply to this post by Andrey Borodin-2
Hello,

Thanks for the patch and working on GiST infrastructure, it's really
valuable for PostGIS use cases and I wait to see this improvement in
PG13.

On Sat, Feb 29, 2020 at 3:13 PM Andrey M. Borodin <[hidden email]> wrote:

> Thomas, I've used your wording almost exactly with explanation how
> point_zorder_internal() works. It has more explanation power than my attempts
> to compose good comment.

PostGIS uses this trick to ensure locality. In PostGIS 3 we enhanced
that trick to have the Hilbert curve instead of Z Order curve.

For visual representation have a look at these links:
 - http://blog.cleverelephant.ca/2019/08/postgis-3-sorting.html - as
it's implemented in PostGIS btree sorting opclass
 - https://observablehq.com/@mourner/hilbert-curve-packing - to
explore general approach.

Indeed if it feels insecure to work with bit magic that implementation
can be left out to extensions.

> There is one design decision that worries me most:
> should we use opclass function or index option to provide this sorting information?
> It is needed only during index creation, actually. And having extra i-class only for fast build
> seems excessive.
> I think we can provide both ways and let opclass developers decide?

Reloption variant looks dirty. It won't cover an index on (id uuid,
geom geometry) where id is duplicated (say, tracked car identifier)
but present in every query - no way to pass such thing as reloption.
I'm also concerned about security of passing a sortsupport function
manually during index creation (what if that's not from the same
extension or even (wrong-)user defined something).

We know for sure it's a good idea for all btree_gist types and
geometry and I don't see a case where user would want to disable it.
Just make it part of operator class, and that would also allow fast
creation of multi-column index.

Thanks.

--
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Reply | Threaded
Open this post in threaded view
|

Re: Yet another fast GiST build

Yuri Astrakhan
Awesome addition!  Would it make sense to use x86's BMI2's PDEP instruction, or is the interleave computation too small of a percentage to introduce not-so-easy-to-port code?  Also, I think it needs a bit more documentation to explain the logic, i.e. a link to https://stackoverflow.com/questions/39490345/interleave-bits-efficiently ?  Thx for making it faster :)
Reply | Threaded
Open this post in threaded view
|

Re: Yet another fast GiST build

Darafei "Komяpa" Praliaskouski
Hello Yuri,

PDEP is indeed first thing that comes up when you start googling
z-curve and bit interleaving :)
We had the code with z-curve generating PDEP instruction in PostGIS,
and dropped it since. In sorting, we now utilize sort support / prefix
search, and key generated as Hilbert curve, with fine tuning it for
different projections' geometric properties.

From this patch the most valuable thing for us is the sorting build
infrastructure itself. Maybe to get it a bit more understandable for
people not deep in geometry it makes sense to first expose the
btree_gist datatypes to this thing? So that btree_gist index on
integer will be built exactly the same way the btree index on integer
is built. This will also get everyone a reference point on the
bottlenecks and optimality of patch.

On Fri, Apr 3, 2020 at 10:56 AM Yuri Astrakhan <[hidden email]> wrote:
>
> Awesome addition!  Would it make sense to use x86's BMI2's PDEP instruction, or is the interleave computation too small of a percentage to introduce not-so-easy-to-port code?  Also, I think it needs a bit more documentation to explain the logic, i.e. a link to https://stackoverflow.com/questions/39490345/interleave-bits-efficiently ?  Thx for making it faster :)



--
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Reply | Threaded
Open this post in threaded view
|

Re: Yet another fast GiST build (typo)

Daniel Gustafsson
In reply to this post by Andrey Borodin-2
> On 29 Feb 2020, at 18:24, Andrey M. Borodin <[hidden email]> wrote:

> I've fixed this and added patch with GiST reloption to provide sort support function.

0002 in this patchset fails to apply, please submit a rebased version.  I've
marked the entry Waiting for Author in the meantime.

cheers ./daniel

Reply | Threaded
Open this post in threaded view
|

Re: Yet another fast GiST build (typo)

Andrey Borodin-2


> 1 июля 2020 г., в 17:05, Daniel Gustafsson <[hidden email]> написал(а):
>
>> On 29 Feb 2020, at 18:24, Andrey M. Borodin <[hidden email]> wrote:
>
>> I've fixed this and added patch with GiST reloption to provide sort support function.
>
> 0002 in this patchset fails to apply, please submit a rebased version.  I've
> marked the entry Waiting for Author in the meantime.

Thanks, Daniel!

PFA v7.

Best regards, Andrey Borodin.

v7-0001-Add-sort-support-for-point-gist_point_sortsupport.patch (3K) Download Attachment
v7-0002-Implement-GiST-build-using-sort-support.patch (30K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Yet another fast GiST build (typo)

Andrey Borodin-2


> 6 июля 2020 г., в 17:55, Andrey M. Borodin <[hidden email]> написал(а):
>
>
>
>> 1 июля 2020 г., в 17:05, Daniel Gustafsson <[hidden email]> написал(а):
>>
>>> On 29 Feb 2020, at 18:24, Andrey M. Borodin <[hidden email]> wrote:
>>
>>> I've fixed this and added patch with GiST reloption to provide sort support function.
>>
>> 0002 in this patchset fails to apply, please submit a rebased version.  I've
>> marked the entry Waiting for Author in the meantime.
>
> Thanks, Daniel!
>
> PFA v7.
Oops. I've mismerged docs and did not notice this with check world. PFA v8 with fixed docs.

Best regards, Andrey Borodin.

v8-0001-Add-sort-support-for-point-gist_point_sortsupport.patch (3K) Download Attachment
v8-0002-Implement-GiST-build-using-sort-support.patch (30K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Yet another fast GiST build (typo)

Thomas Munro-5
On Tue, Jul 7, 2020 at 7:03 PM Andrey M. Borodin <[hidden email]> wrote:
> Oops. I've mismerged docs and did not notice this with check world. PFA v8 with fixed docs.

It looks like point_zorder_internal() has the check for NaN in the wrong place.


Reply | Threaded
Open this post in threaded view
|

Re: Yet another fast GiST build (typo)

Andrey Borodin-2


> 10 июля 2020 г., в 10:53, Thomas Munro <[hidden email]> написал(а):
>
> On Tue, Jul 7, 2020 at 7:03 PM Andrey M. Borodin <[hidden email]> wrote:
>> Oops. I've mismerged docs and did not notice this with check world. PFA v8 with fixed docs.
>
> It looks like point_zorder_internal() has the check for NaN in the wrong place.
Thanks! Fixed.

Best regards, Andrey Borodin.

v9-0001-Add-sort-support-for-point-gist_point_sortsupport.patch (3K) Download Attachment
v9-0002-Implement-GiST-build-using-sort-support.patch (30K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Yet another fast GiST build (typo)

Thomas Munro-5
On Fri, Jul 10, 2020 at 6:55 PM Andrey M. Borodin <[hidden email]> wrote:
> Thanks! Fixed.

It's not a bug, but I think those 64 bit constants should be wrapped
in UINT64CONST(), following our convention.

I'm confused about these two patches: 0001 introduces
gist_point_fastcmp(), but then 0002 changes it to gist_bbox_fastcmp().
Maybe you intended to keep both of them?  Also 0002 seems to have
fixups for 0001 squashed into it.


Reply | Threaded
Open this post in threaded view
|

Re: Yet another fast GiST build (typo)

Andrey Borodin-2


> 30 июля 2020 г., в 06:26, Thomas Munro <[hidden email]> написал(а):
>
> On Fri, Jul 10, 2020 at 6:55 PM Andrey M. Borodin <[hidden email]> wrote:
>> Thanks! Fixed.
>
> It's not a bug, but I think those 64 bit constants should be wrapped
> in UINT64CONST(), following our convention.
Thanks, fixed!

> I'm confused about these two patches: 0001 introduces
> gist_point_fastcmp(), but then 0002 changes it to gist_bbox_fastcmp().
> Maybe you intended to keep both of them?  Also 0002 seems to have
> fixups for 0001 squashed into it.
Indeed, that were fixups: point converted to GiST representation is a bbox already, and the function expects only bboxes.

Also I've fixed some mismerges in documentation.

Thanks!

Best regards, Andrey Borodin.


v10-0001-Add-sort-support-for-point-gist_point_sortsuppor.patch (5K) Download Attachment
v10-0002-Implement-GiST-build-using-sort-support.patch (26K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Yet another fast GiST build (typo)

Pavel Borisov
I see this feature quite useful in concept and decided to test it.

On a real database of 7 million rows I observed speedup of 4 times in case of single column index on points only and 2.5 times speedup in case of index on points with several included columns. Standard deviation between in series of measurements being of 10%. Index size saving was of 1.7-1.5 times respectively. Points were in all four quadrants.

On random points same as query in the original message it was observer 3 times speedup with the patch. Then I generated same points set but so they will get into one quadrant didn't observe a difference with the previous case. So probably anomaly in Morton curve not so big to induce noticeable slowdown in a whole random set. But as the ordering is done only for index and not used outside index it seems to me possible to introduce shifting floating point coordinates respective to leftmost-bottom corner point and thus make all of them positive to avoid anomaly of Morton curve near quadrants transitions.

Of course speed measurements depend on machine and configuration a lot, but I am sure anyway there is a noticeable difference in index build time and this is quite valuable for end-user who build GiSt index on point type of data. Furthermore same speedup is also for REBUILD INDEX CONCURRENTLY. There short rebuild time also mean fewer user modified table rows during rebuild which should be integrated in a newer index after rebuild.

This patch can be also seen as a step to futher introduce the other ordering algoritms e.g. Gilbert curve and I consider this feature is useful and is worth to be committed.

Both patches 0001 and 0002 when applied on version 14dev compile and work cleanly. Regression tests are passed.
Code seems clean and legible for me.

In declaration I see little bit different style in similar argument pointers/arrays:
extern IndexTuple gistFormTuple(GISTSTATE *giststate, Relation r, Datum *attdata, bool *isnull, bool isleaf);
extern IndexTuple gistCompressValusAndFormTuple(GISTSTATE *giststate, Relation r, Datum attdata[], bool isnull[], bool isleaf, Datum compatt[]);
I suppose this is because gistFormTuple previously had different style in declaration and definition. Maybe it would be nice to change them to one style in all code, I propose pointers instead of [].

In a big comment 
/*
+        * In this function we need to compute Morton codes for non-integral
+        * components p->x and p->y. But Morton codes are defined only for
+        * integral values.
i don't quite caught meaning of "non-integral" and "integral" and propose to replace it to "float" and "integers".

Also there are some extra spaces before line
"prev_level_start = level_start;"
and after
"The argument is a pointer to a <structname>SortSupport</structname> struct."        

Overall I see the patch useful and almost ready for commit.

вт, 4 авг. 2020 г. в 21:28, Andrey M. Borodin <[hidden email]>:


> 30 июля 2020 г., в 06:26, Thomas Munro <[hidden email]> написал(а):
>
> On Fri, Jul 10, 2020 at 6:55 PM Andrey M. Borodin <[hidden email]> wrote:
>> Thanks! Fixed.
>
> It's not a bug, but I think those 64 bit constants should be wrapped
> in UINT64CONST(), following our convention.
Thanks, fixed!

> I'm confused about these two patches: 0001 introduces
> gist_point_fastcmp(), but then 0002 changes it to gist_bbox_fastcmp().
> Maybe you intended to keep both of them?  Also 0002 seems to have
> fixups for 0001 squashed into it.
Indeed, that were fixups: point converted to GiST representation is a bbox already, and the function expects only bboxes.

Also I've fixed some mismerges in documentation.

Thanks!

Best regards, Andrey Borodin.



--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com
Reply | Threaded
Open this post in threaded view
|

Re: Yet another fast GiST build

Pavel Borisov
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           not tested
Documentation:            not tested

I consider this patch almost ready for commit with minor corrections (see previous message)

The new status of this patch is: Waiting on Author
Reply | Threaded
Open this post in threaded view
|

Re: Yet another fast GiST build (typo)

Andrey Borodin-2
In reply to this post by Pavel Borisov


> 14 авг. 2020 г., в 14:21, Pavel Borisov <[hidden email]> написал(а):
>
> I see this feature quite useful in concept and decided to test it.

Thanks for reviewing and benchmarking, Pavel!

I agree with your suggestions. PFA patch with relevant changes.

Thanks a lot.

Best regards, Andrey Borodin.


v11-0001-Add-sort-support-for-point-gist_point_sortsuppor.patch (5K) Download Attachment
v11-0002-Implement-GiST-build-using-sort-support.patch (26K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Yet another fast GiST build (typo)

Andrey Borodin-2


> 23 авг. 2020 г., в 14:39, Andrey M. Borodin <[hidden email]> написал(а):
>
> Thanks for reviewing and benchmarking, Pavel!

Pavel sent me few typos offlist. PFA v12 fixing these typos.
Thanks!

Best regards, Andrey Borodin.

v12-0001-Add-sort-support-for-point-gist_point_sortsuppor.patch (5K) Download Attachment
v12-0002-Implement-GiST-build-using-sort-support.patch (26K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Yet another fast GiST build (typo)

Pavel Borisov
Pavel sent me few typos offlist. PFA v12 fixing these typos.
Thanks!
 
Now I consider the patch ready to be committed and mark it so on CF. 
Thank you!

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com
Reply | Threaded
Open this post in threaded view
|

Re: Yet another fast GiST build (typo)

Heikki Linnakangas
In reply to this post by Andrey Borodin-2
On 30/08/2020 15:04, Andrey M. Borodin wrote:
>> 23 авг. 2020 г., в 14:39, Andrey M. Borodin <[hidden email]> написал(а):
>>
>> Thanks for reviewing and benchmarking, Pavel!
>
> Pavel sent me few typos offlist. PFA v12 fixing these typos.

In gist_indexsortbuild(), you first build all the leaf pages. Then, you
read through all the index pages you just built, to form the tuples for
the next level, and repeat for all the upper levels. That seems
inefficient, it would be more better to form the tuples for the
downlinks as you go, when you build the leaf pages in the first place.
That's how nbtsort.c works. Also, you could WAL-log the pages as you go.

In gist_indexsortbuild_flush(), can't you just memcpy() the page from
memory to the buffer?

- Heikki


Reply | Threaded
Open this post in threaded view
|

Re: Yet another fast GiST build (typo)

Andrey Borodin-2


> 3 сент. 2020 г., в 23:40, Heikki Linnakangas <[hidden email]> написал(а):
>
> On 30/08/2020 15:04, Andrey M. Borodin wrote:
>>> 23 авг. 2020 г., в 14:39, Andrey M. Borodin <[hidden email]> написал(а):
>>>
>>> Thanks for reviewing and benchmarking, Pavel!
>> Pavel sent me few typos offlist. PFA v12 fixing these typos.
>
> In gist_indexsortbuild(), you first build all the leaf pages. Then, you read through all the index pages you just built, to form the tuples for the next level, and repeat for all the upper levels. That seems inefficient, it would be more better to form the tuples for the downlinks as you go, when you build the leaf pages in the first place. That's how nbtsort.c works. Also, you could WAL-log the pages as you go.
>
> In gist_indexsortbuild_flush(), can't you just memcpy() the page from
> memory to the buffer?
>
> - Heikki
Thanks for ideas, Heikki. Please see v13 with proposed changes.
But I've found out that logging page-by-page slows down GiST build by approximately 15% (when CPU constrained).
Though In think that this is IO-wise.

Thanks!

Best regards, Andrey Borodin.

v13-0001-Add-sort-support-for-point-gist_point_sortsuppor.patch (5K) Download Attachment
v13-0002-Implement-GiST-build-using-sort-support.patch (27K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Yet another fast GiST build (typo)

Heikki Linnakangas
On 05/09/2020 14:53, Andrey M. Borodin wrote:
> Thanks for ideas, Heikki. Please see v13 with proposed changes.

Thanks, that was quick!

> But I've found out that logging page-by-page slows down GiST build by
> approximately 15% (when CPU constrained). Though In think that this
> is IO-wise.
Hmm, any ideas why that is? log_newpage_range() writes one WAL record
for 32 pages, while now you're writing one record per page, so you'll
have a little bit more overhead from that. But 15% seems like a lot.

- Heikki


Reply | Threaded
Open this post in threaded view
|

Re: Yet another fast GiST build (typo)

Andrey Borodin-2


> 6 сент. 2020 г., в 18:26, Heikki Linnakangas <[hidden email]> написал(а):
>
> On 05/09/2020 14:53, Andrey M. Borodin wrote:
>> Thanks for ideas, Heikki. Please see v13 with proposed changes.
>
> Thanks, that was quick!
>
>> But I've found out that logging page-by-page slows down GiST build by
>> approximately 15% (when CPU constrained). Though In think that this
>> is IO-wise.
> Hmm, any ideas why that is? log_newpage_range() writes one WAL record for 32 pages, while now you're writing one record per page, so you'll have a little bit more overhead from that. But 15% seems like a lot.
I do not know. I guess this can be some effect of pglz compression during cold stage. It can be slower and less compressive than pglz with cache table? But this is pointing into the sky.
Nevertheless, here's the patch identical to v13, but with 3rd part: log flushed pages with bunches of 32.
This brings CPU performance back and slightly better than before page-by-page logging.

Some details about test:
MacOS, 6-core i7
psql -c '\timing' -c "create table x as select point (random(),random()) from generate_series(1,10000000,1);" -c "create index on x using gist (point);"

With patch v13 this takes 20,567 seconds, with v14 18,149 seconds, v12 ~18,3s (which is closer to 10% btw, sorry for miscomputation). This was not statistically significant testing, just a quick laptop benchmark with 2-3 tests to verify stability.

Best regards, Andrey Borodin.

v14-0001-Add-sort-support-for-point-gist_point_sortsuppor.patch (5K) Download Attachment
v14-0002-Implement-GiST-build-using-sort-support.patch (27K) Download Attachment
v14-0003-Log-GiST-build-with-packs-of-32-pages.patch (5K) Download Attachment
12345 ... 7