UUID v1 optimizations...

classic Classic list List threaded Threaded
23 messages Options
12
Reply | Threaded
Open this post in threaded view
|

UUID v1 optimizations...

Ancoron Luciferis
Hi all,

Some time ago, I was having trouble with some rather high load OLTP
application (in Java, but that doesn't really matter) that was using v1
UUID's for primary keys and after some time, the bloat of certain
indexes went quite high.

So I investigated the PostgreSQL code to see how it is handling UUID's
with respect to storage, sorting, aso. but all I could find was that it
basically falls back to the 16-byte.

After struggling to find a way to optimize things inside the database, I
reverted to introduce a hack into the application by not shuffling the
timestamp bytes for the UUID's, which makes it look quite serial in
terms of byte order.

With that, we were able to reduce bloat by magnitudes and finally VACUUM
also was able to reclaim index space.

So, my question now is: Would it make sense for you to handle these
time-based UUID's differently internally? Specifically un-shuffling the
timestamp before they are going to storage?

A second question would be whether native support functions could be
introduced? Most interesting for me would be:
* timestamp
* version
* variant


Here are some numbers from the tests I have run (against a PostgreSQL 10
server):

1. insert 5,000,000 rows
2. delete 2,500,000 rows

Index Bloat:
       idxname       | bloat_ratio | bloat_size | real_size
---------------------+-------------+------------+-----------
 uuid_v1_pkey        |        23.6 | 46 MB      | 195 MB
 uuid_serial_pkey    |        50.4 | 76 MB      | 150 MB

Higher ratio for "serial", but still lower total index size. :)

Now, the performance of VACUUM is also very interesting here:

# vacuum (verbose, analyze, freeze) uuid_serial;
INFO:  vacuuming "public.uuid_serial"
INFO:  index "uuid_serial_pkey" now contains 2500001 row versions in
19253 pages
DETAIL:  0 index row versions were removed.ce(toast.reltuples, 0) / 4 )
* bs ) as expected_bytes
9624 index pages have been deleted, 9624 are currently reusable.
CPU: user: 0.03 s, system: 0.01 s, elapsed: 0.05 s.t.oid
INFO:  "uuid_serial": found 0 removable, 2500001 nonremovable row
versions in 13515 out of 27028 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 270712
There were 94 unused item pointers.
Skipped 0 pages due to buffer pins, 13513 frozen pages.
0 pages are entirely empty.be reused
CPU: user: 0.37 s, system: 0.16 s, elapsed: 2.83 s.
INFO:  analyzing "public.uuid_serial"e compressed
INFO:  "uuid_serial": scanned 27028 of 27028 pages, containing 2500001
live rows and 0 dead rows; 30000 rows in sample, 2500001 estimated total
rows
VACUUM          schemaname, tablename, can_estimate,
Time: 3969.812 ms (00:03.970)

# vacuum (verbose, analyze, freeze) uuid_v1;
INFO:  vacuuming "public.uuid_v1"
INFO:  scanned index "uuid_v1_pkey" to remove 2499999 row versions
DETAIL:  CPU: user: 1.95 s, system: 0.13 s, elapsed: 5.09 s
INFO:  "uuid_v1": removed 2499999 row versions in 27028 pages
DETAIL:  CPU: user: 0.22 s, system: 0.26 s, elapsed: 3.93 s
INFO:  index "uuid_v1_pkey" now contains 2500001 row versions in 24991 pages
DETAIL:  2499999 index row versions were removed.
12111 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  "uuid_v1": found 1791266 removable, 2500001 nonremovable row
versions in 27028 out of 27028 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 270716
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 2.90 s, system: 0.71 s, elapsed: 14.54 s.
INFO:  analyzing "public.uuid_v1"
INFO:  "uuid_v1": scanned 27028 of 27028 pages, containing 2500001 live
rows and 0 dead rows; 30000 rows in sample, 2500001 estimated total rows
VACUUM
Time: 15702.803 ms (00:15.703)

...almost 5x faster!

Now insert another 20 million:

COPY uuid_serial FROM '...' WITH ( FORMAT text );
COPY 20000000
Time: 76249.142 ms (01:16.249)

COPY uuid_v1 FROM '...' WITH ( FORMAT text );
COPY 20000000
Time: 804291.611 ms (13:24.292)

...more than 10x faster!

...and the resulting bloat (no VACUUM in between):
       idxname       | bloat_ratio | bloat_size | real_size
---------------------+-------------+------------+-----------
 uuid_v1_pkey        |        30.5 | 295 MB     | 966 MB
 uuid_serial_pkey    |         0.9 | 6056 kB    | 677 MB

...still 30% savings in space.


Cheers,

        Ancoron



Reply | Threaded
Open this post in threaded view
|

Re: UUID v1 optimizations...

Peter Eisentraut-6
On 2019-05-25 15:45, Ancoron Luciferis wrote:
> So, my question now is: Would it make sense for you to handle these
> time-based UUID's differently internally? Specifically un-shuffling the
> timestamp before they are going to storage?

It seems unlikely that we would do that, because that would break
existing stored UUIDs, and we'd also need to figure out a way to store
non-v1 UUIDs under this different scheme.  The use case is pretty narrow
compared to the enormous effort.

It is well understood that using UUIDs or other somewhat-random keys
perform worse than serial-like keys.

Btw., it might be nice to rerun your tests with PostgreSQL 12beta1.  The
btree storage has gotten some improvements.  I don't think it's going to
fundamentally solve your problem, but it would be useful feedback.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: UUID v1 optimizations...

Tom Lane-2
In reply to this post by Ancoron Luciferis
Ancoron Luciferis <[hidden email]> writes:
> So I investigated the PostgreSQL code to see how it is handling UUID's
> with respect to storage, sorting, aso. but all I could find was that it
> basically falls back to the 16-byte.

Yup, they're just blobs to us.

> After struggling to find a way to optimize things inside the database, I
> reverted to introduce a hack into the application by not shuffling the
> timestamp bytes for the UUID's, which makes it look quite serial in
> terms of byte order.

> So, my question now is: Would it make sense for you to handle these
> time-based UUID's differently internally? Specifically un-shuffling the
> timestamp before they are going to storage?

No, because

(1) UUID layout is standardized;

(2) such a change would break on-disk compatibility for existing
databases;

(3) particularly for the case of application-generated UUIDs, we do
not have enough information to know that this would actually do anything
useful;

(4) it in fact *wouldn't* do anything useful, because we'd still have
to sort UUIDs in the same order as today, meaning that btree index behavior
would remain the same as before.  Plus UUID comparison would get a lot
more complicated and slower than it is now.

(5) even if we ignored all that and did it anyway, it would only help
for version-1 UUIDs.  The index performance issue would still remain for
version-4 (random) UUIDs, which are if anything more common than v1.


FWIW, I don't know what tool you're using to get those "bloat" numbers,
but just because somebody calls it bloat doesn't mean that it is.
The normal, steady-state load factor for a btree index is generally
understood to be about 2/3rds, and that looks to be about what
you're getting for the regular-UUID-format index.  The fact that the
serially-loaded index has nearly no dead space is because we hack the
page split logic to make that happen --- but that is a hack, and it's
not without downsides.  It should *not* be taken to be an indication
of what you can expect for any other insertion pattern.

The insertion-speed aspect is a real problem, but the core of that problem
is that use of any sort of standard-format UUID turns applications that
might have had considerable locality of reference into applications that
have none.  If you can manage to keep your whole index in RAM that would
not hurt too much, but as soon as it doesn't fit you have a problem.
When your app has more or less predictable reference patterns it's best
to design a unique key that matches that, instead of expecting that
essentially-random keys will work well.

Or in short, hacking up the way your app generates UUIDs is exactly
the right way to proceed here.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: UUID v1 optimizations...

Ancoron Luciferis
In reply to this post by Peter Eisentraut-6
On 25/05/2019 16:19, Peter Eisentraut wrote:
> On 2019-05-25 15:45, Ancoron Luciferis wrote:
>> So, my question now is: Would it make sense for you to handle these
>> time-based UUID's differently internally? Specifically un-shuffling the
>> timestamp before they are going to storage?
>
> It seems unlikely that we would do that, because that would break
> existing stored UUIDs, and we'd also need to figure out a way to store
> non-v1 UUIDs under this different scheme.  The use case is pretty narrow
> compared to the enormous effort.

I agree, the backwards compatibility really would be a show stopper here.

About the "enormous effort" I was thinking the simplest "solution" would
be to have the version being detected at he time when the internal byte
array is created from the provided representation which then could
directly provide the unshuffled byte order.

>
> It is well understood that using UUIDs or other somewhat-random keys
> perform worse than serial-like keys.

Yes I know. We're using these UUID's for more than just some primary
key, because they also tell use the creation time of the entry and which
"node" of the highly distributed application generated the entry. It is
like an audit-log for us without the need for extra columns and of fixed
size, which helps performance also at the application level.

>
> Btw., it might be nice to rerun your tests with PostgreSQL 12beta1.  The
> btree storage has gotten some improvements.  I don't think it's going to
> fundamentally solve your problem, but it would be useful feedback.
>

Thank you for the pointer to 12beta1. I've just read about it and it
might help a bit. I'll give it a try, for sure and report back.

I also have to rerun those tests against PG 11 anyway.

Cheers,

        Ancoron


Reply | Threaded
Open this post in threaded view
|

Re: UUID v1 optimizations...

Ancoron Luciferis
In reply to this post by Tom Lane-2
On 25/05/2019 16:57, Tom Lane wrote:

> Ancoron Luciferis <[hidden email]> writes:
>> So I investigated the PostgreSQL code to see how it is handling UUID's
>> with respect to storage, sorting, aso. but all I could find was that it
>> basically falls back to the 16-byte.
>
> Yup, they're just blobs to us.
>
>> After struggling to find a way to optimize things inside the database, I
>> reverted to introduce a hack into the application by not shuffling the
>> timestamp bytes for the UUID's, which makes it look quite serial in
>> terms of byte order.
>
>> So, my question now is: Would it make sense for you to handle these
>> time-based UUID's differently internally? Specifically un-shuffling the
>> timestamp before they are going to storage?
>
> No, because
>
> (1) UUID layout is standardized;

You mean the presentation at the byte-level is. ;)

>
> (2) such a change would break on-disk compatibility for existing
> databases;

Yes, that certainly is a show-stopper.

>
> (3) particularly for the case of application-generated UUIDs, we do
> not have enough information to know that this would actually do anything
> useful;

Well, not only the layout is standardized, but also there is a semantic
to it depending on the version. Specifically for version 1, it has:
1. a timestamp
2. a clock sequence
3. a node id

Ans as PostgreSQL already provides this pretty concrete data type, it
could be a natural consequence to also support the semantic of it.

E.g. the network types also come with a lot of additional operators and
functions. So I don't see a reason not to respect the additional
capabilities of a UUID.

For other versions of UUID's, functions like timestamp would certainly
not be available (return NULL?), respecting the concrete semantic.

>
> (4) it in fact *wouldn't* do anything useful, because we'd still have
> to sort UUIDs in the same order as today, meaning that btree index behavior
> would remain the same as before.  Plus UUID comparison would get a lot
> more complicated and slower than it is now.

I get your first sentence, but not your second. I know that when
changing the internal byte order we'd have to completed re-compute
everything on-disk (from table to index data), but why would the sorting
in the index have to be the same?

And actually, comparison logic wouldn't need to be changed at all if the
byte order is "changed" when the UUID is read in when reading the
representation into the internal UUID's byte array.

Function:
string_to_uuid(const char *source, pg_uuid_t *uuid);

^^ here I would apply the change. And of course, reverse it for
generating the textual representation.

That would slow down writes a bit, but that shouldn't be the case
because index insertions are speed up even more.

But still, on-disk change is still a show-stopper, I guess.

>
> (5) even if we ignored all that and did it anyway, it would only help
> for version-1 UUIDs.  The index performance issue would still remain for
> version-4 (random) UUIDs, which are if anything more common than v1.

Yes, I am aware that the changes might be of very limited gain. V4
UUID's are usually used for external identifiers. For internal ones,
they don't make sense to me (too long, issues with randomness/enthropie
under high load, ...). ;)

I just recently used these UUID's also together with a function for
TimescaleDB auto-partitioning to provide the timestamp for the
partitioning logic instead of the need for a separate timestamp column.

This is also one of the reasons why I was also asking for native support
functions to extract the timestamp. I am apparently not very good at C
so I am currently using Python and/or PgPLSQL for it, which is pretty slow.

>
>
> FWIW, I don't know what tool you're using to get those "bloat" numbers,
> but just because somebody calls it bloat doesn't mean that it is.
> The normal, steady-state load factor for a btree index is generally
> understood to be about 2/3rds, and that looks to be about what
> you're getting for the regular-UUID-format index.  The fact that the
> serially-loaded index has nearly no dead space is because we hack the
> page split logic to make that happen --- but that is a hack, and it's
> not without downsides.  It should *not* be taken to be an indication
> of what you can expect for any other insertion pattern.

OK, understood. I was actually a bit surprised by those numbers myself
as these "serial" UUID's still only have the timestamp bytes in
ascending order, the clock sequence and node is still pretty random (but
not inside a single transaction, which might help the hack).

>
> The insertion-speed aspect is a real problem, but the core of that problem
> is that use of any sort of standard-format UUID turns applications that
> might have had considerable locality of reference into applications that
> have none.  If you can manage to keep your whole index in RAM that would
> not hurt too much, but as soon as it doesn't fit you have a problem.
> When your app has more or less predictable reference patterns it's best
> to design a unique key that matches that, instead of expecting that
> essentially-random keys will work well.

The system was configured to have more than enough space for the index
and table data to fit into memory, but I am not sure. How can I verify
that? An EXPLAIN on the INSERT apparently doesn't include index insertion.

>
> Or in short, hacking up the way your app generates UUIDs is exactly
> the right way to proceed here.

OK. Glad to hear that.

One last question, though:
Would it make sense to create a specialized UUID v1 type (e.g. with an
extension) that does the transformation and delegates for all other
things to the existing UUID type support?

>
> regards, tom lane
>



Reply | Threaded
Open this post in threaded view
|

Re: UUID v1 optimizations...

Vitalii Tymchyshyn-2
I am not sure why do you want to change on-disk storage format? If we are talking about indexes, it's more about comparison function (opclass) that is used in an index. 
Am I wrong?

сб, 25 трав. 2019 о 11:21 Ancoron Luciferis <[hidden email]> пише:
On 25/05/2019 16:57, Tom Lane wrote:
> Ancoron Luciferis <[hidden email]> writes:
>> So I investigated the PostgreSQL code to see how it is handling UUID's
>> with respect to storage, sorting, aso. but all I could find was that it
>> basically falls back to the 16-byte.
>
> Yup, they're just blobs to us.
>
>> After struggling to find a way to optimize things inside the database, I
>> reverted to introduce a hack into the application by not shuffling the
>> timestamp bytes for the UUID's, which makes it look quite serial in
>> terms of byte order.
>
>> So, my question now is: Would it make sense for you to handle these
>> time-based UUID's differently internally? Specifically un-shuffling the
>> timestamp before they are going to storage?
>
> No, because
>
> (1) UUID layout is standardized;

You mean the presentation at the byte-level is. ;)

>
> (2) such a change would break on-disk compatibility for existing
> databases;

Yes, that certainly is a show-stopper.

>
> (3) particularly for the case of application-generated UUIDs, we do
> not have enough information to know that this would actually do anything
> useful;

Well, not only the layout is standardized, but also there is a semantic
to it depending on the version. Specifically for version 1, it has:
1. a timestamp
2. a clock sequence
3. a node id

Ans as PostgreSQL already provides this pretty concrete data type, it
could be a natural consequence to also support the semantic of it.

E.g. the network types also come with a lot of additional operators and
functions. So I don't see a reason not to respect the additional
capabilities of a UUID.

For other versions of UUID's, functions like timestamp would certainly
not be available (return NULL?), respecting the concrete semantic.

>
> (4) it in fact *wouldn't* do anything useful, because we'd still have
> to sort UUIDs in the same order as today, meaning that btree index behavior
> would remain the same as before.  Plus UUID comparison would get a lot
> more complicated and slower than it is now.

I get your first sentence, but not your second. I know that when
changing the internal byte order we'd have to completed re-compute
everything on-disk (from table to index data), but why would the sorting
in the index have to be the same?

And actually, comparison logic wouldn't need to be changed at all if the
byte order is "changed" when the UUID is read in when reading the
representation into the internal UUID's byte array.

Function:
string_to_uuid(const char *source, pg_uuid_t *uuid);

^^ here I would apply the change. And of course, reverse it for
generating the textual representation.

That would slow down writes a bit, but that shouldn't be the case
because index insertions are speed up even more.

But still, on-disk change is still a show-stopper, I guess.

>
> (5) even if we ignored all that and did it anyway, it would only help
> for version-1 UUIDs.  The index performance issue would still remain for
> version-4 (random) UUIDs, which are if anything more common than v1.

Yes, I am aware that the changes might be of very limited gain. V4
UUID's are usually used for external identifiers. For internal ones,
they don't make sense to me (too long, issues with randomness/enthropie
under high load, ...). ;)

I just recently used these UUID's also together with a function for
TimescaleDB auto-partitioning to provide the timestamp for the
partitioning logic instead of the need for a separate timestamp column.

This is also one of the reasons why I was also asking for native support
functions to extract the timestamp. I am apparently not very good at C
so I am currently using Python and/or PgPLSQL for it, which is pretty slow.

>
>
> FWIW, I don't know what tool you're using to get those "bloat" numbers,
> but just because somebody calls it bloat doesn't mean that it is.
> The normal, steady-state load factor for a btree index is generally
> understood to be about 2/3rds, and that looks to be about what
> you're getting for the regular-UUID-format index.  The fact that the
> serially-loaded index has nearly no dead space is because we hack the
> page split logic to make that happen --- but that is a hack, and it's
> not without downsides.  It should *not* be taken to be an indication
> of what you can expect for any other insertion pattern.

OK, understood. I was actually a bit surprised by those numbers myself
as these "serial" UUID's still only have the timestamp bytes in
ascending order, the clock sequence and node is still pretty random (but
not inside a single transaction, which might help the hack).

>
> The insertion-speed aspect is a real problem, but the core of that problem
> is that use of any sort of standard-format UUID turns applications that
> might have had considerable locality of reference into applications that
> have none.  If you can manage to keep your whole index in RAM that would
> not hurt too much, but as soon as it doesn't fit you have a problem.
> When your app has more or less predictable reference patterns it's best
> to design a unique key that matches that, instead of expecting that
> essentially-random keys will work well.

The system was configured to have more than enough space for the index
and table data to fit into memory, but I am not sure. How can I verify
that? An EXPLAIN on the INSERT apparently doesn't include index insertion.

>
> Or in short, hacking up the way your app generates UUIDs is exactly
> the right way to proceed here.

OK. Glad to hear that.

One last question, though:
Would it make sense to create a specialized UUID v1 type (e.g. with an
extension) that does the transformation and delegates for all other
things to the existing UUID type support?

>
>                       regards, tom lane
>


Reply | Threaded
Open this post in threaded view
|

Re: UUID v1 optimizations...

Ancoron Luciferis
On 25/05/2019 21:00, Vitalii Tymchyshyn wrote:
> I am not sure why do you want to change on-disk storage format? If we
> are talking about indexes, it's more about comparison function (opclass)
> that is used in an index. 
> Am I wrong?

I don't "want" to change the on-disk format of the v1 UUID's but that
seems to be the most efficient approach to me (if anyone would ever want
to implement that) as not changing it would mean to introduce the
unshuffling of the timestamp bytes in a lot of other places where it
would decrease the performance instead of increasing it, not to speak of
VACUUM, REINDEX, CLUSTER, ....

One of my use-cases would also benefit a lot here when trying to find
entries that have been created within a given time range. To do that
(without an additional timestamp column), I'd end up with a full index
scan to determine the rows to read as there is no correlation between
those (timestamp and value sort order).

For example I would like to query the DB with:

WHERE id >= '7d9c4835-5378-11ec-0000-000000000000' AND id <
'5cdcac78-5a9a-11ec-0000-000000000000'

...which is impossible atm. I have written myself a helper function that
extracts the timestamp, but that is executed (of course) for each and
every entry while filtering or increases (if used as an index
expression) the write RTT by quite a lot (if you're dealing with
peak-load style applications like we have):

WHERE time_part_uuidv1(id) >= '2021-12-02 14:02:31.021778+00' AND
time_part_uuidv1(id) < '2021-12-11 15:52:37.107212+00'

So, although possible, it is not really an option for an application
like we have (and we don't just want to throw hardware at it).

I don't know if all that makes any sense to you or if I'm just missing a
piece of the puzzle inside the PostgreSQL code base. If I do, please
tell me. :)

Cheers,

        Ancoron

>
> сб, 25 трав. 2019 о 11:21 Ancoron Luciferis
> <[hidden email]
> <mailto:[hidden email]>> пише:
>
>     On 25/05/2019 16:57, Tom Lane wrote:
>     > Ancoron Luciferis <[hidden email]
>     <mailto:[hidden email]>> writes:
>     >> So I investigated the PostgreSQL code to see how it is handling
>     UUID's
>     >> with respect to storage, sorting, aso. but all I could find was
>     that it
>     >> basically falls back to the 16-byte.
>     >
>     > Yup, they're just blobs to us.
>     >
>     >> After struggling to find a way to optimize things inside the
>     database, I
>     >> reverted to introduce a hack into the application by not
>     shuffling the
>     >> timestamp bytes for the UUID's, which makes it look quite serial in
>     >> terms of byte order.
>     >
>     >> So, my question now is: Would it make sense for you to handle these
>     >> time-based UUID's differently internally? Specifically
>     un-shuffling the
>     >> timestamp before they are going to storage?
>     >
>     > No, because
>     >
>     > (1) UUID layout is standardized;
>
>     You mean the presentation at the byte-level is. ;)
>
>     >
>     > (2) such a change would break on-disk compatibility for existing
>     > databases;
>
>     Yes, that certainly is a show-stopper.
>
>     >
>     > (3) particularly for the case of application-generated UUIDs, we do
>     > not have enough information to know that this would actually do
>     anything
>     > useful;
>
>     Well, not only the layout is standardized, but also there is a semantic
>     to it depending on the version. Specifically for version 1, it has:
>     1. a timestamp
>     2. a clock sequence
>     3. a node id
>
>     Ans as PostgreSQL already provides this pretty concrete data type, it
>     could be a natural consequence to also support the semantic of it.
>
>     E.g. the network types also come with a lot of additional operators and
>     functions. So I don't see a reason not to respect the additional
>     capabilities of a UUID.
>
>     For other versions of UUID's, functions like timestamp would certainly
>     not be available (return NULL?), respecting the concrete semantic.
>
>     >
>     > (4) it in fact *wouldn't* do anything useful, because we'd still have
>     > to sort UUIDs in the same order as today, meaning that btree index
>     behavior
>     > would remain the same as before.  Plus UUID comparison would get a lot
>     > more complicated and slower than it is now.
>
>     I get your first sentence, but not your second. I know that when
>     changing the internal byte order we'd have to completed re-compute
>     everything on-disk (from table to index data), but why would the sorting
>     in the index have to be the same?
>
>     And actually, comparison logic wouldn't need to be changed at all if the
>     byte order is "changed" when the UUID is read in when reading the
>     representation into the internal UUID's byte array.
>
>     Function:
>     string_to_uuid(const char *source, pg_uuid_t *uuid);
>
>     ^^ here I would apply the change. And of course, reverse it for
>     generating the textual representation.
>
>     That would slow down writes a bit, but that shouldn't be the case
>     because index insertions are speed up even more.
>
>     But still, on-disk change is still a show-stopper, I guess.
>
>     >
>     > (5) even if we ignored all that and did it anyway, it would only help
>     > for version-1 UUIDs.  The index performance issue would still
>     remain for
>     > version-4 (random) UUIDs, which are if anything more common than v1.
>
>     Yes, I am aware that the changes might be of very limited gain. V4
>     UUID's are usually used for external identifiers. For internal ones,
>     they don't make sense to me (too long, issues with randomness/enthropie
>     under high load, ...). ;)
>
>     I just recently used these UUID's also together with a function for
>     TimescaleDB auto-partitioning to provide the timestamp for the
>     partitioning logic instead of the need for a separate timestamp column.
>
>     This is also one of the reasons why I was also asking for native support
>     functions to extract the timestamp. I am apparently not very good at C
>     so I am currently using Python and/or PgPLSQL for it, which is
>     pretty slow.
>
>     >
>     >
>     > FWIW, I don't know what tool you're using to get those "bloat"
>     numbers,
>     > but just because somebody calls it bloat doesn't mean that it is.
>     > The normal, steady-state load factor for a btree index is generally
>     > understood to be about 2/3rds, and that looks to be about what
>     > you're getting for the regular-UUID-format index.  The fact that the
>     > serially-loaded index has nearly no dead space is because we hack the
>     > page split logic to make that happen --- but that is a hack, and it's
>     > not without downsides.  It should *not* be taken to be an indication
>     > of what you can expect for any other insertion pattern.
>
>     OK, understood. I was actually a bit surprised by those numbers myself
>     as these "serial" UUID's still only have the timestamp bytes in
>     ascending order, the clock sequence and node is still pretty random (but
>     not inside a single transaction, which might help the hack).
>
>     >
>     > The insertion-speed aspect is a real problem, but the core of that
>     problem
>     > is that use of any sort of standard-format UUID turns applications
>     that
>     > might have had considerable locality of reference into
>     applications that
>     > have none.  If you can manage to keep your whole index in RAM that
>     would
>     > not hurt too much, but as soon as it doesn't fit you have a problem.
>     > When your app has more or less predictable reference patterns it's
>     best
>     > to design a unique key that matches that, instead of expecting that
>     > essentially-random keys will work well.
>
>     The system was configured to have more than enough space for the index
>     and table data to fit into memory, but I am not sure. How can I verify
>     that? An EXPLAIN on the INSERT apparently doesn't include index
>     insertion.
>
>     >
>     > Or in short, hacking up the way your app generates UUIDs is exactly
>     > the right way to proceed here.
>
>     OK. Glad to hear that.
>
>     One last question, though:
>     Would it make sense to create a specialized UUID v1 type (e.g. with an
>     extension) that does the transformation and delegates for all other
>     things to the existing UUID type support?
>
>     >
>     >                       regards, tom lane
>     >
>
>



Reply | Threaded
Open this post in threaded view
|

Re: UUID v1 optimizations...

Ancoron Luciferis
In reply to this post by Vitalii Tymchyshyn-2
On 25/05/2019 21:00, Vitalii Tymchyshyn wrote:
> I am not sure why do you want to change on-disk storage format? If we
> are talking about indexes, it's more about comparison function (opclass)
> that is used in an index. 
> Am I wrong?

I don't "want" to change the on-disk format of the v1 UUID's but that
seems to be the most efficient approach to me (if anyone would ever want
to implement that) as not changing it would mean to introduce the
unshuffling of the timestamp bytes in a lot of other places where it
would decrease the performance instead of increasing it, not to speak of
VACUUM, REINDEX, CLUSTER, ....

One of my use-cases would also benefit a lot here when trying to find
entries that have been created within a given time range. To do that
(without an additional timestamp column), I'd end up with a full index
scan to determine the rows to read as there is no correlation between
those (timestamp and value sort order).

For example I would like to query the DB with:

WHERE id >= '7d9c4835-5378-11ec-0000-000000000000' AND id <
'5cdcac78-5a9a-11ec-0000-000000000000'

...which is impossible atm. I have written myself a helper function that
extracts the timestamp, but that is executed (of course) for each and
every entry while filtering or increases (if used as an index
expression) the write RTT by quite a lot (if you're dealing with
peak-load style applications like we have):

WHERE time_part_uuidv1(id) >= '2021-12-02 14:02:31.021778+00' AND
time_part_uuidv1(id) < '2021-12-11 15:52:37.107212+00'

So, although possible, it is not really an option for an application
like we have (and we don't just want to throw hardware at it).

I don't know if all that makes any sense to you or if I'm just missing a
piece of the puzzle inside the PostgreSQL code base. If I do, please
tell me. :)

Cheers,

        Ancoron

>
> сб, 25 трав. 2019 о 11:21 Ancoron Luciferis
> <[hidden email]
> <mailto:[hidden email]>> пише:
>
>     On 25/05/2019 16:57, Tom Lane wrote:
>     > Ancoron Luciferis <[hidden email]
>     <mailto:[hidden email]>> writes:
>     >> So I investigated the PostgreSQL code to see how it is handling
>     UUID's
>     >> with respect to storage, sorting, aso. but all I could find was
>     that it
>     >> basically falls back to the 16-byte.
>     >
>     > Yup, they're just blobs to us.
>     >
>     >> After struggling to find a way to optimize things inside the
>     database, I
>     >> reverted to introduce a hack into the application by not
>     shuffling the
>     >> timestamp bytes for the UUID's, which makes it look quite serial in
>     >> terms of byte order.
>     >
>     >> So, my question now is: Would it make sense for you to handle these
>     >> time-based UUID's differently internally? Specifically
>     un-shuffling the
>     >> timestamp before they are going to storage?
>     >
>     > No, because
>     >
>     > (1) UUID layout is standardized;
>
>     You mean the presentation at the byte-level is. ;)
>
>     >
>     > (2) such a change would break on-disk compatibility for existing
>     > databases;
>
>     Yes, that certainly is a show-stopper.
>
>     >
>     > (3) particularly for the case of application-generated UUIDs, we do
>     > not have enough information to know that this would actually do
>     anything
>     > useful;
>
>     Well, not only the layout is standardized, but also there is a semantic
>     to it depending on the version. Specifically for version 1, it has:
>     1. a timestamp
>     2. a clock sequence
>     3. a node id
>
>     Ans as PostgreSQL already provides this pretty concrete data type, it
>     could be a natural consequence to also support the semantic of it.
>
>     E.g. the network types also come with a lot of additional operators and
>     functions. So I don't see a reason not to respect the additional
>     capabilities of a UUID.
>
>     For other versions of UUID's, functions like timestamp would certainly
>     not be available (return NULL?), respecting the concrete semantic.
>
>     >
>     > (4) it in fact *wouldn't* do anything useful, because we'd still have
>     > to sort UUIDs in the same order as today, meaning that btree index
>     behavior
>     > would remain the same as before.  Plus UUID comparison would get a lot
>     > more complicated and slower than it is now.
>
>     I get your first sentence, but not your second. I know that when
>     changing the internal byte order we'd have to completed re-compute
>     everything on-disk (from table to index data), but why would the sorting
>     in the index have to be the same?
>
>     And actually, comparison logic wouldn't need to be changed at all if the
>     byte order is "changed" when the UUID is read in when reading the
>     representation into the internal UUID's byte array.
>
>     Function:
>     string_to_uuid(const char *source, pg_uuid_t *uuid);
>
>     ^^ here I would apply the change. And of course, reverse it for
>     generating the textual representation.
>
>     That would slow down writes a bit, but that shouldn't be the case
>     because index insertions are speed up even more.
>
>     But still, on-disk change is still a show-stopper, I guess.
>
>     >
>     > (5) even if we ignored all that and did it anyway, it would only help
>     > for version-1 UUIDs.  The index performance issue would still
>     remain for
>     > version-4 (random) UUIDs, which are if anything more common than v1.
>
>     Yes, I am aware that the changes might be of very limited gain. V4
>     UUID's are usually used for external identifiers. For internal ones,
>     they don't make sense to me (too long, issues with randomness/enthropie
>     under high load, ...). ;)
>
>     I just recently used these UUID's also together with a function for
>     TimescaleDB auto-partitioning to provide the timestamp for the
>     partitioning logic instead of the need for a separate timestamp column.
>
>     This is also one of the reasons why I was also asking for native support
>     functions to extract the timestamp. I am apparently not very good at C
>     so I am currently using Python and/or PgPLSQL for it, which is
>     pretty slow.
>
>     >
>     >
>     > FWIW, I don't know what tool you're using to get those "bloat"
>     numbers,
>     > but just because somebody calls it bloat doesn't mean that it is.
>     > The normal, steady-state load factor for a btree index is generally
>     > understood to be about 2/3rds, and that looks to be about what
>     > you're getting for the regular-UUID-format index.  The fact that the
>     > serially-loaded index has nearly no dead space is because we hack the
>     > page split logic to make that happen --- but that is a hack, and it's
>     > not without downsides.  It should *not* be taken to be an indication
>     > of what you can expect for any other insertion pattern.
>
>     OK, understood. I was actually a bit surprised by those numbers myself
>     as these "serial" UUID's still only have the timestamp bytes in
>     ascending order, the clock sequence and node is still pretty random (but
>     not inside a single transaction, which might help the hack).
>
>     >
>     > The insertion-speed aspect is a real problem, but the core of that
>     problem
>     > is that use of any sort of standard-format UUID turns applications
>     that
>     > might have had considerable locality of reference into
>     applications that
>     > have none.  If you can manage to keep your whole index in RAM that
>     would
>     > not hurt too much, but as soon as it doesn't fit you have a problem.
>     > When your app has more or less predictable reference patterns it's
>     best
>     > to design a unique key that matches that, instead of expecting that
>     > essentially-random keys will work well.
>
>     The system was configured to have more than enough space for the index
>     and table data to fit into memory, but I am not sure. How can I verify
>     that? An EXPLAIN on the INSERT apparently doesn't include index
>     insertion.
>
>     >
>     > Or in short, hacking up the way your app generates UUIDs is exactly
>     > the right way to proceed here.
>
>     OK. Glad to hear that.
>
>     One last question, though:
>     Would it make sense to create a specialized UUID v1 type (e.g. with an
>     extension) that does the transformation and delegates for all other
>     things to the existing UUID type support?
>
>     >
>     >                       regards, tom lane
>     >
>
>



Reply | Threaded
Open this post in threaded view
|

Re: UUID v1 optimizations...

Tom Lane-2
In reply to this post by Ancoron Luciferis
Ancoron Luciferis <[hidden email]> writes:
> On 25/05/2019 16:57, Tom Lane wrote:
>> (4) it in fact *wouldn't* do anything useful, because we'd still have
>> to sort UUIDs in the same order as today, meaning that btree index behavior
>> would remain the same as before.  Plus UUID comparison would get a lot
>> more complicated and slower than it is now.

> I get your first sentence, but not your second. I know that when
> changing the internal byte order we'd have to completed re-compute
> everything on-disk (from table to index data), but why would the sorting
> in the index have to be the same?

Because we aren't going to change the existing sort order of UUIDs.
We have no idea what applications might be dependent on that.

As Vitalii correctly pointed out, your beef is not with the physical
storage of UUIDs anyway: you just wish they'd sort differently, since
that is what determines the behavior of a btree index.  But we aren't
going to change the sort ordering because that's an even bigger
compatibility break than changing the physical storage; it'd affect
application-visible semantics.

What you might want to think about is creating a function that maps
UUIDs into an ordering that makes sense to you, and then creating
a unique index over that function instead of the raw UUIDs.  That
would give the results you want without having to negotiate with the
rest of the world about whether it's okay to change the semantics
of type uuid.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: UUID v1 optimizations...

Tomas Vondra-4
On Sat, May 25, 2019 at 05:54:15PM -0400, Tom Lane wrote:

>Ancoron Luciferis <[hidden email]> writes:
>> On 25/05/2019 16:57, Tom Lane wrote:
>>> (4) it in fact *wouldn't* do anything useful, because we'd still have
>>> to sort UUIDs in the same order as today, meaning that btree index behavior
>>> would remain the same as before.  Plus UUID comparison would get a lot
>>> more complicated and slower than it is now.
>
>> I get your first sentence, but not your second. I know that when
>> changing the internal byte order we'd have to completed re-compute
>> everything on-disk (from table to index data), but why would the sorting
>> in the index have to be the same?
>
>Because we aren't going to change the existing sort order of UUIDs.
>We have no idea what applications might be dependent on that.
>
>As Vitalii correctly pointed out, your beef is not with the physical
>storage of UUIDs anyway: you just wish they'd sort differently, since
>that is what determines the behavior of a btree index.  But we aren't
>going to change the sort ordering because that's an even bigger
>compatibility break than changing the physical storage; it'd affect
>application-visible semantics.
>
>What you might want to think about is creating a function that maps
>UUIDs into an ordering that makes sense to you, and then creating
>a unique index over that function instead of the raw UUIDs.  That
>would give the results you want without having to negotiate with the
>rest of the world about whether it's okay to change the semantics
>of type uuid.
>

FWIW that's essentially what I implemented as an extension some time
ago. See [1] for a more detailed explanation and some benchmarks.

The thing is - it's not really desirable to get perfectly ordered
ordering, because that would mean we never get back to older parts of
the index (so if you delete data, we'd never fill that space).

[1] https://www.2ndquadrant.com/en/blog/sequential-uuid-generators/


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: UUID v1 optimizations...

Tom Lane-2
Tomas Vondra <[hidden email]> writes:
> On Sat, May 25, 2019 at 05:54:15PM -0400, Tom Lane wrote:
>> What you might want to think about is creating a function that maps
>> UUIDs into an ordering that makes sense to you, and then creating
>> a unique index over that function instead of the raw UUIDs.  That
>> would give the results you want without having to negotiate with the
>> rest of the world about whether it's okay to change the semantics
>> of type uuid.

> FWIW that's essentially what I implemented as an extension some time
> ago. See [1] for a more detailed explanation and some benchmarks.

Also, another way to attack this is to create a new set of ordering
operators for UUID and an associated non-default btree opclass.
Then you declare your index using that opclass and you're done.
The key advantage of this way is that the regular UUID equality
operator can still be a member of that opclass, meaning that searches
of the form "uuidcol = constant" can still use this index, so you
don't have to change your queries (at least not for that common case).
Look at the interrelationship of the regular text btree operators and
the "pattern_ops" btree operators for a precedent.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: UUID v1 optimizations...

Ancoron Luciferis
In reply to this post by Tom Lane-2
On 25/05/2019 23:54, Tom Lane wrote:

> Ancoron Luciferis <[hidden email]> writes:
>> On 25/05/2019 16:57, Tom Lane wrote:
>>> (4) it in fact *wouldn't* do anything useful, because we'd still have
>>> to sort UUIDs in the same order as today, meaning that btree index behavior
>>> would remain the same as before.  Plus UUID comparison would get a lot
>>> more complicated and slower than it is now.
>
>> I get your first sentence, but not your second. I know that when
>> changing the internal byte order we'd have to completed re-compute
>> everything on-disk (from table to index data), but why would the sorting
>> in the index have to be the same?
>
> Because we aren't going to change the existing sort order of UUIDs.
> We have no idea what applications might be dependent on that.
>
> As Vitalii correctly pointed out, your beef is not with the physical
> storage of UUIDs anyway: you just wish they'd sort differently, since
> that is what determines the behavior of a btree index.  But we aren't
> going to change the sort ordering because that's an even bigger
> compatibility break than changing the physical storage; it'd affect
> application-visible semantics.
>
> What you might want to think about is creating a function that maps
> UUIDs into an ordering that makes sense to you, and then creating
> a unique index over that function instead of the raw UUIDs.  That
> would give the results you want without having to negotiate with the
> rest of the world about whether it's okay to change the semantics
> of type uuid.
>
> regards, tom lane
>

I understand. Point taken, I really didn't think about someone could
depend on an index order of a (pretty random) UUID.

The whole point of me starting this discussion was about performance in
multiple areas, but INSERT performance was really becoming an issue for
us apart from the index bloat, which was way higher than just the 30% at
several occasions (including out-of-disk-space in the early days), apart
from the fact that the index was regularly dismissed due to it not being
in memory. In that sense, just creating additional indexes with
functions doesn't really solve the core issues that we had.

Not to mention the performance of VACUUM, among other things.

So, even we currently "solved" a lot of these issues at the application
level, we now have UUID's that look like v1 UUID's but in fact will not
be readable (in the representation as returned by PostgreSQL) by any
other application that doesn't know our specific implementation. This
forces us to hack other tools written in other languages that would
otherwise understand and handle regular v1 UUID's as well.

I should add that the tests I have made where all running on dedicated
SSD's, one for the table data, one for the indexes and one for the WAL.
If those where running against the same disks the difference would
probably be much higher during writes.

I'll think about creating an extension to provide a custom data type
instead. So nobody would be at risk and anyone would decide explicitly
for it with all consequences.

Thank you for your time and precious input. :)


Cheers,

        Ancoron


Reply | Threaded
Open this post in threaded view
|

Re: UUID v1 optimizations...

Tomas Vondra-4
In reply to this post by Tom Lane-2
On Sat, May 25, 2019 at 06:38:08PM -0400, Tom Lane wrote:

>Tomas Vondra <[hidden email]> writes:
>> On Sat, May 25, 2019 at 05:54:15PM -0400, Tom Lane wrote:
>>> What you might want to think about is creating a function that maps
>>> UUIDs into an ordering that makes sense to you, and then creating
>>> a unique index over that function instead of the raw UUIDs.  That
>>> would give the results you want without having to negotiate with the
>>> rest of the world about whether it's okay to change the semantics
>>> of type uuid.
>
>> FWIW that's essentially what I implemented as an extension some time
>> ago. See [1] for a more detailed explanation and some benchmarks.
>
>Also, another way to attack this is to create a new set of ordering
>operators for UUID and an associated non-default btree opclass.
>Then you declare your index using that opclass and you're done.
>The key advantage of this way is that the regular UUID equality
>operator can still be a member of that opclass, meaning that searches
>of the form "uuidcol = constant" can still use this index, so you
>don't have to change your queries (at least not for that common case).
>Look at the interrelationship of the regular text btree operators and
>the "pattern_ops" btree operators for a precedent.
>

Perhaps. But it does not allow to tune how often the values "wrap" and,
which I think is an useful capability.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: UUID v1 optimizations...

Ancoron Luciferis
In reply to this post by Tomas Vondra-4
On 26/05/2019 00:14, Tomas Vondra wrote:

> On Sat, May 25, 2019 at 05:54:15PM -0400, Tom Lane wrote:
>> Ancoron Luciferis <[hidden email]> writes:
>>> On 25/05/2019 16:57, Tom Lane wrote:
>>>> (4) it in fact *wouldn't* do anything useful, because we'd still have
>>>> to sort UUIDs in the same order as today, meaning that btree index
>>>> behavior
>>>> would remain the same as before.  Plus UUID comparison would get a lot
>>>> more complicated and slower than it is now.
>>
>>> I get your first sentence, but not your second. I know that when
>>> changing the internal byte order we'd have to completed re-compute
>>> everything on-disk (from table to index data), but why would the sorting
>>> in the index have to be the same?
>>
>> Because we aren't going to change the existing sort order of UUIDs.
>> We have no idea what applications might be dependent on that.
>>
>> As Vitalii correctly pointed out, your beef is not with the physical
>> storage of UUIDs anyway: you just wish they'd sort differently, since
>> that is what determines the behavior of a btree index.  But we aren't
>> going to change the sort ordering because that's an even bigger
>> compatibility break than changing the physical storage; it'd affect
>> application-visible semantics.
>>
>> What you might want to think about is creating a function that maps
>> UUIDs into an ordering that makes sense to you, and then creating
>> a unique index over that function instead of the raw UUIDs.  That
>> would give the results you want without having to negotiate with the
>> rest of the world about whether it's okay to change the semantics
>> of type uuid.
>>
>
> FWIW that's essentially what I implemented as an extension some time
> ago. See [1] for a more detailed explanation and some benchmarks.

Yes, I've seen that before. Pretty nice work you but together there and
I'll surely have a look at it but we certainly need the node id in
compliance with v1 UUID's so that's why we've been generating UUID's at
the application side from day 1.

>
> The thing is - it's not really desirable to get perfectly ordered
> ordering, because that would mean we never get back to older parts of
> the index (so if you delete data, we'd never fill that space).

Wouldn't this apply also to any sequential-looking index (e.g. on
serial)? The main issue with the UUID's is that it almost instantly
consumes a big part of the total value space (e.g. first value is
'01...' and second is coming as 'f3...') which I would assume not being
very efficient with btrees (space reclaim? - bloat).

One of our major concerns is to keep index size small (VACUUM can't be
run every minute) to fit into memory next to a lot of others.

I've experimented with the rollover "prefix" myself but found that it
makes the index too big (same or larger index size than standard v1
UUIDs) and VACUUM too slow (almost as slow as a standard V1 UUID),
although INSERT performance wasn't that bad, our sequential UUID's where
way faster (at least pre-generated and imported with COPY to eliminate
any value generation impact).

Cheers,

        Ancoron

>
> [1] https://www.2ndquadrant.com/en/blog/sequential-uuid-generators/
>
>
> regards
>



Reply | Threaded
Open this post in threaded view
|

Re: UUID v1 optimizations...

Tomas Vondra-4
On Sun, May 26, 2019 at 01:49:30AM +0200, Ancoron Luciferis wrote:

>On 26/05/2019 00:14, Tomas Vondra wrote:
>> On Sat, May 25, 2019 at 05:54:15PM -0400, Tom Lane wrote:
>>> Ancoron Luciferis <[hidden email]> writes:
>>>> On 25/05/2019 16:57, Tom Lane wrote:
>>>>> (4) it in fact *wouldn't* do anything useful, because we'd still have
>>>>> to sort UUIDs in the same order as today, meaning that btree index
>>>>> behavior
>>>>> would remain the same as before.  Plus UUID comparison would get a lot
>>>>> more complicated and slower than it is now.
>>>
>>>> I get your first sentence, but not your second. I know that when
>>>> changing the internal byte order we'd have to completed re-compute
>>>> everything on-disk (from table to index data), but why would the sorting
>>>> in the index have to be the same?
>>>
>>> Because we aren't going to change the existing sort order of UUIDs.
>>> We have no idea what applications might be dependent on that.
>>>
>>> As Vitalii correctly pointed out, your beef is not with the physical
>>> storage of UUIDs anyway: you just wish they'd sort differently, since
>>> that is what determines the behavior of a btree index.  But we aren't
>>> going to change the sort ordering because that's an even bigger
>>> compatibility break than changing the physical storage; it'd affect
>>> application-visible semantics.
>>>
>>> What you might want to think about is creating a function that maps
>>> UUIDs into an ordering that makes sense to you, and then creating
>>> a unique index over that function instead of the raw UUIDs.  That
>>> would give the results you want without having to negotiate with the
>>> rest of the world about whether it's okay to change the semantics
>>> of type uuid.
>>>
>>
>> FWIW that's essentially what I implemented as an extension some time
>> ago. See [1] for a more detailed explanation and some benchmarks.
>
>Yes, I've seen that before. Pretty nice work you but together there and
>I'll surely have a look at it but we certainly need the node id in
>compliance with v1 UUID's so that's why we've been generating UUID's at
>the application side from day 1.
>
>>
>> The thing is - it's not really desirable to get perfectly ordered
>> ordering, because that would mean we never get back to older parts of
>> the index (so if you delete data, we'd never fill that space).
>
>Wouldn't this apply also to any sequential-looking index (e.g. on
>serial)?

Yes, it does apply to any index on sequential (ordered) data. If you
delete data from the "old" part (but not all, so the pages don't get
completely empty), that space is lost. It's available for new data, but
if we only insert to "new" part of the index, that's useless.

> The main issue with the UUID's is that it almost instantly
>consumes a big part of the total value space (e.g. first value is
>'01...' and second is coming as 'f3...') which I would assume not being
>very efficient with btrees (space reclaim? - bloat).
>

I don't understand what you mean here. Perhaps you misunderstand how
btree indexes grow? It's not like we allocate separate pages for
different values/prefixes - we insert the data until a page gets full,
then it's split in half. There is some dependency on the order in which
the values are inserted, but AFAIK random order is generally fine.

>One of our major concerns is to keep index size small (VACUUM can't be
>run every minute) to fit into memory next to a lot of others.
>

I don't think this has much to do with vacuum - I don't see how it's
related to the ordering of generated UUID values. And I don't see where
the "can't run vacuum every minute" comes from.

>I've experimented with the rollover "prefix" myself but found that it
>makes the index too big (same or larger index size than standard v1
>UUIDs) and VACUUM too slow (almost as slow as a standard V1 UUID),
>although INSERT performance wasn't that bad, our sequential UUID's where
>way faster (at least pre-generated and imported with COPY to eliminate
>any value generation impact).
>

I very much doubt that has anything to do with the prefix. You'll need
to share more details about how you did your tests.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: UUID v1 optimizations...

Morris de Oryx
I'm not worthy to post here, but a bit of a random thought.

If I've followed the conversation correctly, the reason for a V1 UUID is partly to order and partition rows by a timestamp value, but without the cost of a timestamp column. As I was told as a boy, "Smart numbers aren't." Is it _absolutely_ the case that you can't afford another column? I don't know the ins and outs of the Postgres row format, but my impression is that it's a fixed size, so you may be able to add the column without splitting rows? Anyway, even if that's not true and the extra column costs you disk space, is it the index that concerns you?  Have you considered a timestamp column, or a numeric column with an epoch offset, and a BRIN index? If you insert data is in pretty much chronological order, that might work well for you.

Best of luck, I've enjoyed following the commentary.


On Sun, May 26, 2019 at 11:09 AM Tomas Vondra <[hidden email]> wrote:
On Sun, May 26, 2019 at 01:49:30AM +0200, Ancoron Luciferis wrote:
>On 26/05/2019 00:14, Tomas Vondra wrote:
>> On Sat, May 25, 2019 at 05:54:15PM -0400, Tom Lane wrote:
>>> Ancoron Luciferis <[hidden email]> writes:
>>>> On 25/05/2019 16:57, Tom Lane wrote:
>>>>> (4) it in fact *wouldn't* do anything useful, because we'd still have
>>>>> to sort UUIDs in the same order as today, meaning that btree index
>>>>> behavior
>>>>> would remain the same as before.  Plus UUID comparison would get a lot
>>>>> more complicated and slower than it is now.
>>>
>>>> I get your first sentence, but not your second. I know that when
>>>> changing the internal byte order we'd have to completed re-compute
>>>> everything on-disk (from table to index data), but why would the sorting
>>>> in the index have to be the same?
>>>
>>> Because we aren't going to change the existing sort order of UUIDs.
>>> We have no idea what applications might be dependent on that.
>>>
>>> As Vitalii correctly pointed out, your beef is not with the physical
>>> storage of UUIDs anyway: you just wish they'd sort differently, since
>>> that is what determines the behavior of a btree index.  But we aren't
>>> going to change the sort ordering because that's an even bigger
>>> compatibility break than changing the physical storage; it'd affect
>>> application-visible semantics.
>>>
>>> What you might want to think about is creating a function that maps
>>> UUIDs into an ordering that makes sense to you, and then creating
>>> a unique index over that function instead of the raw UUIDs.  That
>>> would give the results you want without having to negotiate with the
>>> rest of the world about whether it's okay to change the semantics
>>> of type uuid.
>>>
>>
>> FWIW that's essentially what I implemented as an extension some time
>> ago. See [1] for a more detailed explanation and some benchmarks.
>
>Yes, I've seen that before. Pretty nice work you but together there and
>I'll surely have a look at it but we certainly need the node id in
>compliance with v1 UUID's so that's why we've been generating UUID's at
>the application side from day 1.
>
>>
>> The thing is - it's not really desirable to get perfectly ordered
>> ordering, because that would mean we never get back to older parts of
>> the index (so if you delete data, we'd never fill that space).
>
>Wouldn't this apply also to any sequential-looking index (e.g. on
>serial)?

Yes, it does apply to any index on sequential (ordered) data. If you
delete data from the "old" part (but not all, so the pages don't get
completely empty), that space is lost. It's available for new data, but
if we only insert to "new" part of the index, that's useless.

> The main issue with the UUID's is that it almost instantly
>consumes a big part of the total value space (e.g. first value is
>'01...' and second is coming as 'f3...') which I would assume not being
>very efficient with btrees (space reclaim? - bloat).
>

I don't understand what you mean here. Perhaps you misunderstand how
btree indexes grow? It's not like we allocate separate pages for
different values/prefixes - we insert the data until a page gets full,
then it's split in half. There is some dependency on the order in which
the values are inserted, but AFAIK random order is generally fine.

>One of our major concerns is to keep index size small (VACUUM can't be
>run every minute) to fit into memory next to a lot of others.
>

I don't think this has much to do with vacuum - I don't see how it's
related to the ordering of generated UUID values. And I don't see where
the "can't run vacuum every minute" comes from.

>I've experimented with the rollover "prefix" myself but found that it
>makes the index too big (same or larger index size than standard v1
>UUIDs) and VACUUM too slow (almost as slow as a standard V1 UUID),
>although INSERT performance wasn't that bad, our sequential UUID's where
>way faster (at least pre-generated and imported with COPY to eliminate
>any value generation impact).
>

I very much doubt that has anything to do with the prefix. You'll need
to share more details about how you did your tests.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: UUID v1 optimizations...

Ancoron Luciferis
In reply to this post by Tomas Vondra-4
On 26/05/2019 03:09, Tomas Vondra wrote:

> On Sun, May 26, 2019 at 01:49:30AM +0200, Ancoron Luciferis wrote:
>> On 26/05/2019 00:14, Tomas Vondra wrote:
>>> On Sat, May 25, 2019 at 05:54:15PM -0400, Tom Lane wrote:
>>>> Ancoron Luciferis <[hidden email]> writes:
>>>>> On 25/05/2019 16:57, Tom Lane wrote:
>>>>>> (4) it in fact *wouldn't* do anything useful, because we'd still have
>>>>>> to sort UUIDs in the same order as today, meaning that btree index
>>>>>> behavior
>>>>>> would remain the same as before.  Plus UUID comparison would get a
>>>>>> lot
>>>>>> more complicated and slower than it is now.
>>>>
>>>>> I get your first sentence, but not your second. I know that when
>>>>> changing the internal byte order we'd have to completed re-compute
>>>>> everything on-disk (from table to index data), but why would the
>>>>> sorting
>>>>> in the index have to be the same?
>>>>
>>>> Because we aren't going to change the existing sort order of UUIDs.
>>>> We have no idea what applications might be dependent on that.
>>>>
>>>> As Vitalii correctly pointed out, your beef is not with the physical
>>>> storage of UUIDs anyway: you just wish they'd sort differently, since
>>>> that is what determines the behavior of a btree index.  But we aren't
>>>> going to change the sort ordering because that's an even bigger
>>>> compatibility break than changing the physical storage; it'd affect
>>>> application-visible semantics.
>>>>
>>>> What you might want to think about is creating a function that maps
>>>> UUIDs into an ordering that makes sense to you, and then creating
>>>> a unique index over that function instead of the raw UUIDs.  That
>>>> would give the results you want without having to negotiate with the
>>>> rest of the world about whether it's okay to change the semantics
>>>> of type uuid.
>>>>
>>>
>>> FWIW that's essentially what I implemented as an extension some time
>>> ago. See [1] for a more detailed explanation and some benchmarks.
>>
>> Yes, I've seen that before. Pretty nice work you but together there and
>> I'll surely have a look at it but we certainly need the node id in
>> compliance with v1 UUID's so that's why we've been generating UUID's at
>> the application side from day 1.
>>
>>>
>>> The thing is - it's not really desirable to get perfectly ordered
>>> ordering, because that would mean we never get back to older parts of
>>> the index (so if you delete data, we'd never fill that space).
>>
>> Wouldn't this apply also to any sequential-looking index (e.g. on
>> serial)?
>
> Yes, it does apply to any index on sequential (ordered) data. If you
> delete data from the "old" part (but not all, so the pages don't get
> completely empty), that space is lost. It's available for new data, but
> if we only insert to "new" part of the index, that's useless.

OK, thanks for clearing that up for me.

>
>> The main issue with the UUID's is that it almost instantly
>> consumes a big part of the total value space (e.g. first value is
>> '01...' and second is coming as 'f3...') which I would assume not being
>> very efficient with btrees (space reclaim? - bloat).
>>
>
> I don't understand what you mean here. Perhaps you misunderstand how
> btree indexes grow? It's not like we allocate separate pages for
> different values/prefixes - we insert the data until a page gets full,
> then it's split in half. There is some dependency on the order in which
> the values are inserted, but AFAIK random order is generally fine.

OK, I might not understand the basics of the btree implementation. Sorry
for that.

However, one of our issues with standard v1 UUID's was bloat of the
indexes, although we kept only a few months of data in it. I think, this
was due to the pages still containing at least one value and not
reclaimed by vacuum. It just continued to grow.

Now, as we have this different ever-increasing prefix, we still have
some constant growing but we see that whenever historic data get's
deleted (in a separate process), space get's reclaimed.


>
>> One of our major concerns is to keep index size small (VACUUM can't be
>> run every minute) to fit into memory next to a lot of others.
>>
>
> I don't think this has much to do with vacuum - I don't see how it's
> related to the ordering of generated UUID values. And I don't see where
> the "can't run vacuum every minute" comes from.

OK, numbers (after VACUUM) which I really found strange using the query
from pgexperts [1]:

     index_name      | bloat_pct | bloat_mb | index_mb | table_mb
---------------------+-----------+----------+----------+----------
 uuid_v1_pkey        |        38 |      363 |  965.742 |  950.172
 uuid_serial_pkey    |        11 |       74 |  676.844 |  950.172
 uuid_serial_8_pkey  |        46 |      519 | 1122.031 |  950.172
 uuid_serial_16_pkey |        39 |      389 |  991.195 |  950.172

...where the "8" and "16" is a "shift" of the timestamp value,
implemented with:

timestamp = (timestamp >>> (60 - shift)) | (timestamp << (shift + 4))

If someone could shed some light on why that is (the huge difference in
index sizes) I'd be happy.

>
>> I've experimented with the rollover "prefix" myself but found that it
>> makes the index too big (same or larger index size than standard v1
>> UUIDs) and VACUUM too slow (almost as slow as a standard V1 UUID),
>> although INSERT performance wasn't that bad, our sequential UUID's where
>> way faster (at least pre-generated and imported with COPY to eliminate
>> any value generation impact).
>>
>
> I very much doubt that has anything to do with the prefix. You'll need
> to share more details about how you did your tests.

OK, I'll see if I can prepare something and publish it.

>
>
> regards
>


Refs:
[1]
https://github.com/pgexperts/pgx_scripts/blob/master/bloat/index_bloat_check.sql


Reply | Threaded
Open this post in threaded view
|

Re: UUID v1 optimizations...

Ancoron Luciferis
In reply to this post by Morris de Oryx
On 26/05/2019 06:27, Morris de Oryx wrote:

> I'm not worthy to post here, but a bit of a random thought.
>
> If I've followed the conversation correctly, the reason for a V1 UUID is
> partly to order and partition rows by a timestamp value, but without the
> cost of a timestamp column. As I was told as a boy, "Smart numbers
> aren't." Is it _absolutely_ the case that you can't afford another
> column? I don't know the ins and outs of the Postgres row format, but my
> impression is that it's a fixed size, so you may be able to add the
> column without splitting rows? Anyway, even if that's not true and the
> extra column costs you disk space, is it the index that concerns you? 
> Have you considered a timestamp column, or a numeric column with an
> epoch offset, and a BRIN index? If you insert data is in pretty much
> chronological order, that might work well for you.

Exactly, we are using the actual information combined within a v1 UUID
in multiple places and would like to avoid redundancy of information in
the database as we strive to keep as much of it in memory and
partitioning as well as timestamp (range) searching and sorting is a
pretty common thing for us.

For us, it's not an absolute "no-go" to have an additional column but
the semantics of the v1 UUID already guarantees us uniqueness across all
partitions, is the internal primary key and has additional information
we are using (creation time, node).

In addition, the extra column would need yet another index which brings
our write performance down again. So, while it would improve reading,
we're currently (still) more concerned about the write performance.

The BRIN index is something I might need to test, though.

>
> Best of luck, I've enjoyed following the commentary.
>
>
> On Sun, May 26, 2019 at 11:09 AM Tomas Vondra
> <[hidden email] <mailto:[hidden email]>> wrote:
>
>     On Sun, May 26, 2019 at 01:49:30AM +0200, Ancoron Luciferis wrote:
>     >On 26/05/2019 00:14, Tomas Vondra wrote:
>     >> On Sat, May 25, 2019 at 05:54:15PM -0400, Tom Lane wrote:
>     >>> Ancoron Luciferis <[hidden email]
>     <mailto:[hidden email]>> writes:
>     >>>> On 25/05/2019 16:57, Tom Lane wrote:
>     >>>>> (4) it in fact *wouldn't* do anything useful, because we'd
>     still have
>     >>>>> to sort UUIDs in the same order as today, meaning that btree index
>     >>>>> behavior
>     >>>>> would remain the same as before.  Plus UUID comparison would
>     get a lot
>     >>>>> more complicated and slower than it is now.
>     >>>
>     >>>> I get your first sentence, but not your second. I know that when
>     >>>> changing the internal byte order we'd have to completed re-compute
>     >>>> everything on-disk (from table to index data), but why would
>     the sorting
>     >>>> in the index have to be the same?
>     >>>
>     >>> Because we aren't going to change the existing sort order of UUIDs.
>     >>> We have no idea what applications might be dependent on that.
>     >>>
>     >>> As Vitalii correctly pointed out, your beef is not with the physical
>     >>> storage of UUIDs anyway: you just wish they'd sort differently,
>     since
>     >>> that is what determines the behavior of a btree index.  But we
>     aren't
>     >>> going to change the sort ordering because that's an even bigger
>     >>> compatibility break than changing the physical storage; it'd affect
>     >>> application-visible semantics.
>     >>>
>     >>> What you might want to think about is creating a function that maps
>     >>> UUIDs into an ordering that makes sense to you, and then creating
>     >>> a unique index over that function instead of the raw UUIDs.  That
>     >>> would give the results you want without having to negotiate with the
>     >>> rest of the world about whether it's okay to change the semantics
>     >>> of type uuid.
>     >>>
>     >>
>     >> FWIW that's essentially what I implemented as an extension some time
>     >> ago. See [1] for a more detailed explanation and some benchmarks.
>     >
>     >Yes, I've seen that before. Pretty nice work you but together there and
>     >I'll surely have a look at it but we certainly need the node id in
>     >compliance with v1 UUID's so that's why we've been generating UUID's at
>     >the application side from day 1.
>     >
>     >>
>     >> The thing is - it's not really desirable to get perfectly ordered
>     >> ordering, because that would mean we never get back to older parts of
>     >> the index (so if you delete data, we'd never fill that space).
>     >
>     >Wouldn't this apply also to any sequential-looking index (e.g. on
>     >serial)?
>
>     Yes, it does apply to any index on sequential (ordered) data. If you
>     delete data from the "old" part (but not all, so the pages don't get
>     completely empty), that space is lost. It's available for new data, but
>     if we only insert to "new" part of the index, that's useless.
>
>     > The main issue with the UUID's is that it almost instantly
>     >consumes a big part of the total value space (e.g. first value is
>     >'01...' and second is coming as 'f3...') which I would assume not being
>     >very efficient with btrees (space reclaim? - bloat).
>     >
>
>     I don't understand what you mean here. Perhaps you misunderstand how
>     btree indexes grow? It's not like we allocate separate pages for
>     different values/prefixes - we insert the data until a page gets full,
>     then it's split in half. There is some dependency on the order in which
>     the values are inserted, but AFAIK random order is generally fine.
>
>     >One of our major concerns is to keep index size small (VACUUM can't be
>     >run every minute) to fit into memory next to a lot of others.
>     >
>
>     I don't think this has much to do with vacuum - I don't see how it's
>     related to the ordering of generated UUID values. And I don't see where
>     the "can't run vacuum every minute" comes from.
>
>     >I've experimented with the rollover "prefix" myself but found that it
>     >makes the index too big (same or larger index size than standard v1
>     >UUIDs) and VACUUM too slow (almost as slow as a standard V1 UUID),
>     >although INSERT performance wasn't that bad, our sequential UUID's
>     where
>     >way faster (at least pre-generated and imported with COPY to eliminate
>     >any value generation impact).
>     >
>
>     I very much doubt that has anything to do with the prefix. You'll need
>     to share more details about how you did your tests.
>
>
>     regards
>
>     --
>     Tomas Vondra                  http://www.2ndQuadrant.com
>     PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>



Reply | Threaded
Open this post in threaded view
|

Re: UUID v1 optimizations...

Morris de Oryx
On Sun, May 26, 2019 at 7:38 PM Ancoron Luciferis <[hidden email]> wrote:

The BRIN index is something I might need to test, though.

Yes, check that out, it might give you some ideas. A B-tree (in whatever variant) is inherently a large index type. They're ideal for finding unique values quickly, not ideal for storing redundant values, and pretty decent at finding ranges. A BRIN (Block Range Index), as implemented in Postgres, is good for finding unique values and and ranges. But here's the thing, a BRIN index takes some absurdly small % of the space of a B-tree. You have to blink and check again to be sure you've figured it right.

How can a BRIN index be so much smaller? By throwing virtually everything out. A BRIN index doesn't store all of the values in a page, it stores the min/max value and that's it. So it's a probabilistic index (of sorts.) Is the value you're seeking on such and so page? The answer is "No" or "Maybe." That's a fast test on a very cheap data structure.

When the answer is "maybe", the full has to be loaded and scanned to determine if a specific value is found. So, you have a very small index structure, but have to do more sequential scanning to determine if a record is indexed in that page or not. In the real world, this can work out to be a high-performance structure at very low cost. But for it to work, your records need to be physically ordered (CLUSTER) by the condition in that index. And, going forward, you ought to be inserting in order too.(More-or-less.) So, a BRIN index is a great option if you have an insertion pattern that allows it to remain efficient, and if you're goal is range searching without a heavy B-tree index to maintain.

I have no clue how BRIN indexes and partitioning interact.

Reply | Threaded
Open this post in threaded view
|

Re: UUID v1 optimizations...

Tomas Vondra-4
In reply to this post by Morris de Oryx
On Sun, May 26, 2019 at 02:27:05PM +1000, Morris de Oryx wrote:

>I'm not worthy to post here, but a bit of a random thought.
>
>If I've followed the conversation correctly, the reason for a V1 UUID is
>partly to order and partition rows by a timestamp value, but without the
>cost of a timestamp column. As I was told as a boy, "Smart numbers aren't."
>Is it _absolutely_ the case that you can't afford another column? I don't
>know the ins and outs of the Postgres row format, but my impression is that
>it's a fixed size, so you may be able to add the column without splitting
>rows? Anyway, even if that's not true and the extra column costs you disk
>space, is it the index that concerns you?  Have you considered a timestamp
>column, or a numeric column with an epoch offset, and a BRIN index? If you
>insert data is in pretty much chronological order, that might work well for
>you.
>
>Best of luck, I've enjoyed following the commentary.
>

No, an extra column is not a solution, because it has no impact on the
index on the UUID column. One of the problems with indexes on random
data is that the entries go to random parts of the index. In the extreme
case, each index insert goes to a different index page (since the last
checkpoint) and therefore has to write the whole page into the WAL.
That's what full-page writes do. This inflates the amount of WAL, may
trigger more frequent checkpoints and (of course) reduces the cache hit
ratio for index pages (because we have to touch many of them).

The point of generating UUIDs in a more sequential way is to limit this
behavior by "concentrating" the index inserts into a smaller part of the
index. That's why indexes on sequential data (say, generated from a
SERIAL column) perform better.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


12