Unique index on hash of jsonb value - correct solution?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Unique index on hash of jsonb value - correct solution?

Albrecht Dreß
Hi all,

in a database I have a table with a text, a jsonb and a bytea column, which together shall be unique, like:

<snip>
  Column  |  Type  | Collation | Nullable |               Default
---------+--------+-----------+----------+-------------------------------------
  id      | bigint |           | not null | nextval('demotab_id_seq'::regclass)
  textval | text   |           | not null |
  jsonval | jsonb  |           | not null |
  blobval | bytea  |           | not null |
Indexes:
     "demo_idx" UNIQUE, btree (textval, jsonval, blobval)
</snip>

This seems to work just fine in most cases, but adding a bigger json value (the text and the bytea columns are always short) results in a “index row size 2840 exceeds maximum 2712 for index "…"” error.  Following the hint in the error message, I replaced the index by

<snip>
Indexes:
     "demo_idx" UNIQUE, btree (textval, md5(jsonval::text), blobval)
</snip>

which seems to solve the issue.

My question: is this approach (a) correct and (b) still safe if the items in the jsonb (always a dict in my case) are re-ordered?  I tested a few cases, and trying to insert something like e.g. '{"a":1,"b":2}' and '{"b":2,"a":1}' actually does produce the same hash (i.e. the 2nd insert attempt is rejected due to the index), but is this guaranteed by design for every case?  Or is there a better solution for this use case?

Thanks in advance,
Albrecht.

attachment0 (499 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Unique index on hash of jsonb value - correct solution?

Laurenz Albe
On Mon, 2020-05-18 at 18:43 +0200, Albrecht Dreß wrote:

> in a database I have a table with a text, a jsonb and a bytea column, which
> together shall be unique, like:
>
> <snip>
>   Column  |  Type  | Collation | Nullable |               Default
> ---------+--------+-----------+----------+-------------------------------------
>   id      | bigint |           | not null | nextval('demotab_id_seq'::regclass)
>   textval | text   |           | not null |
>   jsonval | jsonb  |           | not null |
>   blobval | bytea  |           | not null |
> Indexes:
>      "demo_idx" UNIQUE, btree (textval, jsonval, blobval)
> </snip>
>
> This seems to work just fine in most cases, but adding a bigger json value (the text
> and the bytea columns are always short) results in a “index row size 2840 exceeds
> maximum 2712 for index "…"” error.  Following the hint in the error message,
> I replaced the index by
>
> <snip>
> Indexes:
>      "demo_idx" UNIQUE, btree (textval, md5(jsonval::text), blobval)
> </snip>
>
> which seems to solve the issue.
>
> My question: is this approach (a) correct and (b) still safe if the items in the
> jsonb (always a dict in my case) are re-ordered?  I tested a few cases, and trying
> to insert something like e.g. '{"a":1,"b":2}' and '{"b":2,"a":1}' actually does
> produce the same hash (i.e. the 2nd insert attempt is rejected due to the index),
> but is this guaranteed by design for every case?  Or is there a better solution
> for this use case?

"jsonb" uses an internal binary representation that reorders the attributes in
a deterministic fashin, so yes, that is guaranteed.

I would use an "md5" hash for the "bytea" column as well to keep the index smaller.

There can be collisions with an md5 hash, so it is possible for duplicates to
creep in.  Besides, if you have an auto-generated "bigint" column, you should
make that the primary key.  Perhaps then the uniqueness condition is no longer
necessary.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com



Reply | Threaded
Open this post in threaded view
|

Re: Unique index on hash of jsonb value - correct solution?

Stephen Frost
Greetings,

* Laurenz Albe ([hidden email]) wrote:

> On Mon, 2020-05-18 at 18:43 +0200, Albrecht Dreß wrote:
> > in a database I have a table with a text, a jsonb and a bytea column, which
> > together shall be unique, like:
> >
> > <snip>
> >   Column  |  Type  | Collation | Nullable |               Default
> > ---------+--------+-----------+----------+-------------------------------------
> >   id      | bigint |           | not null | nextval('demotab_id_seq'::regclass)
> >   textval | text   |           | not null |
> >   jsonval | jsonb  |           | not null |
> >   blobval | bytea  |           | not null |
> > Indexes:
> >      "demo_idx" UNIQUE, btree (textval, jsonval, blobval)
> > </snip>
> >
> > This seems to work just fine in most cases, but adding a bigger json value (the text
> > and the bytea columns are always short) results in a “index row size 2840 exceeds
> > maximum 2712 for index "…"” error.  Following the hint in the error message,
> > I replaced the index by
> >
> > <snip>
> > Indexes:
> >      "demo_idx" UNIQUE, btree (textval, md5(jsonval::text), blobval)
> > </snip>
> >
> > which seems to solve the issue.
That's only going to work up to a certain size for that text and blob
value too, of course..  This is looking like it might be some kind of KV
store which is generally discouraged.

> > My question: is this approach (a) correct and (b) still safe if the items in the
> > jsonb (always a dict in my case) are re-ordered?  I tested a few cases, and trying
> > to insert something like e.g. '{"a":1,"b":2}' and '{"b":2,"a":1}' actually does
> > produce the same hash (i.e. the 2nd insert attempt is rejected due to the index),
> > but is this guaranteed by design for every case?  Or is there a better solution
> > for this use case?
>
> "jsonb" uses an internal binary representation that reorders the attributes in
> a deterministic fashin, so yes, that is guaranteed.

Sure- but the md5() isn't going to run on the jsonb data directly, it'll
run on the text representation that's returned, and I'm not at all
convinced that the project has agreed to make that text form always be
canonical and identical forever, including across major version
upgrades..  Further, there are some other cases to consider- such as: do
you think that 0.00001230 is different from 0.0000123?  Since we'll
store numbers in jsonb as numeric, we preserve trailing franctional
zeroes.

eg:

select md5(c1::text), md5(c2::text), c1, c2 from (select '{"reading": 1.230e-5}'::jsonb, '{"reading": 1.23e-5}'::jsonb) as x(c1,c2);
               md5                |               md5                |           c1            |           c2          
----------------------------------+----------------------------------+-------------------------+------------------------
 d793380db9196092889ccdb4c84f2bd4 | 4fb5efd00c7a0f0e0e6f6375b8a834d5 | {"reading": 0.00001230} | {"reading": 0.0000123}
(1 row)

You might wonder what PG normally thinks returns when comparing those
numbers, but that's easy to see:

select '1.230e-5'::numeric = '1.23e-5'::numeric;
 ?column?
----------
 t
(1 row)

> I would use an "md5" hash for the "bytea" column as well to keep the index smaller.

This goes to the point above that you might want to include the other
columns in the hash, or hash them independently if needed, to avoid
hitting the max index row size.

> There can be collisions with an md5 hash, so it is possible for duplicates to
> creep in.  Besides, if you have an auto-generated "bigint" column, you should
> make that the primary key.  Perhaps then the uniqueness condition is no longer
> necessary.

The issue isn't just with the md5 hash, as illustrated above.  Having an
alternative column that can be used as a primary key certainly does
*not* remove the general need to ensure that these columns are unique,
if that's what the data model calls for.

Overall, if what you need is uniqueness here, I'd strongly reconsider
the choice to use jsonb to store this data and instead see if you can
break the data out into proper columns with a proper unique constraint
across them (or across some hash of the combination of them that's done
in a way that is clear and unambiguous).

Thanks,

Stephen

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Unique index on hash of jsonb value - correct solution?

Albrecht Dreß
Hi,

first, thanks a lot for your input!

Am 19.05.20 16:16 schrieb(en) Stephen Frost:
> That's only going to work up to a certain size for that text and blob value too, of course..

These three columns (stripped-down code, there are more in the table) are a “lazy” import from various python scripts.  The text and bytea (actually a sha256 hash) fields always have a fixed, short size, but the json varies between a single item and a quite large dict, so extracting the data into columns seems to be cumbersome.

> This is looking like it might be some kind of KV store which is generally discouraged.

Well, yes, somehow…

>> "jsonb" uses an internal binary representation that reorders the attributes in a deterministic fashin, so yes, that is guaranteed.
>
> Sure- but the md5() isn't going to run on the jsonb data directly, it'll run on the text representation that's returned, and I'm not at all convinced that the project has agreed to make that text form always be canonical and identical forever, including across major version upgrades..

Ok, I see.

> Further, there are some other cases to consider- such as: do you think that 0.00001230 is different from 0.0000123?  Since we'll store numbers in jsonb as numeric, we preserve trailing franctional zeroes.

Good point; currently no problem for me (strings and ints only in the json dicts), but might be a serious issue in the future.

> Overall, if what you need is uniqueness here, I'd strongly reconsider the choice to use jsonb to store this data and instead see if you can break the data out into proper columns with a proper unique constraint across them (or across some hash of the combination of them that's done in a way that is clear and unambiguous).

Got your point, I hope – probably the clean solution would be a defined, sorted serialisation plus hashing of the json (and the text and bytea columns) either in the python producers, or in the (already existing) db function called by the python scripts for inserting data…

Thanks again,
Albrecht.

attachment0 (499 bytes) Download Attachment