GiST limits on contrib/cube with dimension > 100?

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

GiST limits on contrib/cube with dimension > 100?

Siarhei Siniak-2
I've been using cube extension recompiled with
#define MAX_DIM 256.

But with a version 11.3 I'm getting the following error:
failed to add item to index page in <index_name>

There's a regression unit test in contrib/cube/expected/cube.out:

CREATE TABLE test_cube (c cube);
\copy test_cube from 'data/test_cube.data'
CREATE INDEX test_cube_ix ON test_cube USING gist (c);
SELECT * FROM test_cube WHERE c && '(3000,1000),(0,0)' ORDER BY c;

I've created gist index in the same way, i.e. create index <index_name> on <table_name> using gist(<column_name>);

If MAX_DIM equals to 512, btree index complaints as:
index row size 4112 exceeds maximum 2712 for index <index_name>
HINT:  Values larger than 1/3 of a buffer page cannot be indexed.                                      
Consider a function index of an MD5 hash of the value, or use full text indexing.   

That's why 256 has been set.

But gist doesn't provide explanation on its error.

These are the places where the message might have been generated:
src/backend/access/gist/gist.c:418:                                     elog(ERROR, "failed to add item to index page in \"%s\"", RelationGetRelationName(rel));
src/backend/access/gist/gist.c:540:                                     elog(ERROR, "failed to add item to index page in \"%s\"",

Question is what restrains from setting MAX_DIM bigger than 100 in a custom recompiled cube extension version?
In practice the error messages are too cryptic.

contrib/cube/cube.c has the following methods regarding GIST:
/*
** GiST support methods
*/

PG_FUNCTION_INFO_V1(g_cube_consistent);
PG_FUNCTION_INFO_V1(g_cube_compress);
PG_FUNCTION_INFO_V1(g_cube_decompress);
PG_FUNCTION_INFO_V1(g_cube_penalty);
PG_FUNCTION_INFO_V1(g_cube_picksplit);
PG_FUNCTION_INFO_V1(g_cube_union);
PG_FUNCTION_INFO_V1(g_cube_same);
PG_FUNCTION_INFO_V1(g_cube_distance);

g_cube_compress has the following body:
    PG_RETURN_DATUM(PG_GETARG_DATUM(0));

Does it just returns void pointer to the underlying x array?
cube data structure:
typedef struct NDBOX
{
    /* varlena header (do not touch directly!) */
    int32        vl_len_;

    /*----------
     * Header contains info about NDBOX. For binary compatibility with old
     * versions, it is defined as "unsigned int".
     *
     * Following information is stored:
     *
     *    bits 0-7  : number of cube dimensions;
     *    bits 8-30 : unused, initialize to zero;
     *    bit  31   : point flag. If set, the upper right coordinates are not
     *                stored, and are implicitly the same as the lower left
     *                coordinates.
     *----------
     */
    unsigned int header;

    /*
     * The lower left coordinates for each dimension come first, followed by
     * upper right coordinates unless the point flag is set.
     */
    double        x[FLEXIBLE_ARRAY_MEMBER];
} NDBOX;

Can it be a problem of not fitting into some limits when building or updating gist index for cube with MAX_DIM > 100?
 

Reply | Threaded
Open this post in threaded view
|

Re: GiST limits on contrib/cube with dimension > 100?

Daniel Gustafsson
> On 9 Jun 2019, at 20:05, Siarhei Siniak <[hidden email]> wrote:
>
> I've been using cube extension recompiled with
> #define MAX_DIM 256.
>
> But with a version 11.3 I'm getting the following error:
> failed to add item to index page in <index_name>

This sounds like a variant of the issue reported on -bugs in
[hidden email]
and is also reproducible on HEAD.

cheers ./daniel

Reply | Threaded
Open this post in threaded view
|

Re: GiST limits on contrib/cube with dimension > 100?

Siarhei Siniak-2
Can you point out a failling unit test in the codebase?

P.S sorry for a late reply, has got this message in the spam folder )
Le lundi 10 juin 2019 à 14:57:32 UTC+3, Daniel Gustafsson <[hidden email]> a écrit :


> On 9 Jun 2019, at 20:05, Siarhei Siniak <[hidden email]> wrote:

>
> I've been using cube extension recompiled with
> #define MAX_DIM 256.
>
> But with a version 11.3 I'm getting the following error:
> failed to add item to index page in <index_name>


This sounds like a variant of the issue reported on -bugs in
[hidden email]
and is also reproducible on HEAD.

cheers ./daniel
Reply | Threaded
Open this post in threaded view
|

Re: GiST limits on contrib/cube with dimension > 100?

Siarhei Siniak-2
I've added debug prints to cube extension.
g_custom_cube_a_f8
g_custom_cube_picksplit
are the only called methods
after that it prints
import psycopg2
import logging
import numpy
import unittest


import python.utils.logging
import python.custom_db.backends
import python.custom_db.backends.postgresql


class TestPostgresql(unittest.TestCase):
    def test_gist(self):
        b = python.custom_db.backends.postgresql.Postgresql(
            databases=dict(
                test=dict(
                    minconn=1,
                    maxconn=1
                )
            )
        )

        b.connect()

        try:
            c = b.get_connection(use='test')

            c2 = c[0]

            with c2.cursor() as cur:
                cur.execute(r'''
                    drop table if exists test;
                    create table test(image_id integer primary key, latent_code custom_cube);
                    create index lc_idx on test using gist(latent_code);
                ''')
                c2.commit()

                with self.assertRaises(psycopg2.errors.InternalError_):
                    for k in range(10):
                        logging.info('test_postgresql.test_gist, k = %d' % k)
                        cur.execute(
                            r'''
                                insert into test (image_id, latent_code)
                                values (%s, custom_cube(%s))
                            ''',
                            [
                                k,
                                [float(x) for x in numpy.random.uniform(0, 1, 512)],
                            ]
                        )
                        c2.commit()
        finally:
            b.put_connection(c2, 'test')

```
Reply | Threaded
Open this post in threaded view
|

Re: GiST limits on contrib/cube with dimension > 100?

Siarhei Siniak-2
In reply to this post by Siarhei Siniak-2
I've added debug prints to cube extension. g_custom_cube_a_f8 and g_custom_cube_picksplit are the only called methods.
After that it prints:
     ERROR:  failed to add item to index page in "lc_idx"
Cube extension modifications:
    #define MAX_DIM (512)
Python test source code has been attached to the letter.

P.S.
sorry for the previous letter, didn't configure plain text composition


test_postgresql.py (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: GiST limits on contrib/cube with dimension > 100?

Andrey Borodin-2
In reply to this post by Siarhei Siniak-2
Hi!

> 9 июня 2019 г., в 23:05, Siarhei Siniak <[hidden email]> написал(а):
>
> I've been using cube extension recompiled with
> #define MAX_DIM 256.
>
> But with a version 11.3 I'm getting the following error:
> failed to add item to index page in <index_name>

So, you have changed source code (removing dim constraint) and get cryptic error after that. In some way this is expected...

Though, the reason why cube has this limit is not physical. R-tree's (cube+gist) just do not work effectively with more than 10 non-correlated dimensions.
If you have some correlated dimensions - you, probably, should invent something more clever that just cube - plain array of D*2*doubles for each tuple.

100 is upper bound for sane data that can be indexed in case of cube...

Nevertheless, we can improve AddTuple messages. But there is not such strict guidelines as with B-tree. Probably, tuples should not be bigger than half of usable page space.

Best regards, Andrey Borodin.

Reply | Threaded
Open this post in threaded view
|

Re: GiST limits on contrib/cube with dimension > 100?

Siarhei Siniak-2
A uniform set of points with a dimension 128 and type cube. That has a size of 50 ** 3. Can be queried for a nearest neighbor at a speed of 10 queries per second with limit varying from 1 to 25.
It works better than when no index used at all. So gist gives here a speed up.

The documentation of postgresql doesn't mention complexity of such an index. I've got confused as to its speed.

Does postgresql allow for an approximate nearest neighbor search?
https://github.com/erikbern/ann-benchmarks has a lot of efficient implementations.
Reply | Threaded
Open this post in threaded view
|

Re: GiST limits on contrib/cube with dimension > 100?

Andrey Borodin-2


> 12 июня 2019 г., в 15:11, Siarhei Siniak <[hidden email]> написал(а):
>
> A uniform set of points with a dimension 128 and type cube. That has a size of 50 ** 3. Can be queried for a nearest neighbor at a speed of 10 queries per second with limit varying from 1 to 25.
> It works better than when no index used at all. So gist gives here a speed up.

Then, I think, data is correlated. I believe it is possible to implement something better for high dimensional KNN in GiST than cube.


>
> The documentation of postgresql doesn't mention complexity of such an index. I've got confused as to its speed.
>
> Does postgresql allow for an approximate nearest neighbor search?
> https://github.com/erikbern/ann-benchmarks has a lot of efficient implementations.

ANN is beyond concepts of SQL standard: database with index must return same results as without index.
I can add ANN to github.com/x4m/ags which is a fork of GiST.

But PostgreSQL adds a lot of OLTP overhead:
1. it is crash-safe
2. it supports concurrent operations
2a. in a very unexpected way, for example in serializable mode it guaranties you that if you were looking for nearest neighbor there will not appear any new closer neighbor until the end of your transaction.
3. it allows extensibility and has abstraction layers
4. it has declarative querying language
etcetc

All this comes at a cost of database that can do anything and be anything. It its very hard to compete with specialized indexes that only do ANN.

Yet, as far as I know, no one really pursued the idea of fast high dimensional ANN in PG.

Best regards, Andrey Borodin.

Reply | Threaded
Open this post in threaded view
|

Re: GiST limits on contrib/cube with dimension > 100?

Siarhei Siniak-2

>ANN is beyond concepts of SQL standard: database with index must return same results as without index.
>I can add ANN to github.com/x4m/ags which is a fork of GiST.
How to recompile that extension and not to get a name conflict with a standard one?
I've renamed everything for cube extension. When I needed to fork it.
But it's impractical.
Reply | Threaded
Open this post in threaded view
|

Re: GiST limits on contrib/cube with dimension > 100?

Tom Lane-2
In reply to this post by Andrey Borodin-2
Andrey Borodin <[hidden email]> writes:
>> 9 июня 2019 г., в 23:05, Siarhei Siniak <[hidden email]> написал(а):
>> I've been using cube extension recompiled with
>> #define MAX_DIM 256.
>> But with a version 11.3 I'm getting the following error:
>> failed to add item to index page in <index_name>

> So, you have changed source code (removing dim constraint) and get cryptic error after that. In some way this is expected...

Yeah.  There is necessarily a limit on the size of index entries,
and it's getting exceeded.

> Nevertheless, we can improve AddTuple messages. But there is not such strict guidelines as with B-tree. Probably, tuples should not be bigger than half of usable page space.

I do not think "improve AddTuple messages" is going to be enough to fix
this.  Daniel was correct that this is the same problem seen in

https://www.postgresql.org/message-id/flat/AM6PR06MB57318C9882C021879DD4101EA3100%40AM6PR06MB5731.eurprd06.prod.outlook.com

See my reply there.  I think we should continue this discussion on that
thread, since it has seniority.

                        regards, tom lane