Efficiently searching for the most recent rows where a column matches any result from a different query

Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Efficiently searching for the most recent rows where a column matches any result from a different query

mkslaf
Hello,

I have the following schema:

    CREATE TABLE users (
        id   BIGSERIAL PRIMARY KEY,
        name TEXT      NOT NULL UNIQUE
    );
    
    CREATE TABLE friends (
        user_id        BIGINT NOT NULL REFERENCES users,
        friend_user_id BIGINT NOT NULL REFERENCES users,
        UNIQUE (user_id, friend_user_id)
    );
    
    CREATE TABLE posts (
        id      BIGSERIAL PRIMARY KEY,
        user_id BIGINT    NOT NULL REFERENCES users,
        content TEXT      NOT NULL
    );
    CREATE INDEX posts_user_id_id_index ON posts(user_id, id);

Each user can unilaterally follow any number of friends. The posts table has a large number of rows and is rapidly growing.

My goal is to retrieve the 10 most recent posts of a user's friends. This query gives the correct result, but is inefficient:

    SELECT posts.id, users.name, posts.content
    FROM posts JOIN users ON posts.user_id = users.id
    WHERE posts.user_id IN (SELECT friend_user_id FROM friends WHERE user_id = 1)
    ORDER BY posts.id DESC LIMIT 10;

If the user's friends have recently posted, the query is still reasonably fast (https://explain.depesz.com/s/6ykR). But if the user's friends haven't recently posted or the user has no friends, it quickly deteriorates (https://explain.depesz.com/s/OnoG).

If I match only a single post author (e.g. WHERE posts.user_id = 5), Postgres uses the index posts_user_id_id_index. But if I use IN, the index doesn't appear to be used at all.

How can I get these results more efficiently?

I've uploaded the schema and the queries I've tried to dbfiddle at http://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=cf1489b7f6d53c3fe0b55ed7ccbad1f0. The output of "SELECT version()" is "PostgreSQL 9.6.5 on x86_64-pc-linux-gnu, compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit" for me.

Thank you in advance for any insights, pointers or suggestions you are able to give me.

Regards,
Milo
Reply | Threaded
Open this post in threaded view
|

Re: Efficiently searching for the most recent rows where a column matches any result from a different query

Hellmuth Vargas
Hello:


EXPLAIN (ANALYZE, BUFFERS)
select * from (
SELECT posts.id, users.name, posts.content
FROM posts JOIN users ON posts.user_id = users.id
WHERE posts.user_id IN (SELECT friend_user_id FROM friends WHERE user_id = 1)

ORDER BY posts.id DESC 
) as a
ORDER BY a.id DESC 
LIMIT 10;

------


EXPLAIN (ANALYZE, BUFFERS)
select * from (
SELECT posts.id, users.name, posts.content
FROM posts JOIN users ON posts.user_id = users.id
WHERE posts.user_id IN (SELECT friend_user_id FROM friends WHERE user_id = 2)

ORDER BY posts.id DESC 
) as a
ORDER BY a.id DESC 
LIMIT 10;

2018-02-13 8:28 GMT-05:00 <[hidden email]>:
Hello,

I have the following schema:

    CREATE TABLE users (
        id   BIGSERIAL PRIMARY KEY,
        name TEXT      NOT NULL UNIQUE
    );
    
    CREATE TABLE friends (
        user_id        BIGINT NOT NULL REFERENCES users,
        friend_user_id BIGINT NOT NULL REFERENCES users,
        UNIQUE (user_id, friend_user_id)
    );
    
    CREATE TABLE posts (
        id      BIGSERIAL PRIMARY KEY,
        user_id BIGINT    NOT NULL REFERENCES users,
        content TEXT      NOT NULL
    );
    CREATE INDEX posts_user_id_id_index ON posts(user_id, id);

Each user can unilaterally follow any number of friends. The posts table has a large number of rows and is rapidly growing.

My goal is to retrieve the 10 most recent posts of a user's friends. This query gives the correct result, but is inefficient:

    SELECT posts.id, users.name, posts.content
    FROM posts JOIN users ON posts.user_id = users.id
    WHERE posts.user_id IN (SELECT friend_user_id FROM friends WHERE user_id = 1)
    ORDER BY posts.id DESC LIMIT 10;

If the user's friends have recently posted, the query is still reasonably fast (https://explain.depesz.com/s/6ykR). But if the user's friends haven't recently posted or the user has no friends, it quickly deteriorates (https://explain.depesz.com/s/OnoG).

If I match only a single post author (e.g. WHERE posts.user_id = 5), Postgres uses the index posts_user_id_id_index. But if I use IN, the index doesn't appear to be used at all.

How can I get these results more efficiently?

I've uploaded the schema and the queries I've tried to dbfiddle at http://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=cf1489b7f6d53c3fe0b55ed7ccbad1f0. The output of "SELECT version()" is "PostgreSQL 9.6.5 on x86_64-pc-linux-gnu, compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit" for me.

Thank you in advance for any insights, pointers or suggestions you are able to give me.

Regards,
Milo



--
Cordialmente,

Ing. Hellmuth I. Vargas S.
Esp. Telemática y Negocios por Internet 
Oracle Database 10g Administrator Certified Associate
EnterpriseDB Certified PostgreSQL 9.3 Associate

Reply | Threaded
Open this post in threaded view
|

Re: Efficiently searching for the most recent rows where a column matches any result from a different query

mkslaf
In reply to this post by mkslaf

Hello Hellmuth,

Thank you for your response.

I've uploaded the query plan for the first query (user_id=2) here: https://gist.github.com/anonymous/6d251b277ef71f8977b03cab91fedccd
The query plan for the second query (user_id=1) can be found here: https://gist.github.com/anonymous/32ed485b40cce2651ddc52661f3e7f7b

Just like in the original queries, posts_user_id_id_index is not used.

Kind regards,
Milo

13. Feb 2018 22:13 by [hidden email]:

Hello:


EXPLAIN (ANALYZE, BUFFERS)
select * from (
SELECT posts.id, users.name, posts.content
FROM posts JOIN users ON posts.user_id = users.id
WHERE posts.user_id IN (SELECT friend_user_id FROM friends WHERE user_id = 1)

ORDER BY posts.id DESC 
) as a
ORDER BY a.id DESC 
LIMIT 10;

------


EXPLAIN (ANALYZE, BUFFERS)
select * from (
SELECT posts.id, users.name, posts.content
FROM posts JOIN users ON posts.user_id = users.id
WHERE posts.user_id IN (SELECT friend_user_id FROM friends WHERE user_id = 2)

ORDER BY posts.id DESC 
) as a
ORDER BY a.id DESC 
LIMIT 10;

2018-02-13 8:28 GMT-05:00 <[hidden email]>:
Hello,

I have the following schema:

    CREATE TABLE users (
        id   BIGSERIAL PRIMARY KEY,
        name TEXT      NOT NULL UNIQUE
    );
    
    CREATE TABLE friends (
        user_id        BIGINT NOT NULL REFERENCES users,
        friend_user_id BIGINT NOT NULL REFERENCES users,
        UNIQUE (user_id, friend_user_id)
    );
    
    CREATE TABLE posts (
        id      BIGSERIAL PRIMARY KEY,
        user_id BIGINT    NOT NULL REFERENCES users,
        content TEXT      NOT NULL
    );
    CREATE INDEX posts_user_id_id_index ON posts(user_id, id);

Each user can unilaterally follow any number of friends. The posts table has a large number of rows and is rapidly growing.

My goal is to retrieve the 10 most recent posts of a user's friends. This query gives the correct result, but is inefficient:

    SELECT posts.id, users.name, posts.content
    FROM posts JOIN users ON posts.user_id = users.id
    WHERE posts.user_id IN (SELECT friend_user_id FROM friends WHERE user_id = 1)
    ORDER BY posts.id DESC LIMIT 10;

If the user's friends have recently posted, the query is still reasonably fast (https://explain.depesz.com/s/6ykR). But if the user's friends haven't recently posted or the user has no friends, it quickly deteriorates (https://explain.depesz.com/s/OnoG).

If I match only a single post author (e.g. WHERE posts.user_id = 5), Postgres uses the index posts_user_id_id_index. But if I use IN, the index doesn't appear to be used at all.

How can I get these results more efficiently?

I've uploaded the schema and the queries I've tried to dbfiddle at http://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=cf1489b7f6d53c3fe0b55ed7ccbad1f0. The output of "SELECT version()" is "PostgreSQL 9.6.5 on x86_64-pc-linux-gnu, compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit" for me.

Thank you in advance for any insights, pointers or suggestions you are able to give me.

Regards,
Milo



--
Cordialmente,

Ing. Hellmuth I. Vargas S.
Esp. Telemática y Negocios por Internet 
Oracle Database 10g Administrator Certified Associate
EnterpriseDB Certified PostgreSQL 9.3 Associate

Reply | Threaded
Open this post in threaded view
|

Re: Efficiently searching for the most recent rows where a column matches any result from a different query

Nandakumar M
Hi,

I myself am new to performance tuning queries. But, from what you have
said it looks like Postgres has to go through all the posts using the
backward index scan and find out whether their author is amongst the
user's friends list.

Since the number of friends is arbitrary for any user, even if a user
has few friends (or no friends at all), the stats will not reflect
this and so the planner cannot take advantage of this to directly
fetch the posts from the small set of friends.

My suggestion (which involves changing the schema and query) is to
have a last_post_id or last_posted_time column in user table, find the
last 10 friends who have posted first and then use it to find the last
10 posts. Something like,

select * from posts where posts.author_id in (select id from users
where id in (select friend_id from user_friend where user_id = 1) and
last_posted_time is not null order by last_posted_time desc limit 10);

I am not sure if this is the best way to solve this. If there are
better solutions I would be happy to learn the same.

Regards
Nanda

On Thu, Feb 15, 2018 at 5:48 PM,  <[hidden email]> wrote:

>
> Hello Hellmuth,
>
> Thank you for your response.
>
> I've uploaded the query plan for the first query (user_id=2) here:
> https://gist.github.com/anonymous/6d251b277ef71f8977b03cab91fedccd
> The query plan for the second query (user_id=1) can be found here:
> https://gist.github.com/anonymous/32ed485b40cce2651ddc52661f3e7f7b
>
> Just like in the original queries, posts_user_id_id_index is not used.
>
> Kind regards,
> Milo
>
> 13. Feb 2018 22:13 by [hidden email]:
>
> Hello:
>
>
> EXPLAIN (ANALYZE, BUFFERS)
> select * from (
> SELECT posts.id, users.name, posts.content
> FROM posts JOIN users ON posts.user_id = users.id
> WHERE posts.user_id IN (SELECT friend_user_id FROM friends WHERE user_id =
> 1)
>
> ORDER BY posts.id DESC
> ) as a
> ORDER BY a.id DESC
> LIMIT 10;
>
> ------
>
>
> EXPLAIN (ANALYZE, BUFFERS)
> select * from (
> SELECT posts.id, users.name, posts.content
> FROM posts JOIN users ON posts.user_id = users.id
> WHERE posts.user_id IN (SELECT friend_user_id FROM friends WHERE user_id =
> 2)
>
> ORDER BY posts.id DESC
> ) as a
> ORDER BY a.id DESC
> LIMIT 10;
>
> 2018-02-13 8:28 GMT-05:00 <[hidden email]>:
>>
>> Hello,
>>
>> I have the following schema:
>>
>>     CREATE TABLE users (
>>         id   BIGSERIAL PRIMARY KEY,
>>         name TEXT      NOT NULL UNIQUE
>>     );
>>
>>     CREATE TABLE friends (
>>         user_id        BIGINT NOT NULL REFERENCES users,
>>         friend_user_id BIGINT NOT NULL REFERENCES users,
>>         UNIQUE (user_id, friend_user_id)
>>     );
>>
>>     CREATE TABLE posts (
>>         id      BIGSERIAL PRIMARY KEY,
>>         user_id BIGINT    NOT NULL REFERENCES users,
>>         content TEXT      NOT NULL
>>     );
>>     CREATE INDEX posts_user_id_id_index ON posts(user_id, id);
>>
>> Each user can unilaterally follow any number of friends. The posts table
>> has a large number of rows and is rapidly growing.
>>
>> My goal is to retrieve the 10 most recent posts of a user's friends. This
>> query gives the correct result, but is inefficient:
>>
>>     SELECT posts.id, users.name, posts.content
>>     FROM posts JOIN users ON posts.user_id = users.id
>>     WHERE posts.user_id IN (SELECT friend_user_id FROM friends WHERE
>> user_id = 1)
>>     ORDER BY posts.id DESC LIMIT 10;
>>
>> If the user's friends have recently posted, the query is still reasonably
>> fast (https://explain.depesz.com/s/6ykR). But if the user's friends haven't
>> recently posted or the user has no friends, it quickly deteriorates
>> (https://explain.depesz.com/s/OnoG).
>>
>> If I match only a single post author (e.g. WHERE posts.user_id = 5),
>> Postgres uses the index posts_user_id_id_index. But if I use IN, the index
>> doesn't appear to be used at all.
>>
>> How can I get these results more efficiently?
>>
>> I've uploaded the schema and the queries I've tried to dbfiddle at
>> http://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=cf1489b7f6d53c3fe0b55ed7ccbad1f0.
>> The output of "SELECT version()" is "PostgreSQL 9.6.5 on
>> x86_64-pc-linux-gnu, compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit" for
>> me.
>>
>> Thank you in advance for any insights, pointers or suggestions you are
>> able to give me.
>>
>> Regards,
>> Milo
>
>
>
>
> --
> Cordialmente,
>
> Ing. Hellmuth I. Vargas S.
> Esp. Telemática y Negocios por Internet
> Oracle Database 10g Administrator Certified Associate
> EnterpriseDB Certified PostgreSQL 9.3 Associate
>

Reply | Threaded
Open this post in threaded view
|

Re: Efficiently searching for the most recent rows where a column matches any result from a different query

Nandakumar M
Hi,

Correction in the query. I missed to add limit 10 in the outer most query..

> select * from posts where posts.author_id in (select id from users
> where id in (select friend_id from user_friend where user_id = 1) and
> last_posted_time is not null order by last_posted_time desc limit 10);
>

select * from posts where posts.author_id in (select id from users
where id in (select friend_id from user_friend where user_id = 1) and
last_posted_time is not null order by last_posted_time desc limit 10)
order by post_id desc limit 10;

Regards,
Nanda

Next Thread