Add parallelism and glibc dependent only options to reindexdb

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
78 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Sergei Kornilov
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, failed
Spec compliant:           not tested
Documentation:            tested, passed

Hi

I did some review and have few notes about behavior.

reindex database does not work with concurrently option:

> ./inst/bin/reindexdb --echo -d postgres -j 8 --concurrently
> SELECT pg_catalog.set_config('search_path', '', false);
> REINDEX SYSTEM CONCURRENTLY postgres;
> reindexdb: error: reindexing of system catalogs on database "postgres" failed: ERROR:  cannot reindex system catalogs concurrently

I think we need print message and skip system catalogs for concurrently reindex.
Or we can disallow concurrently database reindex with multiple jobs. I prefer first option.

> + reindex_one_database(dbname, REINDEX_SCHEMA, &schemas, host,
> + port, username, prompt_password, progname,
> + echo, verbose, concurrently,
> + Min(concurrentCons, nsp_count));

Should be just concurrentCons instead of Min(concurrentCons, nsp_count)
reindex_one_database for REINDEX_SCHEMA will build tables list and then processing by available workers. So:
-j 8 -S public -S public -S public -S poblic -S public -S public - will work with 6 jobs (and without multiple processing for same table)
-j 8 -S public - will have only one worker regardless tables count

> if (concurrentCons > FD_SETSIZE - 1)

"if (concurrentCons >= FD_SETSIZE)" would not cleaner? Well, pgbench uses >= condition, vacuumdb uses > FD_SETSIZE - 1. No more FD_SETSIZE in conditions =)

regards, Sergei
Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Julien Rouhaud-3
Thanks for the review!

On Thu, Jul 25, 2019 at 10:17 AM Sergei Kornilov <[hidden email]> wrote:

>
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:       tested, failed
> Spec compliant:           not tested
> Documentation:            tested, passed
>
> Hi
>
> I did some review and have few notes about behavior.
>
> reindex database does not work with concurrently option:
>
> > ./inst/bin/reindexdb --echo -d postgres -j 8 --concurrently
> > SELECT pg_catalog.set_config('search_path', '', false);
> > REINDEX SYSTEM CONCURRENTLY postgres;
> > reindexdb: error: reindexing of system catalogs on database "postgres" failed: ERROR:  cannot reindex system catalogs concurrently
>
> I think we need print message and skip system catalogs for concurrently reindex.
> Or we can disallow concurrently database reindex with multiple jobs. I prefer first option.
Good point.  I agree with 1st option, as that's already what would
happen without the --jobs switch:

$ reindexdb -d postgres --concurrently
WARNING:  cannot reindex system catalogs concurrently, skipping all

(although this is emitted by the backend)
I modified the client code to behave the same and added a regression test.

> > +                     reindex_one_database(dbname, REINDEX_SCHEMA, &schemas, host,
> > +                                                              port, username, prompt_password, progname,
> > +                                                              echo, verbose, concurrently,
> > +                                                              Min(concurrentCons, nsp_count));
>
> Should be just concurrentCons instead of Min(concurrentCons, nsp_count)

Indeed, that changed with v8 and I forgot to update it, fixed.

> reindex_one_database for REINDEX_SCHEMA will build tables list and then processing by available workers. So:
> -j 8 -S public -S public -S public -S poblic -S public -S public - will work with 6 jobs (and without multiple processing for same table)
> -j 8 -S public - will have only one worker regardless tables count
>
> > if (concurrentCons > FD_SETSIZE - 1)
>
> "if (concurrentCons >= FD_SETSIZE)" would not cleaner? Well, pgbench uses >= condition, vacuumdb uses > FD_SETSIZE - 1. No more FD_SETSIZE in conditions =)

I don't have a strong opinion here.  If we change for >=, it'd be
better to also adapt vacuumdb for consistency.  I didn't change it for
now, to stay consistent with vacuumdb.

reindex_parallel_v9.diff (25K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Sergei Kornilov
Hi

Thank you, v9 code and behavior looks good for me. Builds cleanly, works with older releases. I'll mark Ready to Commiter.

> I don't have a strong opinion here. If we change for >=, it'd be
> better to also adapt vacuumdb for consistency. I didn't change it for
> now, to stay consistent with vacuumdb.

Yep, no strong opinion from me too.

regards, Sergei


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Michael Paquier-2
On Thu, Jul 25, 2019 at 12:12:40PM +0300, Sergei Kornilov wrote:
> Thank you, v9 code and behavior looks good for me. Builds cleanly,
> works with older releases. I'll mark Ready to Commiter.

The restriction with --jobs and --system is not documented, and that's
good to have from the start.  This actually makes the parallel job
handling with --index inconsistent as we mask parallelism in this case
by enforcing the use of one connection.  I think that we should
revisit the interactions with --index and --jobs actually, because,
assuming that users never read the documentation, they may actually be
surprised to see that something like --index idx1 .. --index idxN
--jobs=N does not lead to any improvements at all, until they find out
the reason why.  It is also much easier to have an error as starting
point because it can be lifted later one.  There is an argument that
we may actually not have this restriction at all on --index as long as
the user knows what it is doing and does not define indexes from the
same relation, still I would keep an error.

Thinking deeper about it, there is also a point of gathering first all
the relations if one associates --schemas and --tables in the same
call of reindexdb and then pass down a list of decomposed relations
which are processed in parallel.  The code as currently presented is
rather straight-forward, and I don't think that this is worth the
extra complication, but this was not mentioned until now on this
thread :)

For the non-parallel case in reindex_one_database(), I would add an
Assert on REINDEX_DATABASE and REINDEX_SYSTEM with a comment to
mention that a NULL list of objects can just be passed down only in
those two cases when the single-object list with the database name is
built.

>> I don't have a strong opinion here. If we change for >=, it'd be
>> better to also adapt vacuumdb for consistency. I didn't change it for
>> now, to stay consistent with vacuumdb.
>
> Yep, no strong opinion from me too.

My opinion tends towards consistency.  Consistency sounds good.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Sergei Kornilov
Hi

>>>  I don't have a strong opinion here. If we change for >=, it'd be
>>>  better to also adapt vacuumdb for consistency. I didn't change it for
>>>  now, to stay consistent with vacuumdb.
>>
>>  Yep, no strong opinion from me too.
>
> My opinion tends towards consistency. Consistency sounds good.

Which one consistency you prefer? Currently we have just one >= FD_SETSIZE in pgbench and one > FD_SETSIZE -1 in vacuumdb. That's all.

regards, Sergei


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Julien Rouhaud-3
In reply to this post by Michael Paquier-2
On Thu, Jul 25, 2019 at 12:03 PM Michael Paquier <[hidden email]> wrote:
>
> On Thu, Jul 25, 2019 at 12:12:40PM +0300, Sergei Kornilov wrote:
> > Thank you, v9 code and behavior looks good for me. Builds cleanly,
> > works with older releases. I'll mark Ready to Commiter.
>
> The restriction with --jobs and --system is not documented, and that's
> good to have from the start.

That's documented in the patch:
+        Note that this option is ignored with the <option>--index</option>
+        option to prevent deadlocks when processing multiple indexes from
+        the same relation, and incompatible with the <option>--system</option>
+        option.

The restriction with --jobs and --concurrently is indeed not
specifically documented in reindexdb.sgml, there's only a mention in
reindex.sgml:

    <command>REINDEX SYSTEM</command> does not support
    <command>CONCURRENTLY</command> since system catalogs cannot be reindexed

The behavior doesn't really change with this patch, though we could
enhance the documentation.

>  This actually makes the parallel job
> handling with --index inconsistent as we mask parallelism in this case
> by enforcing the use of one connection.  I think that we should
> revisit the interactions with --index and --jobs actually, because,
> assuming that users never read the documentation, they may actually be
> surprised to see that something like --index idx1 .. --index idxN
> --jobs=N does not lead to any improvements at all, until they find out
> the reason why.

The problem is that a user doing something like:

reindexdb -j48 -i some_index -S s1 -S s2 -S s3....

will probably be disappointed to learn that he has to run a specific
command for the index(es) that should be reindexed.  Maybe we can
issue a warning that parallelism isn't used when an index list is
processed and user asked for multiple jobs?

> Thinking deeper about it, there is also a point of gathering first all
> the relations if one associates --schemas and --tables in the same
> call of reindexdb and then pass down a list of decomposed relations
> which are processed in parallel.  The code as currently presented is
> rather straight-forward, and I don't think that this is worth the
> extra complication, but this was not mentioned until now on this
> thread :)

+1


> For the non-parallel case in reindex_one_database(), I would add an
> Assert on REINDEX_DATABASE and REINDEX_SYSTEM with a comment to
> mention that a NULL list of objects can just be passed down only in
> those two cases when the single-object list with the database name is
> built.

Something like that?

    if (!parallel)
    {
-       if (user_list == NULL)
+       /*
+        * Database-wide and system catalogs processing should omit the list
+        * of objects to process.
+        */
+       if (process_type == REINDEX_DATABASE || process_type == REINDEX_SYSTEM)
        {
+           Assert(user_list == NULL);
+
            process_list = pg_malloc0(sizeof(SimpleStringList));
            simple_string_list_append(process_list, PQdb(conn));
        }

There's another assert  after the else-parallel that checks that a
list is present, so there's no need to also check it here.

I don't send a new patch since the --index wanted behavior is not clear yet.


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Michael Paquier-2
On Thu, Jul 25, 2019 at 01:00:34PM +0200, Julien Rouhaud wrote:
> The problem is that a user doing something like:
>
> reindexdb -j48 -i some_index -S s1 -S s2 -S s3....
>
> will probably be disappointed to learn that he has to run a specific
> command for the index(es) that should be reindexed.  Maybe we can
> issue a warning that parallelism isn't used when an index list is
> processed and user asked for multiple jobs?

Arguments go in both directions as some other users may be surprised
by the performance of indexes as serialization is enforced.

> I don't send a new patch since the --index wanted behavior is not
> clear yet.

So I am sending one patch (actually two) after a closer review that I
have spent time shaping into a committable state.  And for this part I
have another suggestion that is to use a switch/case without a default
so as any newly-added object types would allow somebody to think about
those code paths as this would generate compiler warnings.

While reviewing I have found an extra bug in the logic: when using a
list of tables, the number of parallel slots is the minimum between
concurrentCons and tbl_count, but this does not get applied after
building a list of tables for a schema or database reindex, so we had
better recompute the number of items in reindex_one_database() before
allocating the number of parallel slots.  There was also a small gem
in the TAP tests for one of the commands using "-j2" in one of the
command arguments.

So here we go:
- 0001 is your original thing, with --jobs enforced to 1 for the index
part.
- 0002 is my addition to forbid --index with --jobs.

I am fine to be outvoted regarding 0002, and it is the case based on
the state of this thread with 2:1.  We could always revisit this
decision in this development cycle anyway.
--
Michael

0001-Base-patch-for-support-of-jobs-in-reindexdb.patch (21K) Download Attachment
0002-Forbid-index-with-jobs.patch (3K) Download Attachment
signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Julien Rouhaud
On Fri, Jul 26, 2019 at 5:27 AM Michael Paquier <[hidden email]> wrote:

>
> On Thu, Jul 25, 2019 at 01:00:34PM +0200, Julien Rouhaud wrote:
> > The problem is that a user doing something like:
> >
> > reindexdb -j48 -i some_index -S s1 -S s2 -S s3....
> >
> > will probably be disappointed to learn that he has to run a specific
> > command for the index(es) that should be reindexed.  Maybe we can
> > issue a warning that parallelism isn't used when an index list is
> > processed and user asked for multiple jobs?
>
> Arguments go in both directions as some other users may be surprised
> by the performance of indexes as serialization is enforced.

Sure, but there is no easy solution in that case, as you'd have to do
all the work of spawning multiple reindexdb according to the
underlying table, so probably what will happen here is that there'll
just be two simple calls to reindexdb, one for the indexes, serialized
anyway, and one for everything else.  My vote is still to allow it,
possibly emitting a notice or a warning.

> > I don't send a new patch since the --index wanted behavior is not
> > clear yet.
>
> So I am sending one patch (actually two) after a closer review that I
> have spent time shaping into a committable state.  And for this part I
> have another suggestion that is to use a switch/case without a default
> so as any newly-added object types would allow somebody to think about
> those code paths as this would generate compiler warnings.

Thanks for that!  I'm fine with using switch to avoid future bad surprises.

> While reviewing I have found an extra bug in the logic: when using a
> list of tables, the number of parallel slots is the minimum between
> concurrentCons and tbl_count, but this does not get applied after
> building a list of tables for a schema or database reindex, so we had
> better recompute the number of items in reindex_one_database() before
> allocating the number of parallel slots.

I see that you iterate over the SimpleStringList after it's generated.
Why not computing that while building it in get_parallel_object_list
(and keep the provided table list count) instead?


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Michael Paquier-2
On Fri, Jul 26, 2019 at 09:36:32AM +0200, Julien Rouhaud wrote:
> I see that you iterate over the SimpleStringList after it's generated.
> Why not computing that while building it in get_parallel_object_list
> (and keep the provided table list count) instead?

Yeah.  I was hesitating to do that, or just break out of the counting
loop if there are more objects than concurrent jobs, but that's less
intuitive.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Sergei Kornilov
In reply to this post by Michael Paquier-2
Hi

> So here we go:
> - 0001 is your original thing, with --jobs enforced to 1 for the index
> part.
> - 0002 is my addition to forbid --index with --jobs.
>
> I am fine to be outvoted regarding 0002, and it is the case based on
> the state of this thread with 2:1. We could always revisit this
> decision in this development cycle anyway.

Explicit is better than implicit, so I am +1 to commit both patches.

regards, Sergei


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Michael Paquier-2
On Fri, Jul 26, 2019 at 10:53:03AM +0300, Sergei Kornilov wrote:
> Explicit is better than implicit, so I am +1 to commit both patches.

Hence my count is incorrect:
- Forbid --jobs and --index: Michael P, Sergei K.
- Enforce --jobs=1 with --index: Julien R.
- Have no restrictions: 0.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Julien Rouhaud
In reply to this post by Michael Paquier-2
On Fri, Jul 26, 2019 at 9:41 AM Michael Paquier <[hidden email]> wrote:
>
> On Fri, Jul 26, 2019 at 09:36:32AM +0200, Julien Rouhaud wrote:
> > I see that you iterate over the SimpleStringList after it's generated.
> > Why not computing that while building it in get_parallel_object_list
> > (and keep the provided table list count) instead?
>
> Yeah.  I was hesitating to do that, or just break out of the counting
> loop if there are more objects than concurrent jobs, but that's less
> intuitive.

That's probably still more intuitive than having the count coming from
either main() or from get_parallel_object_list() depending on the
process type, so I'm fine with that alternative.  Maybe we could bite
the bullet and add a count meber to Simple*List, also providing a
macro to initialize a new list so that next time a field is added
there won't be a massive boilerplate code change?


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Michael Paquier-2
On Sat, Jul 27, 2019 at 11:44:47AM +0200, Julien Rouhaud wrote:
> That's probably still more intuitive than having the count coming from
> either main() or from get_parallel_object_list() depending on the
> process type, so I'm fine with that alternative.  Maybe we could bite
> the bullet and add a count meber to Simple*List, also providing a
> macro to initialize a new list so that next time a field is added
> there won't be a massive boilerplate code change?

Perhaps, we could discuss about that on a separate thread.  For now I
have gone with the simplest approach of counting the items, and
stopping the count if there are more items than jobs.  While reviewing
I have found a double-free in your patch when building a list of
relations for schemas or databases.  If the list finishes empty,
PQfinish() was called twice on the connection, leading to a crash.  I
have added a test for that, done an extra pass on the patch adjusting
a couple of things then committed the patch with the restriction on
--index and --jobs.  This entry is now marked as committed in the CF
app.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Julien Rouhaud
On Sat, Jul 27, 2019 at 3:27 PM Michael Paquier <[hidden email]> wrote:

>
> On Sat, Jul 27, 2019 at 11:44:47AM +0200, Julien Rouhaud wrote:
> > That's probably still more intuitive than having the count coming from
> > either main() or from get_parallel_object_list() depending on the
> > process type, so I'm fine with that alternative.  Maybe we could bite
> > the bullet and add a count meber to Simple*List, also providing a
> > macro to initialize a new list so that next time a field is added
> > there won't be a massive boilerplate code change?
>
> Perhaps, we could discuss about that on a separate thread.

Agreed.

>   For now I
> have gone with the simplest approach of counting the items, and
> stopping the count if there are more items than jobs.  While reviewing
> I have found a double-free in your patch when building a list of
> relations for schemas or databases.  If the list finishes empty,
> PQfinish() was called twice on the connection, leading to a crash.  I
> have added a test for that

Oops, thanks for spotting and fixing.

> , done an extra pass on the patch adjusting
> a couple of things then committed the patch with the restriction on
> --index and --jobs.  This entry is now marked as committed in the CF
> app.

Thanks!


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Michael Paquier-2
In reply to this post by Alvaro Herrera-9
On Mon, Jul 22, 2019 at 01:05:32PM -0400, Alvaro Herrera wrote:
> Fair enough.  List has gotten quite sophisticated now, so I understand
> the concern.

Just wondering something...  List cells include one pointer, one
signed integer and an OID.  The two last entries are basically 4-byte
each, hence could we reduce a bit the bloat by unifying both of them?
I understand that the distinction exists because both may not be of
the same size..

/me runs and hides
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Tom Lane-2
Michael Paquier <[hidden email]> writes:
> Just wondering something...  List cells include one pointer, one
> signed integer and an OID.  The two last entries are basically 4-byte
> each, hence could we reduce a bit the bloat by unifying both of them?

We couldn't really simplify the API that way; for example,
lfirst_int and lfirst_oid *must* be different because they
must return different types.  I think it'd be a bad idea
to have some parts of the API that distinguish the two types
while others pretend they're the same, so there's not much
room for shortening that.

You could imagine unifying the implementations of many of the
_int and _oid functions, but I can't get excited about that.
It would add confusion for not a lot of code savings.

> I understand that the distinction exists because both may not be of
> the same size..

Well, even more to the point, one's signed and one isn't.

In the long run, might we ever switch to 64-bit OIDs?  I dunno.
Now that we kicked them out of user tables, it might be feasible,
but by the same token there's not much pressure to do it.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Andres Freund
Hi,

On 2019-07-28 10:07:27 -0400, Tom Lane wrote:
> In the long run, might we ever switch to 64-bit OIDs?  I dunno.
> Now that we kicked them out of user tables, it might be feasible,
> but by the same token there's not much pressure to do it.

Depends on the the table, I'd say. Having toast tables have 64bit ids,
and not advance the oid counter, would be quite the advantage over the
current situation. Toasting performance craters once the oid counter has
wrapped. But obviously there are upgrade problems there - presumably
we'd need 'narrow" and 'wide' toast tables, or such.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Tom Lane-2
Andres Freund <[hidden email]> writes:
> On 2019-07-28 10:07:27 -0400, Tom Lane wrote:
>> In the long run, might we ever switch to 64-bit OIDs?  I dunno.

> Depends on the the table, I'd say. Having toast tables have 64bit ids,
> and not advance the oid counter, would be quite the advantage over the
> current situation. Toasting performance craters once the oid counter has
> wrapped. But obviously there are upgrade problems there - presumably
> we'd need 'narrow" and 'wide' toast tables, or such.

Yeah, but I'd be inclined to fix toast tables as a special case,
rather than widening OIDs in general.  We could define the chunk
number as being int8 not OID for the "wide" style.

                        regards, tom lane


1234