Add parallelism and glibc dependent only options to reindexdb

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
78 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Julien Rouhaud
On Fri, Jul 12, 2019 at 3:20 AM Michael Paquier <[hidden email]> wrote:

>
> On Thu, Jul 11, 2019 at 06:22:25PM +0200, Julien Rouhaud wrote:
> > I think t hat it makes the code quite cleaner to have the selection
> > outside ConsumeIdleSlot.
>
> Actually, you have an issue with ConsumeIdleSlot() if there is only
> one parallel slot, no?  In this case the current patch returns
> immediately the slot available without waiting.  I think that we
> should wait until the slot becomes free in that case as well, and
> switch isFree to false.

It shouldn't be a problem, I reused the same infrastructure as for
vacuumdb.  so run_reindex_command has a new "async" parameter, so when
there's no parallelism it's using executeMaintenanceCommand (instead
of PQsendQuery) which will block until query completion.  That's why
there's no isFree usage at all in this case.

>  If you want to keep things splitted, that's
> fine by me, I would still use "Get" within the name for the routine,
> and rename the other to get_idle_slot_internal() or
> get_idle_slot_guts() to point out that it has an internal role.

Ok, I'll change to get_idle_slot_internal then.


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Michael Paquier-2
On Fri, Jul 12, 2019 at 07:49:13AM +0200, Julien Rouhaud wrote:
> It shouldn't be a problem, I reused the same infrastructure as for
> vacuumdb.  so run_reindex_command has a new "async" parameter, so when
> there's no parallelism it's using executeMaintenanceCommand (instead
> of PQsendQuery) which will block until query completion.  That's why
> there's no isFree usage at all in this case.

My point is more about consistency and simplification with the case
where n > 1 and that we could actually move the async/sync code paths
into the same banner as the async mode waits as well until a slot is
free, or in short when the query completes.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Julien Rouhaud
On Fri, Jul 12, 2019 at 7:57 AM Michael Paquier <[hidden email]> wrote:

>
> On Fri, Jul 12, 2019 at 07:49:13AM +0200, Julien Rouhaud wrote:
> > It shouldn't be a problem, I reused the same infrastructure as for
> > vacuumdb.  so run_reindex_command has a new "async" parameter, so when
> > there's no parallelism it's using executeMaintenanceCommand (instead
> > of PQsendQuery) which will block until query completion.  That's why
> > there's no isFree usage at all in this case.
>
> My point is more about consistency and simplification with the case
> where n > 1 and that we could actually move the async/sync code paths
> into the same banner as the async mode waits as well until a slot is
> free, or in short when the query completes.
I attach v4 with all previous comment addressed.

I also changed to handle parallel and non-parallel case the same way.
I kept the possibility for synchronous behavior in reindexdb, as
there's an early need to run some queries in case of parallel
database-wide reindex.  It avoids to open all the connections in case
anything fails during this preliminary work, and it also avoids
another call for the async wait function.  If we add parallelism to
clusterdb (I'll probably work on that next time I have spare time),
reindexdb would be the only caller left of
executeMaintenanceCommand(), so that's something we may want to
change.

I didn't change the behavior wrt. possible deadlock if user specify
catalog objects using --index or --table and ask for multiple
connection, as I'm afraid that it'll add too much code for a little
benefit.  Please shout if you think otherwise.

0001-Export-vacuumdb-s-parallel-infrastructure-v4.patch (32K) Download Attachment
0002-Add-parallel-processing-to-reindexdb-v4.patch (23K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Julien Rouhaud
On Fri, Jul 12, 2019 at 11:47 AM Julien Rouhaud <[hidden email]> wrote:
>
> I didn't change the behavior wrt. possible deadlock if user specify
> catalog objects using --index or --table and ask for multiple
> connection, as I'm afraid that it'll add too much code for a little
> benefit.  Please shout if you think otherwise.

Sorry I meant schemas, not indexes.

After more thinking about schema and multiple jobs, I think that
erroring out is quite user unfriendly, as it's entirely ok to ask for
multiple indexes and multiple object that do support parallelism in a
single call.  So I think it's better to remove the error, ignore the
given --jobs options for indexes and document this behavior.


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Michael Paquier-2
On Tue, Jul 16, 2019 at 02:03:16PM +0200, Julien Rouhaud wrote:
> After more thinking about schema and multiple jobs, I think that
> erroring out is quite user unfriendly, as it's entirely ok to ask
> for
> multiple indexes and multiple object that do support parallelism in
> a
> single call.  So I think it's better to remove the error, ignore the
> given --jobs options for indexes and document this behavior.

No objections to that.  I still need to study a bit more 0002 though
to come to a clear conclusion.

Actually, from patch 0002:
+       free_slot = ParallelSlotsGetIdle(slots, concurrentCons, progname);
+       if (!free_slot)
+       {
+           failed = true;
+           goto finish;
+       }
+
+       run_reindex_command(conn, process_type, objname, progname, echo,
+                           verbose, concurrently, true);
The same connection gets reused, shouldn't the connection come from
the free slot?

On top of that quick lookup, I have done an in-depth review on 0001 to
bring it to a committable state, fixing a couple of typos, incorrect
comments (description of ParallelSlotsGetIdle was for example
incorrect) on the way.  Other things include that connectDatabase
should have an assertion for a non-NULL connection, calling pg_free()
on the slots terminate is more consistent as pg_malloc is used first.
A comment at the top of processQueryResult still referred to
vacuuming of a missing relation.  Most of the patch was in a clean
state, with a clear interface for parallel slots, the place of the new
routines also makes sense, so I did not have much to do :)

Another thing I have noticed is that we don't really need to pass down
progname across all those layers as we finish by using pg_log_error()
when processing results, so more simplifications can be done.  Let's
handle all that in the same patch as we are messing with the area.
connectDatabase() and connectMaintenanceDatabase() still need it
though as this is used in the connection string, so
ParallelSlotsSetup() also needs it.  This part is not really your
fault but as I am looking at it, it does not hurt to clean up what we
can.  Attached is an updated version of 0001 that I am comfortable
with.  I'd like to commit that with the cleanups included and then
let's move to the real deal with 0002.
--
Michael

scripts-refactor-parallel.patch (30K) Download Attachment
signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Julien Rouhaud
On Wed, Jul 17, 2019 at 9:59 AM Michael Paquier <[hidden email]> wrote:

>
> On Tue, Jul 16, 2019 at 02:03:16PM +0200, Julien Rouhaud wrote:
> > After more thinking about schema and multiple jobs, I think that
> > erroring out is quite user unfriendly, as it's entirely ok to ask
> > for
> > multiple indexes and multiple object that do support parallelism in
> > a
> > single call.  So I think it's better to remove the error, ignore the
> > given --jobs options for indexes and document this behavior.
>
> No objections to that.  I still need to study a bit more 0002 though
> to come to a clear conclusion.
>
> Actually, from patch 0002:
> +       free_slot = ParallelSlotsGetIdle(slots, concurrentCons, progname);
> +       if (!free_slot)
> +       {
> +           failed = true;
> +           goto finish;
> +       }
> +
> +       run_reindex_command(conn, process_type, objname, progname, echo,
> +                           verbose, concurrently, true);
> The same connection gets reused, shouldn't the connection come from
> the free slot?

Ouch indeed.

> On top of that quick lookup, I have done an in-depth review on 0001 to
> bring it to a committable state, fixing a couple of typos, incorrect
> comments (description of ParallelSlotsGetIdle was for example
> incorrect) on the way.  Other things include that connectDatabase
> should have an assertion for a non-NULL connection,

disconnectDatabase you mean?  Fine by me.

> calling pg_free()
> on the slots terminate is more consistent as pg_malloc is used first.
> A comment at the top of processQueryResult still referred to
> vacuuming of a missing relation.  Most of the patch was in a clean
> state, with a clear interface for parallel slots, the place of the new
> routines also makes sense, so I did not have much to do :)

Thanks :)

> Another thing I have noticed is that we don't really need to pass down
> progname across all those layers as we finish by using pg_log_error()
> when processing results, so more simplifications can be done.  Let's
> handle all that in the same patch as we are messing with the area.
> connectDatabase() and connectMaintenanceDatabase() still need it
> though as this is used in the connection string, so
> ParallelSlotsSetup() also needs it.  This part is not really your
> fault but as I am looking at it, it does not hurt to clean up what we
> can.  Attached is an updated version of 0001 that I am comfortable
> with.  I'd like to commit that with the cleanups included and then
> let's move to the real deal with 0002.

Good catch, I totally missed this progname change.  I read the patch
you attached, I have a few comments:

+/*
+ * Disconnect the given connection, canceling any statement if one is active.
+ */
+void
+disconnectDatabase(PGconn *conn)

Nitpicking, but this comment doesn't follow the style of other
functions' comments (it's also the case for existing comment on
executeQuery at least).


While reading the comments you added on ParallelSlotsSetup(), I
wondered if we couldn't also add an Assert(conn) at the beginning?

+void
+ParallelSlotsTerminate(ParallelSlot *slots, int numslots)
+{
+ int i;
+
+ for (i = 0; i < numslots; i++)
+ {
+ PGconn    *conn = slots[i].connection;
+
+ if (conn == NULL)
+ continue;
+
+ disconnectDatabase(conn);
+ }
+
+ pg_free(slots);
+}

Is it ok to call pg_free(slots) and let caller have a pointer pointing
to freed memory?


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Michael Paquier-2
On Wed, Jul 17, 2019 at 07:46:10PM +0200, Julien Rouhaud wrote:
> On Wed, Jul 17, 2019 at 9:59 AM Michael Paquier <[hidden email]> wrote:
>> On top of that quick lookup, I have done an in-depth review on 0001 to
>> bring it to a committable state, fixing a couple of typos, incorrect
>> comments (description of ParallelSlotsGetIdle was for example
>> incorrect) on the way.  Other things include that connectDatabase
>> should have an assertion for a non-NULL connection,
>
> disconnectDatabase you mean?  Fine by me.

Oops, yes.  I meant disconnectDatabase() here.  The patch does so, not
my words.

> +/*
> + * Disconnect the given connection, canceling any statement if one is active.
> + */
> +void
> +disconnectDatabase(PGconn *conn)
>
> Nitpicking, but this comment doesn't follow the style of other
> functions' comments (it's also the case for existing comment on
> executeQuery at least).

connectDatabase, connectMaintenanceDatabase, executeQuery and most of
the others follow that style, so I am just going to simplify
consumeQueryResult and processQueryResult to keep a consistent style.

> While reading the comments you added on ParallelSlotsSetup(), I
> wondered if we couldn't also add an Assert(conn) at the beginning?

That makes sense as this gets associated to the first slot.  There
could be an argument for making a set of slots extensible to simplify
this interface, but that complicates the logic for the scripts.

> Is it ok to call pg_free(slots) and let caller have a pointer pointing
> to freed memory?

The interface has a Setup call which initializes the whole thing, and
Terminate is the logical end point, so having the free logic within
the termination looks more consistent to me.  We could now have actual
Init() and Free() but I am not sure that this justifies the move as
this complicates the scripts using it.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Michael Paquier-2
On Thu, Jul 18, 2019 at 09:45:14AM +0900, Michael Paquier wrote:
> On Wed, Jul 17, 2019 at 07:46:10PM +0200, Julien Rouhaud wrote:
>> Is it ok to call pg_free(slots) and let caller have a pointer pointing
>  to freed memory?
>
> The interface has a Setup call which initializes the whole thing, and
> Terminate is the logical end point, so having the free logic within
> the termination looks more consistent to me.  We could now have actual
> Init() and Free() but I am not sure that this justifies the move as
> this complicates the scripts using it.

I have reconsidered this point, moved the pg_free() call out of the
termination logic, and committed the first patch after an extra lookup
and more polishing.

For the second patch, could you send a rebase with a fix for the
connection slot when processing the reindex commands?
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Julien Rouhaud
On Fri, Jul 19, 2019 at 2:35 AM Michael Paquier <[hidden email]> wrote:

>
> On Thu, Jul 18, 2019 at 09:45:14AM +0900, Michael Paquier wrote:
> > On Wed, Jul 17, 2019 at 07:46:10PM +0200, Julien Rouhaud wrote:
> >> Is it ok to call pg_free(slots) and let caller have a pointer pointing
> >  to freed memory?
> >
> > The interface has a Setup call which initializes the whole thing, and
> > Terminate is the logical end point, so having the free logic within
> > the termination looks more consistent to me.  We could now have actual
> > Init() and Free() but I am not sure that this justifies the move as
> > this complicates the scripts using it.
>
> I have reconsidered this point, moved the pg_free() call out of the
> termination logic, and committed the first patch after an extra lookup
> and more polishing.
Thanks!

> For the second patch, could you send a rebase with a fix for the
> connection slot when processing the reindex commands?

Attached, I also hopefully removed all the now unneeded progname usage.

reindex_parallel_v5.diff (22K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Michael Paquier-2
On Fri, Jul 19, 2019 at 08:29:27AM +0200, Julien Rouhaud wrote:
> On Fri, Jul 19, 2019 at 2:35 AM Michael Paquier <[hidden email]> wrote:
>> For the second patch, could you send a rebase with a fix for the
>> connection slot when processing the reindex commands?
>
> Attached, I also hopefully removed all the now unneeded progname usage.

+        Note that this mode is not compatible the <option>-i / --index</option>
+        or the <option>-s / --system</option> options.
Nits: this is not a style consistent with the documentation.  When
referring to both the long and short options the formulation "-i or
--index" gets used.  Here we could just use the long option.  This
sentence is missing a "with".

       simple_string_list_append(&tables, optarg);
+      tbl_count++;
       break;
The number of items in a simple list is not counted, and vacuumdb does
the same thing to count objects.  What do you think about extending
simple lists to track the number of items stored?

+$node->issues_sql_like([qw(reindexdb -j2)],
+   qr/statement: REINDEX TABLE public.test1/,
+   'Global and parallel reindex will issue per-table REINDEX');
Would it make sense to have some tests for schemas here?

One of my comments in [1] has not been answered.  What about
the decomposition of a list of schemas into a list of tables when
using the parallel mode?

[1]: https://www.postgresql.org/message-id/20190711040433.GG4500@...
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Julien Rouhaud
On Mon, Jul 22, 2019 at 6:11 AM Michael Paquier <[hidden email]> wrote:

>
> On Fri, Jul 19, 2019 at 08:29:27AM +0200, Julien Rouhaud wrote:
> > On Fri, Jul 19, 2019 at 2:35 AM Michael Paquier <[hidden email]> wrote:
> >> For the second patch, could you send a rebase with a fix for the
> >> connection slot when processing the reindex commands?
> >
> > Attached, I also hopefully removed all the now unneeded progname usage.
>
> +        Note that this mode is not compatible the <option>-i / --index</option>
> +        or the <option>-s / --system</option> options.
> Nits: this is not a style consistent with the documentation.  When
> referring to both the long and short options the formulation "-i or
> --index" gets used.  Here we could just use the long option.  This
> sentence is missing a "with".
Right, so I kept the long option.  Also this comment was outdated, as
the --jobs is now just ignored with a list of indexes, so I fixed that
too.

>
>        simple_string_list_append(&tables, optarg);
> +      tbl_count++;
>        break;
> The number of items in a simple list is not counted, and vacuumdb does
> the same thing to count objects.  What do you think about extending
> simple lists to track the number of items stored?

I considered this, but it would require to adapt all code that declare
SimpleStringList stack variable (vacuumdb.c, clusterdb.c,
createuser.c, pg_dumpall.c and pg_dump.c), so it looked like too much
trouble to avoid 2 local variables here and 1 in vacuumdb.c.  I don't
have a strong opinion here, so I can go for it if you prefer.

>
> +$node->issues_sql_like([qw(reindexdb -j2)],
> +   qr/statement: REINDEX TABLE public.test1/,
> +   'Global and parallel reindex will issue per-table REINDEX');
> Would it make sense to have some tests for schemas here?
>
> One of my comments in [1] has not been answered.  What about
> the decomposition of a list of schemas into a list of tables when
> using the parallel mode?

I did that in attached v6, and added suitable regression tests.

reindex_parallel_v6.diff (25K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Alvaro Herrera-9
On 2019-Jul-22, Julien Rouhaud wrote:

> On Mon, Jul 22, 2019 at 6:11 AM Michael Paquier <[hidden email]> wrote:

> >        simple_string_list_append(&tables, optarg);
> > +      tbl_count++;
> >        break;
> > The number of items in a simple list is not counted, and vacuumdb does
> > the same thing to count objects.  What do you think about extending
> > simple lists to track the number of items stored?
>
> I considered this, but it would require to adapt all code that declare
> SimpleStringList stack variable (vacuumdb.c, clusterdb.c,
> createuser.c, pg_dumpall.c and pg_dump.c), so it looked like too much
> trouble to avoid 2 local variables here and 1 in vacuumdb.c.  I don't
> have a strong opinion here, so I can go for it if you prefer.

Can we use List for this instead?

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Alvaro Herrera-9
In reply to this post by Julien Rouhaud
On 2019-Jul-19, Julien Rouhaud wrote:

> > For the second patch, could you send a rebase with a fix for the
> > connection slot when processing the reindex commands?
>
> Attached, I also hopefully removed all the now unneeded progname usage.

BTW "progname" is a global variable in logging.c, and it's initialized
by pg_logging_init(), so there's no point in having a local variable in
main() that's called the same and initialized the same way.  You could
just remove it from the signature of all those functions
(connectDatabase and callers), and there would be no visible change.

Also: [see attached]

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

less-progname.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Julien Rouhaud
In reply to this post by Alvaro Herrera-9
On Mon, Jul 22, 2019 at 5:11 PM Alvaro Herrera <[hidden email]> wrote:

>
> On 2019-Jul-22, Julien Rouhaud wrote:
>
> > On Mon, Jul 22, 2019 at 6:11 AM Michael Paquier <[hidden email]> wrote:
>
> > >        simple_string_list_append(&tables, optarg);
> > > +      tbl_count++;
> > >        break;
> > > The number of items in a simple list is not counted, and vacuumdb does
> > > the same thing to count objects.  What do you think about extending
> > > simple lists to track the number of items stored?
> >
> > I considered this, but it would require to adapt all code that declare
> > SimpleStringList stack variable (vacuumdb.c, clusterdb.c,
> > createuser.c, pg_dumpall.c and pg_dump.c), so it looked like too much
> > trouble to avoid 2 local variables here and 1 in vacuumdb.c.  I don't
> > have a strong opinion here, so I can go for it if you prefer.
>
> Can we use List for this instead?

Isn't that for backend code only?


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Alvaro Herrera-9
On 2019-Jul-22, Julien Rouhaud wrote:

> On Mon, Jul 22, 2019 at 5:11 PM Alvaro Herrera <[hidden email]> wrote:
> >
> > > I considered this, but it would require to adapt all code that declare
> > > SimpleStringList stack variable (vacuumdb.c, clusterdb.c,
> > > createuser.c, pg_dumpall.c and pg_dump.c), so it looked like too much
> > > trouble to avoid 2 local variables here and 1 in vacuumdb.c.  I don't
> > > have a strong opinion here, so I can go for it if you prefer.
> >
> > Can we use List for this instead?
>
> Isn't that for backend code only?

Well, we already have palloc() on the frontend side, and list.c doesn't
have any elog()/ereport(), so it should be possible to use it ... I do
see that it uses MemoryContextAlloc() in a few places.  Maybe we can
just #define that to palloc()?

(Maybe we can use the impulse to get rid of these "simple lists"
altogether?)

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Tom Lane-2
Alvaro Herrera <[hidden email]> writes:
> On 2019-Jul-22, Julien Rouhaud wrote:
>> On Mon, Jul 22, 2019 at 5:11 PM Alvaro Herrera <[hidden email]> wrote:
>>> Can we use List for this instead?

>> Isn't that for backend code only?

> Well, we already have palloc() on the frontend side, and list.c doesn't
> have any elog()/ereport(), so it should be possible to use it ... I do
> see that it uses MemoryContextAlloc() in a few places.  Maybe we can
> just #define that to palloc()?

I'm not happy about either the idea of pulling all of list.c into
frontend programs, or restricting it to be frontend-safe.  That's
very fundamental infrastructure and I don't want it laboring under
such a restriction.  Furthermore, List usage generally leaks memory
like mad (cf nearby list_concat discussion) which doesn't seem like
something we want for frontend code.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Alvaro Herrera-9
On 2019-Jul-22, Tom Lane wrote:

> Alvaro Herrera <[hidden email]> writes:
> > On 2019-Jul-22, Julien Rouhaud wrote:
> >> On Mon, Jul 22, 2019 at 5:11 PM Alvaro Herrera <[hidden email]> wrote:
> >>> Can we use List for this instead?
>
> >> Isn't that for backend code only?
>
> > Well, we already have palloc() on the frontend side, and list.c doesn't
> > have any elog()/ereport(), so it should be possible to use it ... I do
> > see that it uses MemoryContextAlloc() in a few places.  Maybe we can
> > just #define that to palloc()?
>
> I'm not happy about either the idea of pulling all of list.c into
> frontend programs, or restricting it to be frontend-safe.  That's
> very fundamental infrastructure and I don't want it laboring under
> such a restriction.  Furthermore, List usage generally leaks memory
> like mad (cf nearby list_concat discussion) which doesn't seem like
> something we want for frontend code.

Fair enough.  List has gotten quite sophisticated now, so I understand
the concern.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Michael Paquier-2
In reply to this post by Alvaro Herrera-9
On Mon, Jul 22, 2019 at 11:18:06AM -0400, Alvaro Herrera wrote:
> BTW "progname" is a global variable in logging.c, and it's initialized
> by pg_logging_init(), so there's no point in having a local variable in
> main() that's called the same and initialized the same way.  You could
> just remove it from the signature of all those functions
> (connectDatabase and callers), and there would be no visible change.

Sure, and I was really tempted to do that until I noticed that we pass
down progname for fallback_application_name in the connection string
and that we would basically need to externalize progname in logging.h,
as well as switch all the callers of pg_logging_init to now include
their own definition of progname, which was much more invasive than
the initial refactoring intended.  I am also under the impression that
we had better keep get_progname() and pg_logging_init() as rather
independent things.

> Also: [see attached]

Missed those in the initial cleanup.  Applied, thanks!
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Michael Paquier-2
In reply to this post by Julien Rouhaud
On Mon, Jul 22, 2019 at 02:40:19PM +0200, Julien Rouhaud wrote:
> Right, so I kept the long option.  Also this comment was outdated, as
> the --jobs is now just ignored with a list of indexes, so I fixed that
> too.

+   if (!parallel)
+   {
+       if (user_list == NULL)
+       {
+           /*
+            * Create a dummy list with an empty string, as user requires an
+            * element.
+            */
+           process_list = pg_malloc0(sizeof(SimpleStringList));
+           simple_string_list_append(process_list, "");
+       }
+   }
This part looks a bit hacky and this is needed because we don't have a
list of objects when doing a non-parallel system or database reindex.
The deal is that we just want a list with one element: the database of
the connection.  Wouldn't it be more natural to assign the database
name here using PQdb(conn)?  Then add an assertion at the top of
run_reindex_command() checking for a non-NULL name?

> I considered this, but it would require to adapt all code that declare
> SimpleStringList stack variable (vacuumdb.c, clusterdb.c,
> createuser.c, pg_dumpall.c and pg_dump.c), so it looked like too much
> trouble to avoid 2 local variables here and 1 in vacuumdb.c.  I don't
> have a strong opinion here, so I can go for it if you prefer.

Okay.  This is a tad wider than the original patch proposal, and this
adds two lines.  So let's discard that for now and keep it simple.

>> +$node->issues_sql_like([qw(reindexdb -j2)],
>> +   qr/statement: REINDEX TABLE public.test1/,
>> +   'Global and parallel reindex will issue per-table REINDEX');
>> Would it make sense to have some tests for schemas here?
>>
>> One of my comments in [1] has not been answered.  What about
>> the decomposition of a list of schemas into a list of tables when
>> using the parallel mode?
>
> I did that in attached v6, and added suitable regression tests.
The two tests for "reindexdb -j2" can be combined into a single call,
checking for both commands to be generated in the same output.  The
second command triggering a reindex on two schemas cannot be used to
check for the generation of both s1.t1 and s2.t2 as the ordering may
not be guaranteed.  The commands arrays also looked inconsistent with
the rest.  Attached is an updated patch with some format modifications
and the fixes for the tests.
--
Michael

reindex_parallel_v7.patch (19K) Download Attachment
signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Julien Rouhaud
Sorry for the late answer,

On Tue, Jul 23, 2019 at 9:38 AM Michael Paquier <[hidden email]> wrote:

>
> On Mon, Jul 22, 2019 at 02:40:19PM +0200, Julien Rouhaud wrote:
> > Right, so I kept the long option.  Also this comment was outdated, as
> > the --jobs is now just ignored with a list of indexes, so I fixed that
> > too.
>
> +   if (!parallel)
> +   {
> +       if (user_list == NULL)
> +       {
> +           /*
> +            * Create a dummy list with an empty string, as user requires an
> +            * element.
> +            */
> +           process_list = pg_malloc0(sizeof(SimpleStringList));
> +           simple_string_list_append(process_list, "");
> +       }
> +   }
> This part looks a bit hacky and this is needed because we don't have a
> list of objects when doing a non-parallel system or database reindex.
> The deal is that we just want a list with one element: the database of
> the connection.  Wouldn't it be more natural to assign the database
> name here using PQdb(conn)?  Then add an assertion at the top of
> run_reindex_command() checking for a non-NULL name?
Good point, fixed it that way.
>
> > I considered this, but it would require to adapt all code that declare
> > SimpleStringList stack variable (vacuumdb.c, clusterdb.c,
> > createuser.c, pg_dumpall.c and pg_dump.c), so it looked like too much
> > trouble to avoid 2 local variables here and 1 in vacuumdb.c.  I don't
> > have a strong opinion here, so I can go for it if you prefer.
>
> Okay.  This is a tad wider than the original patch proposal, and this
> adds two lines.  So let's discard that for now and keep it simple.

Ok!

> >> +$node->issues_sql_like([qw(reindexdb -j2)],
> >> +   qr/statement: REINDEX TABLE public.test1/,
> >> +   'Global and parallel reindex will issue per-table REINDEX');
> >> Would it make sense to have some tests for schemas here?
> >>
> >> One of my comments in [1] has not been answered.  What about
> >> the decomposition of a list of schemas into a list of tables when
> >> using the parallel mode?
> >
> > I did that in attached v6, and added suitable regression tests.
>
> The two tests for "reindexdb -j2" can be combined into a single call,
> checking for both commands to be generated in the same output.  The
> second command triggering a reindex on two schemas cannot be used to
> check for the generation of both s1.t1 and s2.t2 as the ordering may
> not be guaranteed.  The commands arrays also looked inconsistent with
> the rest.  Attached is an updated patch with some format modifications
> and the fixes for the tests.
Attached v8 based on your v7 + the fix for the dummy part.

reindex_parallel_v8.diff (25K) Download Attachment
1234