Add parallelism and glibc dependent only options to reindexdb

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
78 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Tomas Vondra-4
On Tue, Jul 02, 2019 at 10:45:44AM +0200, Julien Rouhaud wrote:

>On Tue, Jul 2, 2019 at 10:28 AM Tomas Vondra
><[hidden email]> wrote:
>>
>> I wonder why this is necessary:
>>
>> pg_log_error("cannot reindex glibc dependent objects and a subset of objects");
>>
>> What's the reasoning behind that? It seems like a valid use case to me -
>> imagine you have a bug database, but only a couple of tables are used by
>> the application regularly (the rest may be archive tables, for example).
>> Why not to allow rebuilding glibc-dependent indexes on the used tables, so
>> that the database can be opened for users sooner.
>
>It just seemed wrong to me to allow a partial processing for something
>that's aimed to prevent corruption.  I'd think that if users are
>knowledgeable enough to only reindex a subset of indexes/tables in
>such cases, they can also discard indexes that don't get affected by a
>collation lib upgrade.  I'm not strongly opposed to supporting if
>though, as there indeed can be valid use cases.
>

I don't know, it just seems like an unnecessary limitation.

>> BTW now that we allow rebuilding only some of the indexes, it'd be great
>> to have a dry-run mode, were we just print which indexes will be rebuilt
>> without actually rebuilding them.
>
>+1.  If we end up doing the filter in the backend, we'd have to add
>such option in the REINDEX command, and actually issue all the orders
>to retrieve the list.

Hmmm, yeah. FWIW I'm not requesting v0 to have that feature, but it'd be
good to design the feature in a way that allows adding it later.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Peter Eisentraut-6
In reply to this post by Julien Rouhaud
On 2019-07-02 10:30, Julien Rouhaud wrote:
> That's a great idea, and would make the parallelism in reindexdb much
> simpler.  There's however a downside, as users won't have a way to
> benefit from index filtering until they upgrade to this version.  OTOH
> glibc 2.28 is already there, and a hypothetical fancy reindexdb is far
> from being released.

Isn't that also the case for your proposal?  We are not going to release
a new reindexdb before a new REINDEX.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Peter Eisentraut-6
In reply to this post by Julien Rouhaud
On 2019-07-02 10:45, Julien Rouhaud wrote:
> It just seemed wrong to me to allow a partial processing for something
> that's aimed to prevent corruption.  I'd think that if users are
> knowledgeable enough to only reindex a subset of indexes/tables in
> such cases, they can also discard indexes that don't get affected by a
> collation lib upgrade.  I'm not strongly opposed to supporting if
> though, as there indeed can be valid use cases.

We are moving in this direction.  Thomas Munro has proposed an approach
for tracking collation versions on a per-object level rather than
per-database.  So then we'd need a way to reindex not those indexes
affected by collation but only those affected by collation and not yet
fixed.

One could also imagine a behavior where not-yet-fixed indexes are simply
ignored by the planner.  So the gradual upgrading approach that Tomas
described is absolutely a possibility.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Julien Rouhaud
In reply to this post by Peter Eisentraut-6
On Fri, Jul 5, 2019 at 6:16 PM Peter Eisentraut
<[hidden email]> wrote:

>
> On 2019-07-02 10:30, Julien Rouhaud wrote:
> > That's a great idea, and would make the parallelism in reindexdb much
> > simpler.  There's however a downside, as users won't have a way to
> > benefit from index filtering until they upgrade to this version.  OTOH
> > glibc 2.28 is already there, and a hypothetical fancy reindexdb is far
> > from being released.
>
> Isn't that also the case for your proposal?  We are not going to release
> a new reindexdb before a new REINDEX.

Sure, but my point was that once the new reindexdb is released (or if
you're so desperate, using a nightly build or compiling your own), it
can be used against any previous major version.  There is probably a
large fraction of users who don't perform a postgres upgrade when they
upgrade their OS, so that's IMHO also something to consider.


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Michael Paquier-2
On Fri, Jul 05, 2019 at 07:25:41PM +0200, Julien Rouhaud wrote:
> On Fri, Jul 5, 2019 at 6:16 PM Peter Eisentraut <[hidden email]> wrote:
>> Isn't that also the case for your proposal?  We are not going to release
>> a new reindexdb before a new REINDEX.
>
> Sure, but my point was that once the new reindexdb is released (or if
> you're so desperate, using a nightly build or compiling your own), it
> can be used against any previous major version.  There is probably a
> large fraction of users who don't perform a postgres upgrade when they
> upgrade their OS, so that's IMHO also something to consider.

I think that we need to think long-term here and be confident in the
fact we will still see breakages with collations and glibc, using a
solution that we think is the right API.  Peter's idea to make the
backend-aware command of the filtering is cool.  On top of that, there
is no need to add any conflict logic in reindexdb and we can live with
restricting --jobs support for non-index objects.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Julien Rouhaud
On Mon, Jul 8, 2019 at 9:57 AM Michael Paquier <[hidden email]> wrote:

>
> On Fri, Jul 05, 2019 at 07:25:41PM +0200, Julien Rouhaud wrote:
> > On Fri, Jul 5, 2019 at 6:16 PM Peter Eisentraut <[hidden email]> wrote:
> >> Isn't that also the case for your proposal?  We are not going to release
> >> a new reindexdb before a new REINDEX.
> >
> > Sure, but my point was that once the new reindexdb is released (or if
> > you're so desperate, using a nightly build or compiling your own), it
> > can be used against any previous major version.  There is probably a
> > large fraction of users who don't perform a postgres upgrade when they
> > upgrade their OS, so that's IMHO also something to consider.
>
> I think that we need to think long-term here and be confident in the
> fact we will still see breakages with collations and glibc, using a
> solution that we think is the right API.  Peter's idea to make the
> backend-aware command of the filtering is cool.  On top of that, there
> is no need to add any conflict logic in reindexdb and we can live with
> restricting --jobs support for non-index objects.

Don't get me wrong, I do agree that implementing filtering in the
backend is a better design.  What's bothering me is that I also agree
that there will be more glibc breakage, and if that happens within a
few years, a lot of people will still be using pg12- version, and they
still won't have an efficient way to rebuild their indexes.  Now, it'd
be easy to publish an external tools that does a simple
parallel-and-glic-filtering reindex tool that will serve that purpose
for the few years it'll be needed, so everyone can be happy.

For now, I'll resubmit the parallel patch using per-table only
approach, and will submit the filtering in the backend using a new
REINDEX option in a different thread.


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Julien Rouhaud
On Mon, Jul 8, 2019 at 9:08 PM Julien Rouhaud <[hidden email]> wrote:
>
> I'll resubmit the parallel patch using per-table only
> approach

Attached.

0001-Export-vacuumdb-s-parallel-infrastructure.patch (22K) Download Attachment
0002-Add-parallel-processing-to-reindexdb.patch (22K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Bruce Momjian
In reply to this post by Alvaro Herrera-9
On Mon, Jul  1, 2019 at 09:51:12AM -0400, Alvaro Herrera wrote:

> Now that we have REINDEX CONCURRENTLY, I think reindexdb is going to
> gain more popularity.
>
> Please don't reuse a file name as generic as "parallel.c" -- it's
> annoying when navigating source.  Maybe conn_parallel.c multiconn.c
> connscripts.c admconnection.c ...?
>
> If your server crashes or is stopped midway during the reindex, you
> would have to start again from scratch, and it's tedious (if it's
> possible at all) to determine which indexes were missed.  I think it
> would be useful to have a two-phase mode: in the initial phase reindexdb
> computes the list of indexes to be reindexed and saves them into a work
> table somewhere.  In the second phase, it reads indexes from that table
> and processes them, marking them as done in the work table.  If the
> second phase crashes or is stopped, it can be restarted and consults the
> work table.  I would keep the work table, as it provides a bit of an
> audit trail.  It may be important to be able to run even if unable to
> create such a work table (because of the <ironic>numerous</> users that
> DROP DATABASE postgres).
>
> Maybe we'd have two flags in the work table for each index:
> "reindex requested", "reindex done".

I think we have a similar issue with adding checksums, so let's address
with a generic framework and use it for all cases, like vacuumdb too.

--
  Bruce Momjian  <[hidden email]>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Michael Paquier-2
In reply to this post by Julien Rouhaud
On Mon, Jul 08, 2019 at 11:02:14PM +0200, Julien Rouhaud wrote:
> On Mon, Jul 8, 2019 at 9:08 PM Julien Rouhaud <[hidden email]> wrote:
>> I'll resubmit the parallel patch using per-table only
>> approach
>
> Attached.

I have done a lookup of this patch set with a focus on the refactoring
part, and the split is a bit confusing.

+void
+DisconnectDatabase(ParallelSlot *slot)
+{
+   char        errbuf[256];
common.c has already an API to connect to a database.  It would be
more natural to move the disconnection part also to common.c and have
the caller of DisconnectDatabase reset the slot connection by itself?
disconnectDatabase() (lower case for the first character) would make
the set more consistent.  We could also have a wrapper like say
DiscardSlot() which does this work, but that seems like an overkill
for a single slot if one API could do the cleanup of the full set.

$ git grep select_loop
scripts_parallel.c:     /* We must reconstruct the fd_set for each
call to select_loop */
scripts_parallel.c:     i = select_loop(maxFd, &slotset, &aborting);
scripts_parallel.c:select_loop(int maxFd, fd_set *workerset, bool
*aborting)
scripts_parallel.h:extern int   select_loop(int maxFd, fd_set
*workerset, bool *aborting);

select_loop is only present in scripts_parallel.c, so it can remain
static.

+       slots = (ParallelSlot *) pg_malloc(sizeof(ParallelSlot) *
concurrentCons);
+       init_slot(slots, conn);
+       if (parallel)
+       {
+               for (i = 1; i < concurrentCons; i++)
+               {
+                       conn = connectDatabase(dbname, host, port,
username, prompt_password,
+
progname, echo, false, true);
+                       init_slot(slots + i, conn);
+               }
+       }

This comes from 0002 and could be more refactored as vacuumdb does the
same thing.  Based on 0001, init_slot() is called now in vacuumdb.c
and initializes a set of slots while connecting to a given database.
In short, in input we have a set of parameters and the ask to open
connections with N slots, and the return result is an pg_malloc'd
array of slots ready to be used.  We could just call that
ParallelSlotInit() (if you have a better name feel free).

+    /*
+     * Get the connection slot to use.  If in parallel mode, here we wait
+     * for one connection to become available if none already is. In
+     * non-parallel mode we simply use the only slot we have, which we
+     * know to be free.
+     */
+    if (parallel)
This business also is duplicated in both reindexdb.c and vacuumdb.c.

+bool
+GetQueryResult(PGconn *conn, const char *progname)
+{
This also does not stick with the parallel stuff, as that's basically
only getting a query result.  We could stick that into common.c.

Patch 2 has no documentation.  The option as well as the restrictions
in place need to be documented properly.

Here is a small idea about the set of routines we could have for the
parallel stuff, with only three of them needed to work on the parallel
slots and get free connections:
- Initialization of the full slot set.
- Cleanup and disconnection of the slots.
- Fetch an idle connection and wait for one until available.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Julien Rouhaud
Thanks for the review.

On Tue, Jul 9, 2019 at 9:24 AM Michael Paquier <[hidden email]> wrote:

>
> On Mon, Jul 08, 2019 at 11:02:14PM +0200, Julien Rouhaud wrote:
> > On Mon, Jul 8, 2019 at 9:08 PM Julien Rouhaud <[hidden email]> wrote:
> >> I'll resubmit the parallel patch using per-table only
> >> approach
> >
> > Attached.
>
> I have done a lookup of this patch set with a focus on the refactoring
> part, and the split is a bit confusing.

Yes, that wasn't a smart split  :(

> +void
> +DisconnectDatabase(ParallelSlot *slot)
> +{
> +   char        errbuf[256];
> common.c has already an API to connect to a database.  It would be
> more natural to move the disconnection part also to common.c and have
> the caller of DisconnectDatabase reset the slot connection by itself?

Ok.

> $ git grep select_loop
> scripts_parallel.c:     /* We must reconstruct the fd_set for each
> call to select_loop */
> scripts_parallel.c:     i = select_loop(maxFd, &slotset, &aborting);
> scripts_parallel.c:select_loop(int maxFd, fd_set *workerset, bool
> *aborting)
> scripts_parallel.h:extern int   select_loop(int maxFd, fd_set
> *workerset, bool *aborting);
>
> select_loop is only present in scripts_parallel.c, so it can remain
> static.

Good point.

> +       slots = (ParallelSlot *) pg_malloc(sizeof(ParallelSlot) *
> concurrentCons);
> +       init_slot(slots, conn);
> +       if (parallel)
> +       {
> +               for (i = 1; i < concurrentCons; i++)
> +               {
> +                       conn = connectDatabase(dbname, host, port,
> username, prompt_password,
> +
> progname, echo, false, true);
> +                       init_slot(slots + i, conn);
> +               }
> +       }
>
> This comes from 0002 and could be more refactored as vacuumdb does the
> same thing.  Based on 0001, init_slot() is called now in vacuumdb.c
> and initializes a set of slots while connecting to a given database.
> In short, in input we have a set of parameters and the ask to open
> connections with N slots, and the return result is an pg_malloc'd
> array of slots ready to be used.  We could just call that
> ParallelSlotInit() (if you have a better name feel free).

Given how the rest of the functions are named, I'll probably use
InitParallelSlots().

>
> +    /*
> +     * Get the connection slot to use.  If in parallel mode, here we wait
> +     * for one connection to become available if none already is. In
> +     * non-parallel mode we simply use the only slot we have, which we
> +     * know to be free.
> +     */
> +    if (parallel)
> This business also is duplicated in both reindexdb.c and vacuumdb.c.
>
> +bool
> +GetQueryResult(PGconn *conn, const char *progname)
> +{
> This also does not stick with the parallel stuff, as that's basically
> only getting a query result.  We could stick that into common.c.

This function also has a bad name, as it's discarding the result via
ProcessQueryResult.  Maybe we should rename them to GetQuerySuccess()
and ConsumeAndTrashQueryResult()?

> Patch 2 has no documentation.  The option as well as the restrictions
> in place need to be documented properly.

I forgot that I had forgotten to add documentation :( will fix this time.


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Peter Eisentraut-6
In reply to this post by Julien Rouhaud
On 2019-07-08 21:08, Julien Rouhaud wrote:
> Don't get me wrong, I do agree that implementing filtering in the
> backend is a better design.  What's bothering me is that I also agree
> that there will be more glibc breakage, and if that happens within a
> few years, a lot of people will still be using pg12- version, and they
> still won't have an efficient way to rebuild their indexes.  Now, it'd
> be easy to publish an external tools that does a simple
> parallel-and-glic-filtering reindex tool that will serve that purpose
> for the few years it'll be needed, so everyone can be happy.

You can already do that: Run a query through psql to get a list of
affected tables or indexes and feed those to reindexdb using -i or -t
options.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Julien Rouhaud
In reply to this post by Julien Rouhaud
On Tue, Jul 9, 2019 at 9:52 AM Julien Rouhaud <[hidden email]> wrote:
>
> On Tue, Jul 9, 2019 at 9:24 AM Michael Paquier <[hidden email]> wrote:
> >
> > I have done a lookup of this patch set with a focus on the refactoring
> > part, and the split is a bit confusing.
> [...]

I finished to do a better refactoring, and ended up with this API in
scripts_parallel:

extern ParallelSlot *ConsumeIdleSlot(ParallelSlot *slots, int numslots,
const char *progname);

extern ParallelSlot *SetupParallelSlots(const char *dbname, const char *host,
const char *port,
const char *username, bool prompt_password,
const char *progname, bool echo,
PGconn *conn, int numslots);

extern bool WaitForSlotsCompletion(ParallelSlot *slots, int numslots,
   const char *progname);

ConsumeIdleSlot() being a wrapper on top of (now static) GetIdleSlot,
which handles parallelism and possible failure.

Attached v3, including updated documentation for the new -j option.

0002-Add-parallel-processing-to-reindexdb-v3.patch (23K) Download Attachment
0001-Export-vacuumdb-s-parallel-infrastructure-v3.patch (30K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Michael Paquier-2
In reply to this post by Peter Eisentraut-6
On Tue, Jul 09, 2019 at 01:09:38PM +0200, Peter Eisentraut wrote:
> You can already do that: Run a query through psql to get a list of
> affected tables or indexes and feed those to reindexdb using -i or -t
> options.

Sure, but that's limited if one can only afford a limited amount of
downtime for an upgrade window and you still need to handle properly
the index-level conflicts when doing the processing in parallel.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Alvaro Herrera-9
In reply to this post by Julien Rouhaud
On 2019-Jul-09, Julien Rouhaud wrote:

> I finished to do a better refactoring, and ended up with this API in
> scripts_parallel:

Looking good!  I'm not sure about the "Consume" word in ConsumeIdleSlot;
maybe "Reserve"? "Obtain"? "Get"?

Code commentary: I think the comment that sits atop the function should
describe what the function does without getting too much in how it does
it.  For example in ConsumeIdleSlot you have "If there are multiples
slots, here we wait for one connection to become available if none
already is, returning NULL if an error occured.  Otherwise, we simply
use the only slot we have, which we know to be free." which seems like
it should be in another comment *inside* the function; make the external
one something like "Reserve and return a connection that is currently
idle, waiting until one becomes idle if none is".  Maybe you can put the
part I first quoted as a second paragraph in the comment at top of
function and keeping the second part I quoted as first paragraph; we
seem to use that style too.

Placement: I think it's good if related functions stay together, or
there is some other rationale for placement within the file.  I have two
favorite approaches: one is to put all externally callable functions at
top of file, followed by all the static helpers in the lower half of the
file.  The other is to put each externally accessible immediately
followed by its specific static helpers.  If you choose one of those,
that means that SetupParallelSlots should either move upwards, or move
downwards.  The current ordering seems a dartboard kind of thing where
the thrower is not Green Arrow.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Julien Rouhaud
Hi Alvaro,

Thanks a lot for the review

On Wed, Jul 10, 2019 at 4:15 PM Alvaro Herrera <[hidden email]> wrote:
>
> On 2019-Jul-09, Julien Rouhaud wrote:
>
> > I finished to do a better refactoring, and ended up with this API in
> > scripts_parallel:
>
> Looking good!

Thanks!

>  I'm not sure about the "Consume" word in ConsumeIdleSlot;
> maybe "Reserve"? "Obtain"? "Get"?

Yes, Consume is maybe a little bit weird.  I wanted to point out the
make it clear that this function is actually removing a slot from the
free list, especially since there's a (historical) get_idle_slot().  I
like Reserve, but Obtain and Get are probably too ambiguous.

> Code commentary: I think the comment that sits atop the function should
> describe what the function does without getting too much in how it does
> it.  For example in ConsumeIdleSlot you have "If there are multiples
> slots, here we wait for one connection to become available if none
> already is, returning NULL if an error occured.  Otherwise, we simply
> use the only slot we have, which we know to be free." which seems like
> it should be in another comment *inside* the function; make the external
> one something like "Reserve and return a connection that is currently
> idle, waiting until one becomes idle if none is".  Maybe you can put the
> part I first quoted as a second paragraph in the comment at top of
> function and keeping the second part I quoted as first paragraph; we
> seem to use that style too.

Good point, I'll fix as you say.

> Placement: I think it's good if related functions stay together, or
> there is some other rationale for placement within the file.  I have two
> favorite approaches: one is to put all externally callable functions at
> top of file, followed by all the static helpers in the lower half of the
> file.  The other is to put each externally accessible immediately
> followed by its specific static helpers.  If you choose one of those,
> that means that SetupParallelSlots should either move upwards, or move
> downwards.  The current ordering seems a dartboard kind of thing where
> the thrower is not Green Arrow.

:)  I tried to put everything in alphabetic order as it was previously
being done, but I apparently failed again at sorting more than 3
characters.

I usually prefer to group exported functions together and static ones
together, as I find it more maintainable in the long term, so upwards
it'll be.


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Michael Paquier-2
On Wed, Jul 10, 2019 at 09:44:14PM +0200, Julien Rouhaud wrote:
> On Wed, Jul 10, 2019 at 4:15 PM Alvaro Herrera <[hidden email]> wrote:
>> Looking good!
>
> Thanks!

Confirmed.  The last set is much easier to go through.

>>  I'm not sure about the "Consume" word in ConsumeIdleSlot;
>> maybe "Reserve"? "Obtain"? "Get"?
>
> Yes, Consume is maybe a little bit weird.  I wanted to point out the
> make it clear that this function is actually removing a slot from the
> free list, especially since there's a (historical) get_idle_slot().  I
> like Reserve, but Obtain and Get are probably too ambiguous.

The refactoring patch is getting in shape.  Now reindex_one_database()
is the only place setting and manipulating the slots.  I am wondering
if we should have a wrapper which disconnects all the slots (doing
conn = NULL after the disconnectDatabase() call does not matter).
Get* would be my choice, because we look at the set of parallel slots,
and get an idle one.  It would be nice to have more consistency in the
names for the routines, say:
- ParallelSlotInit() instead of SetupParallelSlots (still my
suggestion is not perfect either as that sounds like one single slot,
but we have a set of these).
- ParallelSlotGetIdle() instead of ConsumeIdleSlot().  Still that's
more a wait-then-get behavior.
- ParallelSlotWaitCompletion() instead of WaitForSlotsCompletion()
- ParallelSlotDisconnect, as a wrapper for the calls to
DisconnectDatabase().

>> Placement: I think it's good if related functions stay together, or
>> there is some other rationale for placement within the file.  I have two
>> favorite approaches: one is to put all externally callable functions at
>> top of file, followed by all the static helpers in the lower half of the
>> file.  The other is to put each externally accessible immediately
>> followed by its specific static helpers.  If you choose one of those,
>> that means that SetupParallelSlots should either move upwards, or move
>> downwards.  The current ordering seems a dartboard kind of thing where
>> the thrower is not Green Arrow.
>
> I usually prefer to group exported functions together and static ones
> together, as I find it more maintainable in the long term, so upwards
> it'll be.
That's mainly a matter of taste.  Depending on the code path in the
tree, sometimes the two approaches from above are used.  We have some
other files where the static routines are listed first at the top,
followed by the exported ones at the bottom as it saves from some
declarations of the functions at the top of the file.  Keeping the
footprint of the author is not that bad either, and that depends also
on the context.  For this one, as the static functions are linked to
the exported ones in a close manner, I would keep each set grouped
though.

+    /*
+     * Database-wide parallel reindex requires special processing.  If
+     * multiple jobs were asked, we have to reindex system catalogs first,
+     * as they can't be processed in parallel.
+     */
+               if (process_type == REINDEX_DATABASE)

In patch 0002, a parallel database REINDEX first processes the catalog
relations in a serializable fashion, and then all the other relations
in parallel, which is right  Could we have schema-level reindexes also
process things in parallel with all the relations from all the schemas
listed?  This would be profitable in particular for callers listing
multiple schemas with an unbalanced number of tables in each, and we'd
need to be careful of the same where pg_catalog is listed.  Actually,
your patch breaks if we do a parallel run with pg_catalog and another
schema, no?
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Julien Rouhaud
On Thu, Jul 11, 2019 at 6:04 AM Michael Paquier <[hidden email]> wrote:

>
> On Wed, Jul 10, 2019 at 09:44:14PM +0200, Julien Rouhaud wrote:
> > On Wed, Jul 10, 2019 at 4:15 PM Alvaro Herrera <[hidden email]> wrote:
> >> Looking good!
> >
> > Thanks!
>
> Confirmed.  The last set is much easier to go through.
>
> >>  I'm not sure about the "Consume" word in ConsumeIdleSlot;
> >> maybe "Reserve"? "Obtain"? "Get"?
> >
> > Yes, Consume is maybe a little bit weird.  I wanted to point out the
> > make it clear that this function is actually removing a slot from the
> > free list, especially since there's a (historical) get_idle_slot().  I
> > like Reserve, but Obtain and Get are probably too ambiguous.
>
> The refactoring patch is getting in shape.  Now reindex_one_database()
> is the only place setting and manipulating the slots.  I am wondering
> if we should have a wrapper which disconnects all the slots (doing
> conn = NULL after the disconnectDatabase() call does not matter).

You already mentioned that in a previous mail.  I was afraid it'd be
overkill, but it'll make caller code easier, so let's do it.

> Get* would be my choice, because we look at the set of parallel slots,
> and get an idle one.

That's what the former GetIdleSlot (that I renamed to get_idle_slot as
it's not static) is doing.  ConsumeIdleSlot() actually get an idle
slot and mark it as being used.  That's probably some leakage of
internal implementation, but having a GetIdleParallelSlot (or
ParallelSlotGetIdle) *and* a get_idle_slot sounds like a bad idea, and
I don't have a better idea on how to rename get_idle_slot.  If you
really prefer Get* and you're fine with a static get_idle_slot, that's
fine by me.

>  It would be nice to have more consistency in the
> names for the routines, say:
> - ParallelSlotInit() instead of SetupParallelSlots (still my
> suggestion is not perfect either as that sounds like one single slot,
> but we have a set of these).
> - ParallelSlotGetIdle() instead of ConsumeIdleSlot().  Still that's
> more a wait-then-get behavior.
> - ParallelSlotWaitCompletion() instead of WaitForSlotsCompletion()
> - ParallelSlotDisconnect, as a wrapper for the calls to
> DisconnectDatabase().

I don't have an opinion on whether to use parallel slot as prefix or
postfix, so I'm fine with postfixing.

> +    /*
> +     * Database-wide parallel reindex requires special processing.  If
> +     * multiple jobs were asked, we have to reindex system catalogs first,
> +     * as they can't be processed in parallel.
> +     */
> +               if (process_type == REINDEX_DATABASE)
>
> In patch 0002, a parallel database REINDEX first processes the catalog
> relations in a serializable fashion, and then all the other relations
> in parallel, which is right  Could we have schema-level reindexes also
> process things in parallel with all the relations from all the schemas
> listed?  This would be profitable in particular for callers listing
> multiple schemas with an unbalanced number of tables in each

It would also be beneficial for a parallel reindex of a single schema.

> and we'd
> need to be careful of the same where pg_catalog is listed.  Actually,
> your patch breaks if we do a parallel run with pg_catalog and another
> schema, no?

It can definitely cause problems if you ask for pg_catalog and other
schema, same as if you ask a parallel reindex of some catalog tables
(possibly with other tables).  There's a --system switch for that
need, so I don't know if documenting the limitation to avoid some
extra code to deal with it is a good enough solution?


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Michael Paquier-2
On Thu, Jul 11, 2019 at 11:48:20AM +0200, Julien Rouhaud wrote:

> On Thu, Jul 11, 2019 at 6:04 AM Michael Paquier <[hidden email]> wrote:
>> Get* would be my choice, because we look at the set of parallel slots,
>> and get an idle one.
>
> That's what the former GetIdleSlot (that I renamed to get_idle_slot as
> it's not static) is doing.  ConsumeIdleSlot() actually get an idle
> slot and mark it as being used.  That's probably some leakage of
> internal implementation, but having a GetIdleParallelSlot (or
> ParallelSlotGetIdle) *and* a get_idle_slot sounds like a bad idea, and
> I don't have a better idea on how to rename get_idle_slot.  If you
> really prefer Get* and you're fine with a static get_idle_slot, that's
> fine by me.
Do we actually need get_idle_slot?  ConsumeIdleSlot is its only
caller.

>> and we'd
>> need to be careful of the same where pg_catalog is listed.  Actually,
>> your patch breaks if we do a parallel run with pg_catalog and another
>> schema, no?
>
> It can definitely cause problems if you ask for pg_catalog and other
> schema, same as if you ask a parallel reindex of some catalog tables
> (possibly with other tables).  There's a --system switch for that
> need, so I don't know if documenting the limitation to avoid some
> extra code to deal with it is a good enough solution?
vacuumdb --full still has limitations in this area and we had some
reports on the matter about this behavior being annoying.  Its
documentation also mentions that mixing catalog relations with  --full
can cause deadlocks.

Documenting it may be fine at the end, but my take is that it would be
nice to make sure that we don't have deadlocks if we can avoid them
easily.  It is also a matter of balance.  If for example the patch
gets 3 times bigger in size because of that we may have an argument
for not doing it and keep the code simple.  What do people think about
that?  I would be nice to get more opinions here.

And while scanning the code...

+ * getQuerySucess
Typo here.

- * Pump the conn till it's dry of results; return false if any are errors.
This comment could be improved on the way, like "Go through all the
connections and make sure to consume any remaining results.  If any
error is found, false is returned after processing all the parallel
slots."
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Julien Rouhaud
On Thu, Jul 11, 2019 at 3:34 PM Michael Paquier <[hidden email]> wrote:

>
> On Thu, Jul 11, 2019 at 11:48:20AM +0200, Julien Rouhaud wrote:
> > On Thu, Jul 11, 2019 at 6:04 AM Michael Paquier <[hidden email]> wrote:
> >> Get* would be my choice, because we look at the set of parallel slots,
> >> and get an idle one.
> >
> > That's what the former GetIdleSlot (that I renamed to get_idle_slot as
> > it's not static) is doing.  ConsumeIdleSlot() actually get an idle
> > slot and mark it as being used.  That's probably some leakage of
> > internal implementation, but having a GetIdleParallelSlot (or
> > ParallelSlotGetIdle) *and* a get_idle_slot sounds like a bad idea, and
> > I don't have a better idea on how to rename get_idle_slot.  If you
> > really prefer Get* and you're fine with a static get_idle_slot, that's
> > fine by me.
>
> Do we actually need get_idle_slot?  ConsumeIdleSlot is its only
> caller.

I think t hat it makes the code quite cleaner to have the selection
outside ConsumeIdleSlot.

> And while scanning the code...
>
> + * getQuerySucess
> Typo here.

Argh, I thought I caught all of them, thanks!

> - * Pump the conn till it's dry of results; return false if any are errors.
> This comment could be improved on the way, like "Go through all the
> connections and make sure to consume any remaining results.  If any
> error is found, false is returned after processing all the parallel
> slots."

You're talking about getQuerySuccess right? That was actually the
original comment of a function I renamed.  +1 to improve it, but this
function is in common.c and doesn't deal with parallel slot at all, so
I'll just drop the slang parts.


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Michael Paquier-2
On Thu, Jul 11, 2019 at 06:22:25PM +0200, Julien Rouhaud wrote:
> I think t hat it makes the code quite cleaner to have the selection
> outside ConsumeIdleSlot.

Actually, you have an issue with ConsumeIdleSlot() if there is only
one parallel slot, no?  In this case the current patch returns
immediately the slot available without waiting.  I think that we
should wait until the slot becomes free in that case as well, and
switch isFree to false.  If you want to keep things splitted, that's
fine by me, I would still use "Get" within the name for the routine,
and rename the other to get_idle_slot_internal() or
get_idle_slot_guts() to point out that it has an internal role.

> You're talking about getQuerySuccess right? That was actually the
> original comment of a function I renamed.  +1 to improve it, but this
> function is in common.c and doesn't deal with parallel slot at all, so
> I'll just drop the slang parts.

If we can design a clean interface with better comments, we can use
this occasion to browse the whole thing and make it better.
--
Michael

signature.asc (849 bytes) Download Attachment
1234