Add parallelism and glibc dependent only options to reindexdb

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
43 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Add parallelism and glibc dependent only options to reindexdb

Julien Rouhaud
Hi,

With the glibc 2.28 coming, all users will have to reindex almost
every indexes after a glibc upgrade to guarantee the lack of
corruption.  Unfortunately, reindexdb is not ideal for that as it's
processing everything using a single connexion and isn't able to
discard indexes that doesn't depend on a glibc collation.

PFA a patchset to add parallelism to reindexdb (reusing the
infrastructure in vacuumdb with some additions) and an option to
discard indexes that doesn't depend on glibc (without any specific
collation filtering or glibc version detection), with updated
regression tests.  Note that this should be applied on top of the
existing reindexdb cleanup & refactoring patch
(https://commitfest.postgresql.org/23/2115/).

This was sponsored by VMware, and has been discussed internally with
Kevin and Michael, in Cc.

0003-Add-parallel-processing-to-reindexdb.patch (36K) Download Attachment
0001-Export-vacuumdb-s-parallel-infrastructure.patch (22K) Download Attachment
0004-Add-a-glibc-dependent-option.patch (34K) Download Attachment
0002-Add-a-SimplePtrList-implementation.patch (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Michael Paquier-2
On Sun, Jun 30, 2019 at 11:45:47AM +0200, Julien Rouhaud wrote:
> With the glibc 2.28 coming, all users will have to reindex almost
> every indexes after a glibc upgrade to guarantee the lack of
> corruption.  Unfortunately, reindexdb is not ideal for that as it's
> processing everything using a single connexion and isn't able to
> discard indexes that doesn't depend on a glibc collation.

We have seen that with a database of up to 100GB we finish by cutting
the reindex time from 30 minutes to a couple of minutes with a schema
we work on.  Julien, what were the actual numbers?

> PFA a patchset to add parallelism to reindexdb (reusing the
> infrastructure in vacuumdb with some additions) and an option to
> discard indexes that doesn't depend on glibc (without any specific
> collation filtering or glibc version detection), with updated
> regression tests.  Note that this should be applied on top of the
> existing reindexdb cleanup & refactoring patch
> (https://commitfest.postgresql.org/23/2115/).

Please note that patch 0003 does not seem to apply correctly on HEAD
as of c74d49d4.  Here is also a small description of each patch:
- 0001 refactors the connection slot facility from vacuumdb.c into a
new, separate file called parallel.c in src/bin/scripts/.  This is not
really fancy as some code only moves around.
- 0002 adds an extra option for simple lists to be able to use
pointers, with an interface to append elements in it.
- 0003 begins to be the actual fancy thing with the addition of a
--jobs option into reindexdb.  The main issue here which should be
discussed is that when it comes to reindex of tables, you basically
are not going to have any conflicts between the objects manipulated.
However if you wish to do a reindex on a set of indexes then things
get more tricky as it is necessary to list items per-table so as
multiple connections do not conflict with each other if attempting to
work on multiple indexes of the same table.  What this patch does is
to select the set of indexes which need to be worked on (see the
addition of cell in ParallelSlot), and then does a kind of
pre-planning of each item into the connection slots so as each
connection knows from the beginning which items it needs to process.
This is quite different from vacuumdb where a new item is distributed
only on a free connection from a unique list.  I'd personally prefer
if we keep the facility in parallel.c so as it is only
execution-dependent and that we have no pre-planning.  This would
require keeping within reindexdb.c an array of lists, with one list
corresponding to one connection instead which feels more natural.
- 0004 is the part where the concurrent additions really matter as
this consists in applying an extra filter to the indexes selected so
as only the glibc-sensitive indexes are chosen for the processing.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Julien Rouhaud
On Mon, Jul 1, 2019 at 10:55 AM Michael Paquier <[hidden email]> wrote:

>
> On Sun, Jun 30, 2019 at 11:45:47AM +0200, Julien Rouhaud wrote:
> > With the glibc 2.28 coming, all users will have to reindex almost
> > every indexes after a glibc upgrade to guarantee the lack of
> > corruption.  Unfortunately, reindexdb is not ideal for that as it's
> > processing everything using a single connexion and isn't able to
> > discard indexes that doesn't depend on a glibc collation.
>
> We have seen that with a database of up to 100GB we finish by cutting
> the reindex time from 30 minutes to a couple of minutes with a schema
> we work on.  Julien, what were the actual numbers?

I did my benchmarking using a quite ideal database, having a large
number of tables and various set of indexes, for a 75 GB total size.
This was done on my laptop which has 6 multithreaded cores (and crappy
IO), also keeping the default max_parallel_maintenance_worker = 2.

A naive reindexdb took approximately 33 minutes.  Filtering the list
of indexes took that down to slightly less than 15 min, but of course
each database will have a different ratio there.

Then, keeping the --glibc-dependent and using different level of parallelism:

-j1: ~ 14:50
-j3: ~ 7:30
-j6: ~ 5:23
-j8: ~ 4:45

That's pretty much the kind of results I was expecting given the
hardware I used.

> > PFA a patchset to add parallelism to reindexdb (reusing the
> > infrastructure in vacuumdb with some additions) and an option to
> > discard indexes that doesn't depend on glibc (without any specific
> > collation filtering or glibc version detection), with updated
> > regression tests.  Note that this should be applied on top of the
> > existing reindexdb cleanup & refactoring patch
> > (https://commitfest.postgresql.org/23/2115/).
>
> Please note that patch 0003 does not seem to apply correctly on HEAD
> as of c74d49d4.

Yes, this is because this patchset has to be applied on top of the
reindexdb refactoring patch mentioned.  It's sad that we don't have a
good way to deal with that kind of dependency, as it's also breaking
Thomas' cfbot :(

> - 0003 begins to be the actual fancy thing with the addition of a
> --jobs option into reindexdb.  The main issue here which should be
> discussed is that when it comes to reindex of tables, you basically
> are not going to have any conflicts between the objects manipulated.
> However if you wish to do a reindex on a set of indexes then things
> get more tricky as it is necessary to list items per-table so as
> multiple connections do not conflict with each other if attempting to
> work on multiple indexes of the same table.  What this patch does is
> to select the set of indexes which need to be worked on (see the
> addition of cell in ParallelSlot), and then does a kind of
> pre-planning of each item into the connection slots so as each
> connection knows from the beginning which items it needs to process.
> This is quite different from vacuumdb where a new item is distributed
> only on a free connection from a unique list.  I'd personally prefer
> if we keep the facility in parallel.c so as it is only
> execution-dependent and that we have no pre-planning.  This would
> require keeping within reindexdb.c an array of lists, with one list
> corresponding to one connection instead which feels more natural.

My fear here is that this approach would add some extra complexity,
especially requiring to deal with free connection handling both in
GetIdleSlot() and the main reindexdb loop.  Also, the pre-planning
allows us to start processing the biggest tables first, which
optimises the overall runtime.


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Alvaro Herrera-9
In reply to this post by Michael Paquier-2
Now that we have REINDEX CONCURRENTLY, I think reindexdb is going to
gain more popularity.

Please don't reuse a file name as generic as "parallel.c" -- it's
annoying when navigating source.  Maybe conn_parallel.c multiconn.c
connscripts.c admconnection.c ...?

If your server crashes or is stopped midway during the reindex, you
would have to start again from scratch, and it's tedious (if it's
possible at all) to determine which indexes were missed.  I think it
would be useful to have a two-phase mode: in the initial phase reindexdb
computes the list of indexes to be reindexed and saves them into a work
table somewhere.  In the second phase, it reads indexes from that table
and processes them, marking them as done in the work table.  If the
second phase crashes or is stopped, it can be restarted and consults the
work table.  I would keep the work table, as it provides a bit of an
audit trail.  It may be important to be able to run even if unable to
create such a work table (because of the <ironic>numerous</> users that
DROP DATABASE postgres).

Maybe we'd have two flags in the work table for each index:
"reindex requested", "reindex done".
   
The "glibc filter" thing (which I take to mean "indexes that depend on
collations") would apply to the first phase: it just skips adding other
indexes to the work table.  I suppose ICU collations are not affected,
so the filter would be for glibc collations only?  The --glibc-dependent
switch seems too ad-hoc.  Maybe "--exclude-rule=glibc"?  That way we can
add other rules later.  (Not "--exclude=foo" because we'll want to add
the possibility to ignore specific indexes by name.)

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Tom Lane-2
In reply to this post by Michael Paquier-2
Michael Paquier <[hidden email]> writes:

> - 0003 begins to be the actual fancy thing with the addition of a
> --jobs option into reindexdb.  The main issue here which should be
> discussed is that when it comes to reindex of tables, you basically
> are not going to have any conflicts between the objects manipulated.
> However if you wish to do a reindex on a set of indexes then things
> get more tricky as it is necessary to list items per-table so as
> multiple connections do not conflict with each other if attempting to
> work on multiple indexes of the same table.  What this patch does is
> to select the set of indexes which need to be worked on (see the
> addition of cell in ParallelSlot), and then does a kind of
> pre-planning of each item into the connection slots so as each
> connection knows from the beginning which items it needs to process.
> This is quite different from vacuumdb where a new item is distributed
> only on a free connection from a unique list.  I'd personally prefer
> if we keep the facility in parallel.c so as it is only
> execution-dependent and that we have no pre-planning.  This would
> require keeping within reindexdb.c an array of lists, with one list
> corresponding to one connection instead which feels more natural.

Couldn't we make this enormously simpler and less bug-prone by just
dictating that --jobs applies only to reindex-table operations?

> - 0004 is the part where the concurrent additions really matter as
> this consists in applying an extra filter to the indexes selected so
> as only the glibc-sensitive indexes are chosen for the processing.

I think you'd be better off to define and document this as "reindex
only collation-sensitive indexes", without any particular reference
to a reason why somebody might want to do that.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Julien Rouhaud
In reply to this post by Alvaro Herrera-9
On Mon, Jul 1, 2019 at 3:51 PM Alvaro Herrera <[hidden email]> wrote:
>
> Please don't reuse a file name as generic as "parallel.c" -- it's
> annoying when navigating source.  Maybe conn_parallel.c multiconn.c
> connscripts.c admconnection.c ...?

I could use scripts_parallel.[ch] as I've already used it in the #define part?

> If your server crashes or is stopped midway during the reindex, you
> would have to start again from scratch, and it's tedious (if it's
> possible at all) to determine which indexes were missed.  I think it
> would be useful to have a two-phase mode: in the initial phase reindexdb
> computes the list of indexes to be reindexed and saves them into a work
> table somewhere.  In the second phase, it reads indexes from that table
> and processes them, marking them as done in the work table.  If the
> second phase crashes or is stopped, it can be restarted and consults the
> work table.  I would keep the work table, as it provides a bit of an
> audit trail.  It may be important to be able to run even if unable to
> create such a work table (because of the <ironic>numerous</> users that
> DROP DATABASE postgres).

Or we could create a table locally in each database, that would fix
this problem and probably make the code simpler?

It also raises some additional concerns about data expiration.  I
guess that someone could launch the tool by mistake, kill reindexdb,
and run it again 2 months later while a lot of new objects have been
added for instance.

> The "glibc filter" thing (which I take to mean "indexes that depend on
> collations") would apply to the first phase: it just skips adding other
> indexes to the work table.  I suppose ICU collations are not affected,
> so the filter would be for glibc collations only?

Indeed, ICU shouldn't need such a filter.  xxx_pattern_ops based
indexes are also excluded.

>  The --glibc-dependent
> switch seems too ad-hoc.  Maybe "--exclude-rule=glibc"?  That way we can
> add other rules later.  (Not "--exclude=foo" because we'll want to add
> the possibility to ignore specific indexes by name.)

That's a good point, I like the --exclude-rule switch.


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Julien Rouhaud
In reply to this post by Tom Lane-2
On Mon, Jul 1, 2019 at 4:10 PM Tom Lane <[hidden email]> wrote:

>
> Michael Paquier <[hidden email]> writes:
> > - 0003 begins to be the actual fancy thing with the addition of a
> > --jobs option into reindexdb.  The main issue here which should be
> > discussed is that when it comes to reindex of tables, you basically
> > are not going to have any conflicts between the objects manipulated.
> > However if you wish to do a reindex on a set of indexes then things
> > get more tricky as it is necessary to list items per-table so as
> > multiple connections do not conflict with each other if attempting to
> > work on multiple indexes of the same table.  What this patch does is
> > to select the set of indexes which need to be worked on (see the
> > addition of cell in ParallelSlot), and then does a kind of
> > pre-planning of each item into the connection slots so as each
> > connection knows from the beginning which items it needs to process.
> > This is quite different from vacuumdb where a new item is distributed
> > only on a free connection from a unique list.  I'd personally prefer
> > if we keep the facility in parallel.c so as it is only
> > execution-dependent and that we have no pre-planning.  This would
> > require keeping within reindexdb.c an array of lists, with one list
> > corresponding to one connection instead which feels more natural.
>
> Couldn't we make this enormously simpler and less bug-prone by just
> dictating that --jobs applies only to reindex-table operations?

That would also mean that we'll have to fallback on doing reindex at
table-level, even if we only want to reindex indexes that depends on
glibc.  I'm afraid that this will often add a huge penalty.

> > - 0004 is the part where the concurrent additions really matter as
> > this consists in applying an extra filter to the indexes selected so
> > as only the glibc-sensitive indexes are chosen for the processing.
>
> I think you'd be better off to define and document this as "reindex
> only collation-sensitive indexes", without any particular reference
> to a reason why somebody might want to do that.

We should still document that indexes based on ICU would be exluded?
I also realize that I totally forgot to update reindexdb.sgml.  Sorry
about that, I'll fix with the next versions.


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Daniel Verite
        Julien Rouhaud wrote:

> > I think you'd be better off to define and document this as "reindex
> > only collation-sensitive indexes", without any particular reference
> > to a reason why somebody might want to do that.
>
> We should still document that indexes based on ICU would be exluded?

But why exclude them?
As a data point, in the last 5 years, the en_US collation in ICU
had 7 different versions (across 11 major versions of ICU):

ICU Unicode en_US

54.1 7.0 137.56
55.1 7.0 153.56
56.1 8.0 153.64
57.1 8.0 153.64
58.2 9.0 153.72
59.1 9.0 153.72
60.2 10.0 153.80
61.1 10.0 153.80
62.1 11.0 153.88
63.2 11.0 153.88
64.2 12.1 153.97

The rightmost column corresponds to pg_collation.collversion
in Postgres.
Each time there's a new Unicode version, it seems
all collation versions are bumped. And there's a new Unicode
version pretty much every year these days.
Based on this, most ICU upgrades in practice would require reindexing
affected indexes.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Alvaro Herrera-9
On 2019-Jul-01, Daniel Verite wrote:

> But why exclude them?
> As a data point, in the last 5 years, the en_US collation in ICU
> had 7 different versions (across 11 major versions of ICU):

So we need a switch --include-rule=icu-collations?

(I mentioned "--exclude-rule=glibc" elsewhere in the thread, but I think
it should be --include-rule=glibc-collations instead, no?)

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Julien Rouhaud
In reply to this post by Daniel Verite
On Mon, Jul 1, 2019 at 10:13 PM Daniel Verite <[hidden email]> wrote:

>
> > > I think you'd be better off to define and document this as "reindex
> > > only collation-sensitive indexes", without any particular reference
> > > to a reason why somebody might want to do that.
> >
> > We should still document that indexes based on ICU would be exluded?
>
> But why exclude them?
> As a data point, in the last 5 years, the en_US collation in ICU
> had 7 different versions (across 11 major versions of ICU):
>
> ICU     Unicode en_US
>
> 54.1    7.0     137.56
> 55.1    7.0     153.56
> 56.1    8.0     153.64
> 57.1    8.0     153.64
> 58.2    9.0     153.72
> 59.1    9.0     153.72
> 60.2    10.0    153.80
> 61.1    10.0    153.80
> 62.1    11.0    153.88
> 63.2    11.0    153.88
> 64.2    12.1    153.97
>
> The rightmost column corresponds to pg_collation.collversion
> in Postgres.
> Each time there's a new Unicode version, it seems
> all collation versions are bumped. And there's a new Unicode
> version pretty much every year these days.
> Based on this, most ICU upgrades in practice would require reindexing
> affected indexes.

I have a vague recollection that ICU was providing some backward
compatibility so that even if you upgrade your lib you can still get
the sort order that was active when you built your indexes, though
maybe for a limited number of versions.

Even if that's just me being delusional, I'd still prefer Alvaro's
approach to have distinct switches for each collation system.


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Peter Geoghegan-4
On Mon, Jul 1, 2019 at 1:34 PM Julien Rouhaud <[hidden email]> wrote:
> I have a vague recollection that ICU was providing some backward
> compatibility so that even if you upgrade your lib you can still get
> the sort order that was active when you built your indexes, though
> maybe for a limited number of versions.

That isn't built in. Another database system that uses ICU handles
this by linking to multiple versions of ICU, each with its own UCA
version and associated collations. I don't think that we want to go
there, so it makes sense to make an upgrade that crosses ICU or glibc
versions as painless as possible.

Note that ICU does at least provide a standard way to use multiple
versions at once; the symbol names have the ICU version baked in.
You're actually calling the functions using the versioned symbol names
without realizing it, because there is macro trickery involved.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Thomas Munro-5
In reply to this post by Julien Rouhaud
On Tue, Jul 2, 2019 at 8:34 AM Julien Rouhaud <[hidden email]> wrote:
> Even if that's just me being delusional, I'd still prefer Alvaro's
> approach to have distinct switches for each collation system.

Hi Julien,

Makes sense.  But why use the name "glibc" in the code and user
interface?  The name of the collation provider in PostgreSQL is "libc"
(for example in the CREATE COLLATION command), and the problem applies
no matter who makes your libc.

--
Thomas Munro
https://enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Alvaro Herrera-9
On 2019-Jul-02, Thomas Munro wrote:

> On Tue, Jul 2, 2019 at 8:34 AM Julien Rouhaud <[hidden email]> wrote:
> > Even if that's just me being delusional, I'd still prefer Alvaro's
> > approach to have distinct switches for each collation system.
>
> Hi Julien,
>
> Makes sense.  But why use the name "glibc" in the code and user
> interface?  The name of the collation provider in PostgreSQL is "libc"
> (for example in the CREATE COLLATION command), and the problem applies
> no matter who makes your libc.

Makes sense.  "If your libc is glibc and you go across an upgrade over
version X, please use --include-rule=libc-collation"

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Michael Paquier-2
In reply to this post by Julien Rouhaud
On Mon, Jul 01, 2019 at 06:28:13PM +0200, Julien Rouhaud wrote:
> On Mon, Jul 1, 2019 at 4:10 PM Tom Lane <[hidden email]> wrote:
>> Couldn't we make this enormously simpler and less bug-prone by just
>> dictating that --jobs applies only to reindex-table operations?

I had the same argument about the first patch sets actually, but... :)

> That would also mean that we'll have to fallback on doing reindex at
> table-level, even if we only want to reindex indexes that depends on
> glibc.  I'm afraid that this will often add a huge penalty.

Yes, I would expect that most of the time glibc-sensible indexes are
also mixed with other ones which we don't care about here.  One
advantage of the argument from Tom though is that it is possible to
introduce --jobs with minimal steps:
1) Refactor the code for connection slots, without the cell addition
2) Introduce --jobs without INDEX support.

In short, the conflict business between indexes is something which
could be tackled afterwards and with a separate patch.  Parallel
indexes at table-level has value in itself, particularly with
CONCURRENTLY coming in the picture.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Michael Paquier-2
In reply to this post by Julien Rouhaud
On Mon, Jul 01, 2019 at 06:14:20PM +0200, Julien Rouhaud wrote:
> On Mon, Jul 1, 2019 at 3:51 PM Alvaro Herrera <[hidden email]> wrote:
> >
> > Please don't reuse a file name as generic as "parallel.c" -- it's
> > annoying when navigating source.  Maybe conn_parallel.c multiconn.c
> > connscripts.c admconnection.c ...?
>
> I could use scripts_parallel.[ch] as I've already used it in the
> #define part?

multiconn.c sounds rather good, but I have a poor ear for any kind of
naming..

>> If your server crashes or is stopped midway during the reindex, you
>> would have to start again from scratch, and it's tedious (if it's
>> possible at all) to determine which indexes were missed.  I think it
>> would be useful to have a two-phase mode: in the initial phase reindexdb
>> computes the list of indexes to be reindexed and saves them into a work
>> table somewhere.  In the second phase, it reads indexes from that table
>> and processes them, marking them as done in the work table.  If the
>> second phase crashes or is stopped, it can be restarted and consults the
>> work table.  I would keep the work table, as it provides a bit of an
>> audit trail.  It may be important to be able to run even if unable to
>> create such a work table (because of the <ironic>numerous</> users that
>> DROP DATABASE postgres).
>
> Or we could create a table locally in each database, that would fix
> this problem and probably make the code simpler?
>
> It also raises some additional concerns about data expiration.  I
> guess that someone could launch the tool by mistake, kill reindexdb,
> and run it again 2 months later while a lot of new objects have been
> added for instance.
This looks like fancy additions, still that's not the core of the
problem, no?  If you begin to play in this area you would need more
control options, basically a "continue" mode to be able to restart a
previously failed attempt, and a "reinit" mode able to restart the
operation completely from scratch, and perhaps even a "reset" mode
which cleans up any data already present.  Not really a complexity,
but this has to be maintained a database level.

>>  The --glibc-dependent
>> switch seems too ad-hoc.  Maybe "--exclude-rule=glibc"?  That way we can
>> add other rules later.  (Not "--exclude=foo" because we'll want to add
>> the possibility to ignore specific indexes by name.)
>
> That's a good point, I like the --exclude-rule switch.

Sounds kind of nice.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Peter Eisentraut-6
In reply to this post by Alvaro Herrera-9
On 2019-07-01 22:46, Alvaro Herrera wrote:

> On 2019-Jul-02, Thomas Munro wrote:
>> On Tue, Jul 2, 2019 at 8:34 AM Julien Rouhaud <[hidden email]> wrote:
>>> Even if that's just me being delusional, I'd still prefer Alvaro's
>>> approach to have distinct switches for each collation system.
>>
>> Makes sense.  But why use the name "glibc" in the code and user
>> interface?  The name of the collation provider in PostgreSQL is "libc"
>> (for example in the CREATE COLLATION command), and the problem applies
>> no matter who makes your libc.
>
> Makes sense.  "If your libc is glibc and you go across an upgrade over
> version X, please use --include-rule=libc-collation"

I think it might be better to put the logic of what indexes are
collation affected etc. into the backend REINDEX command.  We are likely
to enhance the collation version and dependency tracking over time,
possibly soon, possibly multiple times, and it would be very cumbersome
to have to keep updating reindexdb with this.  Moreover, since for
performance you likely want to reindex by table, implementing a logic of
"reindex all collation-affected indexes on this table" would be much
easier to do in the backend.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Julien Rouhaud
In reply to this post by Peter Geoghegan-4
On Mon, Jul 1, 2019 at 11:21 PM Peter Geoghegan <[hidden email]> wrote:

>
> On Mon, Jul 1, 2019 at 1:34 PM Julien Rouhaud <[hidden email]> wrote:
> > I have a vague recollection that ICU was providing some backward
> > compatibility so that even if you upgrade your lib you can still get
> > the sort order that was active when you built your indexes, though
> > maybe for a limited number of versions.
>
> That isn't built in. Another database system that uses ICU handles
> this by linking to multiple versions of ICU, each with its own UCA
> version and associated collations. I don't think that we want to go
> there, so it makes sense to make an upgrade that crosses ICU or glibc
> versions as painless as possible.
>
> Note that ICU does at least provide a standard way to use multiple
> versions at once; the symbol names have the ICU version baked in.
> You're actually calling the functions using the versioned symbol names
> without realizing it, because there is macro trickery involved.

Ah, thanks for the clarification!


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Tomas Vondra-4
In reply to this post by Julien Rouhaud
On Sun, Jun 30, 2019 at 11:45:47AM +0200, Julien Rouhaud wrote:

>Hi,
>
>With the glibc 2.28 coming, all users will have to reindex almost
>every indexes after a glibc upgrade to guarantee the lack of
>corruption.  Unfortunately, reindexdb is not ideal for that as it's
>processing everything using a single connexion and isn't able to
>discard indexes that doesn't depend on a glibc collation.
>
>PFA a patchset to add parallelism to reindexdb (reusing the
>infrastructure in vacuumdb with some additions) and an option to
>discard indexes that doesn't depend on glibc (without any specific
>collation filtering or glibc version detection), with updated
>regression tests.  Note that this should be applied on top of the
>existing reindexdb cleanup & refactoring patch
>(https://commitfest.postgresql.org/23/2115/).
>
>This was sponsored by VMware, and has been discussed internally with
>Kevin and Michael, in Cc.

I wonder why this is necessary:

pg_log_error("cannot reindex glibc dependent objects and a subset of objects");

What's the reasoning behind that? It seems like a valid use case to me -
imagine you have a bug database, but only a couple of tables are used by
the application regularly (the rest may be archive tables, for example).
Why not to allow rebuilding glibc-dependent indexes on the used tables, so
that the database can be opened for users sooner.

BTW now that we allow rebuilding only some of the indexes, it'd be great
to have a dry-run mode, were we just print which indexes will be rebuilt
without actually rebuilding them.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Julien Rouhaud
In reply to this post by Peter Eisentraut-6
On Tue, Jul 2, 2019 at 9:19 AM Peter Eisentraut
<[hidden email]> wrote:

>
> On 2019-07-01 22:46, Alvaro Herrera wrote:
> > On 2019-Jul-02, Thomas Munro wrote:
> >> On Tue, Jul 2, 2019 at 8:34 AM Julien Rouhaud <[hidden email]> wrote:
> >>> Even if that's just me being delusional, I'd still prefer Alvaro's
> >>> approach to have distinct switches for each collation system.
> >>
> >> Makes sense.  But why use the name "glibc" in the code and user
> >> interface?  The name of the collation provider in PostgreSQL is "libc"
> >> (for example in the CREATE COLLATION command), and the problem applies
> >> no matter who makes your libc.
> >
> > Makes sense.  "If your libc is glibc and you go across an upgrade over
> > version X, please use --include-rule=libc-collation"
>
> I think it might be better to put the logic of what indexes are
> collation affected etc. into the backend REINDEX command.  We are likely
> to enhance the collation version and dependency tracking over time,
> possibly soon, possibly multiple times, and it would be very cumbersome
> to have to keep updating reindexdb with this.  Moreover, since for
> performance you likely want to reindex by table, implementing a logic of
> "reindex all collation-affected indexes on this table" would be much
> easier to do in the backend.

That's a great idea, and would make the parallelism in reindexdb much
simpler.  There's however a downside, as users won't have a way to
benefit from index filtering until they upgrade to this version.  OTOH
glibc 2.28 is already there, and a hypothetical fancy reindexdb is far
from being released.


Reply | Threaded
Open this post in threaded view
|

Re: Add parallelism and glibc dependent only options to reindexdb

Julien Rouhaud
In reply to this post by Tomas Vondra-4
On Tue, Jul 2, 2019 at 10:28 AM Tomas Vondra
<[hidden email]> wrote:

>
> I wonder why this is necessary:
>
> pg_log_error("cannot reindex glibc dependent objects and a subset of objects");
>
> What's the reasoning behind that? It seems like a valid use case to me -
> imagine you have a bug database, but only a couple of tables are used by
> the application regularly (the rest may be archive tables, for example).
> Why not to allow rebuilding glibc-dependent indexes on the used tables, so
> that the database can be opened for users sooner.

It just seemed wrong to me to allow a partial processing for something
that's aimed to prevent corruption.  I'd think that if users are
knowledgeable enough to only reindex a subset of indexes/tables in
such cases, they can also discard indexes that don't get affected by a
collation lib upgrade.  I'm not strongly opposed to supporting if
though, as there indeed can be valid use cases.

> BTW now that we allow rebuilding only some of the indexes, it'd be great
> to have a dry-run mode, were we just print which indexes will be rebuilt
> without actually rebuilding them.

+1.  If we end up doing the filter in the backend, we'd have to add
such option in the REINDEX command, and actually issue all the orders
to retrieve the list.


123