Hello,
Now that we have the infrastructure to track indexes that might be corrupted due to changes in collation libraries, I think it would be a good idea to offer an easy way for users to reindex all indexes that might be corrupted. I'm attaching a POC patch as a discussion basis. It implements a new "COLLATION" option to reindex, with "not_current" being the only accepted value. Note that I didn't spent too much efforts on the grammar part yet. So for instance you can do: REINDEX (COLLATION 'not_current') DATABASE mydb; The filter is also implemented so that you could cumulate multiple filters, so it could be easy to add more filtering, for instance: REINDEX (COLLATION 'libc', COLLATION 'not_current') DATABASE mydb; to only rebuild indexes depending on outdated libc collations, or REINDEX (COLLATION 'libc', VERSION 'X.Y') DATABASE mydb; to only rebuild indexes depending on a specific version of libc. |
On Thu, Dec 03, 2020 at 05:31:43PM +0800, Julien Rouhaud wrote:
> Now that we have the infrastructure to track indexes that might be corrupted > due to changes in collation libraries, I think it would be a good idea to offer > an easy way for users to reindex all indexes that might be corrupted. Yes. It would be a good thing. > The filter is also implemented so that you could cumulate multiple filters, so > it could be easy to add more filtering, for instance: > > REINDEX (COLLATION 'libc', COLLATION 'not_current') DATABASE mydb; > > to only rebuild indexes depending on outdated libc collations, or > > REINDEX (COLLATION 'libc', VERSION 'X.Y') DATABASE mydb; > > to only rebuild indexes depending on a specific version of libc. to satisfy. From what I heard on this topic, the goal is to reduce the amount of time necessary to reindex a system so as REINDEX only works on indexes whose dependent collation versions are not known or works on indexes in need of a collation refresh (like a reindexdb --all --collation -j $jobs). What would be the benefit in having more complexity with library-dependent settings while we could take care of the use cases that matter the most with a simple grammar? Perhaps "not_current" is not the best match as a keyword, we could just use "collation" and handle that as a boolean. As long as we don't need new operators in the grammar rules.. -- Michael |
On Mon, Dec 14, 2020 at 3:45 PM Michael Paquier <[hidden email]> wrote:
> > On Thu, Dec 03, 2020 at 05:31:43PM +0800, Julien Rouhaud wrote: > > Now that we have the infrastructure to track indexes that might be corrupted > > due to changes in collation libraries, I think it would be a good idea to offer > > an easy way for users to reindex all indexes that might be corrupted. > > Yes. It would be a good thing. > > > The filter is also implemented so that you could cumulate multiple filters, so > > it could be easy to add more filtering, for instance: > > > > REINDEX (COLLATION 'libc', COLLATION 'not_current') DATABASE mydb; > > > > to only rebuild indexes depending on outdated libc collations, or > > > > REINDEX (COLLATION 'libc', VERSION 'X.Y') DATABASE mydb; > > > > to only rebuild indexes depending on a specific version of libc. > > Deciding on the grammar to use depends on the use cases we would like > to satisfy. From what I heard on this topic, the goal is to reduce > the amount of time necessary to reindex a system so as REINDEX only > works on indexes whose dependent collation versions are not known or > works on indexes in need of a collation refresh (like a reindexdb > --all --collation -j $jobs). What would be the benefit in having more > complexity with library-dependent settings while we could take care > of the use cases that matter the most with a simple grammar? Perhaps > "not_current" is not the best match as a keyword, we could just use > "collation" and handle that as a boolean. As long as we don't need > new operators in the grammar rules.. I'm not sure what the DBA usual pattern here. If the reindexing runtime is really critical, I'm assuming that at least some people will dig into library details to see what are the collations that actually broke in the last upgrade and will want to reindex only those, and force the version for the rest of the indexes. And obviously, they probably won't wait to have multiple collation versions dependencies before taking care of that. In that case the filters that would matters would be one to only keep indexes with an outdated collation version, and an additional one for a specific collation name. Or we could have the COLLATION keyword without additional argument mean all outdated collations, and COLLATION 'collation_name' to specify a specific one. This is maybe a bit ugly, and would probably require a different approach for reindexdb. |
On Tue, Dec 15, 2020 at 12:22 PM Julien Rouhaud <[hidden email]> wrote:
> > On Mon, Dec 14, 2020 at 3:45 PM Michael Paquier <[hidden email]> wrote: > > > > On Thu, Dec 03, 2020 at 05:31:43PM +0800, Julien Rouhaud wrote: > > > Now that we have the infrastructure to track indexes that might be corrupted > > > due to changes in collation libraries, I think it would be a good idea to offer > > > an easy way for users to reindex all indexes that might be corrupted. > > > > Yes. It would be a good thing. > > > > > The filter is also implemented so that you could cumulate multiple filters, so > > > it could be easy to add more filtering, for instance: > > > > > > REINDEX (COLLATION 'libc', COLLATION 'not_current') DATABASE mydb; > > > > > > to only rebuild indexes depending on outdated libc collations, or > > > > > > REINDEX (COLLATION 'libc', VERSION 'X.Y') DATABASE mydb; > > > > > > to only rebuild indexes depending on a specific version of libc. > > > > Deciding on the grammar to use depends on the use cases we would like > > to satisfy. From what I heard on this topic, the goal is to reduce > > the amount of time necessary to reindex a system so as REINDEX only > > works on indexes whose dependent collation versions are not known or > > works on indexes in need of a collation refresh (like a reindexdb > > --all --collation -j $jobs). What would be the benefit in having more > > complexity with library-dependent settings while we could take care > > of the use cases that matter the most with a simple grammar? Perhaps > > "not_current" is not the best match as a keyword, we could just use > > "collation" and handle that as a boolean. As long as we don't need > > new operators in the grammar rules.. > > I'm not sure what the DBA usual pattern here. If the reindexing > runtime is really critical, I'm assuming that at least some people > will dig into library details to see what are the collations that > actually broke in the last upgrade and will want to reindex only > those, and force the version for the rest of the indexes. And > obviously, they probably won't wait to have multiple collation > versions dependencies before taking care of that. In that case the > filters that would matters would be one to only keep indexes with an > outdated collation version, and an additional one for a specific > collation name. Or we could have the COLLATION keyword without > additional argument mean all outdated collations, and COLLATION > 'collation_name' to specify a specific one. This is maybe a bit ugly, > and would probably require a different approach for reindexdb. Is this really a common enough operation that we need it i the main grammar? Having the functionality, definitely, but what if it was "just" a function instead? So you'd do something like: SELECT 'reindex index ' || i FROM pg_blah(some, arguments, here) \gexec Or even a function that returns the REINDEX commands directly (taking a parameter to turn on/off concurrency for example). That also seems like it would be easier to make flexible, and just as easy to plug into reindexdb? -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/ |
On Tue, Dec 15, 2020 at 06:34:16PM +0100, Magnus Hagander wrote:
> Is this really a common enough operation that we need it in the main grammar? > > Having the functionality, definitely, but what if it was "just" a > function instead? So you'd do something like: > SELECT 'reindex index ' || i FROM pg_blah(some, arguments, here) > \gexec > > Or even a function that returns the REINDEX commands directly (taking > a parameter to turn on/off concurrency for example). > > That also seems like it would be easier to make flexible, and just as > easy to plug into reindexdb? table is very useful when it comes to parallel reindexing, because it is no-brainer in terms of knowing which index to distribute to one job or another. In short, with this grammar you can just issue a set of REINDEX TABLE commands that we know will not conflict with each other. You cannot get that level of control with REINDEX INDEX as it may be possible that more or more commands conflict if they work on an index of the same relation because it is required to take lock also on the parent table. Of course, we could decide to implement a redistribution logic in all frontend tools that need such things, like reindexdb, but that's not something I think we should let the client decide of. A backend-side filtering is IMO much simpler, less code, and more elegant. -- Michael |
On Wed, Dec 16, 2020 at 8:27 AM Michael Paquier <[hidden email]> wrote:
> > On Tue, Dec 15, 2020 at 06:34:16PM +0100, Magnus Hagander wrote: > > Is this really a common enough operation that we need it in the main grammar? > > > > Having the functionality, definitely, but what if it was "just" a > > function instead? So you'd do something like: > > SELECT 'reindex index ' || i FROM pg_blah(some, arguments, here) > > \gexec > > > > Or even a function that returns the REINDEX commands directly (taking > > a parameter to turn on/off concurrency for example). > > > > That also seems like it would be easier to make flexible, and just as > > easy to plug into reindexdb? > > Having control in the grammar to choose which index to reindex for a > table is very useful when it comes to parallel reindexing, because > it is no-brainer in terms of knowing which index to distribute to one > job or another. In short, with this grammar you can just issue a set > of REINDEX TABLE commands that we know will not conflict with each > other. You cannot get that level of control with REINDEX INDEX as it > may be possible that more or more commands conflict if they work on an > index of the same relation because it is required to take lock also on > the parent table. Of course, we could decide to implement a > redistribution logic in all frontend tools that need such things, like > reindexdb, but that's not something I think we should let the client > decide of. A backend-side filtering is IMO much simpler, less code, > and more elegant. required frequently, but I'm pretty sure that reindexing only indexes that might be corrupt is something that will be required often.. So I agree, having all that logic in the backend makes everything easier for users, having the choice of the tools they want to issue the query while still having all features available. There was a conflict with a3dc926009be8 (Refactor option handling of CLUSTER, REINDEX and VACUUM), so rebased version attached. No other changes included yet. |
On Thu, Jan 21, 2021 at 11:12:56AM +0800, Julien Rouhaud wrote:
> > There was a conflict with a3dc926009be8 (Refactor option handling of > CLUSTER, REINDEX and VACUUM), so rebased version attached. No other > changes included yet. New conflict, v3 attached. |
Hi, For index_has_deprecated_collation(),+ object.objectSubId = 0; The objectSubId field is not accessed by do_check_index_has_deprecated_collation(). Does it need to be assigned ? For RelationGetIndexListFiltered(), it seems when (options & REINDEXOPT_COLL_NOT_CURRENT) == 0, the full_list would be returned. This can be checked prior to entering the foreach loop. Cheers On Sat, Feb 6, 2021 at 11:20 PM Julien Rouhaud <[hidden email]> wrote: On Thu, Jan 21, 2021 at 11:12:56AM +0800, Julien Rouhaud wrote: |
Hi,
Thanks for the review! On Mon, Feb 8, 2021 at 12:14 AM Zhihong Yu <[hidden email]> wrote: > > Hi, > For index_has_deprecated_collation(), > > + object.objectSubId = 0; > > The objectSubId field is not accessed by do_check_index_has_deprecated_collation(). Does it need to be assigned ? Indeed it's not strictly necessary I think, but it makes things cleaner and future proof, and that's how things are already done nearby. So I think it's better to keep it this way. > For RelationGetIndexListFiltered(), it seems when (options & REINDEXOPT_COLL_NOT_CURRENT) == 0, the full_list would be returned. > This can be checked prior to entering the foreach loop. That's already the case with this test: /* Fast exit if no filtering was asked, or if the list if empty. */ if (!reindexHasFilter(options) || full_list == NIL) return full_list; knowing that #define reindexHasFilter(x) ((x & REINDEXOPT_COLL_NOT_CURRENT) != 0) The code as-is written to be extensible with possibly other filters (e.g. specific library or specific version). Feedback so far seems to indicate that it may be overkill and only filtering indexes with deprecated collation is enough. I'll simplify this code in a future version, getting rid of reindexHasFilter, unless someone thinks more filter is a good idea. For now I'm attaching a rebased version, there was a conflict with the recent patch to add the missing_ok parameter to get_collation_version_for_oid() |
This post was updated on .
In reply to this post by Magnus Hagander-2
From what I heard on this topic, the goal is to reduce
the amount of time necessary to reindex a system so as REINDEX only works on indexes whose dependent collation versions are not known or works on indexes in need of a collation refresh (like a reindexdb --all --collation -j $jobs). Thanks CMO - UMC toronto (on the web) -- Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html |
Hi,
On Thu, Feb 25, 2021 at 12:11 AM mariakatosvich <[hidden email]> wrote: > > From what I heard on this topic, the goal is to reduce > the amount of time necessary to reindex a system so as REINDEX only > works on indexes whose dependent collation versions are not known or > works on indexes in need of a collation refresh (like a reindexdb > --all --collation -j $jobs). That's indeed the goal. The current patch only adds infrastructure for the REINDEX command, which will make easy to add the option for reindexdb. I'll add the reindexdb part too in the next version of the patch. |
In reply to this post by Julien Rouhaud
Hello, The PostGIS project needed this from time to time. Would be great if reindex by opclass can be made possible. We changed the semantics of btree at least twice (in 2.4 and 3.0), fixed some ND mixed-dimension indexes semantics in 3.0, fixed hash index on 32 bit arch in 3.0. On Thu, Dec 3, 2020 at 12:32 PM Julien Rouhaud <[hidden email]> wrote: Hello, |
In reply to this post by Julien Rouhaud
On Thu, Feb 25, 2021 at 1:22 AM Julien Rouhaud <[hidden email]> wrote:
> #define reindexHasFilter(x) ((x & REINDEXOPT_COLL_NOT_CURRENT) != 0) It's better to use "(x) & ..." in macros to avoid weird operator precedence problems in future code. It seems like there are several different names for similar things in this patch: "outdated", "not current", "deprecated". Can we settle on one, maybe "outdated"? > The code as-is written to be extensible with possibly other filters > (e.g. specific library or specific version). Feedback so far seems to > indicate that it may be overkill and only filtering indexes with > deprecated collation is enough. I'll simplify this code in a future > version, getting rid of reindexHasFilter, unless someone thinks more > filter is a good idea. Hmm, yeah, I think it should probably be very general. Suppose someone invents versioned dependencies for (say) functions or full text stemmers etc, then do we want to add more syntax here to rebuild indexes (assuming we don't use INVALID for such cases, IDK)? I don't think we'd want to have more syntax just to be able to say "hey, please fix my collation problems but not my stemmer problems". What about just REINDEX (OUTDATED)? It's hard to find a single word that means "depends on an object whose version has changed". |
In reply to this post by Darafei "Komяpa" Praliaskouski
Hi,
On Wed, Feb 24, 2021 at 09:34:59PM +0300, Darafei "Komяpa" Praliaskouski wrote: > Hello, > > The PostGIS project needed this from time to time. Would be great if > reindex by opclass can be made possible. > > We changed the semantics of btree at least twice (in 2.4 and 3.0), fixed > some ND mixed-dimension indexes semantics in 3.0, fixed hash index on 32 > bit arch in 3.0. Oh, I wasn't aware of that. Thanks for bringing this up! Looking at the last occurence (c00f9525a3c6c) I see that the NEWS item does mention the need to do a REINDEX. As far as I understand there wouldn't be any hard error if one doesn't do to a REINDEX, and you'd end up with some kind of "logical" corruption as the new lib version won't have the same semantics depending on the number of dimensions, so more or less the same kind of problems that would happen in case of breaking update of a collation library. It seems to me that it's a legitimate use case, especially since GiST doesn't have a metapage to store an index version. But I'm wondering if the right answer is to allow REINDEX / reindexdb to look for specific opclass or to add an API to let third party code register a custom dependency. In this case it would be some kind of gist ABI versioning. We could then have a single REINDEX option, like REINDEX (OUTDATED) as Thomas suggested in https://www.postgresql.org/message-id/CA+hUKG+WWioP6xV5Xf1pPhiWNGD1B7hdBBCdQoKfp=zymJajBQ@... for both cases. |
In reply to this post by Thomas Munro-5
On Thu, Feb 25, 2021 at 07:36:02AM +1300, Thomas Munro wrote:
> On Thu, Feb 25, 2021 at 1:22 AM Julien Rouhaud <[hidden email]> wrote: > > #define reindexHasFilter(x) ((x & REINDEXOPT_COLL_NOT_CURRENT) != 0) > > It's better to use "(x) & ..." in macros to avoid weird operator > precedence problems in future code. Ah indeed, thanks! I usually always protect the arguments wth parenthesis but I somehow missed this one. I'll send a new version of the patch shortly with the rest of the problems you mentioned. > It seems like there are several different names for similar things in > this patch: "outdated", "not current", "deprecated". Can we settle on > one, maybe "outdated"? Oops, I apparently missed a lot of places during the various rewrite of the patch. +1 for oudated. > > > The code as-is written to be extensible with possibly other filters > > (e.g. specific library or specific version). Feedback so far seems to > > indicate that it may be overkill and only filtering indexes with > > deprecated collation is enough. I'll simplify this code in a future > > version, getting rid of reindexHasFilter, unless someone thinks more > > filter is a good idea. > > Hmm, yeah, I think it should probably be very general. Suppose someone > invents versioned dependencies for (say) functions or full text > stemmers etc, then do we want to add more syntax here to rebuild > indexes (assuming we don't use INVALID for such cases, IDK)? I don't > think we'd want to have more syntax just to be able to say "hey, > please fix my collation problems but not my stemmer problems". What > about just REINDEX (OUTDATED)? It's hard to find a single word that > means "depends on an object whose version has changed". That quite make sense. I agree that it would make the solution simpler and better. Looking at the other use case for PostGIS mentioned by Darafei, it seems that it would help to make concept of index dependency extensible for third party code too (see https://www.postgresql.org/message-id/20210226074531.dhkfneao2czzqk6n%40nol). Would that make sense? |
In reply to this post by Michael Paquier-2
On Wed, Dec 16, 2020 at 1:27 AM Michael Paquier <[hidden email]> wrote:
> > On Tue, Dec 15, 2020 at 06:34:16PM +0100, Magnus Hagander wrote: > > Is this really a common enough operation that we need it in the main grammar? > > > > Having the functionality, definitely, but what if it was "just" a > > function instead? So you'd do something like: > > SELECT 'reindex index ' || i FROM pg_blah(some, arguments, here) > > \gexec > > > > Or even a function that returns the REINDEX commands directly (taking > > a parameter to turn on/off concurrency for example). > > > > That also seems like it would be easier to make flexible, and just as > > easy to plug into reindexdb? > > Having control in the grammar to choose which index to reindex for a > table is very useful when it comes to parallel reindexing, because > it is no-brainer in terms of knowing which index to distribute to one > job or another. In short, with this grammar you can just issue a set > of REINDEX TABLE commands that we know will not conflict with each > other. You cannot get that level of control with REINDEX INDEX as it (oops, seems I forgot to reply to this one, sorry!) Uh, isn't it almost exactly the opposite? If you do it in the backend grammar you *cannot* parallelize it between indexes, because you can only run one at a time. Whereas if you do it in the frontend, you can. Either with something like reindexdb that could do it automatically, or with psql as a copy/paste job? > may be possible that more or more commands conflict if they work on an > index of the same relation because it is required to take lock also on > the parent table. Of course, we could decide to implement a > redistribution logic in all frontend tools that need such things, like > reindexdb, but that's not something I think we should let the client > decide of. A backend-side filtering is IMO much simpler, less code, > and more elegant. It's simpler in the simple case, i agree with that. But ISTM it's also a lot less flexible for anything except the simplest case... -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/ |
In reply to this post by Julien Rouhaud
On Thu, Jan 21, 2021 at 4:13 AM Julien Rouhaud <[hidden email]> wrote:
> > On Wed, Dec 16, 2020 at 8:27 AM Michael Paquier <[hidden email]> wrote: > > > > On Tue, Dec 15, 2020 at 06:34:16PM +0100, Magnus Hagander wrote: > > > Is this really a common enough operation that we need it in the main grammar? > > > > > > Having the functionality, definitely, but what if it was "just" a > > > function instead? So you'd do something like: > > > SELECT 'reindex index ' || i FROM pg_blah(some, arguments, here) > > > \gexec > > > > > > Or even a function that returns the REINDEX commands directly (taking > > > a parameter to turn on/off concurrency for example). > > > > > > That also seems like it would be easier to make flexible, and just as > > > easy to plug into reindexdb? > > > > Having control in the grammar to choose which index to reindex for a > > table is very useful when it comes to parallel reindexing, because > > it is no-brainer in terms of knowing which index to distribute to one > > job or another. In short, with this grammar you can just issue a set > > of REINDEX TABLE commands that we know will not conflict with each > > other. You cannot get that level of control with REINDEX INDEX as it > > may be possible that more or more commands conflict if they work on an > > index of the same relation because it is required to take lock also on > > the parent table. Of course, we could decide to implement a > > redistribution logic in all frontend tools that need such things, like > > reindexdb, but that's not something I think we should let the client > > decide of. A backend-side filtering is IMO much simpler, less code, > > and more elegant. > > Maybe additional filtering capabilities is not something that will be > required frequently, but I'm pretty sure that reindexing only indexes > that might be corrupt is something that will be required often.. So I > agree, having all that logic in the backend makes everything easier > for users, having the choice of the tools they want to issue the query > while still having all features available. I agree with that scenario -- in that the most common case will be exactly that of reindexing only indexes that might be corrupt. I don't agree with the conclusion though. The most common case of that will be in the case of an upgrade. In that case I want to reindex all of those indexes as quickly as possible. So I'll want to parallelize it across multiple sessions (like reindexdb -j 4 or whatever). But doesn't putting the filter in the grammar prevent me from doing exactly that? Each of those 4 (or whatever) sessions would have to guess which would go where and then speculatively run the command on that, instead of being able to directly distribute the worload? -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/ |
On Fri, Feb 26, 2021 at 5:50 PM Magnus Hagander <[hidden email]> wrote:
> > I don't agree with the conclusion though. > > The most common case of that will be in the case of an upgrade. In > that case I want to reindex all of those indexes as quickly as > possible. So I'll want to parallelize it across multiple sessions > (like reindexdb -j 4 or whatever). But doesn't putting the filter in > the grammar prevent me from doing exactly that? Each of those 4 (or > whatever) sessions would have to guess which would go where and then > speculatively run the command on that, instead of being able to > directly distribute the worload? It means that you'll have to distribute the work on a per-table basis rather than a per-index basis. The time spent to find out that a table doesn't have any impacted index should be negligible compared to the cost of running a reindex. This obviously won't help that much if you have a lot of table but only one being gigantic. But even if we put the logic in the client, this still won't help as reindexdb doesn't support multiple job with an index list: * Index-level REINDEX is not supported with multiple jobs as we * cannot control the concurrent processing of multiple indexes * depending on the same relation. */ if (concurrentCons > 1 && indexes.head != NULL) { pg_log_error("cannot use multiple jobs to reindex indexes"); exit(1); } |
On Fri, Feb 26, 2021 at 11:07 AM Julien Rouhaud <[hidden email]> wrote:
> > On Fri, Feb 26, 2021 at 5:50 PM Magnus Hagander <[hidden email]> wrote: > > > > I don't agree with the conclusion though. > > > > The most common case of that will be in the case of an upgrade. In > > that case I want to reindex all of those indexes as quickly as > > possible. So I'll want to parallelize it across multiple sessions > > (like reindexdb -j 4 or whatever). But doesn't putting the filter in > > the grammar prevent me from doing exactly that? Each of those 4 (or > > whatever) sessions would have to guess which would go where and then > > speculatively run the command on that, instead of being able to > > directly distribute the worload? > > It means that you'll have to distribute the work on a per-table basis > rather than a per-index basis. The time spent to find out that a > table doesn't have any impacted index should be negligible compared to > the cost of running a reindex. This obviously won't help that much if > you have a lot of table but only one being gigantic. Yeah -- or at least a couple of large and many small, which I find to be a very common scenario. Or the case of some tables having many affected indexes and some having few. You'd basically want to order the operation by table on something like "total size of the affected indexes on table x" -- which may very well put a smaller table with many indexes earlier in the queue. But you can't do that without having access to the filter.... > But even if we put the logic in the client, this still won't help as > reindexdb doesn't support multiple job with an index list: > > * Index-level REINDEX is not supported with multiple jobs as we > * cannot control the concurrent processing of multiple indexes > * depending on the same relation. > */ > if (concurrentCons > 1 && indexes.head != NULL) > { > pg_log_error("cannot use multiple jobs to reindex indexes"); > exit(1); > } That sounds like it would be a fixable problem though, in principle. It could/should probably still limit all indexes on the same table to be processed in the same connection for the locking reasons of course, but doing an order by the total size of the indexes like above, and ensuring that they are grouped that way, doesn't sound *that* hard. I doubt it's that important in the current usecase of manually listing the indexes, but it would be useful for something like this. -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/ |
On Fri, Feb 26, 2021 at 11:17:26AM +0100, Magnus Hagander wrote:
> On Fri, Feb 26, 2021 at 11:07 AM Julien Rouhaud <[hidden email]> wrote: > > > > It means that you'll have to distribute the work on a per-table basis > > rather than a per-index basis. The time spent to find out that a > > table doesn't have any impacted index should be negligible compared to > > the cost of running a reindex. This obviously won't help that much if > > you have a lot of table but only one being gigantic. > > Yeah -- or at least a couple of large and many small, which I find to > be a very common scenario. Or the case of some tables having many > affected indexes and some having few. > > You'd basically want to order the operation by table on something like > "total size of the affected indexes on table x" -- which may very well > put a smaller table with many indexes earlier in the queue. But you > can't do that without having access to the filter.... So, long running reindex due to some gigantic and/or numerous indexes on a single (or few) table is not something that we can solve, but inefficient reindex due to wrong table size / to-be-reindexed-indexes-size correlation can be addressed. I would still prefer to go to backend implementation, so that all client tools can benefit from it by default. We could simply export the current index_has_oudated_collation(oid) function in sql, and tweak pg_dump to order tables by the cumulated size of such indexes as you mentioned below, would that work for you? Also, given Thomas proposal in a nearby email this function would be renamed to index_has_oudated_dependencies(oid) or something like that. > > But even if we put the logic in the client, this still won't help as > > reindexdb doesn't support multiple job with an index list: > > > > * Index-level REINDEX is not supported with multiple jobs as we > > * cannot control the concurrent processing of multiple indexes > > * depending on the same relation. > > */ > > if (concurrentCons > 1 && indexes.head != NULL) > > { > > pg_log_error("cannot use multiple jobs to reindex indexes"); > > exit(1); > > } > > That sounds like it would be a fixable problem though, in principle. > It could/should probably still limit all indexes on the same table to > be processed in the same connection for the locking reasons of course, > but doing an order by the total size of the indexes like above, and > ensuring that they are grouped that way, doesn't sound *that* hard. I > doubt it's that important in the current usecase of manually listing > the indexes, but it would be useful for something like this. Yeah, I don't think that in case of oudated dependency the --index will be useful, it's likely that there will be too many indexes to process. We can still try to improve reindexdb to be able to process index lists with parallel connections, but I would rather keep that separated from this patch. |
Free forum by Nabble | Edit this page |