index bloat estimation

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

index bloat estimation

Victor Sudakov
Dear Colleagues,

What queries do you use to estimate index and table bloat?

I've researched some on the Net and found multiple scripts mentioned in
https://wiki.postgresql.org/wiki/Index_Maintenance#Index_Bloat, also
in https://github.com/pgexperts/pgx_scripts etc.

Most of the stuff I've looked at is pretty old, much seems unsupported.
What is the current best practice?

I'd be grateful if you could share your personal favourite ways of
estimating bloat.

--
Victor Sudakov,  VAS4-RIPE, VAS47-RIPN
2:5005/49@fidonet http://vas.tomsk.ru/


Reply | Threaded
Open this post in threaded view
|

Re: index bloat estimation

Guillaume Lelarge-3
Hi,

Le ven. 12 févr. 2021 à 09:26, Victor Sudakov <[hidden email]> a écrit :
Dear Colleagues,

What queries do you use to estimate index and table bloat?

I've researched some on the Net and found multiple scripts mentioned in
https://wiki.postgresql.org/wiki/Index_Maintenance#Index_Bloat, also
in https://github.com/pgexperts/pgx_scripts etc.

Most of the stuff I've looked at is pretty old, much seems unsupported.
What is the current best practice?

I'd be grateful if you could share your personal favourite ways of
estimating bloat.


I use https://github.com/ioguix/pgsql-bloat-estimation . It's pretty good at estimating bloat on tables and Btree indexes (though deduplication in v13 makes it harder).

check_pgactivitu uses the queries from this repository.


--
Guillaume.
Reply | Threaded
Open this post in threaded view
|

Re: index bloat estimation

Keith Fiske-2
In reply to this post by Victor Sudakov

On Fri, Feb 12, 2021 at 3:26 AM Victor Sudakov <[hidden email]> wrote:
Dear Colleagues,

What queries do you use to estimate index and table bloat?

I've researched some on the Net and found multiple scripts mentioned in
https://wiki.postgresql.org/wiki/Index_Maintenance#Index_Bloat, also
in https://github.com/pgexperts/pgx_scripts etc.

Most of the stuff I've looked at is pretty old, much seems unsupported.
What is the current best practice?

I'd be grateful if you could share your personal favourite ways of
estimating bloat.

--
Victor Sudakov,  VAS4-RIPE, VAS47-RIPN
2:5005/49@fidonet http://vas.tomsk.ru/




Why estimate when you can get the exact amount? At least for b-tree indexes anyway.


This script uses the pgstattuple extension to get both table and b-tree index bloat information. Since it's actually scanning the table, it can take longer than other queries that try and do estimates based on statistics. But it does give you very accurate information. You can also just use pgstattuple directly without this script, but you do have to run it individually on the table then each index. The script can scan the table and all its indexes in one step and give you a full summary.


--
Keith Fiske
Senior Database Engineer
Crunchy Data - http://crunchydata.com
Reply | Threaded
Open this post in threaded view
|

Re: index bloat estimation

Gregory Smith
In reply to this post by Victor Sudakov
On Fri, 12 Feb 2021, Victor Sudakov wrote:

> Most of the stuff I've looked at is pretty old, much seems unsupported.
> What is the current best practice?

None of the estimate queries ever really worked well.  They just gave a
bit more information than zero in the days before PG's internal functions
were really reliable and useful for bloat measurement, something that
happened in version 9.5 development.

For any modern PG, it's worth the trouble to learn how to directly use
pgstattuple https://www.postgresql.org/docs/current/pgstattuple.html to do
this job.  You run and interpret the output from pgstattuple(relation) and
its faster estimate version pgstattuple_approx.  It's not hard to run some
simulations with deleted rows to see what bloat looks like when it builds
up.

There's still some need for manual estimates if you want to account for
fillfactor in all cases, but I see that as a niche topic, not where people
should start at.

There are also wrapper scripts built on top of pgstattuple around, like
the already mentioned https://github.com/keithf4/pg_bloat_check

A good bit of the work done in that script is around handling multiple
versions of PG and building some long-term idea of bloat state on all
tables.  Workloads that have a bloat problem are sometimes fixed, but in a
lot of cases the best you can do is monitor them and rebuild things when
it gets bad.  That's one context Keith's packaging of this feature aims
at.

--
Greg Smith  [hidden email]
Director of Open Source Strategy
Crunchy Data https://www.crunchydata.com/
Reply | Threaded
Open this post in threaded view
|

Re: index bloat estimation

Victor Sudakov
In reply to this post by Keith Fiske-2
Keith Fiske wrote:
> >
> > What queries do you use to estimate index and table bloat?

[dd]

> Why estimate when you can get the exact amount? At least for b-tree indexes
> anyway.
>
> https://github.com/keithf4/pg_bloat_check

Hello Keith,

Thanks for this script and for reminding about pgstattuple.

The script has the drawback of requiring r/w access to the database, so
it cannot be run on a replica. But the idea is excellent. At least it
can show the top N bloated relations, and then I can further explore with
pgstattuple(), pgstatindex() etc.


--
Victor Sudakov,  VAS4-RIPE, VAS47-RIPN
2:5005/49@fidonet http://vas.tomsk.ru/


Reply | Threaded
Open this post in threaded view
|

Re: index bloat estimation

Keith Fiske-2


On Sun, Feb 14, 2021 at 11:43 PM Victor Sudakov <[hidden email]> wrote:
Keith Fiske wrote:
> >
> > What queries do you use to estimate index and table bloat?

[dd]

> Why estimate when you can get the exact amount? At least for b-tree indexes
> anyway.
>
> https://github.com/keithf4/pg_bloat_check

Hello Keith,

Thanks for this script and for reminding about pgstattuple.

The script has the drawback of requiring r/w access to the database, so
it cannot be run on a replica. But the idea is excellent. At least it
can show the top N bloated relations, and then I can further explore with
pgstattuple(), pgstatindex() etc.


--
Victor Sudakov,  VAS4-RIPE, VAS47-RIPN
2:5005/49@fidonet http://vas.tomsk.ru/


You cannot run the actual scan on the replica, no. But it can be set to run against the replica and just report the statistics so you can have a cron set up to always run on the given system in case of failover. The "--recovery_mode_norun" can be set so it will only run if the target system is actually a primary.

I could possibly see about letting this actually run against the replica, however this can be a rather long running transaction depending on the size of the tables involved. You can set the "--commit_rate" to avoid some of that, but if you have really large tables, it can still run quite long. So this sort of check is really best run against the primary to avoid issues around having to allow long running queries on the replica (delayed replication or even worse bloat buildup).

However, once you get bloat under control, you likely shouldn't need to be running this often, especially against the entire database. If a few tables turn out to be problematic, you can make schedules just for them. And otherwise run an entire database scan at most maybe once a month during off-peak hours.


--
Keith Fiske
Senior Database Engineer
Crunchy Data - http://crunchydata.com
Reply | Threaded
Open this post in threaded view
|

Re: index bloat estimation

Victor Sudakov
Keith Fiske wrote:

[dd]

>
> You cannot run the actual scan on the replica, no. But it can be set to run
> against the replica and just report the statistics so you can have a cron
> set up to always run on the given system in case of failover. The
> "--recovery_mode_norun" can be set so it will only run if the target system
> is actually a primary.
>
> I could possibly see about letting this actually run against the replica,
> however this can be a rather long running transaction depending on the size
> of the tables involved. You can set the "--commit_rate" to avoid some of
> that, but if you have really large tables, it can still run quite long. So
> this sort of check is really best run against the primary to avoid issues
> around having to allow long running queries on the replica (delayed
> replication or even worse bloat buildup).

If you have a dedicated replica for OLAP, even one running from a WAL
archive (not from a replication slot), this is not an issue.

So running pg_bloat_check against a replica would be very useful for
some of us.

--
Victor Sudakov,  VAS4-RIPE, VAS47-RIPN
2:5005/49@fidonet http://vas.tomsk.ru/


Reply | Threaded
Open this post in threaded view
|

Re: index bloat estimation

Keith Fiske-2


On Tue, Feb 16, 2021 at 9:27 PM Victor Sudakov <[hidden email]> wrote:
Keith Fiske wrote:

[dd]
>
> You cannot run the actual scan on the replica, no. But it can be set to run
> against the replica and just report the statistics so you can have a cron
> set up to always run on the given system in case of failover. The
> "--recovery_mode_norun" can be set so it will only run if the target system
> is actually a primary.
>
> I could possibly see about letting this actually run against the replica,
> however this can be a rather long running transaction depending on the size
> of the tables involved. You can set the "--commit_rate" to avoid some of
> that, but if you have really large tables, it can still run quite long. So
> this sort of check is really best run against the primary to avoid issues
> around having to allow long running queries on the replica (delayed
> replication or even worse bloat buildup).

If you have a dedicated replica for OLAP, even one running from a WAL
archive (not from a replication slot), this is not an issue.

So running pg_bloat_check against a replica would be very useful for
some of us.

--
Victor Sudakov,  VAS4-RIPE, VAS47-RIPN
2:5005/49@fidonet http://vas.tomsk.ru/



Actually this could still be an issue, but all really depends on the size of the tables involved. Long running transactions on the replica have the potential to either delay replication entirely (max_standby_archive_delay, max_standby_streaming_delay) or cause more bloat than normal on the primary (hot_standby_feedback). The latter is more often used to avoid the replication delay, but I have frequently seen people push long running transactions that seem to be a "problem" onto the replicas and don't realize that it doesn't solve all the problems of actually running those queries on the primary. You still have a transaction causing autovacuum to not be able to run efficiently.

The approximate/quick mode of pgstattuple could certainly help with this problem, but as the issue you opened up on the github repo pointed out, that skips over scanning toast tables (https://github.com/keithf4/pg_bloat_check/issues/22) and also does not work against indexes which are more often the problem with bloat and query performance. I have seen significant bloat forming in the toast tables (hundreds of GB) when the regular table only reports very minimal bloat. So I don't recommend relying completely on the approximate check.

--
Keith Fiske
Senior Database Engineer
Crunchy Data - http://crunchydata.com
Reply | Threaded
Open this post in threaded view
|

Re: index bloat estimation

Victor Sudakov
Keith Fiske wrote:

> >
> Actually this could still be an issue, but all really depends on the size
> of the tables involved. Long running transactions on the replica have the
> potential to either delay replication entirely
> (max_standby_archive_delay, max_standby_streaming_delay) or cause more
> bloat than normal on the primary (hot_standby_feedback). The latter is more

As I said earlier, if you have a dedicated replica for OLAP running
continuous recovery from a WAL archive (not from a replication slot),
this is not an issue.  You can do whatever you want with this replica:
stop it, make it a delayed replica, it cannot affect the master in any
way.

>
> The approximate/quick mode of pgstattuple could certainly help with this
> problem, but as the issue you opened up on the github repo pointed out,
> that skips over scanning toast tables (
> https://github.com/keithf4/pg_bloat_check/issues/22) and also does not work
> against indexes which are more often the problem with bloat and query
> performance. I have seen significant bloat forming in the toast tables
> (hundreds of GB) when the regular table only reports very minimal bloat. So
> I don't recommend relying completely on the approximate check.

I'll mind that, thank you.

--
Victor Sudakov,  VAS4-RIPE, VAS47-RIPN
2:5005/49@fidonet http://vas.tomsk.ru/