RE: PG12 autovac issues

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
31 messages Options
12
Reply | Threaded
Open this post in threaded view
|

RE: PG12 autovac issues

Justin King
Hi Andres-

Thanks for the reply, answers below.

On Tue, Mar 17, 2020 at 8:19 PM Andres Freund <[hidden email]> wrote:

>
> Hi,
>
> On 2020-03-17 17:18:57 -0500, Justin King wrote:
> > As you can see in this table, there are only ~80K rows, but billions
> > of updates.  What we have observed is that the frozenxid reaches the
> > 200M mark fairly quickly because of the amount of activity.
>
> And each of those updates is in a separate transaction? Is that
> required? I.e. any chance to perform multiple of those updates in one
> transaction?
>
> Have you considered just increasing the vacuum limit? It's mostly there
> because it can increase space usage a bit, but given today's systems its
> not a usually a problem unless you have hundreds of postgres instances
> on a single system.  It's pretty common to just increase that to 1.8
> billion or so.

We have considered increasing the limit, but as I mentioned, the
problem that we're seeing is that (effectively) a autovac starts on a
system database (postgres, template1) and never completes, or
deadlocks, or something.  This completely stops autovacs from running
until we manually intervene and run a VACUUM FREEZE -- at which point,
the autovacs resume.  If we increase the vacuum limit and this
situation happens, we're going to be in real trouble.

> From a single stats snapshot we can't actually understand the actual xid
> consumption - is it actually the xid usage that triggers the vacuums?

We have looked at this and the xid consumption averages around 1250
xid/sec -- this is when we see the "aggressive" autovac kick off in
the logs.  What I don't understand is why these xid's are being
consumed at this rate on the databases with no activity (postgres,
template1).

>
>
> What makes you think it is a problem that you have all these vacuums? If
> you actually update that much, and you have indexes, you're going want a
> lot of vacuums?
>

I actually don't think there's a problem with the vacuums (I was
mostly pointing out that they are very regular and not problematic).
The main problem I am having is that something is causing the
autovacuums to completely stop and require manual intervention to
resume -- and it seems to be when the "postgres" or "template1"
database hits the autovacuum_freeze_max_age.

>
> > What is interesting is that this happens with the 'postgres' and
> > 'template1' databases as well and there is absolutely no activity in
> > those databases.
>
> That's normal. They should be pretty darn quick in v12?

Yes, a manual VACUUM FREEZE of either database takes less than 1
second -- which is why it's perplexing that the autovac starts and
never seems to complete and prevents other autovacs from running.

>
> Greetings,
>
> Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: PG12 autovac issues

Andres Freund
Hi,

On 2020-03-19 10:23:48 -0500, Justin King wrote:
> > From a single stats snapshot we can't actually understand the actual xid
> > consumption - is it actually the xid usage that triggers the vacuums?
>
> We have looked at this and the xid consumption averages around 1250
> xid/sec -- this is when we see the "aggressive" autovac kick off in
> the logs.  What I don't understand is why these xid's are being
> consumed at this rate on the databases with no activity (postgres,
> template1).

The xid counter is global across all databases.


> > What makes you think it is a problem that you have all these vacuums? If
> > you actually update that much, and you have indexes, you're going want a
> > lot of vacuums?

> I actually don't think there's a problem with the vacuums (I was
> mostly pointing out that they are very regular and not problematic).
> The main problem I am having is that something is causing the
> autovacuums to completely stop and require manual intervention to
> resume -- and it seems to be when the "postgres" or "template1"
> database hits the autovacuum_freeze_max_age.

Did you look at pg_stat_activity for those autovacuums to see whether
they're blocked on something?


> > > What is interesting is that this happens with the 'postgres' and
> > > 'template1' databases as well and there is absolutely no activity in
> > > those databases.
> >
> > That's normal. They should be pretty darn quick in v12?
>
> Yes, a manual VACUUM FREEZE of either database takes less than 1
> second -- which is why it's perplexing that the autovac starts and
> never seems to complete and prevents other autovacs from running.

One big difference between a manual VACUUM and autovacuum is that with
the default settings VACUUM is not throttled, but autovacuum is.

What are your vacuum_cost_delay, autovacuum_vacuum_cost_delay,
vacuum_cost_limit, autovacuum_vacuum_cost_limit, vacuum_cost_page_hit,
vacuum_cost_page_miss set to?

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: PG12 autovac issues

Justin King
On Thu, Mar 19, 2020 at 5:35 PM Andres Freund <[hidden email]> wrote:

>
> Hi,
>
> On 2020-03-19 10:23:48 -0500, Justin King wrote:
> > > From a single stats snapshot we can't actually understand the actual xid
> > > consumption - is it actually the xid usage that triggers the vacuums?
> >
> > We have looked at this and the xid consumption averages around 1250
> > xid/sec -- this is when we see the "aggressive" autovac kick off in
> > the logs.  What I don't understand is why these xid's are being
> > consumed at this rate on the databases with no activity (postgres,
> > template1).
>
> The xid counter is global across all databases.

Then what does the "age" value represent for each database in this
case?  Perhaps I'm misunderstanding what I'm looking at?

postgres=#  SELECT datname, age(datfrozenxid),
current_setting('autovacuum_freeze_max_age') FROM pg_database;
  datname  |    age    | current_setting
-----------+-----------+-----------------
 postgres  | 100937449 | 200000000
 template1 |  50244438 | 200000000
 template0 | 160207297 | 200000000
 feedi     | 150147602 | 200000000


>
> > > What makes you think it is a problem that you have all these vacuums? If
> > > you actually update that much, and you have indexes, you're going want a
> > > lot of vacuums?
>
> > I actually don't think there's a problem with the vacuums (I was
> > mostly pointing out that they are very regular and not problematic).
> > The main problem I am having is that something is causing the
> > autovacuums to completely stop and require manual intervention to
> > resume -- and it seems to be when the "postgres" or "template1"
> > database hits the autovacuum_freeze_max_age.
>
> Did you look at pg_stat_activity for those autovacuums to see whether
> they're blocked on something?

This is not something we've done yet but will next time it occurs.

> > > > What is interesting is that this happens with the 'postgres' and
> > > > 'template1' databases as well and there is absolutely no activity in
> > > > those databases.
> > >
> > > That's normal. They should be pretty darn quick in v12?
> >
> > Yes, a manual VACUUM FREEZE of either database takes less than 1
> > second -- which is why it's perplexing that the autovac starts and
> > never seems to complete and prevents other autovacs from running.
>
> One big difference between a manual VACUUM and autovacuum is that with
> the default settings VACUUM is not throttled, but autovacuum is.
>
> What are your vacuum_cost_delay, autovacuum_vacuum_cost_delay,
> vacuum_cost_limit, autovacuum_vacuum_cost_limit, vacuum_cost_page_hit,
> vacuum_cost_page_miss set to?

Here are all the vacuum related values for the server:

postgres=# select name,setting from pg_settings where name like '%vacuum%';
autovacuum = on
autovacuum_analyze_scale_factor = 0.1
autovacuum_analyze_threshold = 2500
autovacuum_freeze_max_age = 200000000
autovacuum_max_workers = 8
autovacuum_multixact_freeze_max_age = 400000000
autovacuum_naptime = 15
autovacuum_vacuum_cost_delay = 20
autovacuum_vacuum_cost_limit = -1
autovacuum_vacuum_scale_factor = 0.2
autovacuum_vacuum_threshold = 500
autovacuum_work_mem = -1
log_autovacuum_min_duration = 0
vacuum_cleanup_index_scale_factor = 0.1
vacuum_cost_delay = 0
vacuum_cost_limit = 1000
vacuum_cost_page_dirty = 20
vacuum_cost_page_hit = 1
vacuum_cost_page_miss = 10
vacuum_defer_cleanup_age = 0
vacuum_freeze_min_age = 50000000
vacuum_freeze_table_age = 150000000
vacuum_multixact_freeze_min_age = 5000000
vacuum_multixact_freeze_table_age = 150000000

I know the database is busy, so the throttling makes sense, but it
seems like it would complete eventually. We see blocked autovacs for
many hours.

> Greetings,
>
> Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: PG12 autovac issues

Andres Freund
Hi,

On 2020-03-19 18:07:14 -0500, Justin King wrote:

> On Thu, Mar 19, 2020 at 5:35 PM Andres Freund <[hidden email]> wrote:
> >
> > Hi,
> >
> > On 2020-03-19 10:23:48 -0500, Justin King wrote:
> > > > From a single stats snapshot we can't actually understand the actual xid
> > > > consumption - is it actually the xid usage that triggers the vacuums?
> > >
> > > We have looked at this and the xid consumption averages around 1250
> > > xid/sec -- this is when we see the "aggressive" autovac kick off in
> > > the logs.  What I don't understand is why these xid's are being
> > > consumed at this rate on the databases with no activity (postgres,
> > > template1).
> >
> > The xid counter is global across all databases.
>
> Then what does the "age" value represent for each database in this
> case?  Perhaps I'm misunderstanding what I'm looking at?
>
> postgres=#  SELECT datname, age(datfrozenxid),
> current_setting('autovacuum_freeze_max_age') FROM pg_database;
>   datname  |    age    | current_setting
> -----------+-----------+-----------------
>  postgres  | 100937449 | 200000000
>  template1 |  50244438 | 200000000
>  template0 | 160207297 | 200000000
>  feedi     | 150147602 | 200000000

Look at datfrozenxid without the age(). age(xid) computes how "old" xid
is compared to the "next" xid to be assigned. Until vacuum comes around
and performs work, pg_database.datfrozenxid / pg_class.relfrozenxid are
constant, since they represent the values actually present in the
table.  But if xids are being consumed, their "age" increases, because
they're further and further in the past relative to the "newest" xids.


> > One big difference between a manual VACUUM and autovacuum is that with
> > the default settings VACUUM is not throttled, but autovacuum is.
> >
> > What are your vacuum_cost_delay, autovacuum_vacuum_cost_delay,
> > vacuum_cost_limit, autovacuum_vacuum_cost_limit, vacuum_cost_page_hit,
> > vacuum_cost_page_miss set to?
>
> Here are all the vacuum related values for the server:
>
> postgres=# select name,setting from pg_settings where name like '%vacuum%';
> autovacuum = on
> autovacuum_analyze_scale_factor = 0.1
> autovacuum_analyze_threshold = 2500
> autovacuum_freeze_max_age = 200000000
> autovacuum_max_workers = 8
> autovacuum_multixact_freeze_max_age = 400000000
> autovacuum_naptime = 15
> autovacuum_vacuum_cost_delay = 20
> autovacuum_vacuum_cost_limit = -1
> autovacuum_vacuum_scale_factor = 0.2
> autovacuum_vacuum_threshold = 500
> autovacuum_work_mem = -1
> log_autovacuum_min_duration = 0
> vacuum_cleanup_index_scale_factor = 0.1
> vacuum_cost_delay = 0
> vacuum_cost_limit = 1000
> vacuum_cost_page_dirty = 20
> vacuum_cost_page_hit = 1
> vacuum_cost_page_miss = 10
> vacuum_defer_cleanup_age = 0
> vacuum_freeze_min_age = 50000000
> vacuum_freeze_table_age = 150000000
> vacuum_multixact_freeze_min_age = 5000000
> vacuum_multixact_freeze_table_age = 150000000
>
> I know the database is busy, so the throttling makes sense, but it
> seems like it would complete eventually.

The cost limit/delay are way too long/small respectively for a busy
postgres instance.


> We see blocked autovacs for many hours.

On the same table, or just generally being busy?

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: PG12 autovac issues

Justin King
On Thu, Mar 19, 2020 at 6:56 PM Andres Freund <[hidden email]> wrote:

>
> Hi,
>
> On 2020-03-19 18:07:14 -0500, Justin King wrote:
> > On Thu, Mar 19, 2020 at 5:35 PM Andres Freund <[hidden email]> wrote:
> > >
> > > Hi,
> > >
> > > On 2020-03-19 10:23:48 -0500, Justin King wrote:
> > > > > From a single stats snapshot we can't actually understand the actual xid
> > > > > consumption - is it actually the xid usage that triggers the vacuums?
> > > >
> > > > We have looked at this and the xid consumption averages around 1250
> > > > xid/sec -- this is when we see the "aggressive" autovac kick off in
> > > > the logs.  What I don't understand is why these xid's are being
> > > > consumed at this rate on the databases with no activity (postgres,
> > > > template1).
> > >
> > > The xid counter is global across all databases.
> >
> > Then what does the "age" value represent for each database in this
> > case?  Perhaps I'm misunderstanding what I'm looking at?
> >
> > postgres=#  SELECT datname, age(datfrozenxid),
> > current_setting('autovacuum_freeze_max_age') FROM pg_database;
> >   datname  |    age    | current_setting
> > -----------+-----------+-----------------
> >  postgres  | 100937449 | 200000000
> >  template1 |  50244438 | 200000000
> >  template0 | 160207297 | 200000000
> >  feedi     | 150147602 | 200000000
>
> Look at datfrozenxid without the age(). age(xid) computes how "old" xid
> is compared to the "next" xid to be assigned. Until vacuum comes around
> and performs work, pg_database.datfrozenxid / pg_class.relfrozenxid are
> constant, since they represent the values actually present in the
> table.  But if xids are being consumed, their "age" increases, because
> they're further and further in the past relative to the "newest" xids.
>
>
> > > One big difference between a manual VACUUM and autovacuum is that with
> > > the default settings VACUUM is not throttled, but autovacuum is.
> > >
> > > What are your vacuum_cost_delay, autovacuum_vacuum_cost_delay,
> > > vacuum_cost_limit, autovacuum_vacuum_cost_limit, vacuum_cost_page_hit,
> > > vacuum_cost_page_miss set to?
> >
> > Here are all the vacuum related values for the server:
> >
> > postgres=# select name,setting from pg_settings where name like '%vacuum%';
> > autovacuum = on
> > autovacuum_analyze_scale_factor = 0.1
> > autovacuum_analyze_threshold = 2500
> > autovacuum_freeze_max_age = 200000000
> > autovacuum_max_workers = 8
> > autovacuum_multixact_freeze_max_age = 400000000
> > autovacuum_naptime = 15
> > autovacuum_vacuum_cost_delay = 20
> > autovacuum_vacuum_cost_limit = -1
> > autovacuum_vacuum_scale_factor = 0.2
> > autovacuum_vacuum_threshold = 500
> > autovacuum_work_mem = -1
> > log_autovacuum_min_duration = 0
> > vacuum_cleanup_index_scale_factor = 0.1
> > vacuum_cost_delay = 0
> > vacuum_cost_limit = 1000
> > vacuum_cost_page_dirty = 20
> > vacuum_cost_page_hit = 1
> > vacuum_cost_page_miss = 10
> > vacuum_defer_cleanup_age = 0
> > vacuum_freeze_min_age = 50000000
> > vacuum_freeze_table_age = 150000000
> > vacuum_multixact_freeze_min_age = 5000000
> > vacuum_multixact_freeze_table_age = 150000000
> >
> > I know the database is busy, so the throttling makes sense, but it
> > seems like it would complete eventually.
>
> The cost limit/delay are way too long/small respectively for a busy
> postgres instance.

This does make sense, we will look into adjusting those values.

> > We see blocked autovacs for many hours.
>
> On the same table, or just generally being busy?

We haven't isolated *which* table it is blocked on (assuming it is),
but all autovac's cease running until we manually intervene.

When we get into this state again, is there some other information
(other than what is in pg_stat_statement or pg_stat_activity) that
would be useful for folks here to help understand what is going on?
>
> Greetings,
>
> Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: PG12 autovac issues

Andres Freund
Hi,

On 2020-03-20 12:42:31 -0500, Justin King wrote:
> When we get into this state again, is there some other information
> (other than what is in pg_stat_statement or pg_stat_activity) that
> would be useful for folks here to help understand what is going on?

If it's actually stuck on a single table, and that table is not large,
it would be useful to get a backtrace with gdb.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: PG12 autovac issues

Michael Lewis
In reply to this post by Justin King
We haven't isolated *which* table it is blocked on (assuming it is),
but all autovac's cease running until we manually intervene.

When we get into this state again, is there some other information
(other than what is in pg_stat_statement or pg_stat_activity) that
would be useful for folks here to help understand what is going on?

Reply | Threaded
Open this post in threaded view
|

Re: PG12 autovac issues

Julien Rouhaud
In reply to this post by Andres Freund
On Fri, Mar 20, 2020 at 12:03:17PM -0700, Andres Freund wrote:
> Hi,
>
> On 2020-03-20 12:42:31 -0500, Justin King wrote:
> > When we get into this state again, is there some other information
> > (other than what is in pg_stat_statement or pg_stat_activity) that
> > would be useful for folks here to help understand what is going on?
>
> If it's actually stuck on a single table, and that table is not large,
> it would be useful to get a backtrace with gdb.

FTR, we're facing a very similar issue at work (adding Michael and Kevin in Cc)
during performance tests since a recent upgrade to pg12 .

What seems to be happening is that after reaching 200M transaction a first pass
of autovacuum freeze is being run, bumping pg_database.darfrozenxid by ~ 800k
(age(datfrozenxid) still being more than autovacuum_freeze_max_age afterwards).
After that point, all available information seems to indicate that no
autovacuum worker is scheduled anymore:

- log_autovacuum_min_duration is set to 0 and no activity is logged (while
  having thousands of those per hour before that)
- 15 min interval snapshot of pg_database and pg_class shows that
  datfrozenxid/relfrozenxid keeps increasing at a consistent rate and never
  goes down
 - 15 min interval snapshot of pg_stat_activity doesn't show any autovacuum
   worker
- the autovacuum launcher is up and running and doesn't show any sign of
  problem
- n_mod_since_analyze keeps growing at a consistent rate, never going down
- 15 min delta of tup_updated and tup_deleted shows that the globate write
  activity doesn't change before and after the autovacuum problem

The situation continues for ~2h, at which point the bloat is so heavy that the
main filesystem becomes full, and postgres panics after a failed write in
pg_logical directory or similar.

The same bench was run against pg11 many times and never triggered this issue.
So far our best guess is a side effect of 2aa6e331ead7.

Michael and I have been trying to reproduce this issue locally (drastically
reducing the various freeze_age parameters) for hours, but no luck for now.

This is using a vanilla pg 12.1, with some OLTP workload.  The only possibly
relevant configuration changes are quite aggressive autovacuum settings on some
tables (no delay, analyze/vacuum threshold to 1k and analyze/vacuum scale
factor to 0, for both heap and toast).


Reply | Threaded
Open this post in threaded view
|

Re: PG12 autovac issues

Andres Freund
Hi,

On 2020-03-23 16:22:47 +0100, Julien Rouhaud wrote:

> On Fri, Mar 20, 2020 at 12:03:17PM -0700, Andres Freund wrote:
> > Hi,
> >
> > On 2020-03-20 12:42:31 -0500, Justin King wrote:
> > > When we get into this state again, is there some other information
> > > (other than what is in pg_stat_statement or pg_stat_activity) that
> > > would be useful for folks here to help understand what is going on?
> >
> > If it's actually stuck on a single table, and that table is not large,
> > it would be useful to get a backtrace with gdb.
>
> FTR, we're facing a very similar issue at work (adding Michael and Kevin in Cc)
> during performance tests since a recent upgrade to pg12 .
>
> What seems to be happening is that after reaching 200M transaction a first pass
> of autovacuum freeze is being run, bumping pg_database.darfrozenxid by ~ 800k
> (age(datfrozenxid) still being more than autovacuum_freeze_max_age
> afterwards).

If you have older transactions around that'd not be surprising. Do you
have autovacuum logging output for this case?


> After that point, all available information seems to indicate that no
> autovacuum worker is scheduled anymore:

Do you mean that this table doesn't get autovac'ed at all anymore, that
no table is getting autovac'd, or just that there's nothing further
increasing relfrozenxid for that table?

It sounds like:

> - log_autovacuum_min_duration is set to 0 and no activity is logged (while
>   having thousands of those per hour before that)

no table at all?


> - 15 min interval snapshot of pg_database and pg_class shows that
>   datfrozenxid/relfrozenxid keeps increasing at a consistent rate and never
>   goes down

I assume you mean their age?

What is:
- datfrozenxid, age(datfrozenxid) for the oldest database
  SELECT datname, age(datfrozenxid), datfrozenxid FROM pg_database ORDER BY age(datfrozenxid) DESC LIMIT 1;
- relfrozenxid, age(relfrozenxid) for the oldest table in the oldest database
  SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM pg_class WHERE relfrozenxid <> 0 ORDER BY age(relfrozenxid) DESC LIMIT 1;
- Oldest backend xmin
  SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_activity WHERE backend_xmin <> 0 ORDER BY age(backend_xmin) DESC LIMIT 3;
- oldest replication xmin:
  SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_replication WHERE backend_xmin <> 0 ORDER BY age(backend_xmin) DESC LIMIT 3;
- oldest slot xmin:
  SELECT slot_name, xmin, age(xmin), catalog_xmin, age(catalog_xmin) FROM pg_replication_slots WHERE xmin <> 0 OR catalog_xmin <> 0 ORDER BY greatest(age(xmin), age(catalog_xmin)) DESC LIMIT 3;
- oldest prepared transaction (unfortunately xmin not available)
  SELECT gid, database, transaction FROM pg_prepared_xacts ORDER BY age(transaction) LIMIT 3;


> The same bench was run against pg11 many times and never triggered this issue.
> So far our best guess is a side effect of 2aa6e331ead7.

In that case you'd likely see DEBUG1 output, right? Did you try running
with that?


> Michael and I have been trying to reproduce this issue locally (drastically
> reducing the various freeze_age parameters) for hours, but no luck for now.

Have you considered using gdb to investigate?


> This is using a vanilla pg 12.1, with some OLTP workload.  The only possibly
> relevant configuration changes are quite aggressive autovacuum settings on some
> tables (no delay, analyze/vacuum threshold to 1k and analyze/vacuum scale
> factor to 0, for both heap and toast).

That, uh, is not an insignificant set of changes for an autovac
scheduling issues.

It sounds like what might be happening is that you have something
holding back the "oldest needed transaction" horizon. Before
2aa6e331ead7, if the xmin horizon hasn't increased, there'd be repeated
vacuums unable to increase the horizon, but afterwards they'll all
immediately exit without any messages.

I wonder if what might be happening is that we're somehow missed/failed
to update relfrozenxid and/or datfrozenxid. If you manually vacuum some
table in the oldest database, but that is *NOT* the oldest table itself,
does the problem "resolve" itself?

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: PG12 autovac issues

Julien Rouhaud
Hi,

On Mon, Mar 23, 2020 at 09:23:03AM -0700, Andres Freund wrote:

> Hi,
>
> On 2020-03-23 16:22:47 +0100, Julien Rouhaud wrote:
> > On Fri, Mar 20, 2020 at 12:03:17PM -0700, Andres Freund wrote:
> > > Hi,
> > >
> > > On 2020-03-20 12:42:31 -0500, Justin King wrote:
> > > > When we get into this state again, is there some other information
> > > > (other than what is in pg_stat_statement or pg_stat_activity) that
> > > > would be useful for folks here to help understand what is going on?
> > >
> > > If it's actually stuck on a single table, and that table is not large,
> > > it would be useful to get a backtrace with gdb.
> >
> > FTR, we're facing a very similar issue at work (adding Michael and Kevin in Cc)
> > during performance tests since a recent upgrade to pg12 .
> >
> > What seems to be happening is that after reaching 200M transaction a first pass
> > of autovacuum freeze is being run, bumping pg_database.darfrozenxid by ~ 800k
> > (age(datfrozenxid) still being more than autovacuum_freeze_max_age
> > afterwards).
>
> If you have older transactions around that'd not be surprising. Do you
> have autovacuum logging output for this case?

There's an hourly long running query that can retain xmin up to a few million
xid, but definitely not something close to 200M.

When I said a first pass, it a batch of vacuum.  Here's the overview of "to
prevent wraparound" per-hour logs (UTC timezone):

 442 2020-03-14 07
1686 2020-03-14 08
  14 2020-03-14 10

and an overview of all autovacuum activity (still in UTC):

1366 2020-03-14 00
1457 2020-03-14 01
1387 2020-03-14 02
1440 2020-03-14 03
1349 2020-03-14 04
7383 2020-03-14 05
13909 2020-03-14 06
14240 2020-03-14 07
2094 2020-03-14 08
   0 2020-03-14 09
  16 2020-03-14 10
   3 2020-03-14 11
   4 2020-03-14 12
   3 2020-03-14 13

The final outage being:
2020-03-14 10:27:23.280 UTC [...] ERROR: could not extend file "base/16386/20245.4": No space left on device

536222:2020-03-14 10:40:00.089 UTC [...] PANIC:  could not write to file "pg_logical/replorigin_checkpoint.tmp": No space left on device

the last autovacuum evidence before that being:

274177:2020-03-14 08:54:11.797 UTC 5e6c8ed0.d665 0   LOG:  automatic vacuum of table "pg_toast.pg_toast_20237": index scans: 1
274178- pages: 0 removed, 273666 remain, 0 skipped due to pins, 251545 skipped frozen
274179- tuples: 83585 removed, 749 remain, 209 are dead but not yet removable, oldest xmin: 210363848
274180- buffer usage: 50096 hits, 23521 misses, 19996 dirtied
274181- avg read rate: 99.195 MB/s, avg write rate: 84.329 MB/s
274182- system usage: CPU: user: 0.28 s, system: 0.21 s, elapsed: 1.85 s

After the crash-and-restart autovacuum is working again (as seen in the 10AM -
1PM logs), although the bench stays stopped.


What I can see in pg_database is (GMT+1 here):

          timestamp            | datfrozenxid | age(datfrozenxid)
 [...]
 2020-03-14 09:12:11.279515+01 |          480 | 202554804
 2020-03-14 09:27:12.723617+01 |          480 | 205408276
 2020-03-14 09:42:13.868701+01 |          480 | 208239967
 2020-03-14 09:57:15.685827+01 |       827585 | 210143294
 2020-03-14 10:12:17.488993+01 |       827585 | 213143797
 2020-03-14 10:27:18.899525+01 |       827585 | 216104516
 2020-03-14 10:42:19.926601+01 |       827585 | 219075040
 2020-03-14 10:57:21.023513+01 |       827585 | 222085423
 2020-03-14 11:12:22.85198+01  |       827585 | 225057731


> > After that point, all available information seems to indicate that no
> > autovacuum worker is scheduled anymore:
>
> Do you mean that this table doesn't get autovac'ed at all anymore, that
> no table is getting autovac'd, or just that there's nothing further
> increasing relfrozenxid for that table?
>
> It sounds like:
>
> > - log_autovacuum_min_duration is set to 0 and no activity is logged (while
> >   having thousands of those per hour before that)
>
> no table at all?

Correct, no table being autovacuumed, no sign of autovacuum being scheduled or
anything.

> > - 15 min interval snapshot of pg_database and pg_class shows that
> >   datfrozenxid/relfrozenxid keeps increasing at a consistent rate and never
> >   goes down
>
> I assume you mean their age?

Yes sorry.

> What is:
> - datfrozenxid, age(datfrozenxid) for the oldest database
>   SELECT datname, age(datfrozenxid), datfrozenxid FROM pg_database ORDER BY age(datfrozenxid) DESC LIMIT 1;

see above.  FTR there's only one database being used.

> - relfrozenxid, age(relfrozenxid) for the oldest table in the oldest database
>   SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM pg_class WHERE relfrozenxid <> 0 ORDER BY age(relfrozenxid) DESC LIMIT 1;

The vm has been trashed since, and I don't have that level of detail available
in the gathered stats unfortunately (and the available information I have is a
little bit messy, sorry for that).

> - Oldest backend xmin
>   SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_activity WHERE backend_xmin <> 0 ORDER BY age(backend_xmin) DESC LIMIT 3;
> - oldest replication xmin:
>   SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_replication WHERE backend_xmin <> 0 ORDER BY age(backend_xmin) DESC LIMIT 3;
> - oldest slot xmin:
>   SELECT slot_name, xmin, age(xmin), catalog_xmin, age(catalog_xmin) FROM pg_replication_slots WHERE xmin <> 0 OR catalog_xmin <> 0 ORDER BY greatest(age(xmin), age(catalog_xmin)) DESC LIMIT 3;
> - oldest prepared transaction (unfortunately xmin not available)
>   SELECT gid, database, transaction FROM pg_prepared_xacts ORDER BY age(transaction) LIMIT 3;

I have output from this query: https://github.com/OPMDG/check_pgactivity/blob/oldest_xmin/check_pgactivity#L5669-L5732
(which may be dumb).  Here are the non-null rows after 8AM GMT+1:

           timestamp           |    kind     |   age   |   x_min
-------------------------------+-------------+---------+-----------
 2020-03-14 08:12:04.082523+01 | active_xact |     968 | 191022091
 2020-03-14 08:12:04.082523+01 | idle_xact   |     968 | 191022091
 2020-03-14 08:12:04.082523+01 | query       | 2538322 | 188484737
 2020-03-14 08:27:07.046938+01 | active_xact |     606 | 193917749
 2020-03-14 08:27:07.046938+01 | idle_xact   |     606 | 193917749
 2020-03-14 08:27:07.046938+01 | query       |   39646 | 193878709
 2020-03-14 08:42:08.902248+01 | active_xact |     502 | 196764934
 2020-03-14 08:42:08.902248+01 | idle_xact   |     502 | 196764934
 2020-03-14 08:42:08.902248+01 | query       |   12787 | 196752649
 2020-03-14 08:57:10.089606+01 | active_xact |      91 | 199684930
 2020-03-14 08:57:10.089606+01 | idle_xact   |      91 | 199684930
 2020-03-14 08:57:10.089606+01 | query       |   25453 | 199659568
 2020-03-14 09:12:11.279515+01 | active_xact |     264 | 202555020
 2020-03-14 09:12:11.279515+01 | idle_xact   |     264 | 202555020
 2020-03-14 09:12:11.279515+01 | query       | 2316084 | 200239200
 2020-03-14 09:27:12.723617+01 | active_xact |      90 | 205408666
 2020-03-14 09:27:12.723617+01 | idle_xact   |      90 | 205408666
 2020-03-14 09:27:12.723617+01 | query       | 5169556 | 200239200
 2020-03-14 09:42:13.868701+01 | active_xact |     381 | 208240066
 2020-03-14 09:42:13.868701+01 | idle_xact   |     381 | 208240066
 2020-03-14 09:42:13.868701+01 | query       |  934819 | 207305628
 2020-03-14 09:57:15.685827+01 | active_xact |     791 | 210970088
 2020-03-14 09:57:15.685827+01 | idle_xact   |     791 | 210970088
 2020-03-14 09:57:15.685827+01 | query       |     791 | 210970088
 2020-03-14 10:12:17.488993+01 | active_xact |     243 | 213971139
 2020-03-14 10:12:17.488993+01 | idle_xact   |      61 | 213971321
 2020-03-14 10:27:18.899525+01 | active_xact |     105 | 216931996
 2020-03-14 10:27:18.899525+01 | idle_xact   |     105 | 216931996
 2020-03-14 10:27:18.899525+01 | query       |     109 | 216931992
 2020-03-14 10:42:19.926601+01 | active_xact | 1046073 | 218856552
 2020-03-14 10:42:19.926601+01 | idle_xact   |      14 | 219902611
 2020-03-14 10:42:19.926601+01 | query       |     234 | 219902391
 2020-03-14 10:57:21.023513+01 | active_xact |     707 | 222912301
 2020-03-14 10:57:21.023513+01 | idle_xact   |     707 | 222912301
 2020-03-14 10:57:21.023513+01 | query       |     707 | 222912301
 2020-03-14 11:12:22.85198+01  | active_xact |     302 | 225885014
 2020-03-14 11:12:22.85198+01  | idle_xact   |     302 | 225885014
 2020-03-14 11:12:22.85198+01  | query       |     302 | 225885014

> > The same bench was run against pg11 many times and never triggered this issue.
> > So far our best guess is a side effect of 2aa6e331ead7.
>
> In that case you'd likely see DEBUG1 output, right? Did you try running
> with that?

That's unfortunately not an option, as while the issue is reproducible, it
happens after roughly 16h with an average of 6.5ktps, and we definitely don't
have enough disk space for that amount of logs.  We'll try next time with a
recompiled version with the DEBUG1 promoted to WARNING, and maybe other elog.

> > Michael and I have been trying to reproduce this issue locally (drastically
> > reducing the various freeze_age parameters) for hours, but no luck for now.
>
> Have you considered using gdb to investigate?

Not for now, because it's hard to be around for the ~ 90min interval the system
stays alive after atuvacuum get stuck, especially since it's a weekend bench.

> > This is using a vanilla pg 12.1, with some OLTP workload.  The only possibly
> > relevant configuration changes are quite aggressive autovacuum settings on some
> > tables (no delay, analyze/vacuum threshold to 1k and analyze/vacuum scale
> > factor to 0, for both heap and toast).
>
> That, uh, is not an insignificant set of changes for an autovac
> scheduling issues.
>
> It sounds like what might be happening is that you have something
> holding back the "oldest needed transaction" horizon. Before
> 2aa6e331ead7, if the xmin horizon hasn't increased, there'd be repeated
> vacuums unable to increase the horizon, but afterwards they'll all
> immediately exit without any messages.

The xmin horizon is increasing AFAICS.  Grepping the autovacuum logs, I can see
for the 08:00 AM UTC - 08:59 AM UTC interval the oldest xmin going from
200069684 to 210363848.  E.g.:

[...]
266603:2020-03-14 08:49:59.927 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_27683": index scans: 0
266604- pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
266605- tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635921
266606- buffer usage: 25 hits, 1 misses, 1 dirtied
266607- avg read rate: 15.440 MB/s, avg write rate: 15.440 MB/s
266608- system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
266609:2020-03-14 08:49:59.929 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_27694": index scans: 0
266610- pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
266611- tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635934
266612- buffer usage: 25 hits, 1 misses, 1 dirtied
266613- avg read rate: 21.058 MB/s, avg write rate: 21.058 MB/s
266614- system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
266615:2020-03-14 08:49:59.931 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_24845": index scans: 0
266616- pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
266617- tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635934
266618- buffer usage: 25 hits, 1 misses, 2 dirtied
266619- avg read rate: 8.005 MB/s, avg write rate: 16.009 MB/s
266620- system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
266621:2020-03-14 08:49:59.945 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_27712": index scans: 0
266622- pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
266623- tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635953
[...]


> I wonder if what might be happening is that we're somehow missed/failed
> to update relfrozenxid and/or datfrozenxid. If you manually vacuum some
> table in the oldest database, but that is *NOT* the oldest table itself,
> does the problem "resolve" itself?

I'll also add pg_class snapshot for next time we run the bench.  I'm not sure
if we'll be able to catch the 1h45 interval when the system stays alive
after the issue though.


Reply | Threaded
Open this post in threaded view
|

Re: PG12 autovac issues

Andres Freund
Hi,

On 2020-03-23 20:47:25 +0100, Julien Rouhaud wrote:

> > - relfrozenxid, age(relfrozenxid) for the oldest table in the oldest database
> >   SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM pg_class WHERE relfrozenxid <> 0 ORDER BY age(relfrozenxid) DESC LIMIT 1;
>
> The vm has been trashed since, and I don't have that level of detail available
> in the gathered stats unfortunately (and the available information I have is a
> little bit messy, sorry for that).
>
> > - Oldest backend xmin
> >   SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_activity WHERE backend_xmin <> 0 ORDER BY age(backend_xmin) DESC LIMIT 3;
> > - oldest replication xmin:
> >   SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_replication WHERE backend_xmin <> 0 ORDER BY age(backend_xmin) DESC LIMIT 3;
> > - oldest slot xmin:
> >   SELECT slot_name, xmin, age(xmin), catalog_xmin, age(catalog_xmin) FROM pg_replication_slots WHERE xmin <> 0 OR catalog_xmin <> 0 ORDER BY greatest(age(xmin), age(catalog_xmin)) DESC LIMIT 3;
> > - oldest prepared transaction (unfortunately xmin not available)
> >   SELECT gid, database, transaction FROM pg_prepared_xacts ORDER BY age(transaction) LIMIT 3;
>
> I have output from this query: https://github.com/OPMDG/check_pgactivity/blob/oldest_xmin/check_pgactivity#L5669-L5732
> (which may be dumb).  Here are the non-null rows after 8AM GMT+1:

Could you share what the config of the server was?


> > > The same bench was run against pg11 many times and never triggered this issue.
> > > So far our best guess is a side effect of 2aa6e331ead7.
> >
> > In that case you'd likely see DEBUG1 output, right? Did you try running
> > with that?
>
> That's unfortunately not an option, as while the issue is reproducible, it
> happens after roughly 16h with an average of 6.5ktps, and we definitely don't
> have enough disk space for that amount of logs.  We'll try next time with a
> recompiled version with the DEBUG1 promoted to WARNING, and maybe other elog.

You wouldn't need to run it with DEBUG1 the whole time, you could just
change the config and reload config once you hit the issue.


> > > Michael and I have been trying to reproduce this issue locally (drastically
> > > reducing the various freeze_age parameters) for hours, but no luck for now.
> >
> > Have you considered using gdb to investigate?
>
> Not for now, because it's hard to be around for the ~ 90min interval the system
> stays alive after atuvacuum get stuck, especially since it's a weekend bench.
>
> > > This is using a vanilla pg 12.1, with some OLTP workload.  The only possibly
> > > relevant configuration changes are quite aggressive autovacuum settings on some
> > > tables (no delay, analyze/vacuum threshold to 1k and analyze/vacuum scale
> > > factor to 0, for both heap and toast).
> >
> > That, uh, is not an insignificant set of changes for an autovac
> > scheduling issues.
> >
> > It sounds like what might be happening is that you have something
> > holding back the "oldest needed transaction" horizon. Before
> > 2aa6e331ead7, if the xmin horizon hasn't increased, there'd be repeated
> > vacuums unable to increase the horizon, but afterwards they'll all
> > immediately exit without any messages.
>
> The xmin horizon is increasing AFAICS.  Grepping the autovacuum logs, I can see
> for the 08:00 AM UTC - 08:59 AM UTC interval the oldest xmin going from
> 200069684 to 210363848.  E.g.:

Which database is this on?


> [...]
> 266603:2020-03-14 08:49:59.927 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_27683": index scans: 0
> 266604- pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
> 266605- tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635921
> 266606- buffer usage: 25 hits, 1 misses, 1 dirtied
> 266607- avg read rate: 15.440 MB/s, avg write rate: 15.440 MB/s
> 266608- system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
> 266609:2020-03-14 08:49:59.929 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_27694": index scans: 0
> 266610- pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
> 266611- tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635934
> 266612- buffer usage: 25 hits, 1 misses, 1 dirtied
> 266613- avg read rate: 21.058 MB/s, avg write rate: 21.058 MB/s
> 266614- system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
> 266615:2020-03-14 08:49:59.931 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_24845": index scans: 0
> 266616- pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
> 266617- tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635934
> 266618- buffer usage: 25 hits, 1 misses, 2 dirtied
> 266619- avg read rate: 8.005 MB/s, avg write rate: 16.009 MB/s
> 266620- system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
> 266621:2020-03-14 08:49:59.945 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_27712": index scans: 0
> 266622- pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
> 266623- tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635953
> [...]

Do you have any non-toast ones?


> > I wonder if what might be happening is that we're somehow missed/failed
> > to update relfrozenxid and/or datfrozenxid. If you manually vacuum some
> > table in the oldest database, but that is *NOT* the oldest table itself,
> > does the problem "resolve" itself?
>
> I'll also add pg_class snapshot for next time we run the bench.  I'm not sure
> if we'll be able to catch the 1h45 interval when the system stays alive
> after the issue though.

Could you just script something to stop the benchmark once the disk is
90% full or so?

Did you see any errors / fatals around the time autovacuum stopped
working?

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: PG12 autovac issues

Justin King
On Mon, Mar 23, 2020 at 3:00 PM Andres Freund <[hidden email]> wrote:

>
> Hi,
>
> On 2020-03-23 20:47:25 +0100, Julien Rouhaud wrote:
> > > - relfrozenxid, age(relfrozenxid) for the oldest table in the oldest database
> > >   SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM pg_class WHERE relfrozenxid <> 0 ORDER BY age(relfrozenxid) DESC LIMIT 1;
> >
> > The vm has been trashed since, and I don't have that level of detail available
> > in the gathered stats unfortunately (and the available information I have is a
> > little bit messy, sorry for that).
> >
> > > - Oldest backend xmin
> > >   SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_activity WHERE backend_xmin <> 0 ORDER BY age(backend_xmin) DESC LIMIT 3;
> > > - oldest replication xmin:
> > >   SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_replication WHERE backend_xmin <> 0 ORDER BY age(backend_xmin) DESC LIMIT 3;
> > > - oldest slot xmin:
> > >   SELECT slot_name, xmin, age(xmin), catalog_xmin, age(catalog_xmin) FROM pg_replication_slots WHERE xmin <> 0 OR catalog_xmin <> 0 ORDER BY greatest(age(xmin), age(catalog_xmin)) DESC LIMIT 3;
> > > - oldest prepared transaction (unfortunately xmin not available)
> > >   SELECT gid, database, transaction FROM pg_prepared_xacts ORDER BY age(transaction) LIMIT 3;
> >
> > I have output from this query: https://github.com/OPMDG/check_pgactivity/blob/oldest_xmin/check_pgactivity#L5669-L5732
> > (which may be dumb).  Here are the non-null rows after 8AM GMT+1:
>
> Could you share what the config of the server was?
>
>
> > > > The same bench was run against pg11 many times and never triggered this issue.
> > > > So far our best guess is a side effect of 2aa6e331ead7.
> > >
> > > In that case you'd likely see DEBUG1 output, right? Did you try running
> > > with that?
> >
> > That's unfortunately not an option, as while the issue is reproducible, it
> > happens after roughly 16h with an average of 6.5ktps, and we definitely don't
> > have enough disk space for that amount of logs.  We'll try next time with a
> > recompiled version with the DEBUG1 promoted to WARNING, and maybe other elog.
>
> You wouldn't need to run it with DEBUG1 the whole time, you could just
> change the config and reload config once you hit the issue.
>
>
> > > > Michael and I have been trying to reproduce this issue locally (drastically
> > > > reducing the various freeze_age parameters) for hours, but no luck for now.
> > >
> > > Have you considered using gdb to investigate?
> >
> > Not for now, because it's hard to be around for the ~ 90min interval the system
> > stays alive after atuvacuum get stuck, especially since it's a weekend bench.
> >
> > > > This is using a vanilla pg 12.1, with some OLTP workload.  The only possibly
> > > > relevant configuration changes are quite aggressive autovacuum settings on some
> > > > tables (no delay, analyze/vacuum threshold to 1k and analyze/vacuum scale
> > > > factor to 0, for both heap and toast).
> > >
> > > That, uh, is not an insignificant set of changes for an autovac
> > > scheduling issues.
> > >
> > > It sounds like what might be happening is that you have something
> > > holding back the "oldest needed transaction" horizon. Before
> > > 2aa6e331ead7, if the xmin horizon hasn't increased, there'd be repeated
> > > vacuums unable to increase the horizon, but afterwards they'll all
> > > immediately exit without any messages.
> >
> > The xmin horizon is increasing AFAICS.  Grepping the autovacuum logs, I can see
> > for the 08:00 AM UTC - 08:59 AM UTC interval the oldest xmin going from
> > 200069684 to 210363848.  E.g.:
>
> Which database is this on?
>
>
> > [...]
> > 266603:2020-03-14 08:49:59.927 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_27683": index scans: 0
> > 266604-       pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
> > 266605-       tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635921
> > 266606-       buffer usage: 25 hits, 1 misses, 1 dirtied
> > 266607-       avg read rate: 15.440 MB/s, avg write rate: 15.440 MB/s
> > 266608-       system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
> > 266609:2020-03-14 08:49:59.929 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_27694": index scans: 0
> > 266610-       pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
> > 266611-       tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635934
> > 266612-       buffer usage: 25 hits, 1 misses, 1 dirtied
> > 266613-       avg read rate: 21.058 MB/s, avg write rate: 21.058 MB/s
> > 266614-       system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
> > 266615:2020-03-14 08:49:59.931 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_24845": index scans: 0
> > 266616-       pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
> > 266617-       tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635934
> > 266618-       buffer usage: 25 hits, 1 misses, 2 dirtied
> > 266619-       avg read rate: 8.005 MB/s, avg write rate: 16.009 MB/s
> > 266620-       system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
> > 266621:2020-03-14 08:49:59.945 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_27712": index scans: 0
> > 266622-       pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
> > 266623-       tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635953
> > [...]
>
> Do you have any non-toast ones?

This is occurring in our environment right now (started about 30 min
ago).  Here 's the latest logs (grepped by vacuum):

Mar 23 20:54:16 cowtn postgres[15569]: [12-1] 2020-03-23 20:54:16.542
GMT [15569] LOG:  automatic vacuum of table "feedi.production.tita":
index scans: 1
Mar 23 20:54:27 cowtn postgres[15654]: [8-1] 2020-03-23 20:54:27.964
GMT [15654] LOG:  automatic vacuum of table
"feedi.production.distributed_virtual_schedule": index scans: 1
Mar 23 20:54:29 cowtn postgres[15654]: [10-1] 2020-03-23 20:54:29.103
GMT [15654] LOG:  automatic vacuum of table "feedi.production.tita":
index scans: 1
Mar 23 20:54:29 cowtn postgres[15654]: [12-1] 2020-03-23 20:54:29.284
GMT [15654] LOG:  automatic vacuum of table
"feedi.pg_toast.pg_toast_2619": index scans: 1
Mar 23 20:54:44 cowtn postgres[15759]: [8-1] 2020-03-23 20:54:44.005
GMT [15759] LOG:  automatic vacuum of table
"feedi.production.flightplans": index scans: 1
Mar 23 20:54:45 cowtn postgres[15759]: [11-1] 2020-03-23 20:54:45.979
GMT [15759] LOG:  automatic vacuum of table "feedi.production.tita":
index scans: 1
Mar 23 20:54:54 cowtn postgres[15814]: [7-1] 2020-03-23 20:54:54.367
GMT [15814] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_statistic": index scans: 0
Mar 23 20:54:54 cowtn postgres[15814]: [8-1] 2020-03-23 20:54:54.378
GMT [15814] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_type": index scans: 0
Mar 23 20:54:54 cowtn postgres[15814]: [9-1] 2020-03-23 20:54:54.389
GMT [15814] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_foreign_server": index scans: 0
Mar 23 20:54:54 cowtn postgres[15814]: [10-1] 2020-03-23 20:54:54.410
GMT [15814] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_statistic_ext_data": index scans: 0
Mar 23 20:54:54 cowtn postgres[15814]: [11-1] 2020-03-23 20:54:54.420
GMT [15814] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_user_mapping": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [7-1] 2020-03-23 20:55:09.194
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_attribute": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [8-1] 2020-03-23 20:55:09.205
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_proc": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [9-1] 2020-03-23 20:55:09.216
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_class": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [10-1] 2020-03-23 20:55:09.227
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_attrdef": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [11-1] 2020-03-23 20:55:09.239
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_constraint": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [12-1] 2020-03-23 20:55:09.250
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_inherits": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [13-1] 2020-03-23 20:55:09.260
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_index": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [14-1] 2020-03-23 20:55:09.272
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_operator": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [15-1] 2020-03-23 20:55:09.283
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_opfamily": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [16-1] 2020-03-23 20:55:09.294
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_opclass": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [17-1] 2020-03-23 20:55:09.305
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_am": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [18-1] 2020-03-23 20:55:09.317
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_amop": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [19-1] 2020-03-23 20:55:09.328
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_amproc": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [20-1] 2020-03-23 20:55:09.339
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_language": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [21-1] 2020-03-23 20:55:09.349
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_largeobject_metadata": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [22-1] 2020-03-23 20:55:09.360
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_aggregate": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [23-1] 2020-03-23 20:55:09.371
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_largeobject": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [24-1] 2020-03-23 20:55:09.382
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_statistic_ext": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [25-1] 2020-03-23 20:55:09.393
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_rewrite": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [26-1] 2020-03-23 20:55:09.404
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_trigger": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [27-1] 2020-03-23 20:55:09.415
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_event_trigger": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [28-1] 2020-03-23 20:55:09.426
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_description": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [29-1] 2020-03-23 20:55:09.437
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_cast": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [30-1] 2020-03-23 20:55:09.448
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_enum": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [31-1] 2020-03-23 20:55:09.459
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_namespace": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [32-1] 2020-03-23 20:55:09.470
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_conversion": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [33-1] 2020-03-23 20:55:09.481
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_depend": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [34-1] 2020-03-23 20:55:09.502
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_db_role_setting": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [35-1] 2020-03-23 20:55:09.524
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_pltemplate": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [36-1] 2020-03-23 20:55:09.545
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_shdepend": index scans: 1
Mar 23 20:55:09 cowtn postgres[15890]: [37-1] 2020-03-23 20:55:09.556
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_shdescription": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [38-1] 2020-03-23 20:55:09.567
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_ts_config": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [39-1] 2020-03-23 20:55:09.578
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_ts_config_map": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [40-1] 2020-03-23 20:55:09.589
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_ts_dict": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [41-1] 2020-03-23 20:55:09.600
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_ts_parser": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [42-1] 2020-03-23 20:55:09.611
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_ts_template": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [43-1] 2020-03-23 20:55:09.622
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_extension": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [44-1] 2020-03-23 20:55:09.633
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_foreign_data_wrapper": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [45-1] 2020-03-23 20:55:09.643
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_foreign_table": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [46-1] 2020-03-23 20:55:09.654
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_policy": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [47-1] 2020-03-23 20:55:09.675
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_default_acl": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [48-1] 2020-03-23 20:55:09.686
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_init_privs": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [49-1] 2020-03-23 20:55:09.696
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_seclabel": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [50-1] 2020-03-23 20:55:09.719
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_collation": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [51-1] 2020-03-23 20:55:09.729
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_partitioned_table": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [52-1] 2020-03-23 20:55:09.740
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_range": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [53-1] 2020-03-23 20:55:09.751
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_transform": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [54-1] 2020-03-23 20:55:09.762
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_sequence": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [55-1] 2020-03-23 20:55:09.772
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_publication": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [56-1] 2020-03-23 20:55:09.783
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_publication_rel": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [57-1] 2020-03-23 20:55:09.794
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_catalog.pg_subscription_rel": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [58-1] 2020-03-23 20:55:09.805
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.information_schema.sql_packages": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [59-1] 2020-03-23 20:55:09.816
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.information_schema.sql_features": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [60-1] 2020-03-23 20:55:09.827
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.information_schema.sql_implementation_info": index
scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [61-1] 2020-03-23 20:55:09.838
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.information_schema.sql_parts": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [62-1] 2020-03-23 20:55:09.849
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.information_schema.sql_languages": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [63-1] 2020-03-23 20:55:09.859
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.information_schema.sql_sizing": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [64-1] 2020-03-23 20:55:09.870
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.information_schema.sql_sizing_profiles": index scans:
0
Mar 23 20:55:09 cowtn postgres[15890]: [65-1] 2020-03-23 20:55:09.880
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_2600": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [66-1] 2020-03-23 20:55:09.891
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_2604": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [67-1] 2020-03-23 20:55:09.902
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_3456": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [68-1] 2020-03-23 20:55:09.912
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_2606": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [69-1] 2020-03-23 20:55:09.923
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_826": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [70-1] 2020-03-23 20:55:09.933
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_2609": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [71-1] 2020-03-23 20:55:09.944
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_3466": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [72-1] 2020-03-23 20:55:09.955
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_3079": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [73-1] 2020-03-23 20:55:09.965
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_2328": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [74-1] 2020-03-23 20:55:09.976
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_1417": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [75-1] 2020-03-23 20:55:09.987
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_3118": index scans: 0
Mar 23 20:55:09 cowtn postgres[15890]: [76-1] 2020-03-23 20:55:09.997
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_3394": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [77-1] 2020-03-23 20:55:10.008
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_2612": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [78-1] 2020-03-23 20:55:10.018
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_2615": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [79-1] 2020-03-23 20:55:10.029
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_3350": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [80-1] 2020-03-23 20:55:10.040
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_3256": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [81-1] 2020-03-23 20:55:10.050
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_1255": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [82-1] 2020-03-23 20:55:10.062
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_2618": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [83-1] 2020-03-23 20:55:10.072
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_3596": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [84-1] 2020-03-23 20:55:10.083
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_2619": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [85-1] 2020-03-23 20:55:10.094
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_3381": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [86-1] 2020-03-23 20:55:10.105
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_3429": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [87-1] 2020-03-23 20:55:10.115
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_2620": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [88-1] 2020-03-23 20:55:10.126
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_3600": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [89-1] 2020-03-23 20:55:10.137
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_1247": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [90-1] 2020-03-23 20:55:10.147
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_1418": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [91-1] 2020-03-23 20:55:10.158
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_1260": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [92-1] 2020-03-23 20:55:10.169
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_1262": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [93-1] 2020-03-23 20:55:10.179
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_2964": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [94-1] 2020-03-23 20:55:10.190
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_1136": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [95-1] 2020-03-23 20:55:10.201
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_6000": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [96-1] 2020-03-23 20:55:10.211
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_2396": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [97-1] 2020-03-23 20:55:10.222
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_3592": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [98-1] 2020-03-23 20:55:10.232
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_6100": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [99-1] 2020-03-23 20:55:10.243
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_1213": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [100-1] 2020-03-23 20:55:10.254
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_13267": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [101-1] 2020-03-23 20:55:10.264
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_13272": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [102-1] 2020-03-23 20:55:10.275
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_13287": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [103-1] 2020-03-23 20:55:10.286
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_13277": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [104-1] 2020-03-23 20:55:10.296
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_13282": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [105-1] 2020-03-23 20:55:10.307
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_13292": index scans: 0
Mar 23 20:55:10 cowtn postgres[15890]: [106-1] 2020-03-23 20:55:10.317
GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
table "postgres.pg_toast.pg_toast_13297": index scans: 0

>
>
> > > I wonder if what might be happening is that we're somehow missed/failed
> > > to update relfrozenxid and/or datfrozenxid. If you manually vacuum some
> > > table in the oldest database, but that is *NOT* the oldest table itself,
> > > does the problem "resolve" itself?

postgres=# SELECT datname
    , age(datfrozenxid)
    , current_setting('autovacuum_freeze_max_age')
FROM pg_database;
  datname  |    age    | current_setting
-----------+-----------+-----------------
 postgres  | 202375735 | 200000000
 template1 | 202345459 | 200000000
 template0 | 132459914 | 200000000
 feedi     | 132459914 | 200000000
(4 rows)


> >
> > I'll also add pg_class snapshot for next time we run the bench.  I'm not sure
> > if we'll be able to catch the 1h45 interval when the system stays alive
> > after the issue though.
>
> Could you just script something to stop the benchmark once the disk is
> 90% full or so?
>
> Did you see any errors / fatals around the time autovacuum stopped
> working?

There are no ERROR or FATAL messages around the time -- (or at all).

Since this is occurring right now, what else would be useful to
capture?  You'd asked about a GDB -- do you want that of the main
process or the autovac worker?

>
> Greetings,
>
> Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: PG12 autovac issues

Andres Freund
Hi,

On 2020-03-23 16:31:21 -0500, Justin King wrote:
> This is occurring in our environment right now (started about 30 min
> ago).  Here 's the latest logs (grepped by vacuum):
>
> Mar 23 20:54:16 cowtn postgres[15569]: [12-1] 2020-03-23 20:54:16.542
> GMT [15569] LOG:  automatic vacuum of table "feedi.production.tita":
> index scans: 1
> Mar 23 20:54:27 cowtn postgres[15654]: [8-1] 2020-03-23 20:54:27.964
> GMT [15654] LOG:  automatic vacuum of table
> "feedi.production.distributed_virtual_schedule": index scans: 1

Hm, unfortunately you've cut off the details in the subsequent
lines. There's a few newlines in the output. Any chance you could
re-post with those included?


> > > > I wonder if what might be happening is that we're somehow missed/failed
> > > > to update relfrozenxid and/or datfrozenxid. If you manually vacuum some
> > > > table in the oldest database, but that is *NOT* the oldest table itself,
> > > > does the problem "resolve" itself?
>
> postgres=# SELECT datname
>     , age(datfrozenxid)
>     , current_setting('autovacuum_freeze_max_age')
> FROM pg_database;
>   datname  |    age    | current_setting
> -----------+-----------+-----------------
>  postgres  | 202375735 | 200000000
>  template1 | 202345459 | 200000000
>  template0 | 132459914 | 200000000
>  feedi     | 132459914 | 200000000
> (4 rows)

Can you show the oldest tables in 'feedi'? Or, hm, actually, could you
just post the result of all the queries from the "What is:" section in
https://postgr.es/m/20200323162303.s7ay5hjdvimtkz6u%40alap3.anarazel.de


> Since this is occurring right now, what else would be useful to
> capture?  You'd asked about a GDB -- do you want that of the main
> process or the autovac worker?

Unless you can give me gdb access directly, I don't yet have enough data
to suggest what exactly we would want to analyze with gdb in your case.


It'd be helpful if you could change log_min_messages to DEBUG1 and
reload the configuration (not restart!).

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: PG12 autovac issues

Justin King
In reply to this post by Justin King
On Mon, Mar 23, 2020 at 4:31 PM Justin King <[hidden email]> wrote:

>
> On Mon, Mar 23, 2020 at 3:00 PM Andres Freund <[hidden email]> wrote:
> >
> > Hi,
> >
> > On 2020-03-23 20:47:25 +0100, Julien Rouhaud wrote:
> > > > - relfrozenxid, age(relfrozenxid) for the oldest table in the oldest database
> > > >   SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM pg_class WHERE relfrozenxid <> 0 ORDER BY age(relfrozenxid) DESC LIMIT 1;
> > >
> > > The vm has been trashed since, and I don't have that level of detail available
> > > in the gathered stats unfortunately (and the available information I have is a
> > > little bit messy, sorry for that).
> > >
> > > > - Oldest backend xmin
> > > >   SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_activity WHERE backend_xmin <> 0 ORDER BY age(backend_xmin) DESC LIMIT 3;
> > > > - oldest replication xmin:
> > > >   SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_replication WHERE backend_xmin <> 0 ORDER BY age(backend_xmin) DESC LIMIT 3;
> > > > - oldest slot xmin:
> > > >   SELECT slot_name, xmin, age(xmin), catalog_xmin, age(catalog_xmin) FROM pg_replication_slots WHERE xmin <> 0 OR catalog_xmin <> 0 ORDER BY greatest(age(xmin), age(catalog_xmin)) DESC LIMIT 3;
> > > > - oldest prepared transaction (unfortunately xmin not available)
> > > >   SELECT gid, database, transaction FROM pg_prepared_xacts ORDER BY age(transaction) LIMIT 3;
> > >
> > > I have output from this query: https://github.com/OPMDG/check_pgactivity/blob/oldest_xmin/check_pgactivity#L5669-L5732
> > > (which may be dumb).  Here are the non-null rows after 8AM GMT+1:
> >
> > Could you share what the config of the server was?
> >
> >
> > > > > The same bench was run against pg11 many times and never triggered this issue.
> > > > > So far our best guess is a side effect of 2aa6e331ead7.
> > > >
> > > > In that case you'd likely see DEBUG1 output, right? Did you try running
> > > > with that?
> > >
> > > That's unfortunately not an option, as while the issue is reproducible, it
> > > happens after roughly 16h with an average of 6.5ktps, and we definitely don't
> > > have enough disk space for that amount of logs.  We'll try next time with a
> > > recompiled version with the DEBUG1 promoted to WARNING, and maybe other elog.
> >
> > You wouldn't need to run it with DEBUG1 the whole time, you could just
> > change the config and reload config once you hit the issue.
> >
> >
> > > > > Michael and I have been trying to reproduce this issue locally (drastically
> > > > > reducing the various freeze_age parameters) for hours, but no luck for now.
> > > >
> > > > Have you considered using gdb to investigate?
> > >
> > > Not for now, because it's hard to be around for the ~ 90min interval the system
> > > stays alive after atuvacuum get stuck, especially since it's a weekend bench.
> > >
> > > > > This is using a vanilla pg 12.1, with some OLTP workload.  The only possibly
> > > > > relevant configuration changes are quite aggressive autovacuum settings on some
> > > > > tables (no delay, analyze/vacuum threshold to 1k and analyze/vacuum scale
> > > > > factor to 0, for both heap and toast).
> > > >
> > > > That, uh, is not an insignificant set of changes for an autovac
> > > > scheduling issues.
> > > >
> > > > It sounds like what might be happening is that you have something
> > > > holding back the "oldest needed transaction" horizon. Before
> > > > 2aa6e331ead7, if the xmin horizon hasn't increased, there'd be repeated
> > > > vacuums unable to increase the horizon, but afterwards they'll all
> > > > immediately exit without any messages.
> > >
> > > The xmin horizon is increasing AFAICS.  Grepping the autovacuum logs, I can see
> > > for the 08:00 AM UTC - 08:59 AM UTC interval the oldest xmin going from
> > > 200069684 to 210363848.  E.g.:
> >
> > Which database is this on?
> >
> >
> > > [...]
> > > 266603:2020-03-14 08:49:59.927 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_27683": index scans: 0
> > > 266604-       pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
> > > 266605-       tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635921
> > > 266606-       buffer usage: 25 hits, 1 misses, 1 dirtied
> > > 266607-       avg read rate: 15.440 MB/s, avg write rate: 15.440 MB/s
> > > 266608-       system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
> > > 266609:2020-03-14 08:49:59.929 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_27694": index scans: 0
> > > 266610-       pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
> > > 266611-       tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635934
> > > 266612-       buffer usage: 25 hits, 1 misses, 1 dirtied
> > > 266613-       avg read rate: 21.058 MB/s, avg write rate: 21.058 MB/s
> > > 266614-       system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
> > > 266615:2020-03-14 08:49:59.931 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_24845": index scans: 0
> > > 266616-       pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
> > > 266617-       tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635934
> > > 266618-       buffer usage: 25 hits, 1 misses, 2 dirtied
> > > 266619-       avg read rate: 8.005 MB/s, avg write rate: 16.009 MB/s
> > > 266620-       system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
> > > 266621:2020-03-14 08:49:59.945 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_27712": index scans: 0
> > > 266622-       pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
> > > 266623-       tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635953
> > > [...]
> >
> > Do you have any non-toast ones?
>
> This is occurring in our environment right now (started about 30 min
> ago).  Here 's the latest logs (grepped by vacuum):
>
> Mar 23 20:54:16 cowtn postgres[15569]: [12-1] 2020-03-23 20:54:16.542
> GMT [15569] LOG:  automatic vacuum of table "feedi.production.tita":
> index scans: 1
> Mar 23 20:54:27 cowtn postgres[15654]: [8-1] 2020-03-23 20:54:27.964
> GMT [15654] LOG:  automatic vacuum of table
> "feedi.production.distributed_virtual_schedule": index scans: 1
> Mar 23 20:54:29 cowtn postgres[15654]: [10-1] 2020-03-23 20:54:29.103
> GMT [15654] LOG:  automatic vacuum of table "feedi.production.tita":
> index scans: 1
> Mar 23 20:54:29 cowtn postgres[15654]: [12-1] 2020-03-23 20:54:29.284
> GMT [15654] LOG:  automatic vacuum of table
> "feedi.pg_toast.pg_toast_2619": index scans: 1
> Mar 23 20:54:44 cowtn postgres[15759]: [8-1] 2020-03-23 20:54:44.005
> GMT [15759] LOG:  automatic vacuum of table
> "feedi.production.flightplans": index scans: 1
> Mar 23 20:54:45 cowtn postgres[15759]: [11-1] 2020-03-23 20:54:45.979
> GMT [15759] LOG:  automatic vacuum of table "feedi.production.tita":
> index scans: 1
> Mar 23 20:54:54 cowtn postgres[15814]: [7-1] 2020-03-23 20:54:54.367
> GMT [15814] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_statistic": index scans: 0
> Mar 23 20:54:54 cowtn postgres[15814]: [8-1] 2020-03-23 20:54:54.378
> GMT [15814] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_type": index scans: 0
> Mar 23 20:54:54 cowtn postgres[15814]: [9-1] 2020-03-23 20:54:54.389
> GMT [15814] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_foreign_server": index scans: 0
> Mar 23 20:54:54 cowtn postgres[15814]: [10-1] 2020-03-23 20:54:54.410
> GMT [15814] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_statistic_ext_data": index scans: 0
> Mar 23 20:54:54 cowtn postgres[15814]: [11-1] 2020-03-23 20:54:54.420
> GMT [15814] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_user_mapping": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [7-1] 2020-03-23 20:55:09.194
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_attribute": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [8-1] 2020-03-23 20:55:09.205
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_proc": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [9-1] 2020-03-23 20:55:09.216
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_class": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [10-1] 2020-03-23 20:55:09.227
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_attrdef": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [11-1] 2020-03-23 20:55:09.239
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_constraint": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [12-1] 2020-03-23 20:55:09.250
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_inherits": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [13-1] 2020-03-23 20:55:09.260
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_index": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [14-1] 2020-03-23 20:55:09.272
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_operator": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [15-1] 2020-03-23 20:55:09.283
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_opfamily": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [16-1] 2020-03-23 20:55:09.294
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_opclass": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [17-1] 2020-03-23 20:55:09.305
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_am": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [18-1] 2020-03-23 20:55:09.317
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_amop": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [19-1] 2020-03-23 20:55:09.328
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_amproc": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [20-1] 2020-03-23 20:55:09.339
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_language": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [21-1] 2020-03-23 20:55:09.349
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_largeobject_metadata": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [22-1] 2020-03-23 20:55:09.360
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_aggregate": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [23-1] 2020-03-23 20:55:09.371
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_largeobject": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [24-1] 2020-03-23 20:55:09.382
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_statistic_ext": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [25-1] 2020-03-23 20:55:09.393
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_rewrite": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [26-1] 2020-03-23 20:55:09.404
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_trigger": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [27-1] 2020-03-23 20:55:09.415
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_event_trigger": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [28-1] 2020-03-23 20:55:09.426
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_description": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [29-1] 2020-03-23 20:55:09.437
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_cast": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [30-1] 2020-03-23 20:55:09.448
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_enum": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [31-1] 2020-03-23 20:55:09.459
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_namespace": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [32-1] 2020-03-23 20:55:09.470
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_conversion": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [33-1] 2020-03-23 20:55:09.481
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_depend": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [34-1] 2020-03-23 20:55:09.502
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_db_role_setting": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [35-1] 2020-03-23 20:55:09.524
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_pltemplate": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [36-1] 2020-03-23 20:55:09.545
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_shdepend": index scans: 1
> Mar 23 20:55:09 cowtn postgres[15890]: [37-1] 2020-03-23 20:55:09.556
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_shdescription": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [38-1] 2020-03-23 20:55:09.567
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_ts_config": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [39-1] 2020-03-23 20:55:09.578
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_ts_config_map": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [40-1] 2020-03-23 20:55:09.589
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_ts_dict": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [41-1] 2020-03-23 20:55:09.600
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_ts_parser": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [42-1] 2020-03-23 20:55:09.611
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_ts_template": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [43-1] 2020-03-23 20:55:09.622
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_extension": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [44-1] 2020-03-23 20:55:09.633
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_foreign_data_wrapper": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [45-1] 2020-03-23 20:55:09.643
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_foreign_table": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [46-1] 2020-03-23 20:55:09.654
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_policy": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [47-1] 2020-03-23 20:55:09.675
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_default_acl": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [48-1] 2020-03-23 20:55:09.686
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_init_privs": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [49-1] 2020-03-23 20:55:09.696
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_seclabel": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [50-1] 2020-03-23 20:55:09.719
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_collation": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [51-1] 2020-03-23 20:55:09.729
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_partitioned_table": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [52-1] 2020-03-23 20:55:09.740
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_range": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [53-1] 2020-03-23 20:55:09.751
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_transform": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [54-1] 2020-03-23 20:55:09.762
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_sequence": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [55-1] 2020-03-23 20:55:09.772
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_publication": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [56-1] 2020-03-23 20:55:09.783
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_publication_rel": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [57-1] 2020-03-23 20:55:09.794
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_catalog.pg_subscription_rel": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [58-1] 2020-03-23 20:55:09.805
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.information_schema.sql_packages": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [59-1] 2020-03-23 20:55:09.816
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.information_schema.sql_features": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [60-1] 2020-03-23 20:55:09.827
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.information_schema.sql_implementation_info": index
> scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [61-1] 2020-03-23 20:55:09.838
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.information_schema.sql_parts": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [62-1] 2020-03-23 20:55:09.849
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.information_schema.sql_languages": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [63-1] 2020-03-23 20:55:09.859
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.information_schema.sql_sizing": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [64-1] 2020-03-23 20:55:09.870
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.information_schema.sql_sizing_profiles": index scans:
> 0
> Mar 23 20:55:09 cowtn postgres[15890]: [65-1] 2020-03-23 20:55:09.880
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_2600": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [66-1] 2020-03-23 20:55:09.891
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_2604": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [67-1] 2020-03-23 20:55:09.902
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_3456": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [68-1] 2020-03-23 20:55:09.912
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_2606": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [69-1] 2020-03-23 20:55:09.923
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_826": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [70-1] 2020-03-23 20:55:09.933
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_2609": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [71-1] 2020-03-23 20:55:09.944
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_3466": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [72-1] 2020-03-23 20:55:09.955
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_3079": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [73-1] 2020-03-23 20:55:09.965
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_2328": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [74-1] 2020-03-23 20:55:09.976
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_1417": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [75-1] 2020-03-23 20:55:09.987
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_3118": index scans: 0
> Mar 23 20:55:09 cowtn postgres[15890]: [76-1] 2020-03-23 20:55:09.997
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_3394": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [77-1] 2020-03-23 20:55:10.008
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_2612": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [78-1] 2020-03-23 20:55:10.018
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_2615": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [79-1] 2020-03-23 20:55:10.029
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_3350": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [80-1] 2020-03-23 20:55:10.040
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_3256": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [81-1] 2020-03-23 20:55:10.050
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_1255": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [82-1] 2020-03-23 20:55:10.062
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_2618": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [83-1] 2020-03-23 20:55:10.072
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_3596": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [84-1] 2020-03-23 20:55:10.083
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_2619": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [85-1] 2020-03-23 20:55:10.094
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_3381": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [86-1] 2020-03-23 20:55:10.105
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_3429": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [87-1] 2020-03-23 20:55:10.115
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_2620": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [88-1] 2020-03-23 20:55:10.126
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_3600": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [89-1] 2020-03-23 20:55:10.137
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_1247": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [90-1] 2020-03-23 20:55:10.147
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_1418": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [91-1] 2020-03-23 20:55:10.158
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_1260": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [92-1] 2020-03-23 20:55:10.169
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_1262": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [93-1] 2020-03-23 20:55:10.179
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_2964": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [94-1] 2020-03-23 20:55:10.190
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_1136": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [95-1] 2020-03-23 20:55:10.201
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_6000": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [96-1] 2020-03-23 20:55:10.211
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_2396": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [97-1] 2020-03-23 20:55:10.222
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_3592": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [98-1] 2020-03-23 20:55:10.232
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_6100": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [99-1] 2020-03-23 20:55:10.243
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_1213": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [100-1] 2020-03-23 20:55:10.254
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_13267": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [101-1] 2020-03-23 20:55:10.264
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_13272": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [102-1] 2020-03-23 20:55:10.275
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_13287": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [103-1] 2020-03-23 20:55:10.286
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_13277": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [104-1] 2020-03-23 20:55:10.296
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_13282": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [105-1] 2020-03-23 20:55:10.307
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_13292": index scans: 0
> Mar 23 20:55:10 cowtn postgres[15890]: [106-1] 2020-03-23 20:55:10.317
> GMT [15890] LOG:  automatic aggressive vacuum to prevent wraparound of
> table "postgres.pg_toast.pg_toast_13297": index scans: 0
>
> >
> >
> > > > I wonder if what might be happening is that we're somehow missed/failed
> > > > to update relfrozenxid and/or datfrozenxid. If you manually vacuum some
> > > > table in the oldest database, but that is *NOT* the oldest table itself,
> > > > does the problem "resolve" itself?
Andres-

I did confirm that if you run a manual vacuum on a table in the oldest
database but NOT the oldest table, the problem does resolve itself.
Here is the before and after:

postgres=# SELECT datname
    , age(datfrozenxid)
    , current_setting('autovacuum_freeze_max_age')
FROM pg_database;
  datname  |    age    | current_setting
-----------+-----------+-----------------
 postgres  | 202375735 | 200000000
 template1 | 202345459 | 200000000
 template0 | 132459914 | 200000000
 feedi     | 132459914 | 200000000
(4 rows)

postgres=#  SELECT c.oid::regclass
    , age(c.relfrozenxid)
    , pg_size_pretty(pg_total_relation_size(c.oid))
    , age(c.relfrozenxid)/200000000::float AS pct
FROM pg_class c
JOIN pg_namespace n on c.relnamespace = n.oid
WHERE relkind IN ('r', 't', 'm')
ORDER BY 2 DESC LIMIT 10;
          oid          |    age    | pg_size_pretty |     pct
-----------------------+-----------+----------------+-------------
 pg_subscription       | 203640681 | 24 kB          | 1.018203405
 pg_authid             | 203640681 | 80 kB          | 1.018203405
 pg_tablespace         | 203620449 | 80 kB          | 1.018102245
 pg_shseclabel         | 203620449 | 16 kB          | 1.018102245
 pg_auth_members       | 203620449 | 72 kB          | 1.018102245
 pg_database           | 203620449 | 80 kB          | 1.018102245
 pg_replication_origin | 203620449 | 24 kB          | 1.018102245
 pg_statistic          |  53646423 | 456 kB         | 0.268232115
 pg_statistic_ext_data |  53646423 | 16 kB          | 0.268232115
 pg_user_mapping       |  53646423 | 24 kB          | 0.268232115
(10 rows)

postgres=# vacuum FREEZE VERBOSE ANALYZE pg_statistic;
INFO:  aggressively vacuuming "pg_catalog.pg_statistic"
INFO:  "pg_statistic": found 0 removable, 18 nonremovable row versions
in 1 out of 38 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 39602876
There were 18 unused item identifiers.
Skipped 0 pages due to buffer pins, 37 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  aggressively vacuuming "pg_toast.pg_toast_2619"
INFO:  "pg_toast_2619": found 0 removable, 13 nonremovable row
versions in 4 out of 4 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 39602876
There were 12 unused item identifiers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
VACUUM

postgres=# SELECT datname
    , age(datfrozenxid)
    , current_setting('autovacuum_freeze_max_age')
FROM pg_database;
  datname  |    age    | current_setting
-----------+-----------+-----------------
 postgres  |  54640707 | 200000000
 template1 |     38248 | 200000000
 template0 | 134719144 | 200000000
 feedi     | 134719144 | 200000000
(4 rows)

Immediately after running this, all the autovac's starting working
again.  I've attached logfiles that might be easier to read.

We are going to try and recreate the issue, in the meantime, I've
turned on DEBUG1 and when it reoccurs, I'll run the queries in the
"What Is" section as well.

I also have a backtrace of the autovac launcher, but there appears to
be nothing useful at all there (as you'd probably expect) and there
were no autovac workers running.

Thank you again-
Justin

>
> postgres=# SELECT datname
>     , age(datfrozenxid)
>     , current_setting('autovacuum_freeze_max_age')
> FROM pg_database;
>   datname  |    age    | current_setting
> -----------+-----------+-----------------
>  postgres  | 202375735 | 200000000
>  template1 | 202345459 | 200000000
>  template0 | 132459914 | 200000000
>  feedi     | 132459914 | 200000000
> (4 rows)
>
>
> > >
> > > I'll also add pg_class snapshot for next time we run the bench.  I'm not sure
> > > if we'll be able to catch the 1h45 interval when the system stays alive
> > > after the issue though.
> >
> > Could you just script something to stop the benchmark once the disk is
> > 90% full or so?
> >
> > Did you see any errors / fatals around the time autovacuum stopped
> > working?
>
> There are no ERROR or FATAL messages around the time -- (or at all).
>
> Since this is occurring right now, what else would be useful to
> capture?  You'd asked about a GDB -- do you want that of the main
> process or the autovac worker?
>
> >
> > Greetings,
> >
> > Andres Freund

vacuum_logs.txt (70K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PG12 autovac issues

Michael Paquier-2
In reply to this post by Andres Freund
On Mon, Mar 23, 2020 at 01:00:51PM -0700, Andres Freund wrote:

> On 2020-03-23 20:47:25 +0100, Julien Rouhaud wrote:
>>> - relfrozenxid, age(relfrozenxid) for the oldest table in the oldest database
>>>   SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM pg_class WHERE relfrozenxid <> 0 ORDER BY age(relfrozenxid) DESC LIMIT 1;
>>
>> The vm has been trashed since, and I don't have that level of detail available
>> in the gathered stats unfortunately (and the available information I have is a
>> little bit messy, sorry for that).
>>
>>> - Oldest backend xmin
>>>   SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_activity WHERE backend_xmin <> 0 ORDER BY age(backend_xmin) DESC LIMIT 3;
>>> - oldest replication xmin:
>>>   SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_replication WHERE backend_xmin <> 0 ORDER BY age(backend_xmin) DESC LIMIT 3;
>>> - oldest slot xmin:
>>>   SELECT slot_name, xmin, age(xmin), catalog_xmin, age(catalog_xmin) FROM pg_replication_slots WHERE xmin <> 0 OR catalog_xmin <> 0 ORDER BY greatest(age(xmin), age(catalog_xmin)) DESC LIMIT 3;
>>> - oldest prepared transaction (unfortunately xmin not available)
>>>   SELECT gid, database, transaction FROM pg_prepared_xacts ORDER BY age(transaction) LIMIT 3;
>>
>> I have output from this query: https://github.com/OPMDG/check_pgactivity/blob/oldest_xmin/check_pgactivity#L5669-L5732
>> (which may be dumb).  Here are the non-null rows after 8AM GMT+1:
This stuff does not use 2PC and logical slots (there is one physical
slot for a WAL archiver), but we got a dump of pg_stat_activity.
Grepping more info about the evolution of pg_database and pg_class is
on our list.

> Could you share what the config of the server was?

Nothing really fancy:
- autovacuum_vacuum_cost_delay to 2ms (default of v12, but we used it
in v11 as well).
- autovacuum_naptime = 15s
- autovacuum_max_workers = 6
- log_autovacuum_min_duration = 0

>>> In that case you'd likely see DEBUG1 output, right? Did you try running
>>> with that?
>>
>> That's unfortunately not an option, as while the issue is reproducible, it
>> happens after roughly 16h with an average of 6.5ktps, and we definitely don't
>> have enough disk space for that amount of logs.  We'll try next time with a
>> recompiled version with the DEBUG1 promoted to WARNING, and maybe other elog.
>
> You wouldn't need to run it with DEBUG1 the whole time, you could just
> change the config and reload config once you hit the issue.
Sure, though it would reduce the window when the cluster is still up
and running.  One option that we have is just to increase the elevel
from DEBUG1 to WARNING for the log added in 2aa6e33 to keep the amount
of logs reduced without losing information, and we have no issue to
do runs with custom patches.

>> Not for now, because it's hard to be around for the ~ 90min interval the system
>> stays alive after atuvacuum get stuck, especially since it's a weekend bench.

Yeah, our room is very limited here.  I would likely not ne able to do
that, or I would just get very lucky with my timezone.  Not sure for
Julien.

>>> It sounds like what might be happening is that you have something
>>> holding back the "oldest needed transaction" horizon. Before
>>> 2aa6e331ead7, if the xmin horizon hasn't increased, there'd be repeated
>>> vacuums unable to increase the horizon, but afterwards they'll all
>>> immediately exit without any messages.
>>
>> The xmin horizon is increasing AFAICS.  Grepping the autovacuum logs, I can see
>> for the 08:00 AM UTC - 08:59 AM UTC interval the oldest xmin going from
>> 200069684 to 210363848.  E.g.:
>
> Which database is this on?
In this case this was the database used by the application, FWIW.  But
I can see from those logs that it kept increasing for all the other
databases, like postgres or template1 when this set of aggressive jobs
happened.

>> 266621:2020-03-14 08:49:59.945 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_27712": index scans: 0
>> 266622- pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
>> 266623- tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635953
>> [...]
>
> Do you have any non-toast ones?

Yep, this includes catalogs and normal tables, based on the logs all
the relations triggered aggressive and anti-wraparound jobs.

>>> I wonder if what might be happening is that we're somehow missed/failed
>>> to update relfrozenxid and/or datfrozenxid. If you manually vacuum some
>>> table in the oldest database, but that is *NOT* the oldest table itself,
>>> does the problem "resolve" itself?
>>
>> I'll also add pg_class snapshot for next time we run the bench.  I'm not sure
>> if we'll be able to catch the 1h45 interval when the system stays alive
>> after the issue though.
>
> Could you just script something to stop the benchmark once the disk is
> 90% full or so?
Hmm.  I think that this one is possible still tricky.  There are some
alarms in place in this system.

> Did you see any errors / fatals around the time autovacuum stopped
> working?

Before going rogue (we are not sure if autovacuum didn't launch any
workers or if the workers were spawned and exited early as we did not
capture any worker information in pg_stat_activity), we saw a bunch of
aggressive wraparound jobs.  Even after that, we have traces in the
logs of one autovacuum analyze happening at equal interval of time (17
minutes) on one single table, which is...  Er...  uncommon to say the
least.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PG12 autovac issues

Andres Freund
Hi,

On 2020-03-24 14:26:06 +0900, Michael Paquier wrote:
> > Could you share what the config of the server was?
>
> Nothing really fancy:
> - autovacuum_vacuum_cost_delay to 2ms (default of v12, but we used it
> in v11 as well).
> - autovacuum_naptime = 15s
> - autovacuum_max_workers = 6
> - log_autovacuum_min_duration = 0

Oh, so you're also involved in this? I'm starting to get a bit confused
as to who is reporting what.


> > Did you see any errors / fatals around the time autovacuum stopped
> > working?
>
> Before going rogue (we are not sure if autovacuum didn't launch any
> workers or if the workers were spawned and exited early as we did not
> capture any worker information in pg_stat_activity), we saw a bunch of
> aggressive wraparound jobs.  Even after that, we have traces in the
> logs of one autovacuum analyze happening at equal interval of time (17
> minutes) on one single table, which is...  Er...  uncommon to say the
> least.

Well, there's no logging of autovacuum launchers that don't do anything
due to the "skipping redundant" logic, with normal log level. If somehow
the horizon logic of autovacuum workers gets out of whack with what
vacuumlazy.c does, then you'd not get any vacuums. But a usage level
triggered analyze could still happen on such a table, I think.

While looking at this issue I found a few problems, btw. That seems more
like a -hackers discussion, so I started:
https://postgr.es/m/20200323235036.6pje6usrjjx22zv3%40alap3.anarazel.de

I think I might just have figured out another one...

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: PG12 autovac issues

Michael Paquier-2
On Mon, Mar 23, 2020 at 10:40:39PM -0700, Andres Freund wrote:

> On 2020-03-24 14:26:06 +0900, Michael Paquier wrote:
>> Nothing really fancy:
>> - autovacuum_vacuum_cost_delay to 2ms (default of v12, but we used it
>> in v11 as well).
>> - autovacuum_naptime = 15s
>> - autovacuum_max_workers = 6
>> - log_autovacuum_min_duration = 0
>
> Oh, so you're also involved in this? I'm starting to get a bit confused
> as to who is reporting what.
Yeah, sort of.  Julien has done a lot of work on that and I have an
access to the data and test beds, so we are just saying the same
things.

> Well, there's no logging of autovacuum launchers that don't do anything
> due to the "skipping redundant" logic, with normal log level. If somehow
> the horizon logic of autovacuum workers gets out of whack with what
> vacuumlazy.c does, then you'd not get any vacuums. But a usage level
> triggered analyze could still happen on such a table, I think.

What surprised me the most is that the same table happened to be
analyzed again and again after the launcher began its blackout.

> While looking at this issue I found a few problems, btw. That seems more
> like a -hackers discussion, so I started:
> https://postgr.es/m/20200323235036.6pje6usrjjx22zv3%40alap3.anarazel.de

Yes, let's discuss there.

> I think I might just have figured out another one...

Ouch.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PG12 autovac issues

Andres Freund
Hi,

On 2020-03-24 15:12:38 +0900, Michael Paquier wrote:
> > Well, there's no logging of autovacuum launchers that don't do anything
> > due to the "skipping redundant" logic, with normal log level. If somehow
> > the horizon logic of autovacuum workers gets out of whack with what
> > vacuumlazy.c does, then you'd not get any vacuums. But a usage level
> > triggered analyze could still happen on such a table, I think.
>
> What surprised me the most is that the same table happened to be
> analyzed again and again after the launcher began its blackout.

Well, if there's an issue with the "redundant" logic, that would be a
not too surprising outcome. It's quite plausible that one or two tables
in the database would get enough changes to occasionally need to be
analyzed. If the workload is steady, that could e.g. work out to every
~17 minutes. All tables that autovacuum things are not wraparound
threatened will be skipped, but ones that are will get both vacuum and
analyze queued. The redundant logic could then entirely skip all
vacuums - but there's no equivalent for analyze.

> > While looking at this issue I found a few problems, btw. That seems more
> > like a -hackers discussion, so I started:
> > https://postgr.es/m/20200323235036.6pje6usrjjx22zv3%40alap3.anarazel.de
>
> Yes, let's discuss there.

Cool. Would also be good if you could expand on the thread introducing
the "redundant" logic.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: PG12 autovac issues

Justin King
All-

This started happening again.  DEBUG1 is enabled:

Mar 25 14:48:03 cowtn postgres[39720]: [35294-1] 2020-03-25
14:48:03.972 GMT [39720] DEBUG:  autovacuum: processing database
"template0"
Mar 25 14:48:06 cowtn postgres[39735]: [35294-1] 2020-03-25
14:48:06.545 GMT [39735] DEBUG:  autovacuum: processing database
"postgres"
Mar 25 14:48:11 cowtn postgres[39759]: [35294-1] 2020-03-25
14:48:11.284 GMT [39759] DEBUG:  autovacuum: processing database
"template1"
Mar 25 14:48:14 cowtn postgres[39772]: [35294-1] 2020-03-25
14:48:14.564 GMT [39772] DEBUG:  autovacuum: processing database
"feedi"
Mar 25 14:48:14 cowtn postgres[39772]: [35295-1] 2020-03-25
14:48:14.588 GMT [39772] LOG:  automatic vacuum of table
"feedi.pg_catalog.pg_statistic": index scans: 1
Mar 25 14:48:14 cowtn postgres[39772]: [35295-2] #011pages: 0 removed,
117 remain, 0 skipped due to pins, 0 skipped frozen
Mar 25 14:48:14 cowtn postgres[39772]: [35295-3] #011tuples: 477
removed, 704 remain, 0 are dead but not yet removable, oldest xmin:
189591147
Mar 25 14:48:14 cowtn postgres[39772]: [35295-4] #011buffer usage: 315
hits, 0 misses, 0 dirtied
Mar 25 14:48:14 cowtn postgres[39772]: [35295-5] #011avg read rate:
0.000 MB/s, avg write rate: 0.000 MB/s
Mar 25 14:48:14 cowtn postgres[39772]: [35295-6] #011system usage:
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
Mar 25 14:48:15 cowtn postgres[39772]: [35296-1] 2020-03-25
14:48:15.606 GMT [39772] LOG:  automatic vacuum of table
"feedi.production.flightplans": index scans: 1
Mar 25 14:48:15 cowtn postgres[39772]: [35296-2] #011pages: 0 removed,
61331 remain, 0 skipped due to pins, 29772 skipped frozen
Mar 25 14:48:15 cowtn postgres[39772]: [35296-3] #011tuples: 21807
removed, 509983 remain, 1581 are dead but not yet removable, oldest
xmin: 189591147
Mar 25 14:48:15 cowtn postgres[39772]: [35296-4] #011buffer usage:
113684 hits, 0 misses, 1 dirtied
Mar 25 14:48:15 cowtn postgres[39772]: [35296-5] #011avg read rate:
0.000 MB/s, avg write rate: 0.008 MB/s
Mar 25 14:48:15 cowtn postgres[39772]: [35296-6] #011system usage:
CPU: user: 0.64 s, system: 0.12 s, elapsed: 1.00 s
Mar 25 14:48:16 cowtn postgres[39772]: [35297-1] 2020-03-25
14:48:16.537 GMT [39772] LOG:  automatic analyze of table
"feedi.production.flightplans" system usage: CPU: user: 0.83 s,
system: 0.03 s, elapsed: 0.93 s
Mar 25 14:48:16 cowtn postgres[39772]: [35298-1] 2020-03-25
14:48:16.627 GMT [39772] LOG:  automatic vacuum of table
"feedi.production.tita": index scans: 1
Mar 25 14:48:16 cowtn postgres[39772]: [35298-2] #011pages: 0 removed,
1711 remain, 0 skipped due to pins, 0 skipped frozen
Mar 25 14:48:16 cowtn postgres[39772]: [35298-3] #011tuples: 5936
removed, 75280 remain, 1079 are dead but not yet removable, oldest
xmin: 189591147
Mar 25 14:48:16 cowtn postgres[39772]: [35298-4] #011buffer usage:
5748 hits, 0 misses, 1 dirtied
Mar 25 14:48:16 cowtn postgres[39772]: [35298-5] #011avg read rate:
0.000 MB/s, avg write rate: 0.100 MB/s
Mar 25 14:48:16 cowtn postgres[39772]: [35298-6] #011system usage:
CPU: user: 0.06 s, system: 0.00 s, elapsed: 0.07 s
Mar 25 14:48:16 cowtn postgres[39772]: [35299-1] 2020-03-25
14:48:16.959 GMT [39772] LOG:  automatic analyze of table
"feedi.production.tita" system usage: CPU: user: 0.32 s, system: 0.00
s, elapsed: 0.33 s
Mar 25 14:48:16 cowtn postgres[39772]: [35300-1] 2020-03-25
14:48:16.969 GMT [39772] LOG:  automatic analyze of table
"feedi.production.virtual_clocks" system usage: CPU: user: 0.00 s,
system: 0.00 s, elapsed: 0.00 s
Mar 25 14:48:18 cowtn postgres[39790]: [35294-1] 2020-03-25
14:48:18.975 GMT [39790] DEBUG:  autovacuum: processing database
"template0"
Mar 25 14:48:21 cowtn postgres[39799]: [35294-1] 2020-03-25
14:48:21.546 GMT [39799] DEBUG:  autovacuum: processing database
"postgres"
Mar 25 14:48:22 cowtn postgres[39853]: [35294-1] 2020-03-25
14:48:22.446 GMT [39853] LOG:  connection received: host=10.4.4.11
port=25424
Mar 25 14:48:22 cowtn postgres[39853]: [35295-1] 2020-03-25
14:48:22.451 GMT [39853] LOG:  connection authorized: user=feedi
database=feedi SSL enabled (protocol=TLSv1.3,
cipher=TLS_AES_256_GCM_SHA384, bits=256, compression=off)
Mar 25 14:48:26 cowtn postgres[39875]: [35294-1] 2020-03-25
14:48:26.277 GMT [39875] DEBUG:  autovacuum: processing database
"postgres"
Mar 25 14:48:26 cowtn postgres[39875]: [35295-1] 2020-03-25
14:48:26.298 GMT [39875] DEBUG:  skipping redundant vacuum to prevent
wraparound of table "postgres.pg_catalog.pg_authid"
Mar 25 14:48:26 cowtn postgres[39875]: [35296-1] 2020-03-25
14:48:26.309 GMT [39875] DEBUG:  skipping redundant vacuum to prevent
wraparound of table "postgres.pg_catalog.pg_subscription"
Mar 25 14:48:26 cowtn postgres[39875]: [35297-1] 2020-03-25
14:48:26.319 GMT [39875] DEBUG:  skipping redundant vacuum to prevent
wraparound of table "postgres.pg_catalog.pg_database"
Mar 25 14:48:26 cowtn postgres[39875]: [35298-1] 2020-03-25
14:48:26.329 GMT [39875] DEBUG:  skipping redundant vacuum to prevent
wraparound of table "postgres.pg_catalog.pg_tablespace"
Mar 25 14:48:26 cowtn postgres[39875]: [35299-1] 2020-03-25
14:48:26.339 GMT [39875] DEBUG:  skipping redundant vacuum to prevent
wraparound of table "postgres.pg_catalog.pg_auth_members"
Mar 25 14:48:26 cowtn postgres[39875]: [35300-1] 2020-03-25
14:48:26.350 GMT [39875] DEBUG:  skipping redundant vacuum to prevent
wraparound of table "postgres.pg_catalog.pg_replication_origin"
Mar 25 14:48:26 cowtn postgres[39875]: [35301-1] 2020-03-25
14:48:26.360 GMT [39875] DEBUG:  skipping redundant vacuum to prevent
wraparound of table "postgres.pg_catalog.pg_shseclabel"
Mar 25 14:48:26 cowtn postgres[39875]: [35302-1] 2020-03-25
14:48:26.361 GMT [39875] DEBUG:  transaction ID wrap limit is
2137086523, limited by database with OID 13432
Mar 25 14:48:26 cowtn postgres[39875]: [35303-1] 2020-03-25
14:48:26.361 GMT [39875] DEBUG:  MultiXactId wrap limit is 2147483648,
limited by database with OID 13432
Mar 25 14:48:26 cowtn postgres[39875]: [35304-1] 2020-03-25
14:48:26.361 GMT [39875] DEBUG:  MultiXact member stop limit is now
4294914944 based on MultiXact 1

postgres=# SELECT datname
    , oid
    , age(datfrozenxid)
    , current_setting('autovacuum_freeze_max_age')
FROM pg_database;
  datname  |   oid    |    age    | current_setting
-----------+----------+-----------+-----------------
 postgres  |    13432 | 202588645 | 200000000
 template1 |        1 | 152588645 | 200000000
 template0 |    13431 |  87271781 | 200000000
 feedi     | 32178861 |  87271781 | 200000000

Here is the output from the "What is" queries referenced:

postgres=# SELECT datname, age(datfrozenxid), datfrozenxid FROM
pg_database ORDER BY age(datfrozenxid) DESC LIMIT 1;
 datname  |    age    | datfrozenxid
----------+-----------+--------------
 postgres | 202773709 |   4284570172


 postgres=# SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM
pg_class WHERE relfrozenxid <> 0 ORDER BY age(relfrozenxid) DESC LIMIT
1;
    oid    |    age    | relfrozenxid
-----------+-----------+--------------
 pg_authid | 202793549 |   4284570172


 postgres=# SELECT pid, backend_xmin, age(backend_xmin) FROM
pg_stat_activity WHERE backend_xmin <> 0 ORDER BY age(backend_xmin)
DESC LIMIT 3;
  pid  | backend_xmin | age
-------+--------------+------
 17032 |    192412343 | 7199
 17033 |    192412343 | 7199
 52320 |    192419542 |    0
(3 rows)


postgres=# SELECT pid, backend_xmin, age(backend_xmin) FROM
pg_stat_replication WHERE backend_xmin <> 0 ORDER BY age(backend_xmin)
DESC LIMIT 3;
  pid  | backend_xmin | age
-------+--------------+------
 17032 |    192434622 | 2004
 17033 |    192434622 | 2004
(2 rows)


postgres=# SELECT slot_name, xmin, age(xmin), catalog_xmin,
age(catalog_xmin) FROM pg_replication_slots WHERE xmin <> 0 OR
catalog_xmin <> 0 ORDER BY greatest(age(xmin), age(catalog_xmin)) DESC
LIMIT 3;
 slot_name | xmin | age | catalog_xmin | age
-----------+------+-----+--------------+-----
(0 rows)


postgres=# SELECT gid, database, transaction FROM pg_prepared_xacts
ORDER BY age(transaction) LIMIT 3;
 gid | database | transaction
-----+----------+-------------
(0 rows)

Let me know if there's anything else useful I can provide.

Thanks-
Justin

On Tue, Mar 24, 2020 at 2:43 PM Andres Freund <[hidden email]> wrote:

>
> Hi,
>
> On 2020-03-24 15:12:38 +0900, Michael Paquier wrote:
> > > Well, there's no logging of autovacuum launchers that don't do anything
> > > due to the "skipping redundant" logic, with normal log level. If somehow
> > > the horizon logic of autovacuum workers gets out of whack with what
> > > vacuumlazy.c does, then you'd not get any vacuums. But a usage level
> > > triggered analyze could still happen on such a table, I think.
> >
> > What surprised me the most is that the same table happened to be
> > analyzed again and again after the launcher began its blackout.
>
> Well, if there's an issue with the "redundant" logic, that would be a
> not too surprising outcome. It's quite plausible that one or two tables
> in the database would get enough changes to occasionally need to be
> analyzed. If the workload is steady, that could e.g. work out to every
> ~17 minutes. All tables that autovacuum things are not wraparound
> threatened will be skipped, but ones that are will get both vacuum and
> analyze queued. The redundant logic could then entirely skip all
> vacuums - but there's no equivalent for analyze.
>
> > > While looking at this issue I found a few problems, btw. That seems more
> > > like a -hackers discussion, so I started:
> > > https://postgr.es/m/20200323235036.6pje6usrjjx22zv3%40alap3.anarazel.de
> >
> > Yes, let's discuss there.
>
> Cool. Would also be good if you could expand on the thread introducing
> the "redundant" logic.
>
> Greetings,
>
> Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: PG12 autovac issues

Michael Paquier-2
On Wed, Mar 25, 2020 at 10:39:17AM -0500, Justin King wrote:
> This started happening again.  DEBUG1 is enabled:

Thanks for enabling DEBUG1 logs while this happened.

> Mar 25 14:48:26 cowtn postgres[39875]: [35298-1] 2020-03-25
> 14:48:26.329 GMT [39875] DEBUG:  skipping redundant vacuum to prevent
> wraparound of table "postgres.pg_catalog.pg_tablespace"
> Mar 25 14:48:26 cowtn postgres[39875]: [35299-1] 2020-03-25
> 14:48:26.339 GMT [39875] DEBUG:  skipping redundant vacuum to prevent
> wraparound of table "postgres.pg_catalog.pg_auth_members"
> Mar 25 14:48:26 cowtn postgres[39875]: [35300-1] 2020-03-25
> 14:48:26.350 GMT [39875] DEBUG:  skipping redundant vacuum to prevent
> wraparound of table "postgres.pg_catalog.pg_replication_origin"

Are you seeing such log entries happening multiple times for the same
relations, meaning that autovacuum workers are just looping on the
same relations all over again?  This part of the logs point to catalog
tables, but are there other tables within your system facing the same
logs, particularly the database "feedi" with some of your own tables?

>  postgres=# SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM
> pg_class WHERE relfrozenxid <> 0 ORDER BY age(relfrozenxid) DESC LIMIT
> 1;
>     oid    |    age    | relfrozenxid
> -----------+-----------+--------------
>  pg_authid | 202793549 |   4284570172

Ugh.  I think that this is exactly the thing I was suspecting
previously:
- The database stats look sane.
- The relation stats don't look good and visibly are put in such a
state that only one type of jobs gets triggered (non-aggressive but
anti-wraparound), which just keep being skipped and the relation stats
don't get refreshed.  (Such autovacuum jobs should never happen and we
have some underlying issues that will need separate care).

If you still have the cluster in this current state (perhaps you are
not able to keep it longer), could you provide more data about
pg_class.relfrozenxid for the tables which are mentioned in the logs
of the type "skipping redundant vacuum to prevent of table"?

> Let me know if there's anything else useful I can provide.

Thanks!
--
Michael

signature.asc (849 bytes) Download Attachment
12