Fwd: PG12 autovac issues

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Fwd: PG12 autovac issues

Justin King
Apologies, I accidentally sent this to the pgsql-admin list initially
but intended it go here:

We have a database that isn't overly large (~20G), but gets incredibly
frequent updates.  Here's an example table:

feedi=# select * from pg_stat_all_tables where schemaname =
'production' and relname = 'tita';
relid = 16786
schemaname = production
relname = tita
seq_scan = 23365
seq_tup_read = 403862091
idx_scan = 26612759248
idx_tup_fetch = 19415752701
n_tup_ins = 24608806
n_tup_upd = 4207076934
n_tup_del = 24566916
n_tup_hot_upd = 4073821079
n_live_tup = 79942
n_dead_tup = 71969
n_mod_since_analyze = 12020
last_vacuum = 2020-03-17 15:35:19.588859+00
last_autovacuum = 2020-03-17 21:31:08.248598+00
last_analyze = 2020-03-17 15:35:20.372875+00
last_autoanalyze = 2020-03-17 22:04:41.76743+00
vacuum_count = 9
autovacuum_count = 135693
analyze_count = 9
autoanalyze_count = 495877

As you can see in this table, there are only ~80K rows, but billions
of updates.  What we have observed is that the frozenxid reaches the
200M mark fairly quickly because of the amount of activity.  What is
interesting is that this happens with the 'postgres' and 'template1'
databases as well and there is absolutely no activity in those
databases.

When the 'postgres' and/or 'template1' databases hit the
freeze_max_age, there are cases where it kicks off an aggressive
autovac of those tables which seems to prevent autovacs from running
elsewhere.  Oddly, this is not consistent, but that condition seems to
be required.  We have observed this across multiple PG12 servers (dev,
test, staging, production) all with similar workloads.

$ grep -i vacuum /var/log/postgresql/postgres.log | cut -b 1-9 | uniq -c
     17 Mar 17 06
     34 Mar 17 07
     31 Mar 17 08
     31 Mar 17 09
     30 Mar 17 10
     34 Mar 17 11
     33 Mar 17 12
     19 Mar 17 13
     40 Mar 17 15
     31 Mar 17 16
     36 Mar 17 17
     34 Mar 17 18
     35 Mar 17 19
     35 Mar 17 20
     33 Mar 17 21

As you can see above, we end up having around ~33 autovac/hr, and
about 13:30 today, they stopped until we ran a "vacuum freeze verbose
analyze;" against the 'postgres' database (around 15:30) which then
caused the autovacs to resume running against the "feedi" database.

I'm completely perplexed as to what is happening and why it suddenly
started when we moved from PG10 > PG12.  The configs and workload are
essentially the same between versions.  We realize we could simply
increase the autovacuum_freeze_max_age, but that doesn't seem to
actually resolve anything -- it just pushes the problem out.  Has
anyone seen anything similar to this?

Thanks very much for the consideration.

Justin King
http://flightaware.com/


Reply | Threaded
Open this post in threaded view
|

Re: Fwd: PG12 autovac issues

Adrian Klaver-4
On 3/17/20 3:22 PM, Justin King wrote:

> Apologies, I accidentally sent this to the pgsql-admin list initially
> but intended it go here:
>
> We have a database that isn't overly large (~20G), but gets incredibly
> frequent updates.  Here's an example table:
>
> feedi=# select * from pg_stat_all_tables where schemaname =
> 'production' and relname = 'tita';
> relid = 16786
> schemaname = production
> relname = tita
> seq_scan = 23365
> seq_tup_read = 403862091
> idx_scan = 26612759248
> idx_tup_fetch = 19415752701
> n_tup_ins = 24608806
> n_tup_upd = 4207076934
> n_tup_del = 24566916
> n_tup_hot_upd = 4073821079
> n_live_tup = 79942
> n_dead_tup = 71969
> n_mod_since_analyze = 12020
> last_vacuum = 2020-03-17 15:35:19.588859+00
> last_autovacuum = 2020-03-17 21:31:08.248598+00
> last_analyze = 2020-03-17 15:35:20.372875+00
> last_autoanalyze = 2020-03-17 22:04:41.76743+00
> vacuum_count = 9
> autovacuum_count = 135693
> analyze_count = 9
> autoanalyze_count = 495877
>
> As you can see in this table, there are only ~80K rows, but billions
> of updates.  What we have observed is that the frozenxid reaches the
> 200M mark fairly quickly because of the amount of activity.  What is
> interesting is that this happens with the 'postgres' and 'template1'
> databases as well and there is absolutely no activity in those
> databases.
>
> When the 'postgres' and/or 'template1' databases hit the
> freeze_max_age, there are cases where it kicks off an aggressive
> autovac of those tables which seems to prevent autovacs from running
> elsewhere.  Oddly, this is not consistent, but that condition seems to
> be required.  We have observed this across multiple PG12 servers (dev,
> test, staging, production) all with similar workloads.

Is there anything in postgres and template1 besides what was created at
init?

What are your settings for autovacuum?:

https://www.postgresql.org/docs/12/runtime-config-autovacuum.html

>
> $ grep -i vacuum /var/log/postgresql/postgres.log | cut -b 1-9 | uniq -c
>       17 Mar 17 06
>       34 Mar 17 07
>       31 Mar 17 08
>       31 Mar 17 09
>       30 Mar 17 10
>       34 Mar 17 11
>       33 Mar 17 12
>       19 Mar 17 13
>       40 Mar 17 15
>       31 Mar 17 16
>       36 Mar 17 17
>       34 Mar 17 18
>       35 Mar 17 19
>       35 Mar 17 20
>       33 Mar 17 21
>
> As you can see above, we end up having around ~33 autovac/hr, and
> about 13:30 today, they stopped until we ran a "vacuum freeze verbose
> analyze;" against the 'postgres' database (around 15:30) which then
> caused the autovacs to resume running against the "feedi" database.
>
> I'm completely perplexed as to what is happening and why it suddenly
> started when we moved from PG10 > PG12.  The configs and workload are
> essentially the same between versions.  We realize we could simply
> increase the autovacuum_freeze_max_age, but that doesn't seem to
> actually resolve anything -- it just pushes the problem out.  Has
> anyone seen anything similar to this?
>
> Thanks very much for the consideration.
>
> Justin King
> http://flightaware.com/
>
>


--
Adrian Klaver
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Fwd: PG12 autovac issues

Justin King
On Tue, Mar 17, 2020 at 5:39 PM Adrian Klaver <[hidden email]> wrote:

>
> On 3/17/20 3:22 PM, Justin King wrote:
> > Apologies, I accidentally sent this to the pgsql-admin list initially
> > but intended it go here:
> >
> > We have a database that isn't overly large (~20G), but gets incredibly
> > frequent updates.  Here's an example table:
> >
> > feedi=# select * from pg_stat_all_tables where schemaname =
> > 'production' and relname = 'tita';
> > relid = 16786
> > schemaname = production
> > relname = tita
> > seq_scan = 23365
> > seq_tup_read = 403862091
> > idx_scan = 26612759248
> > idx_tup_fetch = 19415752701
> > n_tup_ins = 24608806
> > n_tup_upd = 4207076934
> > n_tup_del = 24566916
> > n_tup_hot_upd = 4073821079
> > n_live_tup = 79942
> > n_dead_tup = 71969
> > n_mod_since_analyze = 12020
> > last_vacuum = 2020-03-17 15:35:19.588859+00
> > last_autovacuum = 2020-03-17 21:31:08.248598+00
> > last_analyze = 2020-03-17 15:35:20.372875+00
> > last_autoanalyze = 2020-03-17 22:04:41.76743+00
> > vacuum_count = 9
> > autovacuum_count = 135693
> > analyze_count = 9
> > autoanalyze_count = 495877
> >
> > As you can see in this table, there are only ~80K rows, but billions
> > of updates.  What we have observed is that the frozenxid reaches the
> > 200M mark fairly quickly because of the amount of activity.  What is
> > interesting is that this happens with the 'postgres' and 'template1'
> > databases as well and there is absolutely no activity in those
> > databases.
> >
> > When the 'postgres' and/or 'template1' databases hit the
> > freeze_max_age, there are cases where it kicks off an aggressive
> > autovac of those tables which seems to prevent autovacs from running
> > elsewhere.  Oddly, this is not consistent, but that condition seems to
> > be required.  We have observed this across multiple PG12 servers (dev,
> > test, staging, production) all with similar workloads.
>
> Is there anything in postgres and template1 besides what was created at
> init?

There is nothing in there at all besides system tables created at init.

>
> What are your settings for autovacuum?:
>
> https://www.postgresql.org/docs/12/runtime-config-autovacuum.html

Here are the settings, these are the only ones that are not set to
default with the exception of a few tables that have been overridden
with a different value due to lots of updates and few rows:

autovacuum = on
log_autovacuum_min_duration = 0
autovacuum_max_workers = 8
autovacuum_naptime = 15s
autovacuum_vacuum_threshold = 500
autovacuum_analyze_threshold = 2500
vacuum_cost_limit = 1000

We want fairly aggressive autovacs to keep table bloat limited -- the
application latency suffers if it has to wade through dead tuples and
staying near realtime is important in our environment.

** Also, it should be noted that the autovacuum_analyze_threshold is
probably an incorrect value, we likely intended that to be 250 and
just have now realized it after poking more at the configuration.

>
> >
> > $ grep -i vacuum /var/log/postgresql/postgres.log | cut -b 1-9 | uniq -c
> >       17 Mar 17 06
> >       34 Mar 17 07
> >       31 Mar 17 08
> >       31 Mar 17 09
> >       30 Mar 17 10
> >       34 Mar 17 11
> >       33 Mar 17 12
> >       19 Mar 17 13
> >       40 Mar 17 15
> >       31 Mar 17 16
> >       36 Mar 17 17
> >       34 Mar 17 18
> >       35 Mar 17 19
> >       35 Mar 17 20
> >       33 Mar 17 21
> >
> > As you can see above, we end up having around ~33 autovac/hr, and
> > about 13:30 today, they stopped until we ran a "vacuum freeze verbose
> > analyze;" against the 'postgres' database (around 15:30) which then
> > caused the autovacs to resume running against the "feedi" database.
> >
> > I'm completely perplexed as to what is happening and why it suddenly
> > started when we moved from PG10 > PG12.  The configs and workload are
> > essentially the same between versions.  We realize we could simply
> > increase the autovacuum_freeze_max_age, but that doesn't seem to
> > actually resolve anything -- it just pushes the problem out.  Has
> > anyone seen anything similar to this?
> >
> > Thanks very much for the consideration.
> >
> > Justin King
> > http://flightaware.com/
> >
> >
>
>
> --
> Adrian Klaver
> [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Fwd: PG12 autovac issues

Adrian Klaver-4
On 3/17/20 3:48 PM, Justin King wrote:

> On Tue, Mar 17, 2020 at 5:39 PM Adrian Klaver <[hidden email]> wrote:
>>
>> On 3/17/20 3:22 PM, Justin King wrote:
>>> Apologies, I accidentally sent this to the pgsql-admin list initially
>>> but intended it go here:
>>>
>>> We have a database that isn't overly large (~20G), but gets incredibly
>>> frequent updates.  Here's an example table:
>>>
>>> feedi=# select * from pg_stat_all_tables where schemaname =
>>> 'production' and relname = 'tita';
>>> relid = 16786
>>> schemaname = production
>>> relname = tita
>>> seq_scan = 23365
>>> seq_tup_read = 403862091
>>> idx_scan = 26612759248
>>> idx_tup_fetch = 19415752701
>>> n_tup_ins = 24608806
>>> n_tup_upd = 4207076934
>>> n_tup_del = 24566916
>>> n_tup_hot_upd = 4073821079
>>> n_live_tup = 79942
>>> n_dead_tup = 71969
>>> n_mod_since_analyze = 12020
>>> last_vacuum = 2020-03-17 15:35:19.588859+00
>>> last_autovacuum = 2020-03-17 21:31:08.248598+00
>>> last_analyze = 2020-03-17 15:35:20.372875+00
>>> last_autoanalyze = 2020-03-17 22:04:41.76743+00
>>> vacuum_count = 9
>>> autovacuum_count = 135693
>>> analyze_count = 9
>>> autoanalyze_count = 495877
>>>
>>> As you can see in this table, there are only ~80K rows, but billions
>>> of updates.  What we have observed is that the frozenxid reaches the
>>> 200M mark fairly quickly because of the amount of activity.  What is
>>> interesting is that this happens with the 'postgres' and 'template1'
>>> databases as well and there is absolutely no activity in those
>>> databases.
>>>
>>> When the 'postgres' and/or 'template1' databases hit the
>>> freeze_max_age, there are cases where it kicks off an aggressive
>>> autovac of those tables which seems to prevent autovacs from running
>>> elsewhere.  Oddly, this is not consistent, but that condition seems to
>>> be required.  We have observed this across multiple PG12 servers (dev,
>>> test, staging, production) all with similar workloads.
>>
>> Is there anything in postgres and template1 besides what was created at
>> init?
>
> There is nothing in there at all besides system tables created at init.
>
>>
>> What are your settings for autovacuum?:
>>
>> https://www.postgresql.org/docs/12/runtime-config-autovacuum.html
>
> Here are the settings, these are the only ones that are not set to
> default with the exception of a few tables that have been overridden
> with a different value due to lots of updates and few rows:

And those values are?

More below.

>
> autovacuum = on
> log_autovacuum_min_duration = 0
> autovacuum_max_workers = 8
> autovacuum_naptime = 15s
> autovacuum_vacuum_threshold = 500
> autovacuum_analyze_threshold = 2500
> vacuum_cost_limit = 1000

Are either of the below set > 0?:

vacuum_cost_delay

autovacuum_vacuum_cost_delay

>
> We want fairly aggressive autovacs to keep table bloat limited -- the
> application latency suffers if it has to wade through dead tuples and
> staying near realtime is important in our environment.
>
> ** Also, it should be noted that the autovacuum_analyze_threshold is
> probably an incorrect value, we likely intended that to be 250 and
> just have now realized it after poking more at the configuration.
>
>>



--
Adrian Klaver
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Fwd: PG12 autovac issues

Adrian Klaver-4
On 3/18/20 6:57 AM, Justin King wrote:
Please reply to list also
Ccing list


>>> Here are the settings, these are the only ones that are not set to
>>> default with the exception of a few tables that have been overridden
>>> with a different value due to lots of updates and few rows:
>>
>> And those values are?
>
> Thanks for the response, hopefully this will help:

The below is helpful, but what I was referring to above was the settings
for the overridden tables.


>
> postgres=# select name,setting from pg_settings where name like '%vacuum%';
> name = setting
> autovacuum = on
> autovacuum_analyze_scale_factor = 0.1
> autovacuum_analyze_threshold = 2500
> autovacuum_freeze_max_age = 200000000
> autovacuum_max_workers = 8
> autovacuum_multixact_freeze_max_age = 400000000
> autovacuum_naptime = 15
> autovacuum_vacuum_cost_delay = 20
> autovacuum_vacuum_cost_limit = -1
> autovacuum_vacuum_scale_factor = 0.2
> autovacuum_vacuum_threshold = 500
> autovacuum_work_mem = -1
> log_autovacuum_min_duration = 0
> vacuum_cleanup_index_scale_factor = 0.1
> vacuum_cost_delay = 0
> vacuum_cost_limit = 1000
> vacuum_cost_page_dirty = 20
> vacuum_cost_page_hit = 1
> vacuum_cost_page_miss = 10
> vacuum_defer_cleanup_age = 0
> vacuum_freeze_min_age = 50000000
> vacuum_freeze_table_age = 150000000
> vacuum_multixact_freeze_min_age = 5000000
> vacuum_multixact_freeze_table_age = 150000000
>
>>
>> More below.
>>
>>>
>>> autovacuum = on
>>> log_autovacuum_min_duration = 0
>>> autovacuum_max_workers = 8
>>> autovacuum_naptime = 15s
>>> autovacuum_vacuum_threshold = 500
>>> autovacuum_analyze_threshold = 2500
>>> vacuum_cost_limit = 1000
>>
>> Are either of the below set > 0?:
>>
>> vacuum_cost_delay
>>
>> autovacuum_vacuum_cost_delay
>>
>>>
>>> We want fairly aggressive autovacs to keep table bloat limited -- the
>>> application latency suffers if it has to wade through dead tuples and
>>> staying near realtime is important in our environment.
>>>
>>> ** Also, it should be noted that the autovacuum_analyze_threshold is
>>> probably an incorrect value, we likely intended that to be 250 and
>>> just have now realized it after poking more at the configuration.
>>>
>>>>
>>
>>
>>
>> --
>> Adrian Klaver
>> [hidden email]


--
Adrian Klaver
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Fwd: PG12 autovac issues

Michael Lewis
Do you have default fillfactor set on this table? If not, I would wonder if reducing it to 50% or even 20% would allow many more HOT updates that would reduce bloat.

Also, is there any period of lower activity on your system that you could schedule a vacuum freeze for daily or weekly? I believe having frozen pages would also mean all the autovacuums would be able to skip more pages and therefore be faster.

>> autovacuum_vacuum_cost_delay = 20

This was changed to 2ms in PG12. You should reduce that most likely.
Reply | Threaded
Open this post in threaded view
|

Re: Re: PG12 autovac issues

Justin King
In reply to this post by Adrian Klaver-4
On Wed, Mar 18, 2020 at 10:13 AM Adrian Klaver
<[hidden email]> wrote:

>
> On 3/18/20 6:57 AM, Justin King wrote:
> Please reply to list also
> Ccing list
>
>
> >>> Here are the settings, these are the only ones that are not set to
> >>> default with the exception of a few tables that have been overridden
> >>> with a different value due to lots of updates and few rows:
> >>
> >> And those values are?
> >
> > Thanks for the response, hopefully this will help:
>
> The below is helpful, but what I was referring to above was the settings
> for the overridden tables.

Ah, apologies, I missed that.  They are:

alter table production.tita set
(autovacuum_analyze_scale_factor = 0, autovacuum_vacuum_scale_factor =
0, autovacuum_vacuum_threshold = 100000, autovacuum_analyze_threshold
= 50000);

>
> >
> > postgres=# select name,setting from pg_settings where name like '%vacuum%';
> > name = setting
> > autovacuum = on
> > autovacuum_analyze_scale_factor = 0.1
> > autovacuum_analyze_threshold = 2500
> > autovacuum_freeze_max_age = 200000000
> > autovacuum_max_workers = 8
> > autovacuum_multixact_freeze_max_age = 400000000
> > autovacuum_naptime = 15
> > autovacuum_vacuum_cost_delay = 20
> > autovacuum_vacuum_cost_limit = -1
> > autovacuum_vacuum_scale_factor = 0.2
> > autovacuum_vacuum_threshold = 500
> > autovacuum_work_mem = -1
> > log_autovacuum_min_duration = 0
> > vacuum_cleanup_index_scale_factor = 0.1
> > vacuum_cost_delay = 0
> > vacuum_cost_limit = 1000
> > vacuum_cost_page_dirty = 20
> > vacuum_cost_page_hit = 1
> > vacuum_cost_page_miss = 10
> > vacuum_defer_cleanup_age = 0
> > vacuum_freeze_min_age = 50000000
> > vacuum_freeze_table_age = 150000000
> > vacuum_multixact_freeze_min_age = 5000000
> > vacuum_multixact_freeze_table_age = 150000000
> >
> >>
> >> More below.
> >>
> >>>
> >>> autovacuum = on
> >>> log_autovacuum_min_duration = 0
> >>> autovacuum_max_workers = 8
> >>> autovacuum_naptime = 15s
> >>> autovacuum_vacuum_threshold = 500
> >>> autovacuum_analyze_threshold = 2500
> >>> vacuum_cost_limit = 1000
> >>
> >> Are either of the below set > 0?:
> >>
> >> vacuum_cost_delay
> >>
> >> autovacuum_vacuum_cost_delay
> >>
> >>>
> >>> We want fairly aggressive autovacs to keep table bloat limited -- the
> >>> application latency suffers if it has to wade through dead tuples and
> >>> staying near realtime is important in our environment.
> >>>
> >>> ** Also, it should be noted that the autovacuum_analyze_threshold is
> >>> probably an incorrect value, we likely intended that to be 250 and
> >>> just have now realized it after poking more at the configuration.
> >>>
> >>>>
> >>
> >>
> >>
> >> --
> >> Adrian Klaver
> >> [hidden email]
>
>
> --
> Adrian Klaver
> [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Fwd: PG12 autovac issues

Justin King
In reply to this post by Michael Lewis
On Wed, Mar 18, 2020 at 1:40 PM Michael Lewis <[hidden email]> wrote:
>
> Do you have default fillfactor set on this table? If not, I would wonder if reducing it to 50% or even 20% would allow many more HOT updates that would reduce bloat.

I don't believe we have a default fillfactor, but I'm still trying to
understand why autovacs would completely stop -- that seems like a
bug.  Especially since there was no change between PG10 and PG12 and
this problem never existed there.

> Also, is there any period of lower activity on your system that you could schedule a vacuum freeze for daily or weekly? I believe having frozen pages would also mean all the autovacuums would be able to skip more pages and therefore be faster.
>
> >> autovacuum_vacuum_cost_delay = 20
>
> This was changed to 2ms in PG12. You should reduce that most likely.

Actually, we set that back from 2ms > 20ms to replicate what PG10 was
doing just in case it had an impact (spoiler: it did not).


Reply | Threaded
Open this post in threaded view
|

Re: Fwd: PG12 autovac issues

Michael Lewis
On Thu, Mar 19, 2020 at 9:31 AM Justin King <[hidden email]> wrote:
On Wed, Mar 18, 2020 at 1:40 PM Michael Lewis <[hidden email]> wrote:
>
> Do you have default fillfactor set on this table? If not, I would wonder if reducing it to 50% or even 20% would allow many more HOT updates that would reduce bloat.

I don't believe we have a default fillfactor, but I'm still trying to
understand why autovacs would completely stop -- that seems like a
bug.  Especially since there was no change between PG10 and PG12 and
this problem never existed there.

Is there any reason to not schedule vacuum freeze for each db daily? Just curious.
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: PG12 autovac issues

Justin King
On Thu, Mar 19, 2020 at 11:02 AM Michael Lewis <[hidden email]> wrote:

>
> On Thu, Mar 19, 2020 at 9:31 AM Justin King <[hidden email]> wrote:
>>
>> On Wed, Mar 18, 2020 at 1:40 PM Michael Lewis <[hidden email]> wrote:
>> >
>> > Do you have default fillfactor set on this table? If not, I would wonder if reducing it to 50% or even 20% would allow many more HOT updates that would reduce bloat.
>>
>> I don't believe we have a default fillfactor, but I'm still trying to
>> understand why autovacs would completely stop -- that seems like a
>> bug.  Especially since there was no change between PG10 and PG12 and
>> this problem never existed there.
>
>
> Is there any reason to not schedule vacuum freeze for each db daily? Just curious.

No, not really -- it just feels like a bandaid for something that
should (and was) working already.  It is possible where I'm headed,
but I just thought I'd reach out to the community to see if anyone has
seen or could think of a reason why this might have started by moving
from PG10 > PG12 and whether it was some possible regression.