[patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
31 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

DavidGould

Please add the attached patch and this discussion to the open commit fest. The
original bugs thread is here: [hidden email].

Bug reference:      15005
Logged by:          David Gould
Email address:      [hidden email]
PostgreSQL version: 10.1 and earlier
Operating system:   Linux
Description:        

ANALYZE can make pg_class.reltuples wildly inaccurate compared to the actual
row counts for tables that are larger than the default_statistics_target.

Example from one of a clients production instances:

# analyze verbose pg_attribute;
INFO:  analyzing "pg_catalog.pg_attribute"
INFO:  "pg_attribute": scanned 30000 of 24519424 pages, containing 6475 live rows and 83 dead rows; 6475 rows in sample, 800983035 estimated total rows.

This is a large complex database - pg_attribute actually has about five
million rows and needs about one hundred thouand pages. However it has
become extremely bloated and is taking 25 million pages (192GB!), about 250
times too much. This happened despite aggressive autovacuum settings and a
periodic bloat monitoring script. Since pg_class.reltuples was 800 million,
the bloat monitoring script did not detect that this table was bloated and
autovacuum did not think it needed vacuuming.
 
When reltuples is very large compared to the actual row count it causes a
number of problems:

- Bad input to the query planner.
- Prevents autovacuum from processing large bloated tables because
  autovacuum_scale_factor * reltuples is large enough the threshold is
  rarely reached.
- Decieves bloat checking tools that rely on the relationship of relpages
  to reltuples*average_row_size.

I've tracked down how this happens and created a reproduction script and a
patch. Attached:

- analyze_reltuples_bug-v1.patch   Patch against master
- README.txt                       Instructions for testing
- analyze_reltuples_bug.sql        Reproduction script
- analyze_counts.awk               Helper for viewing results of test
- test_standard.txt                Test output for unpatched postgresql 10.1
- test_patched.txt                 Test output with patch

The patch applies cleanly, with some offsets, to 9.4.15, 9.5.10, 9.6.6 and 10.1.

Note that this is not the same as the reltuples calculation bug discussed in the
thread at [hidden email]. That one is
mainly concerned with vacuum, this with analyze. The two bugs do amplify each
other though.

Analysis:
---------

Analyze and vacuum calculate the new value for pg_class.reltuples in
vacuum.c:vac_estimate_reltuples():

    old_density = old_rel_tuples / old_rel_pages;
    new_density = scanned_tuples / scanned_pages;
    multiplier = (double) scanned_pages / (double) total_pages;
    updated_density = old_density + (new_density - old_density) * multiplier;
    return floor(updated_density * total_pages + 0.5);

The comments talk about the difference between VACUUM and ANALYZE and explain
that VACUUM probably only scanned changed pages so the density of the scanned
pages is not representative of the rest of the unchanged table. Hence the new
overall density of the table should be adjusted proportionaly to the scanned
pages vs total pages. Which makes sense. However despite the comment noteing
that ANALYZE and VACUUM are different, the code actually does the same
calculation for both.

The problem is that it dilutes the impact of ANALYZE on reltuples for large
tables:

- For a table of 3000000 pages an analyze can only change the reltuples
  value by 1%.
- When combined with changes in relpages due to bloat the new computed
  reltuples can end up far from reality.

Reproducing the reltuples analyze estimate bug.
-----------------------------------------------

The script "reltuples_analyze_bug.sql" creates a table that is large
compared to the analyze sample size and then repeatedly updates about
10% of it followed by an analyze each iteration. The bug is that the
calculation analyze uses to update pg_class.reltuples will tend to
increase each time even though the actual rowcount does not change.

To run:

Given a postgresql 10.x server with >= 1GB of shared buffers:

  createdb test
  psql --no-psqlrc -f analyze_reltuples_bug.sql test > test_standard.out 2>&1
  awk -f analyze_counts.awk test_standard.out

To verify the fix, restart postgres with a patched binary and repeat
the above.

Here are the results with an unpatched server:

After 10 interations of:
  update 10% of rows;
  analyze

reltuples has almost doubled.

               / estimated rows  /    /   pages   /   /sampled rows/
   relname       current  proposed    total scanned    live    dead
reltuples_test  10000001  10000055   153847   3000   195000       0
reltuples_test  10981367   9951346   169231   3000   176410   18590
reltuples_test  11948112  10039979   184615   3000   163150   31850
reltuples_test  12900718  10070666   200000   3000   151060   43940
reltuples_test  13835185   9739305   215384   3000   135655   59345
reltuples_test  14758916   9864947   230768   3000   128245   66755
reltuples_test  15674572  10138631   246153   3000   123565   71435
reltuples_test  16576847   9910944   261537   3000   113685   81315
reltuples_test  17470388  10019961   276922   3000   108550   86450
reltuples_test  18356707  10234607   292306   3000   105040   89960
reltuples_test  19228409   9639927   307690   3000    93990  101010

-dg

--
David Gould                                   [hidden email]
If simplicity worked, the world would be overrun with insects.

README.txt (1K) Download Attachment
analyze_reltuples_bug.sql (2K) Download Attachment
analyze_reltuples_bug-v1.patch (4K) Download Attachment
test_standard.txt (5K) Download Attachment
test_patched.txt (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

Alexander Kuzmenkov
Hi David,

I was able to reproduce the problem using your script.
analyze_counts.awk is missing, though.

The idea of using the result of ANALYZE as-is, without additional
averaging, was discussed when vac_estimate_reltuples() was introduced
originally. Ultimately, it was decided not to do so. You can find the
discussion in this thread:
https://www.postgresql.org/message-id/flat/BANLkTinL6QuAm_Xf8teRZboG2Mdy3dR_vw%40mail.gmail.com#BANLkTinL6QuAm_Xf8teRZboG2Mdy3dR_vw@...

The core problem here seems to be that this calculation of moving
average does not converge in your scenario. It can be shown that when
the number of live tuples is constant and the number of pages grows, the
estimated number of tuples will increase at each step. Do you think we
can use some other formula that would converge in this scenario, but
still filter the noise in ANALYZE results? I couldn't think of one yet.

--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Reply | Threaded
Open this post in threaded view
|

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

DavidGould
On Wed, 28 Feb 2018 15:55:19 +0300
Alexander Kuzmenkov <[hidden email]> wrote:

> Hi David,
>
> I was able to reproduce the problem using your script.
> analyze_counts.awk is missing, though.

Attached now I hope. I think I also added it to the commitfest page.

 
> The idea of using the result of ANALYZE as-is, without additional
> averaging, was discussed when vac_estimate_reltuples() was introduced
> originally. Ultimately, it was decided not to do so. You can find the
> discussion in this thread:
> https://www.postgresql.org/message-id/flat/BANLkTinL6QuAm_Xf8teRZboG2Mdy3dR_vw%40mail.gmail.com#BANLkTinL6QuAm_Xf8teRZboG2Mdy3dR_vw@...

Well that was a long discussion. I'm not sure I would agree that there was a
firm conclusion on what to do about ANALYZE results. There was some
recognition that the case of ANALYZE is different than VACUUM and that is
reflected in the original code comments too. However the actual code ended up
being the same for both ANALYZE and VACUUM. This patch is about that.

See messages:
https://www.postgresql.org/message-id/BANLkTimVhdO_bKQagRsH0OLp7MxgJZDryg%40mail.gmail.com
https://www.postgresql.org/message-id/BANLkTimaDj950K-298JW09RrmG0eJ_C%3DqQ%40mail.gmail.com
https://www.postgresql.org/message-id/28116.1306609295%40sss.pgh.pa.us

> The core problem here seems to be that this calculation of moving
> average does not converge in your scenario. It can be shown that when
> the number of live tuples is constant and the number of pages grows, the
> estimated number of tuples will increase at each step. Do you think we
> can use some other formula that would converge in this scenario, but
> still filter the noise in ANALYZE results? I couldn't think of one yet.

Besides the test data generated with the script I have parsed the analyze
verbose output for several large production systems running complex
applications and have found that for tables larger than the statistics
sample size (300*default_statistics_target) the row count you can caculate
from (pages/sample_pages) * live_rows is pretty accurate, within a few
percent of the value from count(*).

In theory the sample pages analyze uses should represent the whole table
fairly well. We rely on this to generate pg_statistic and it is a key
input to the planner. Why should we not believe in it as much only for
reltuples? If the analyze sampling does not work, the fix would be to improve
that, not to disregard it piecemeal.

My motivation is that I have seen large systems fighting mysterious run-away
bloat for years no matter how aggressively autovacuum is tuned. The fact that
an inflated reltuples can cause autovacuum to simply ignore tables forever
seems worth fixing.

-dg



--
David Gould                                   [hidden email]
If simplicity worked, the world would be overrun with insects.

analyze_counts.awk (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

Alexander Kuzmenkov
On 01.03.2018 06:23, David Gould wrote:
> In theory the sample pages analyze uses should represent the whole table
> fairly well. We rely on this to generate pg_statistic and it is a key
> input to the planner. Why should we not believe in it as much only for
> reltuples? If the analyze sampling does not work, the fix would be to improve
> that, not to disregard it piecemeal.

Well, that sounds reasonable. But the problem with the moving average
calculation remains. Suppose you run vacuum and not analyze. If the
updates are random enough, vacuum won't be able to reclaim all the
pages, so the number of pages will grow. Again, we'll have the same
thing where the number of pages grows, the real number of live tuples
stays constant, and the estimated reltuples grows after each vacuum run.

I did some more calculations on paper to try to understand this. If we
average reltuples directly, instead of averaging tuple density, it
converges like it should. The error with this density calculation seems
to be that we're effectively multiplying the old density by the new
number of pages. I'm not sure why we even work with tuple density. We
could just estimate the number of tuples based on analyze/vacuum, and
then apply moving average to it. The calculations would be shorter, too.
What do you think?

--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Reply | Threaded
Open this post in threaded view
|

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

Tom Lane-2
Alexander Kuzmenkov <[hidden email]> writes:
> On 01.03.2018 06:23, David Gould wrote:
>> In theory the sample pages analyze uses should represent the whole table
>> fairly well. We rely on this to generate pg_statistic and it is a key
>> input to the planner. Why should we not believe in it as much only for
>> reltuples? If the analyze sampling does not work, the fix would be to improve
>> that, not to disregard it piecemeal.

> Well, that sounds reasonable. But the problem with the moving average
> calculation remains. Suppose you run vacuum and not analyze. If the
> updates are random enough, vacuum won't be able to reclaim all the
> pages, so the number of pages will grow. Again, we'll have the same
> thing where the number of pages grows, the real number of live tuples
> stays constant, and the estimated reltuples grows after each vacuum run.

You claimed that before, with no more evidence than this time, and I still
don't follow your argument.  The number of pages may indeed bloat but the
number of live tuples per page will fall.  Ideally, at least, the estimate
would remain on-target.  If it doesn't, there's some other effect that
you haven't explained.  It doesn't seem to me that the use of a moving
average would prevent that from happening.  What it *would* do is smooth
out errors from the inevitable sampling bias in any one vacuum or analyze
run, and that seems like a good thing.

> I did some more calculations on paper to try to understand this. If we
> average reltuples directly, instead of averaging tuple density, it
> converges like it should. The error with this density calculation seems
> to be that we're effectively multiplying the old density by the new
> number of pages. I'm not sure why we even work with tuple density. We
> could just estimate the number of tuples based on analyze/vacuum, and
> then apply moving average to it. The calculations would be shorter, too.
> What do you think?

I think you're reinventing the way we used to do it.  Perhaps consulting
the git history in the vicinity of this code would be enlightening.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

Alexander Kuzmenkov
On 01.03.2018 18:09, Tom Lane wrote:

> Ideally, at least, the estimate would remain on-target.

The test shows that under this particular scenario the estimated number
of tuples grows after each ANALYZE. I tried to explain how this happens
in the attached pdf. The direct averaging of the number of tuples, not
using the density, doesn't have this problem, so I suppose it could help.

> I think you're reinventing the way we used to do it.  Perhaps consulting
> the git history in the vicinity of this code would be enlightening.

I see that before vac_estimate_reltuples was introduced, the results of
analyze and vacuum were used directly, without averaging. What I am
suggesting is to use a different way of averaging, not to remove it.

--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


analyze.tex.pdf (79K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

DavidGould
In reply to this post by Alexander Kuzmenkov
On Thu, 1 Mar 2018 17:25:09 +0300
Alexander Kuzmenkov <[hidden email]> wrote:

> Well, that sounds reasonable. But the problem with the moving average
> calculation remains. Suppose you run vacuum and not analyze. If the
> updates are random enough, vacuum won't be able to reclaim all the
> pages, so the number of pages will grow. Again, we'll have the same
> thing where the number of pages grows, the real number of live tuples
> stays constant, and the estimated reltuples grows after each vacuum run.

I agree VACUUM's moving average may be imperfect, but the rationale makes
sense and I don't have a plan to improve it now. This patch only intends to
improve the behavior of ANALYZE by using the estimated row density time
relpages to get reltuples. It does not change VACUUM.

The problem with the moving average for ANALYZE is that it prevents ANALYZE
from changing the reltuples estimate enough for large tables.

Consider this based on the test setup from the patch:

create table big as select id*p, ten, hun, thou, tenk, lahk, meg, padding
  from reltuples_test,
  generate_series(0,9) g(p);
-- SELECT 100000000
alter table big set (autovacuum_enabled=false);

select count(*) from big;
--    count
--  100000000
select reltuples::int, relpages from pg_class where relname = 'big';
--  reltuples | relpages
--          0 |        0

analyze verbose big;
-- INFO:  analyzing "public.big"
-- INFO:  "big": scanned 30000 of 1538462 pages, containing 1950000 live rows and 0 dead rows;
--        30000 rows in sample, 100000030 estimated total rows

select reltuples::int, relpages from pg_class where relname = 'big';
--  reltuples | relpages
--  100000032 |  1538462

delete from big where ten > 1;
-- DELETE 80000000
select count(*) from big;
--   count  
--  20000000
select reltuples::int, relpages from pg_class where relname = 'big';
--  reltuples | relpages
--  100000032 |  1538462

analyze verbose big;
-- INFO:  analyzing "public.big"
-- INFO:  "big": scanned 30000 of 1538462 pages, containing 388775 live rows and 1561225 dead rows;
--        30000 rows in sample, 98438807 estimated total rows

select reltuples::int, relpages from pg_class where relname = 'big';
 reltuples | relpages
  98438808 |  1538462
select count(*) from big;
--   count  
--  20000000

analyze verbose big;
-- INFO:  analyzing "public.big"
-- INFO:  "big": scanned 30000 of 1538462 pages, containing 390885 live rows and 1559115 dead rows;
--         30000 rows in sample, 96910137 estimated total rows

select reltuples::int, relpages from pg_class where relname = 'big';
 reltuples | relpages
  96910136 |  1538462

Table big has 1.5 million pages. ANALYZE samples 30 thousand. No matter how
many rows we change in T, ANALYZE can only change the reltuples estimate
by old_estimate + new_estimate * (30000/1538462), ie about 1.9 percent.

With the patch on this same table we get:

select count(*) from big;
--   count
--  20000000
select reltuples::int, relpages from pg_class where relname = 'big';
 reltuples | relpages
  96910136 |  1538462

analyze verbose big;
-- INFO:  analyzing "public.big"
-- INFO:  "big": scanned 30000 of 1538462 pages, containing 390745 live rows and 1559255 dead rows;
--        30000 rows in sample, 20038211 estimated total rows

select reltuples::int, relpages from pg_class where relname = 'big';
--  reltuples | relpages
--   20038212 |  1538462

-dg

--
David Gould                                   [hidden email]
If simplicity worked, the world would be overrun with insects.

Reply | Threaded
Open this post in threaded view
|

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

Tom Lane-2
In reply to this post by Alexander Kuzmenkov
Alexander Kuzmenkov <[hidden email]> writes:
> On 01.03.2018 18:09, Tom Lane wrote:
>> Ideally, at least, the estimate would remain on-target.

> The test shows that under this particular scenario the estimated number
> of tuples grows after each ANALYZE. I tried to explain how this happens
> in the attached pdf.

I looked at this and don't think it really answers the question.  What
happens is that, precisely because we only slowly adapt our estimate of
density towards the new measurement, we will have an overestimate of
density if the true density is decreasing (even if the new measurement is
spot-on), and that corresponds exactly to an overestimate of reltuples.
No surprise there.  The question is why it fails to converge to reality
over time.

I think part of David's point is that because we only allow ANALYZE to
scan a limited number of pages even in a very large table, that creates
an artificial limit on the slew rate of the density estimate; perhaps
what's happening in his tables is that the true density is dropping
faster than that limit allows us to adapt.  Still, if there's that
much going on in his tables, you'd think VACUUM would be touching
enough of the table that it would keep the estimate pretty sane.
So I don't think we yet have a convincing explanation of why the
estimates drift worse over time.

Anyway, I find myself semi-persuaded by his argument that we are
already assuming that ANALYZE has taken a random sample of the table,
so why should we not believe its estimate of density too?  Aside from
simplicity, that would have the advantage of providing a way out of the
situation when the existing reltuples estimate has gotten drastically off.

The sticking point in my mind right now is, if we do that, what to do with
VACUUM's estimates.  If you believe the argument in the PDF that we'll
necessarily overshoot reltuples in the face of declining true density,
then it seems like that argument applies to VACUUM as well.  However,
VACUUM has the issue that we should *not* believe that it looked at a
random sample of pages.  Maybe the fact that it looks exactly at the
changed pages causes it to see something less than the overall density,
cancelling out the problem, but that seems kinda optimistic.

Anyway, as I mentioned in the 2011 thread, the existing computation is
isomorphic to the rule "use the old density estimate for the pages we did
not look at, and the new density estimate --- ie, exactly scanned_tuples
--- for the pages we did look at".  That still has a lot of intuitive
appeal, especially for VACUUM where there's reason to believe those page
populations aren't alike.  We could recast the code to look like it's
doing that rather than doing a moving-average, although the outcome
should be the same up to roundoff error.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

Alexander Kuzmenkov
On 02.03.2018 02:49, Tom Lane wrote:
> I looked at this and don't think it really answers the question.  What
> happens is that, precisely because we only slowly adapt our estimate of
> density towards the new measurement, we will have an overestimate of
> density if the true density is decreasing (even if the new measurement is
> spot-on), and that corresponds exactly to an overestimate of reltuples.
> No surprise there.  The question is why it fails to converge to reality
> over time.

The calculation I made for the first step applies to the next steps too,
with minor differences. So, the estimate increases at each step. Just
out of interest, I plotted the reltuples for 60 steps, and it doesn't
look like it's going to converge anytime soon (see attached).
Looking at the formula, this overshoot term is created when we multiply
the old density by the new number of pages. I'm not sure how to fix
this. I think we could average the number of tuples, not the densities.
The attached patch demonstrates what I mean.

--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


analyze.png (19K) Download Attachment
reltuples-avg.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

DavidGould
On Fri, 2 Mar 2018 18:47:44 +0300
Alexander Kuzmenkov <[hidden email]> wrote:

> The calculation I made for the first step applies to the next steps too,
> with minor differences. So, the estimate increases at each step. Just
> out of interest, I plotted the reltuples for 60 steps, and it doesn't
> look like it's going to converge anytime soon (see attached).
> Looking at the formula, this overshoot term is created when we multiply
> the old density by the new number of pages. I'm not sure how to fix
> this. I think we could average the number of tuples, not the densities.
> The attached patch demonstrates what I mean.

I'm confused at this point, I provided a patch that addresses this and a
test case. We seem to be discussing everything as if we first noticed the
issue. Have you reviewed the patch and and attached analysis and tested it?
Please commment on that?

Thanks.

Also, here is a datapoint that I found just this morning on a clients
production system:

INFO:  "staging_xyz": scanned 30000 of   pages, containing 63592 live rows and 964346 dead rows;
30000 rows in sample, 1959918155 estimated total rows

# select (50000953.0/30000*63592)::int as nrows;
   nrows          
-----------
 105988686

This tables reltuples is 18 times the actual row count. It will never converge
because with 50000953 pages analyze can only adjust reltuples by 0.0006 each time.

It will also almost never get vacuumed because the autovacuum threshold of
0.2 * 1959918155 = 391983631 about 3.7 times larger than the actual row count.

The submitted patch is makes analyze effective in setting reltuples to within
a few percent of the count(*) value.

-dg


--
David Gould                                   [hidden email]
If simplicity worked, the world would be overrun with insects.

Reply | Threaded
Open this post in threaded view
|

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

Tom Lane-2
David Gould <[hidden email]> writes:
> I'm confused at this point, I provided a patch that addresses this and a
> test case. We seem to be discussing everything as if we first noticed the
> issue. Have you reviewed the patch and and attached analysis and tested it?
> Please commment on that?

I've looked at the patch.  The questions in my mind are

(1) do we really want to go over to treating ANALYZE's tuple density
result as gospel, contradicting the entire thrust of the 2011 discussion?

(2) what should we do in the VACUUM case?  Alexander's argument seems
to apply with just as much force to the VACUUM case, so either you
discount that or you conclude that VACUUM needs adjustment too.

> This tables reltuples is 18 times the actual row count. It will never converge
> because with 50000953 pages analyze can only adjust reltuples by 0.0006 each time.

But by the same token, analyze only looked at 0.0006 of the pages.  It's
nice that for you, that's enough to get a robust estimate of the density
everywhere; but I have a nasty feeling that that won't hold good for
everybody.  The whole motivation of the 2011 discussion, and the issue
that is also seen in some other nearby discussions, is that the density
can vary wildly.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

DavidGould
In reply to this post by Tom Lane-2
On Thu, 01 Mar 2018 18:49:20 -0500
Tom Lane <[hidden email]> wrote:


> The sticking point in my mind right now is, if we do that, what to do with
> VACUUM's estimates.  If you believe the argument in the PDF that we'll
> necessarily overshoot reltuples in the face of declining true density,
> then it seems like that argument applies to VACUUM as well.  However,
> VACUUM has the issue that we should *not* believe that it looked at a
> random sample of pages.  Maybe the fact that it looks exactly at the
> changed pages causes it to see something less than the overall density,
> cancelling out the problem, but that seems kinda optimistic.

For what it's worth, I think the current estimate formula for VACUUM is
pretty reasonable. Consider a table T with N rows and P pages clustered
on serial key k. Assume reltuples is initially correct.

Then after:

 delete from T where k < 0.2 * (select max k from T);
 vacuum T;

Vacuum will touch the first 20% of the pages due to visibility map, the sample
will have 0 live rows, scanned pages will be 0.2 * P.

Then according to the current code:

    old_density = old_rel_tuples / old_rel_pages;
    new_density = scanned_tuples / scanned_pages;
    multiplier = (double) scanned_pages / (double) total_pages;
    updated_density = old_density + (new_density - old_density) * multiplier;
    return floor(updated_density * total_pages + 0.5);

the new density will be:

   N/P + (0/0.2*P - N/P) * 0.2
 = N/P - N/P * 0.2
 = 0.8 * N/P

New reltuples estimate will be 0.8 * old_reltuples. Which is what we wanted.


If we evenly distribute the deletes across the table:

  delete from T where rand() < 0.2;

Then vacuum will scan all the pages, the sample will have 0.8 * N live rows,
scanned pages will be 1.0 * P. The new density will be

   N/P + (0.8 * N/1.0*P - N/P) * 1.0
 = N/P + (0.8 N/P - N/P)
 = N/P - 0.2 * N/P
 = 0.8 * N/P

Which again gives new reltuples as 0.8 * old_reltuples and is again correct.

I believe that given a good initial estimate of reltuples and relpages and
assuming that the pages vacuum does not scan do not change density then the
vacuum calculation does the right thing.

However, for ANALYZE the case is different.

-dg

--
David Gould                                   [hidden email]
If simplicity worked, the world would be overrun with insects.

Reply | Threaded
Open this post in threaded view
|

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

DavidGould
In reply to this post by Tom Lane-2
On Fri, 02 Mar 2018 17:17:29 -0500
Tom Lane <[hidden email]> wrote:


> But by the same token, analyze only looked at 0.0006 of the pages.  It's
> nice that for you, that's enough to get a robust estimate of the density
> everywhere; but I have a nasty feeling that that won't hold good for
> everybody.

My grasp of statistics is somewhat weak, so please inform me if I've got
this wrong, but every time I've looked into it I've found that one can get
pretty good accuracy and confidence with fairly small samples. Typically 1000
samples will serve no matter the population size if the desired margin of
error is 5%. Even with 99% confidence and a 1% margin of error it takes less
than 20,000 samples. See the table at:

http://www.research-advisors.com/tools/SampleSize.htm

Since we have by default 30000 sample pages and since ANALYZE takes some
trouble to get a random sample I think we really can rely on the results of
extrapolating reltuples from analyze.

-dg

--
David Gould                                   [hidden email]
If simplicity worked, the world would be overrun with insects.

Reply | Threaded
Open this post in threaded view
|

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

Jeff Janes
In reply to this post by DavidGould
On Wed, Jan 17, 2018 at 4:49 PM, David Gould <[hidden email]> wrote:

Please add the attached patch and this discussion to the open commit fest. The
original bugs thread is here: [hidden email].

Bug reference:      15005
Logged by:          David Gould
Email address:      [hidden email]
PostgreSQL version: 10.1 and earlier
Operating system:   Linux
Description:

ANALYZE can make pg_class.reltuples wildly inaccurate compared to the actual
row counts for tables that are larger than the default_statistics_target.

Example from one of a clients production instances:

# analyze verbose pg_attribute;
INFO:  analyzing "pg_catalog.pg_attribute"
INFO:  "pg_attribute": scanned 30000 of 24519424 pages, containing 6475 live rows and 83 dead rows; 6475 rows in sample, 800983035 estimated total rows.

This is a large complex database - pg_attribute actually has about five
million rows and needs about one hundred thouand pages. However it has
become extremely bloated and is taking 25 million pages (192GB!), about 250
times too much. This happened despite aggressive autovacuum settings and a
periodic bloat monitoring script. Since pg_class.reltuples was 800 million,
the bloat monitoring script did not detect that this table was bloated and
autovacuum did not think it needed vacuuming.

I can see how this issue would prevent ANALYZE from fixing the problem, but I don't see how it could have caused the problem in the first place.  In your demonstration case, you had to turn off autovac in order to get it to happen, and then when autovac is turned back on, it is all primed for an autovac to launch, go through, touch almost all of the pages, and fix it for you.  How did your original table get into a state where this wouldn't happen?

Maybe a well-timed crash caused n_dead_tup to get reset to zero and that is why autovac is not kicking in?  What are the pg_stat_user_table number and the state of the visibility map for your massively bloated table, if you still have them?

In any event, I agree with your analysis that ANALYZE should set the number of tuples from scratch.  After all, it sets the other estimates, such as MCV, from scratch, and those are much more fragile to sampling than just the raw number of tuples are.  But if the default target is set to 1, that would scan only 300 pages.  I think that that is a little low of a sample size to base an estimate on, but it isn't clear to that using 300 pages plus whacking them around with an exponential averaging is really going to be much better.  And if you set your default target to 1, that is more-or-less what you signed up for.

It is little weird to have VACUUM incrementally update and then ANALYZE compute from scratch and discard the previous value, but no weirder than what we currently do of having ANALYZE incrementally update despite that it is specifically designed to representatively sample the entire table.  So I don't think we need to decide what to do about VACUUM before we can do something about ANALYZE.

So I support your patch.  There is probably more investigation and work that could be done in this area, but those could be different patches, not blocking this one.

Cheers,

Jeff
Reply | Threaded
Open this post in threaded view
|

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

DavidGould
On Sun, 4 Mar 2018 07:49:46 -0800
Jeff Janes <[hidden email]> wrote:

> On Wed, Jan 17, 2018 at 4:49 PM, David Gould <[hidden email]> wrote:

> > # analyze verbose pg_attribute;
> > INFO:  "pg_attribute": scanned 30000 of 24519424 pages, containing 6475
> > live rows and 83 dead rows; 6475 rows in sample, 800983035 estimated total
> > rows.
>
> I can see how this issue would prevent ANALYZE from fixing the problem, but
> I don't see how it could have caused the problem in the first place.  In
> your demonstration case, you had to turn off autovac in order to get it to
> happen, and then when autovac is turned back on, it is all primed for an
> autovac to launch, go through, touch almost all of the pages, and fix it
> for you.  How did your original table get into a state where this wouldn't
> happen?
>
> Maybe a well-timed crash caused n_dead_tup to get reset to zero and that is
> why autovac is not kicking in?  What are the pg_stat_user_table number and
> the state of the visibility map for your massively bloated table, if you
> still have them?

We see this sort of thing pretty routinely on more than just catalogs, but
catalogs are where it really hurts. These systems are 40 cores/80 threads, 1
TB memory, Fusion IO. Databases are 5 to 10 TB with 100,000 to 200,000 tables.
Tables are updated in batches every few minutes 100 threads at a time. There
are also some long running queries that don't help. Due to the large number of
tables and high rate of mutation it can take a long time between visits from
autovacuum, especially since autovacuum builds a list of pending work and
then processes it to completion so new tables in need of vacuum can't even be
seen until all the old work is done. For what it is worth, streaming
replication doesn't work either as the single threaded recovery can't keep up
with the 80 thread mutator.

We tried relying on external scripts to address the most bloated tables, but
those also depended on reltuples to detect bloat so they missed out a lot.
For a long time we simply had recurring crisis. Once I figured out that
ANALYZE could not set reltuples effectively we worked around it by running
ANALYZE VERBOSE on all the large tables and parsing the notices to calculate
the rowcount the same way as in the patch. This works, but is a nuisance.

The main pain points are that when reltuples gets inflated there is no way
to fix it, auto vacuum stops looking at the table and hand run ANALYZE can't
reset the reltuples. The only cure is VACUUM FULL, but that is not really
practical without unacceptable amounts of downtime.

-dg

--
David Gould                                   [hidden email]
If simplicity worked, the world would be overrun with insects.

Reply | Threaded
Open this post in threaded view
|

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

DavidGould
In reply to this post by Jeff Janes
On Sun, 4 Mar 2018 07:49:46 -0800
Jeff Janes <[hidden email]> wrote:

> In any event, I agree with your analysis that ANALYZE should set the number
> of tuples from scratch.  After all, it sets the other estimates, such as
> MCV, from scratch, and those are much more fragile to sampling than just
> the raw number of tuples are.  But if the default target is set to 1, that
> would scan only 300 pages.  I think that that is a little low of a sample
> size to base an estimate on, but it isn't clear to that using 300 pages
> plus whacking them around with an exponential averaging is really going to
> be much better.  And if you set your default target to 1, that is
> more-or-less what you signed up for.
>
> It is little weird to have VACUUM incrementally update and then ANALYZE
> compute from scratch and discard the previous value, but no weirder than
> what we currently do of having ANALYZE incrementally update despite that it
> is specifically designed to representatively sample the entire table.  So I
> don't think we need to decide what to do about VACUUM before we can do
> something about ANALYZE.

Thanks. I was going to add the point about trusting ANALYZE for the
statistics but not for reltuples, but you beat me to it. 300 samples would be
on the small side, as you say that's asking for it. Even the old default
target of 10 gives 3000 samples which is probably plenty.

I think the method VACUUM uses is appropriate and probably correct for
VACUUM. But not for ANALYZE. Which is actually hinted at in the original
comments but not in the code.

-dg

--
David Gould                                   [hidden email]
If simplicity worked, the world would be overrun with insects.

Reply | Threaded
Open this post in threaded view
|

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

DavidGould
In reply to this post by Jeff Janes
On Sun, 4 Mar 2018 07:49:46 -0800
Jeff Janes <[hidden email]> wrote:

> I don't see how it could have caused the problem in the first place.  In
> your demonstration case, you had to turn off autovac in order to get it to
> happen, and then when autovac is turned back on, it is all primed for an
> autovac to launch, go through, touch almost all of the pages, and fix it
> for you.  How did your original table get into a state where this wouldn't
> happen?

One more way for this to happen, vacuum was including the dead tuples in the
estimate in addition to the live tuples. This is a separate bug that tends
to aggravate the one I'm trying to fix. See the thread re BUG #15005 at:

https://www.postgresql.org/message-id/16db4468-edfa-830a-f921-39a50498e77e%402ndquadrant.com
> It seems to me that VACUUM and ANALYZE somewhat disagree on what exactly
> reltuples means. VACUUM seems to be thinking that
>
>      reltuples = live + dead
>
> while ANALYZE apparently believes that
>
>      reltuples = live

There is a patch for this one from Tomas Vondra/Tom Lane that I hope it will
land in the next set of releases.

-dg

--
David Gould                                   [hidden email]
If simplicity worked, the world would be overrun with insects.

Reply | Threaded
Open this post in threaded view
|

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

Tom Lane-2
David Gould <[hidden email]> writes:
> On Sun, 4 Mar 2018 07:49:46 -0800
> Jeff Janes <[hidden email]> wrote:
>> I don't see how it could have caused the problem in the first place.  In
>> your demonstration case, you had to turn off autovac in order to get it to
>> happen, and then when autovac is turned back on, it is all primed for an
>> autovac to launch, go through, touch almost all of the pages, and fix it
>> for you.  How did your original table get into a state where this wouldn't
>> happen?

> One more way for this to happen, vacuum was including the dead tuples in the
> estimate in addition to the live tuples.

FWIW, I've been continuing to think about this and poke at your example,
and I am having the same difficulty as Jeff.  While it's clear that if you
managed to get into a state with wildly inflated reltuples, ANALYZE would
fail to get out of it, it's not clear how you could get to such a state
without additional contributing factors.  This ANALYZE behavior only seems
to result in an incremental increase in reltuples per run, and so that
shouldn't prevent autovacuum from happening and fixing the estimate ---
maybe not as quickly as it should happen, but it'd happen.

The reasons I'm harping on this are (a) if there are additional bugs
contributing to the problem, we need to identify them and fix them;
(b) we need to understand what the triggering conditions are in some
detail, so that we can decide whether this bug is bad enough to justify
back-patching a behavioral change.  I remain concerned that the proposed
fix is too simplistic and will have some unforeseen side-effects, so
I'd really rather just put it in HEAD and let it go through a beta test
cycle before it gets loosed on the world.

Another issue I have after thinking more is that we need to consider
what should happen during a combined VACUUM+ANALYZE.  In this situation,
with the proposed patch, we'd overwrite VACUUM's result with an estimate
derived from ANALYZE's sample ... even though VACUUM's result might've
come from a full-table scan and hence be exact.  In the existing code
a small ANALYZE sample can't do a lot of damage to VACUUM's result, but
that would no longer be true with this.  I'm inclined to feel that we
should trust VACUUM's result for reltuples more than ANALYZE's, on the
grounds that if there actually was any large change in reltuples, VACUUM
would have looked at most of the pages and hence have a more reliable
number.  Hence I think we should skip the pg_class update for ANALYZE if
it's in a combined VACUUM+ANALYZE, at least unless ANALYZE looked at all
(most of?) the pages.  That could be mechanized with something like

- if (!inh)
+ if (!inh && !(options & VACOPT_VACUUM))

controlling do_analyze_rel's call to vac_update_relstats, maybe with a
check on scanned_pages vs total_pages.  Perhaps the call to
pgstat_report_analyze needs to be filtered similarly (or maybe we still
want to report that an analyze happened, but somehow tell the stats
collector not to change its counts?)

Also, as a stylistic matter, I'd be inclined not to touch
vac_estimate_reltuples' behavior.  The place where the rubber is meeting
the road is

    *totalrows = vac_estimate_reltuples(onerel, true,
                                        totalblocks,
                                        bs.m,
                                        liverows);
    if (bs.m > 0)
        *totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
    else
        *totaldeadrows = 0.0;

and it seems to me it'd make more sense to abandon the use of
vac_estimate_reltuples entirely, and just calculate totalrows in a fashion
exactly parallel to totaldeadrows.  (I think that's how the code used to
look here ...)

In HEAD, we could then drop vac_estimate_reltuples' is_analyze argument
altogether, though that would be unwise in the back branches (if we
back-patch) since we have no idea whether any third party code is calling
this function.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

Tom Lane-2
In reply to this post by DavidGould
David Gould <[hidden email]> writes:
> On Thu, 01 Mar 2018 18:49:20 -0500
> Tom Lane <[hidden email]> wrote:
>> The sticking point in my mind right now is, if we do that, what to do with
>> VACUUM's estimates.

> For what it's worth, I think the current estimate formula for VACUUM is
> pretty reasonable. Consider a table T with N rows and P pages clustered
> on serial key k. Assume reltuples is initially correct.

If the starting condition involves uniform tuple density throughout the
table, with matching reltuples/relpages ratio, then any set of changes
followed by one VACUUM will produce the right reltuples (to within
roundoff error) at the end.  This can be seen by recognizing that VACUUM
will visit every changed page, and the existing calculation is equivalent
to "assume the old tuple density is correct for the unvisited pages, and
then add on the measured tuple count within the visited pages".  I'm a bit
inclined to reformulate and redocument the calculation that way, in hopes
that people would find it more convincing.

However, things get less good if the initial state is nonuniform and
we do a set of updates that line up with the nonuniformity.  For
example, start with a uniformly full table, and update 50% of the
rows lying within the first 20% of the pages.  Now those 20% are
only half full of live tuples, and the table has grown by 10%, with
all those added pages full.  Do a VACUUM.  It will process the first
20% and the new 10% of pages, and arrive at a correct reltuples count
per the above argument.  But now, reltuples/relpages reflects an average
tuple density that's only about 90% of maximum.  Next, delete the
surviving tuples in the first 20% of pages, and again VACUUM.  VACUUM
will examine only the first 20% of pages, and find that they're devoid
of live tuples.  It will then update reltuples using the 90% density
figure as the estimate of what's in the remaining pages, and that's
too small, so that reltuples will drop to about 90% of the correct
value.

Lacking an oracle (small "o"), I do not think there's much we can do
about this, without resorting to very expensive measures such as
scanning the whole table.  (It's somewhat interesting to speculate
about whether scanning the table's FSM could yield useful data, but
I'm unsure that I'd trust the results much.)  The best we can do is
hope that correlated update patterns like this are uncommon.

Maybe this type of situation is an argument for trusting an ANALYZE-based
estimate more than the VACUUM-based estimate.  I remain uncomfortable with
that in cases where VACUUM looked at much more of the table than ANALYZE
did, though.  Maybe we need some heuristic based on the number of pages
actually visited by each pass?

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

DavidGould
In reply to this post by Tom Lane-2
On Tue, 06 Mar 2018 11:16:04 -0500
Tom Lane <[hidden email]> wrote:

> so that we can decide whether this bug is bad enough to justify
> back-patching a behavioral change.  I remain concerned that the proposed
> fix is too simplistic and will have some unforeseen side-effects, so
> I'd really rather just put it in HEAD and let it go through a beta test
> cycle before it gets loosed on the world.

It happens to us fairly regularly and causes lots of problems. However,
I'm agreeable to putting it in head for now, my client can build from
source to pick up this patch until that ships, but doesn't want to maintain
their own fork forever. That said, if it does get though beta I'd hope we
could back-patch at that time.

-dg

--
David Gould                                   [hidden email]
If simplicity worked, the world would be overrun with insects.

12
Previous Thread Next Thread