Stats collector's idx_blks_hit value is highly misleading in practice

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Stats collector's idx_blks_hit value is highly misleading in practice

Peter Geoghegan-4
It occurs to mean that statistics collector stats such as
pg_statio_*_tables.idx_blks_hit are highly misleading in practice
because they fail to take account of the difference between internal
pages and leaf pages in B-Tree indexes. These two types of pages are
in fundamentally different categories, and I think that failing to
recognize that at the level of these system views makes them much less
useful. Somebody should probably write a patch that makes this
difference clear from the system views. Possibly by using some
generalized notion of "record" pages instead of leaf pages, and
"metadata" pages instead of internal pages. That would even work with
hash indexes, I think.

Consider the following example, which is based on a standard nbtree
index, but could work in almost the same way with other index access
methods:

We have a pgbench_accounts pkey after initialization by pgbench at
scale 1500. It has 409,837 leaf pages and 1,451 internal pages,
meaning that about one third of one percent of all pages in the index
are internal pages. Occasionally, with indexes on large text strings
we might notice that as many as 1% of all index pages are internal
pages, but that's very much on the high side. Generally speaking,
we're virtually guaranteed to have *all* internal pages in
shared_buffers once a steady state has been reached. Once the cache
warms up, point lookups (like the queries pgbench performs) will only
have to access one leaf page at most, which amounts to only one I/O at
most. (This asymmetry is the main reason why B-Trees are generally
very effective when buffered in a buffer cache.)

If we run the pgbench queries against this database/example index
we'll find that we have to access 4 index pages per query execution --
the root, two additional internal pages, plus a leaf page. Based on
the reasonable assumptions I'm making, 3 out of 4 of those pages will
be hits when steady state is reached with pgbench's SELECT-only
workload, regardless of how large shared_buffers is or how bloated the
index is (we only need 1451 buffers for that, and those are bound to
get hot quickly).

The overall effect is idx_blks_hit changes over time in a way that
makes no sense -- even to an expert. Let's say we start with this
entire 3213 MB pgbench index in shared_buffers. We should only get
increments in idx_blks_hit, never increments in idx_blks_read - that
much makes sense. If we then iteratively shrink shared_buffers (or
equivalently, make the index grow without adding a new level), the
proportion of page accesses that increment idx_blks_read (rather than
incrementing idx_blks_hit) goes up roughly linearly as misses increase
linearly - which also makes sense. But here is the silly part: we
cannot really have a hit rate of less than 75% if you compare
idx_blks_hit to idx_blks_read, unless and until we can barely even fit
1% of the index in memory (at which point it's hard to distinguish
from noise). So if we naively consume the current view we'll see a hit
rate that starts at 100%, and very slowly shrinks to 75%, which is
where we bottom out (more or less, roughly speaking). This behavior
seems pretty hard to defend to me.

If somebody fixed this by putting internal pages into their own bucket
in the system view, then motivated users would quickly learn that
internal page stats aren't really useful -- they are only included for
completeness. They're such a small contributor to the overall hit rate
that they can simply be ignored completely. The thing that users ought
to focus on is leaf page hit rate. Now index hit rate (by which I mean
leaf page hit rate) actually makes sense. Note that Heroku promoted
simple heuristics like this for many years.

I suppose that a change like this could end up affecting other things,
such as EXPLAIN ANALYZE statistics. OTOH we only break out index pages
separately for bitmap scans at the moment, so maybe it could be fairly
well targeted. And, maybe this is unappealing given the current
statistics collector limitations. I'm not volunteering to work on it
right now, but it would be nice to fix this. Please don't wait for me
to do it.

--
Peter Geoghegan