[DOC] Document concurrent index builds waiting on each other

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
70 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Re: [DOC] Document concurrent index builds waiting on each other

Andres Freund
Hi,

On 2020-04-15 21:44:48 -0400, James Coleman wrote:

> On Wed, Apr 15, 2020 at 6:31 PM Andres Freund <[hidden email]> wrote:
> > If it's about the xmin horizon for vacuum: I think we could probably
> > avoid that using the same flag. As vacuum cannot be run against a table
> > that has a CIC running (although it'd theoretically be possible to allow
> > that), it should be safe to ignore PROC_IN_CIC backends in vacuum's
> > GetOldestXmin() call. That might not be true for system relations, but
> > we don't allow CIC on those.
>
> Yeah, I mean that if I have a CIC running on table X then vacuum can't
> remove dead tuples (from after the CIC's snapshot) on table Y.

For me "blocking" evokes waiting for a lock, which is why I thought
you'd not mean that issue.


> That's a pretty significant danger, given the combination of:
> 1. Index builds on very large tables can take many days, and

We at least don't hold a single snapshot over the multiple phases...


> 2. The well understood problems of high update tables with dead tuples
> and poor plans.

Which specific problem are you referring to? The planner probing the end
of the index for values outside of the histogram? I'd hope
3ca930fc39ccf987c1c22fd04a1e7463b5dd0dfd improved the situation there a
bit?


> > [description why we could ignore CIC for vacuum horizon on other tables ]

> I've previously discussed this with other hackers and the reasoning
> they'd understood way that we couldn't always safely ignore
> PROC_IN_CIC backends in the vacuum's oldest xmin call because of
> function indexes, and the fact that (despite clear recommendations to
> the contrary), there's nothing actually preventing someone from adding
> a function index on table X that queries table Y.

Well, even if we consider this an actual problem, we could still use
PROC_IN_CIC for non-expression non-partial indexes (index operator
themselves better ensure this isn't a problem, or they're ridiculously
broken already - they can get called during vacuum).

Even when expressions are involved, I don't think that necessarily would
have to mean that we need to use the same snapshot to run expressions in
for the hole scan. So we could occasionally take a new snapshot for the
purpose of computing expressions.

The hard part presumably would be that we'd need to advertise one xmin
for the expression snapshot to protect tuples potentially accessed from
being removed, but at the same time we also need to advertise the xmin
of the snapshot used by CIC, to avoid HOT pruning in other session from
removing tuple versions from the table the index is being created
on.

There's not really infrastructure for doing so. I think we'd basically
have to start publicizing multiple xmin values (as long as PGXACT->xmin
is <= new xmin for expressions, only GetOldestXmin() would need to care,
and it's not that performance critical). Not pretty.


> I'm not sure I buy that we should care about people doing something
> clearly so dangerous, but...I grant that it'd be nice not to cause new
> crashes.

I don't think it's just dangerous expressions that would be
affected. Normal expression indexes need to be able to do syscache
lookups etc, and they can't safely do so if tuple versions can be
removed in the middle of a scan. We could avoid that by not ignoring
PROC_IN_CIC backend in GetOldestXmin() calls for catalog tables (yuck).

Regards,

Andres


Reply | Threaded
Open this post in threaded view
|

Re: [DOC] Document concurrent index builds waiting on each other

James Coleman
On Thu, Apr 16, 2020 at 6:12 PM Andres Freund <[hidden email]> wrote:

>
> Hi,
>
> On 2020-04-15 21:44:48 -0400, James Coleman wrote:
> > On Wed, Apr 15, 2020 at 6:31 PM Andres Freund <[hidden email]> wrote:
> > > If it's about the xmin horizon for vacuum: I think we could probably
> > > avoid that using the same flag. As vacuum cannot be run against a table
> > > that has a CIC running (although it'd theoretically be possible to allow
> > > that), it should be safe to ignore PROC_IN_CIC backends in vacuum's
> > > GetOldestXmin() call. That might not be true for system relations, but
> > > we don't allow CIC on those.
> >
> > Yeah, I mean that if I have a CIC running on table X then vacuum can't
> > remove dead tuples (from after the CIC's snapshot) on table Y.
>
> For me "blocking" evokes waiting for a lock, which is why I thought
> you'd not mean that issue.

It was sloppy choice of language on my part; for better or worse at
work we've taken to talking about "blocking vacuum" when that's really
shorthand for "blocking [or you'd prefer preventing] vacuuming dead
tuples".

> > That's a pretty significant danger, given the combination of:
> > 1. Index builds on very large tables can take many days, and
>
> We at least don't hold a single snapshot over the multiple phases...

For sure. And text sorting improvements have made this better also,
still, as you often point out re: xid size, databases are only getting
larger (and more TPS).

> > 2. The well understood problems of high update tables with dead tuples
> > and poor plans.
>
> Which specific problem are you referring to? The planner probing the end
> of the index for values outside of the histogram? I'd hope
> 3ca930fc39ccf987c1c22fd04a1e7463b5dd0dfd improved the situation there a
> bit?

Yes, and other commits too, IIRC from the time we spent debugging
exactly the scenario mentioned in that commit.

But by "poor plans" I don't mean specifically "poor planning time" but
that we can still end up choosing the "wrong" plan, right? And dead
tuples can make an index scan be significantly worse than it would
otherwise be. Same for a seq scan: you can end up looking at millions
of dead tuples in a nominally 500 row table.

> > > [description why we could ignore CIC for vacuum horizon on other tables ]
>
> > I've previously discussed this with other hackers and the reasoning
> > they'd understood way that we couldn't always safely ignore
> > PROC_IN_CIC backends in the vacuum's oldest xmin call because of
> > function indexes, and the fact that (despite clear recommendations to
> > the contrary), there's nothing actually preventing someone from adding
> > a function index on table X that queries table Y.
>
> Well, even if we consider this an actual problem, we could still use
> PROC_IN_CIC for non-expression non-partial indexes (index operator
> themselves better ensure this isn't a problem, or they're ridiculously
> broken already - they can get called during vacuum).

Agreed. It'd be unfortunate to have to limit it though.

> Even when expressions are involved, I don't think that necessarily would
> have to mean that we need to use the same snapshot to run expressions in
> for the hole scan. So we could occasionally take a new snapshot for the
> purpose of computing expressions.
>
> The hard part presumably would be that we'd need to advertise one xmin
> for the expression snapshot to protect tuples potentially accessed from
> being removed, but at the same time we also need to advertise the xmin
> of the snapshot used by CIC, to avoid HOT pruning in other session from
> removing tuple versions from the table the index is being created
> on.
>
> There's not really infrastructure for doing so. I think we'd basically
> have to start publicizing multiple xmin values (as long as PGXACT->xmin
> is <= new xmin for expressions, only GetOldestXmin() would need to care,
> and it's not that performance critical). Not pretty.

In other words, pretty invasive.

> > I'm not sure I buy that we should care about people doing something
> > clearly so dangerous, but...I grant that it'd be nice not to cause new
> > crashes.
>
> I don't think it's just dangerous expressions that would be
> affected. Normal expression indexes need to be able to do syscache
> lookups etc, and they can't safely do so if tuple versions can be
> removed in the middle of a scan. We could avoid that by not ignoring
> PROC_IN_CIC backend in GetOldestXmin() calls for catalog tables (yuck).

At first glance this sounds a lot less invasive, but I also agree it's gross.

James


Reply | Threaded
Open this post in threaded view
|

Re: [DOC] Document concurrent index builds waiting on each other

David G Johnston
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:       not tested
Spec compliant:           not tested
Documentation:            tested, passed

James,

I'm on board with the point of pointing out explicitly the "concurrent index builds on multiple tables at the same time will not return on any one table until all have completed", with back-patching.  I do not believe the new paragraph is necessary though.  I'd suggest trying to weave it into the existing paragraph ending "Even then, however, the index may not be immediately usable for queries: in the worst case, it cannot be used as long as transactions exist that predate the start of the index build."  Adding "Notably, " in front of the existing sentence fragment above and tacking it onto the end probably suffices.

I don't actually don't whether this is true behavior though.  Is it something our tests do, or could, demonstrate?

It is sorta weird to say "one will not return until all have completed, though, since usually people think return means completed".  That whole paragraph is a bit unclear for the inexperienced DBA, in particular marked ready to use but isn't usable.

That isn't really on this patch to fix though, and the clarity around concurrent CIC seems worthwhile to add, even if imprecise - IMO it doesn't make that whole section any less clear and points out what seems to be a unique dynamic.  IOW I would send the simple fix (inline, not a new paragraph) to a committer.  The bigger doc reworking or actual behavioral improvements shouldn't hold up such a simple improvement.

David J.

The new status of this patch is: Waiting on Author
Reply | Threaded
Open this post in threaded view
|

Re: [DOC] Document concurrent index builds waiting on each other

James Coleman
On Thu, Jul 16, 2020 at 7:34 PM David Johnston
<[hidden email]> wrote:

>
> The following review has been posted through the commitfest application:
> make installcheck-world:  not tested
> Implements feature:       not tested
> Spec compliant:           not tested
> Documentation:            tested, passed
>
> James,
>
> I'm on board with the point of pointing out explicitly the "concurrent index builds on multiple tables at the same time will not return on any one table until all have completed", with back-patching.  I do not believe the new paragraph is necessary though.  I'd suggest trying to weave it into the existing paragraph ending "Even then, however, the index may not be immediately usable for queries: in the worst case, it cannot be used as long as transactions exist that predate the start of the index build."  Adding "Notably, " in front of the existing sentence fragment above and tacking it onto the end probably suffices.
I'm not sure "the index may not be immediately usable for queries" is
really accurate/sufficient: it seems to imply the CREATE INDEX has
returned but for some reason the index isn't yet valid. The issue I'm
trying to describe here is that the CREATE INDEX query itself will not
return until all preceding queries have completed *including*
concurrent index creations on unrelated tables.

> I don't actually don't whether this is true behavior though.  Is it something our tests do, or could, demonstrate?

It'd take tests that exercise parallelism, but it's pretty simple to
demonstrate (but you do have to catch the first index build in a scan
phase, so you either need lots of data or a hack). Here's an example
that uses a bit of a hack to simulate a slow scan phase:

Setup:
create table items(i int);
create table others(i int);
create function slow_expr() returns text as $$ select pg_sleep(15);
select '5'; $$ language sql immutable;
insert into items(i) values (1), (2);
insert into others(i) values (1), (2);

Then the following in order:
1. In session A: create index concurrently on items((i::text || slow_expr()));
2. In session B (at the same time): create index concurrently on others(i);

You'll notice that the 2nd command, which should be practically
instantaneous, waits on the first ~30s scan phase of (1) before it
returns. The same is true if after (2) completes you immediately run
it again -- it waits on the second ~30s scan phase of (1).

That does reveal a bit of complexity though that that the current
patch doesn't address, which is that this can be phase dependent (and
that complexity gets a lot more non-obvious when there's real live
activity (particularly long-running transactions) in the system as
well.

I've attached a new patch series with two items:
1. A simpler (and I believe more correct) doc changes for "cic blocks
cic on other tables".
2. A patch to document that all index builds can prevent tuples from
being vacuumed away on other tables.

If it's preferable we could commit the first and discuss the second
separately, but since that limitation was also discussed up-thread, I
decided to include them both here for now.

James

v2-0001-Document-concurrent-indexes-waiting-on-each-other.patch (2K) Download Attachment
v2-0002-Document-vacuum-on-one-table-depending-on-concurr.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [DOC] Document concurrent index builds waiting on each other

Alvaro Herrera-9
In reply to this post by Andres Freund
On 2020-Mar-25, Andres Freund wrote:

> What I was thinking of was a new flag, with a distinct value from
> PROC_IN_VACUUM. It'd currently just be specified in the
> GetCurrentVirtualXIDs() calls in WaitForOlderSnapshots(). That'd avoid
> needing to wait for other CICs on different relations. Since CIC is not
> permitted on system tables, and CIC doesn't do DML on normal tables, it
> seems fairly obviously correct to exclude other CICs.

Hmm, that does work, and seems a pretty small patch -- attached.  Of
course, some more commentary is necessary, but the theory of operation
is as Andres says.  (It does not solve the vacuuming problem I was
describing in the other thread, only the spurious waiting that James is
complaining about in this thread.)

I'm going to try and poke holes on this now ... (Expression indexes with
falsely immutable functions?)

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

0001-Flag-CREATE-INDEX-CONCURRENTLY-to-avoid-spurious-wai.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [DOC] Document concurrent index builds waiting on each other

Alvaro Herrera-9
On 2020-Aug-04, Alvaro Herrera wrote:

> diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
> index b20e2ad4f6..43c8ea3e31 100644
> --- a/src/include/storage/proc.h
> +++ b/src/include/storage/proc.h
> @@ -53,6 +53,8 @@ struct XidCache
>  #define PROC_IS_AUTOVACUUM 0x01 /* is it an autovac worker? */
>  #define PROC_IN_VACUUM 0x02 /* currently running lazy vacuum */
>  #define PROC_IN_ANALYZE 0x04 /* currently running analyze */
> +#define PROC_IN_CIC 0x40 /* currently running CREATE INDEX
> +   CONCURRENTLY */
>  #define PROC_VACUUM_FOR_WRAPAROUND 0x08 /* set by autovac only */
>  #define PROC_IN_LOGICAL_DECODING 0x10 /* currently doing logical
>   * decoding outside xact */

Hah, missed to add new bit to PROC_VACUUM_STATE_MASK here.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

PROC_IN_ANALYZE stillborn 13 years ago

Alvaro Herrera-9
Back in the 8.3 cycle (2007) when the autovacuum launcher/worker split
was done, we annoyed people because it blocked DDL.  That led to an
effort to cancel autovac automatically when that was detected, by Simon
Riggs.
https://postgr.es/m/1191526327.4223.204.camel@...
https://postgr.es/m/1192129897.4233.433.camel@...

I was fixated on only cancelling when it was ANALYZE, to avoid losing
any VACUUM work.
https://postgr.es/m/20071025164150.GF23566@...
That turned into some flags for PGPROC to detect whether a process was
ANALYZE, and cancel only those.
https://postgr.es/m/20071024151328.GG6559@...
Commit:
https://postgr.es/m/20071024205536.CB425754229@...
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=745c1b2c2ab

However, I was outvoted, so we do not limit cancellation to analyze.
Patch and discussion: https://postgr.es/m/20071025164150.GF23566@...
Commit:
https://postgr.es/m/20071026204510.AA02E754229@...
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=acac68b2bca

... which means the flag I had added two days earlier has never been
used for anything.  We've carried the flag forward to this day for
almost 13 years, dutifully turning it on and off ... but never checking
it anywhere.

I propose to remove it, as in the attached patch.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

0001-Remove-PROC_IN_ANALYZE.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PROC_IN_ANALYZE stillborn 13 years ago

Andres Freund
Hi,

On 2020-08-05 19:55:49 -0400, Alvaro Herrera wrote:

> Back in the 8.3 cycle (2007) when the autovacuum launcher/worker split
> was done, we annoyed people because it blocked DDL.  That led to an
> effort to cancel autovac automatically when that was detected, by Simon
> Riggs.
> https://postgr.es/m/1191526327.4223.204.camel@...
> https://postgr.es/m/1192129897.4233.433.camel@...
>
> I was fixated on only cancelling when it was ANALYZE, to avoid losing
> any VACUUM work.
> https://postgr.es/m/20071025164150.GF23566@...
> That turned into some flags for PGPROC to detect whether a process was
> ANALYZE, and cancel only those.
> https://postgr.es/m/20071024151328.GG6559@...
> Commit:
> https://postgr.es/m/20071024205536.CB425754229@...
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=745c1b2c2ab
>
> However, I was outvoted, so we do not limit cancellation to analyze.
> Patch and discussion: https://postgr.es/m/20071025164150.GF23566@...
> Commit:
> https://postgr.es/m/20071026204510.AA02E754229@...
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=acac68b2bca
>
> ... which means the flag I had added two days earlier has never been
> used for anything.  We've carried the flag forward to this day for
> almost 13 years, dutifully turning it on and off ... but never checking
> it anywhere.
>
> I propose to remove it, as in the attached patch.

I'm mildly against that, because I'd really like to start making use of
the flag. Not so much for cancellations, but to avoid the drastic impact
analyze has on bloat.  In OLTP workloads with big tables, and without
disabled cost limiting for analyze (or slow IO), the snapshot that
analyze holds is often by far the transaction with the oldest xmin.

It's not entirely trivial to fix (just ignoring it could lead to
detoasting issues), but also not that.

Only mildly against because it'd not be hard to reintroduce once we need
it.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: PROC_IN_ANALYZE stillborn 13 years ago

Simon Riggs
On Thu, 6 Aug 2020 at 02:07, Andres Freund <[hidden email]> wrote:

On 2020-08-05 19:55:49 -0400, Alvaro Herrera wrote: 
> ... which means the flag I had added two days earlier has never been
> used for anything.  We've carried the flag forward to this day for
> almost 13 years, dutifully turning it on and off ... but never checking
> it anywhere.
>
> I propose to remove it, as in the attached patch.

I'm mildly against that, because I'd really like to start making use of
the flag. Not so much for cancellations, but to avoid the drastic impact
analyze has on bloat.  In OLTP workloads with big tables, and without
disabled cost limiting for analyze (or slow IO), the snapshot that
analyze holds is often by far the transaction with the oldest xmin.

It's not entirely trivial to fix (just ignoring it could lead to
detoasting issues), but also not that.

Only mildly against because it'd not be hard to reintroduce once we need
it.

Good points, both.

The most obvious way to avoid long analyze snapshots is to make the analysis take multiple snapshots as it runs, rather than try to invent some clever way of ignoring the analyze snapshots (which as Alvaro points out, we never did). All we need to do is to have an analyze snapshot last for at most N rows, but keep scanning until we have the desired sample size. Doing that would mean the analyze sample wouldn't come from a single snapshot, but then who cares? There is no requirement for consistency - the sample would be arguably *more* stable because it comes from multiple points in time, not just one.

--
Simon Riggs                http://www.2ndQuadrant.com/
Mission Critical Databases
Reply | Threaded
Open this post in threaded view
|

Re: PROC_IN_ANALYZE stillborn 13 years ago

Robert Haas
In reply to this post by Andres Freund
On Wed, Aug 5, 2020 at 9:07 PM Andres Freund <[hidden email]> wrote:

> I'm mildly against that, because I'd really like to start making use of
> the flag. Not so much for cancellations, but to avoid the drastic impact
> analyze has on bloat.  In OLTP workloads with big tables, and without
> disabled cost limiting for analyze (or slow IO), the snapshot that
> analyze holds is often by far the transaction with the oldest xmin.
>
> It's not entirely trivial to fix (just ignoring it could lead to
> detoasting issues), but also not that.
>
> Only mildly against because it'd not be hard to reintroduce once we need
> it.

I think we should nuke it. It's trivial to reintroduce the flag if we
need it later, if and when somebody's willing to do the associated
work. In the meantime, it adds confusion.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: PROC_IN_ANALYZE stillborn 13 years ago

Tom Lane-2
Robert Haas <[hidden email]> writes:
> On Wed, Aug 5, 2020 at 9:07 PM Andres Freund <[hidden email]> wrote:
>> Only mildly against because it'd not be hard to reintroduce once we need
>> it.

> I think we should nuke it. It's trivial to reintroduce the flag if we
> need it later, if and when somebody's willing to do the associated
> work. In the meantime, it adds confusion.

+1 for removal.  It's not clear to me that we'd ever put it back.
Long-running ANALYZE snapshots are indeed a problem, but Simon's proposal
upthread to just take a new one every so often seems like a much cleaner
and simpler answer than having onlookers assume that it's safe to ignore
ANALYZE processes.  (Given that ANALYZE can invoke user-defined functions,
and can be invoked from inside user transactions, any such assumption
seems horribly dangerous.)

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: PROC_IN_ANALYZE stillborn 13 years ago

Robert Haas
On Thu, Aug 6, 2020 at 2:37 PM Tom Lane <[hidden email]> wrote:
> +1 for removal.  It's not clear to me that we'd ever put it back.
> Long-running ANALYZE snapshots are indeed a problem, but Simon's proposal
> upthread to just take a new one every so often seems like a much cleaner
> and simpler answer than having onlookers assume that it's safe to ignore
> ANALYZE processes.  (Given that ANALYZE can invoke user-defined functions,
> and can be invoked from inside user transactions, any such assumption
> seems horribly dangerous.

Not to get too far from the proposal on the table of just removing
something that's been unused for a really long time, which stands on
its own merits, but if a particular ANALYZE doesn't invoke any
user-defined functions and isn't run inside a transaction, could we
skip acquiring a snapshot altogether? That's an extremely common case,
though by no means universal.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: PROC_IN_ANALYZE stillborn 13 years ago

Tom Lane-2
Robert Haas <[hidden email]> writes:
> Not to get too far from the proposal on the table of just removing
> something that's been unused for a really long time, which stands on
> its own merits, but if a particular ANALYZE doesn't invoke any
> user-defined functions and isn't run inside a transaction, could we
> skip acquiring a snapshot altogether? That's an extremely common case,
> though by no means universal.

I'm inclined to think not.

(1) Without a snapshot it's hard to make any non-bogus decisions about
which tuples are live and which are dead.  Admittedly, with Simon's
proposal the final totals would be spongy anyhow, but at least the
individual decisions produce meaningful answers.

(2) I'm pretty sure there are places in the system that assume that any
reader of a table is using an MVCC snapshot.  For instance, didn't you
introduce some such assumptions along with or just after getting rid of
SnapshotNow for catalog scans?

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: PROC_IN_ANALYZE stillborn 13 years ago

Robert Haas
On Thu, Aug 6, 2020 at 3:11 PM Tom Lane <[hidden email]> wrote:
> (1) Without a snapshot it's hard to make any non-bogus decisions about
> which tuples are live and which are dead.  Admittedly, with Simon's
> proposal the final totals would be spongy anyhow, but at least the
> individual decisions produce meaningful answers.

I don't think I believe this. It's impossible to make *consistent*
decisions, but it's not difficult to make *non-bogus* decisions.
HeapTupleSatisfiesVacuum() and HeapTupleSatifiesUpdate() both make
such decisions, and neither takes a snapshot argument.

> (2) I'm pretty sure there are places in the system that assume that any
> reader of a table is using an MVCC snapshot.  For instance, didn't you
> introduce some such assumptions along with or just after getting rid of
> SnapshotNow for catalog scans?

SnapshotSelf still exists and is still used, and IIRC, it has very
similar semantics to the old SnapshotNow, so I don't think that we
introduced any really general assumptions of this sort. I think the
important part of those changes was that all the code that had
previously used SnapshotNow to examine system catalog tuples for DDL
purposes and catcache lookups and so forth started using an MVCC scan,
which removed one (of many) impediments to concurrent DDL. I think the
fact that we removed SnapshotNow outright rather than just ceasing to
use it for that purpose was mostly so that nobody would accidentally
reintroduce code that used it for the sorts of purposes for which it
had been used previously, and secondarily for code cleanliness.
There's nothing wrong with it fundamentally AFAIK.

It's worth mentioning, I think, that the main problem with SnapshotNow
was that it provided no particular stability. If you did an index scan
under SnapshotNow you might find two copies or no copies of a row
being concurrently updated, rather than exactly one. And that in turn
could cause problems like failure to build a relcache entry. Now, how
important is stability to ANALYZE? If you *either* retake your MVCC
snapshots periodically as you re-scan the table *or* use a non-MVCC
snapshot for the scan, you can get those same kinds of artifacts: you
might see two copies of a just-updated row, or none. Maybe this would
actually *break* something - e.g. could there be code that would get
confused if we sample multiple rows for the same value in a column
that has a UNIQUE index? But I think mostly the consequences would be
that you might get somewhat different results from the statistics.

It's not clear to me that it would even be correct to categorize those
somewhat-different results as "less accurate." Tuples that are
invisible to a query often have performance consequences very similar
to visible tuples, in terms of the query run time.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: PROC_IN_ANALYZE stillborn 13 years ago

Andres Freund
In reply to this post by Robert Haas
Hi,

On 2020-08-06 14:48:52 -0400, Robert Haas wrote:

> On Thu, Aug 6, 2020 at 2:37 PM Tom Lane <[hidden email]> wrote:
> > +1 for removal.  It's not clear to me that we'd ever put it back.
> > Long-running ANALYZE snapshots are indeed a problem, but Simon's proposal
> > upthread to just take a new one every so often seems like a much cleaner
> > and simpler answer than having onlookers assume that it's safe to ignore
> > ANALYZE processes.  (Given that ANALYZE can invoke user-defined functions,
> > and can be invoked from inside user transactions, any such assumption
> > seems horribly dangerous.
>
> Not to get too far from the proposal on the table of just removing
> something that's been unused for a really long time, which stands on
> its own merits, but if a particular ANALYZE doesn't invoke any
> user-defined functions and isn't run inside a transaction, could we
> skip acquiring a snapshot altogether? That's an extremely common case,
> though by no means universal.

I don't think so, at least not in very common situations. E.g. as long
as there's a toast table we need to hold a snapshot to ensure that we
don't get failures looking up toasted datums. IIRC there were some other
similar issues that I can't quite recall right now.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: PROC_IN_ANALYZE stillborn 13 years ago

Tom Lane-2
In reply to this post by Robert Haas
Robert Haas <[hidden email]> writes:
> ... how
> important is stability to ANALYZE? If you *either* retake your MVCC
> snapshots periodically as you re-scan the table *or* use a non-MVCC
> snapshot for the scan, you can get those same kinds of artifacts: you
> might see two copies of a just-updated row, or none. Maybe this would
> actually *break* something - e.g. could there be code that would get
> confused if we sample multiple rows for the same value in a column
> that has a UNIQUE index? But I think mostly the consequences would be
> that you might get somewhat different results from the statistics.

Yeah, that's an excellent point.  I can imagine somebody complaining
"this query clearly matches a unique index, why is the planner estimating
multiple rows out?".  But most of the time it wouldn't matter much.
(And I think you can get cases like that anyway today.)

> It's not clear to me that it would even be correct to categorize those
> somewhat-different results as "less accurate."

Estimating two rows where the correct answer is one row is clearly
"less accurate".  But I suspect you'd have to be quite unlucky to
get such a result in practice from Simon's proposal, as long as we
weren't super-aggressive about changing ANALYZE's snapshot a lot.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: PROC_IN_ANALYZE stillborn 13 years ago

Andres Freund
In reply to this post by Robert Haas
Hi,

On 2020-08-06 16:22:23 -0400, Robert Haas wrote:

> On Thu, Aug 6, 2020 at 3:11 PM Tom Lane <[hidden email]> wrote:
> > (1) Without a snapshot it's hard to make any non-bogus decisions about
> > which tuples are live and which are dead.  Admittedly, with Simon's
> > proposal the final totals would be spongy anyhow, but at least the
> > individual decisions produce meaningful answers.
>
> I don't think I believe this. It's impossible to make *consistent*
> decisions, but it's not difficult to make *non-bogus* decisions.
> HeapTupleSatisfiesVacuum() and HeapTupleSatifiesUpdate() both make
> such decisions, and neither takes a snapshot argument.

Yea, I don't think that's a big problem for the main table. As I just
mentioned in an email a few minutes ago, toast is a bit of a different
topic.

In fact using conceptually like a new snapshot for each sample tuple
actually seems like it'd be somewhat of an improvement over using a
single snapshot. Given that it's a sample it's not like have very
precise expectations of the precise sample, and publishing one that
solely consists of pretty old rows by the time we're done doesn't seem
like it's a meaningful improvement.  I guess there's some danger of
distinctness estimates getting worse, by seeing multiple versions of the
same tuple multiple times - but they're notoriously inaccurate already,
don't think this changes much.


> > (2) I'm pretty sure there are places in the system that assume that any
> > reader of a table is using an MVCC snapshot.  For instance, didn't you
> > introduce some such assumptions along with or just after getting rid of
> > SnapshotNow for catalog scans?
>
> SnapshotSelf still exists and is still used, and IIRC, it has very
> similar semantics to the old SnapshotNow, so I don't think that we
> introduced any really general assumptions of this sort. I think the
> important part of those changes was that all the code that had
> previously used SnapshotNow to examine system catalog tuples for DDL
> purposes and catcache lookups and so forth started using an MVCC scan,
> which removed one (of many) impediments to concurrent DDL. I think the
> fact that we removed SnapshotNow outright rather than just ceasing to
> use it for that purpose was mostly so that nobody would accidentally
> reintroduce code that used it for the sorts of purposes for which it
> had been used previously, and secondarily for code cleanliness.
> There's nothing wrong with it fundamentally AFAIK.

Some preaching to the choir:

IDK, there's not really much it (along with Self, Any, ...) can safely
be used for, unless you have pretty heavyweight additional locking, or
look explicitly at exactly one tuple version.  Except that it's probably
unnecessary, and that there's some disaster recovery benefits, I'd be in
favor of prohibiting most snapshot types for [sys]table scans.


I'm doubtful that using the term "snapshot" for any of these is a good
choice, and I don't think there's benefit in actually going through the
snapshot APIs.  Especially not when, like *Dirty, they abuse fields
inside SnapshotData to return data that can't be returned through the
normal API.  It'd probably be better to have more explicit APIs for
these, rather than going through snapshot.


> It's not clear to me that it would even be correct to categorize those
> somewhat-different results as "less accurate." Tuples that are
> invisible to a query often have performance consequences very similar
> to visible tuples, in terms of the query run time.

+1

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: PROC_IN_ANALYZE stillborn 13 years ago

Tom Lane-2
Andres Freund <[hidden email]> writes:
> In fact using conceptually like a new snapshot for each sample tuple
> actually seems like it'd be somewhat of an improvement over using a
> single snapshot.

Dunno, that feels like a fairly bad idea to me.  It seems like it would
overemphasize the behavior of whatever queries happened to be running
concurrently with the ANALYZE.  I do follow the argument that using a
single snapshot for the whole ANALYZE overemphasizes a single instant
in time, but I don't think that leads to the conclusion that we shouldn't
use a snapshot at all.

Another angle that would be worth considering, aside from the issue
of whether the sample used for pg_statistic becomes more or less
representative, is what impact all this would have on the tuple count
estimates that go to the stats collector and pg_class.reltuples.
Right now, we don't have a great story at all on how the stats collector's
count is affected by combining VACUUM/ANALYZE table-wide counts with
the incremental deltas reported by transactions happening concurrently
with VACUUM/ANALYZE.  Would changing this behavior make that better,
or worse, or about the same?

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: PROC_IN_ANALYZE stillborn 13 years ago

Simon Riggs
In reply to this post by Tom Lane-2
On Thu, 6 Aug 2020 at 22:35, Tom Lane <[hidden email]> wrote:
Robert Haas <[hidden email]> writes:
> ... how
> important is stability to ANALYZE? If you *either* retake your MVCC
> snapshots periodically as you re-scan the table *or* use a non-MVCC
> snapshot for the scan, you can get those same kinds of artifacts: you
> might see two copies of a just-updated row, or none. Maybe this would
> actually *break* something - e.g. could there be code that would get
> confused if we sample multiple rows for the same value in a column
> that has a UNIQUE index? But I think mostly the consequences would be
> that you might get somewhat different results from the statistics.

Yeah, that's an excellent point.  I can imagine somebody complaining
"this query clearly matches a unique index, why is the planner estimating
multiple rows out?".  But most of the time it wouldn't matter much.
(And I think you can get cases like that anyway today.)

> It's not clear to me that it would even be correct to categorize those
> somewhat-different results as "less accurate."

Estimating two rows where the correct answer is one row is clearly
"less accurate".  But I suspect you'd have to be quite unlucky to
get such a result in practice from Simon's proposal, as long as we
weren't super-aggressive about changing ANALYZE's snapshot a lot.

Seems like we're agreed we can use more than one snapshot, the only discussion is "how many?"

The more you take the more weirdness you will see, so adopting an approach of one-snapshot-per-row seems like the worst case for accuracy, even if it does make analyze faster.

(If we do want to speed up ANALYZE, we should use the system block sampling approach, but the argument against that is less independence of rows.)

Keeping the discussion on reducing the impact of bernoulli sampled analyze, I was imagining we would do one snapshot for each block of rows with default statistics_target, so that default behavior would be unaffected. Larger settings would be chunked according to the default, so stats_target=10k(max) would take a 10000/100 = 100 snapshots. That approach allows people to vary that using an existing parameter if needed.

--
Simon Riggs                http://www.2ndQuadrant.com/
Mission Critical Databases
Reply | Threaded
Open this post in threaded view
|

Re: PROC_IN_ANALYZE stillborn 13 years ago

Robert Haas
In reply to this post by Tom Lane-2
On Thu, Aug 6, 2020 at 5:35 PM Tom Lane <[hidden email]> wrote:
> > It's not clear to me that it would even be correct to categorize those
> > somewhat-different results as "less accurate."
>
> Estimating two rows where the correct answer is one row is clearly
> "less accurate" [ than estimating one row ].

That's a tautology, so I can't argue with it as far as it goes.
Thinking about it more, there are really two ways to think about an
estimated row count.

On the one hand, if you think of the row count estimate as the number
of rows that are going to pop out of a node, then it's always right to
think of a unique index as limiting the number of occurrences of a
given value to 1. But, if you think of the row count estimate as a way
of estimating the amount of work that the node has to do to produce
that output, then it isn't.

If a table has a lot of inserts and deletes, or a lot of updates,
index scans might have to do a lot of extra work chasing down index
pointers to tuples that end up being invisible to our scan. The scan
may not have any filter quals at all, and even if it does, they are
likely cheap to evaluate compared to the cost of finding a locking
buffers and checking visibility, so the dominant cost of the scan is
really based on the total number of rows that are present, not the
number that are visible. Ideally, the presence of those rows would
affect the cost estimate for the node in a way very similar to
expecting to find more rows. At the same time, it doesn't work to just
bump up the row count estimate for the node, because then you'll think
more rows will be output, which might cause poor planning decisions at
higher levels.

It doesn't seem easy to get this 100% right. Tuple visibility can
change very quickly, much faster than the inter-ANALYZE interval. And
sometimes tuples can be pruned away very quickly, too, and the index
pointers may be opportunistically removed very quickly, too.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


1234