Hi,
On 2020-04-15 21:44:48 -0400, James Coleman wrote: > On Wed, Apr 15, 2020 at 6:31 PM Andres Freund <[hidden email]> wrote: > > If it's about the xmin horizon for vacuum: I think we could probably > > avoid that using the same flag. As vacuum cannot be run against a table > > that has a CIC running (although it'd theoretically be possible to allow > > that), it should be safe to ignore PROC_IN_CIC backends in vacuum's > > GetOldestXmin() call. That might not be true for system relations, but > > we don't allow CIC on those. > > Yeah, I mean that if I have a CIC running on table X then vacuum can't > remove dead tuples (from after the CIC's snapshot) on table Y. For me "blocking" evokes waiting for a lock, which is why I thought you'd not mean that issue. > That's a pretty significant danger, given the combination of: > 1. Index builds on very large tables can take many days, and We at least don't hold a single snapshot over the multiple phases... > 2. The well understood problems of high update tables with dead tuples > and poor plans. Which specific problem are you referring to? The planner probing the end of the index for values outside of the histogram? I'd hope 3ca930fc39ccf987c1c22fd04a1e7463b5dd0dfd improved the situation there a bit? > > [description why we could ignore CIC for vacuum horizon on other tables ] > I've previously discussed this with other hackers and the reasoning > they'd understood way that we couldn't always safely ignore > PROC_IN_CIC backends in the vacuum's oldest xmin call because of > function indexes, and the fact that (despite clear recommendations to > the contrary), there's nothing actually preventing someone from adding > a function index on table X that queries table Y. Well, even if we consider this an actual problem, we could still use PROC_IN_CIC for non-expression non-partial indexes (index operator themselves better ensure this isn't a problem, or they're ridiculously broken already - they can get called during vacuum). Even when expressions are involved, I don't think that necessarily would have to mean that we need to use the same snapshot to run expressions in for the hole scan. So we could occasionally take a new snapshot for the purpose of computing expressions. The hard part presumably would be that we'd need to advertise one xmin for the expression snapshot to protect tuples potentially accessed from being removed, but at the same time we also need to advertise the xmin of the snapshot used by CIC, to avoid HOT pruning in other session from removing tuple versions from the table the index is being created on. There's not really infrastructure for doing so. I think we'd basically have to start publicizing multiple xmin values (as long as PGXACT->xmin is <= new xmin for expressions, only GetOldestXmin() would need to care, and it's not that performance critical). Not pretty. > I'm not sure I buy that we should care about people doing something > clearly so dangerous, but...I grant that it'd be nice not to cause new > crashes. I don't think it's just dangerous expressions that would be affected. Normal expression indexes need to be able to do syscache lookups etc, and they can't safely do so if tuple versions can be removed in the middle of a scan. We could avoid that by not ignoring PROC_IN_CIC backend in GetOldestXmin() calls for catalog tables (yuck). Regards, Andres |
On Thu, Apr 16, 2020 at 6:12 PM Andres Freund <[hidden email]> wrote:
> > Hi, > > On 2020-04-15 21:44:48 -0400, James Coleman wrote: > > On Wed, Apr 15, 2020 at 6:31 PM Andres Freund <[hidden email]> wrote: > > > If it's about the xmin horizon for vacuum: I think we could probably > > > avoid that using the same flag. As vacuum cannot be run against a table > > > that has a CIC running (although it'd theoretically be possible to allow > > > that), it should be safe to ignore PROC_IN_CIC backends in vacuum's > > > GetOldestXmin() call. That might not be true for system relations, but > > > we don't allow CIC on those. > > > > Yeah, I mean that if I have a CIC running on table X then vacuum can't > > remove dead tuples (from after the CIC's snapshot) on table Y. > > For me "blocking" evokes waiting for a lock, which is why I thought > you'd not mean that issue. It was sloppy choice of language on my part; for better or worse at work we've taken to talking about "blocking vacuum" when that's really shorthand for "blocking [or you'd prefer preventing] vacuuming dead tuples". > > That's a pretty significant danger, given the combination of: > > 1. Index builds on very large tables can take many days, and > > We at least don't hold a single snapshot over the multiple phases... For sure. And text sorting improvements have made this better also, still, as you often point out re: xid size, databases are only getting larger (and more TPS). > > 2. The well understood problems of high update tables with dead tuples > > and poor plans. > > Which specific problem are you referring to? The planner probing the end > of the index for values outside of the histogram? I'd hope > 3ca930fc39ccf987c1c22fd04a1e7463b5dd0dfd improved the situation there a > bit? Yes, and other commits too, IIRC from the time we spent debugging exactly the scenario mentioned in that commit. But by "poor plans" I don't mean specifically "poor planning time" but that we can still end up choosing the "wrong" plan, right? And dead tuples can make an index scan be significantly worse than it would otherwise be. Same for a seq scan: you can end up looking at millions of dead tuples in a nominally 500 row table. > > > [description why we could ignore CIC for vacuum horizon on other tables ] > > > I've previously discussed this with other hackers and the reasoning > > they'd understood way that we couldn't always safely ignore > > PROC_IN_CIC backends in the vacuum's oldest xmin call because of > > function indexes, and the fact that (despite clear recommendations to > > the contrary), there's nothing actually preventing someone from adding > > a function index on table X that queries table Y. > > Well, even if we consider this an actual problem, we could still use > PROC_IN_CIC for non-expression non-partial indexes (index operator > themselves better ensure this isn't a problem, or they're ridiculously > broken already - they can get called during vacuum). Agreed. It'd be unfortunate to have to limit it though. > Even when expressions are involved, I don't think that necessarily would > have to mean that we need to use the same snapshot to run expressions in > for the hole scan. So we could occasionally take a new snapshot for the > purpose of computing expressions. > > The hard part presumably would be that we'd need to advertise one xmin > for the expression snapshot to protect tuples potentially accessed from > being removed, but at the same time we also need to advertise the xmin > of the snapshot used by CIC, to avoid HOT pruning in other session from > removing tuple versions from the table the index is being created > on. > > There's not really infrastructure for doing so. I think we'd basically > have to start publicizing multiple xmin values (as long as PGXACT->xmin > is <= new xmin for expressions, only GetOldestXmin() would need to care, > and it's not that performance critical). Not pretty. In other words, pretty invasive. > > I'm not sure I buy that we should care about people doing something > > clearly so dangerous, but...I grant that it'd be nice not to cause new > > crashes. > > I don't think it's just dangerous expressions that would be > affected. Normal expression indexes need to be able to do syscache > lookups etc, and they can't safely do so if tuple versions can be > removed in the middle of a scan. We could avoid that by not ignoring > PROC_IN_CIC backend in GetOldestXmin() calls for catalog tables (yuck). At first glance this sounds a lot less invasive, but I also agree it's gross. James |
The following review has been posted through the commitfest application:
make installcheck-world: not tested Implements feature: not tested Spec compliant: not tested Documentation: tested, passed James, I'm on board with the point of pointing out explicitly the "concurrent index builds on multiple tables at the same time will not return on any one table until all have completed", with back-patching. I do not believe the new paragraph is necessary though. I'd suggest trying to weave it into the existing paragraph ending "Even then, however, the index may not be immediately usable for queries: in the worst case, it cannot be used as long as transactions exist that predate the start of the index build." Adding "Notably, " in front of the existing sentence fragment above and tacking it onto the end probably suffices. I don't actually don't whether this is true behavior though. Is it something our tests do, or could, demonstrate? It is sorta weird to say "one will not return until all have completed, though, since usually people think return means completed". That whole paragraph is a bit unclear for the inexperienced DBA, in particular marked ready to use but isn't usable. That isn't really on this patch to fix though, and the clarity around concurrent CIC seems worthwhile to add, even if imprecise - IMO it doesn't make that whole section any less clear and points out what seems to be a unique dynamic. IOW I would send the simple fix (inline, not a new paragraph) to a committer. The bigger doc reworking or actual behavioral improvements shouldn't hold up such a simple improvement. David J. The new status of this patch is: Waiting on Author |
On Thu, Jul 16, 2020 at 7:34 PM David Johnston
<[hidden email]> wrote: > > The following review has been posted through the commitfest application: > make installcheck-world: not tested > Implements feature: not tested > Spec compliant: not tested > Documentation: tested, passed > > James, > > I'm on board with the point of pointing out explicitly the "concurrent index builds on multiple tables at the same time will not return on any one table until all have completed", with back-patching. I do not believe the new paragraph is necessary though. I'd suggest trying to weave it into the existing paragraph ending "Even then, however, the index may not be immediately usable for queries: in the worst case, it cannot be used as long as transactions exist that predate the start of the index build." Adding "Notably, " in front of the existing sentence fragment above and tacking it onto the end probably suffices. really accurate/sufficient: it seems to imply the CREATE INDEX has returned but for some reason the index isn't yet valid. The issue I'm trying to describe here is that the CREATE INDEX query itself will not return until all preceding queries have completed *including* concurrent index creations on unrelated tables. > I don't actually don't whether this is true behavior though. Is it something our tests do, or could, demonstrate? It'd take tests that exercise parallelism, but it's pretty simple to demonstrate (but you do have to catch the first index build in a scan phase, so you either need lots of data or a hack). Here's an example that uses a bit of a hack to simulate a slow scan phase: Setup: create table items(i int); create table others(i int); create function slow_expr() returns text as $$ select pg_sleep(15); select '5'; $$ language sql immutable; insert into items(i) values (1), (2); insert into others(i) values (1), (2); Then the following in order: 1. In session A: create index concurrently on items((i::text || slow_expr())); 2. In session B (at the same time): create index concurrently on others(i); You'll notice that the 2nd command, which should be practically instantaneous, waits on the first ~30s scan phase of (1) before it returns. The same is true if after (2) completes you immediately run it again -- it waits on the second ~30s scan phase of (1). That does reveal a bit of complexity though that that the current patch doesn't address, which is that this can be phase dependent (and that complexity gets a lot more non-obvious when there's real live activity (particularly long-running transactions) in the system as well. I've attached a new patch series with two items: 1. A simpler (and I believe more correct) doc changes for "cic blocks cic on other tables". 2. A patch to document that all index builds can prevent tuples from being vacuumed away on other tables. If it's preferable we could commit the first and discuss the second separately, but since that limitation was also discussed up-thread, I decided to include them both here for now. James ![]() ![]() |
In reply to this post by Andres Freund
On 2020-Mar-25, Andres Freund wrote:
> What I was thinking of was a new flag, with a distinct value from > PROC_IN_VACUUM. It'd currently just be specified in the > GetCurrentVirtualXIDs() calls in WaitForOlderSnapshots(). That'd avoid > needing to wait for other CICs on different relations. Since CIC is not > permitted on system tables, and CIC doesn't do DML on normal tables, it > seems fairly obviously correct to exclude other CICs. Hmm, that does work, and seems a pretty small patch -- attached. Of course, some more commentary is necessary, but the theory of operation is as Andres says. (It does not solve the vacuuming problem I was describing in the other thread, only the spurious waiting that James is complaining about in this thread.) I'm going to try and poke holes on this now ... (Expression indexes with falsely immutable functions?) -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services |
On 2020-Aug-04, Alvaro Herrera wrote:
> diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h > index b20e2ad4f6..43c8ea3e31 100644 > --- a/src/include/storage/proc.h > +++ b/src/include/storage/proc.h > @@ -53,6 +53,8 @@ struct XidCache > #define PROC_IS_AUTOVACUUM 0x01 /* is it an autovac worker? */ > #define PROC_IN_VACUUM 0x02 /* currently running lazy vacuum */ > #define PROC_IN_ANALYZE 0x04 /* currently running analyze */ > +#define PROC_IN_CIC 0x40 /* currently running CREATE INDEX > + CONCURRENTLY */ > #define PROC_VACUUM_FOR_WRAPAROUND 0x08 /* set by autovac only */ > #define PROC_IN_LOGICAL_DECODING 0x10 /* currently doing logical > * decoding outside xact */ Hah, missed to add new bit to PROC_VACUUM_STATE_MASK here. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services |
Back in the 8.3 cycle (2007) when the autovacuum launcher/worker split
was done, we annoyed people because it blocked DDL. That led to an effort to cancel autovac automatically when that was detected, by Simon Riggs. https://postgr.es/m/1191526327.4223.204.camel@... https://postgr.es/m/1192129897.4233.433.camel@... I was fixated on only cancelling when it was ANALYZE, to avoid losing any VACUUM work. https://postgr.es/m/20071025164150.GF23566@... That turned into some flags for PGPROC to detect whether a process was ANALYZE, and cancel only those. https://postgr.es/m/20071024151328.GG6559@... Commit: https://postgr.es/m/20071024205536.CB425754229@... https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=745c1b2c2ab However, I was outvoted, so we do not limit cancellation to analyze. Patch and discussion: https://postgr.es/m/20071025164150.GF23566@... Commit: https://postgr.es/m/20071026204510.AA02E754229@... https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=acac68b2bca ... which means the flag I had added two days earlier has never been used for anything. We've carried the flag forward to this day for almost 13 years, dutifully turning it on and off ... but never checking it anywhere. I propose to remove it, as in the attached patch. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services |
Hi,
On 2020-08-05 19:55:49 -0400, Alvaro Herrera wrote: > Back in the 8.3 cycle (2007) when the autovacuum launcher/worker split > was done, we annoyed people because it blocked DDL. That led to an > effort to cancel autovac automatically when that was detected, by Simon > Riggs. > https://postgr.es/m/1191526327.4223.204.camel@... > https://postgr.es/m/1192129897.4233.433.camel@... > > I was fixated on only cancelling when it was ANALYZE, to avoid losing > any VACUUM work. > https://postgr.es/m/20071025164150.GF23566@... > That turned into some flags for PGPROC to detect whether a process was > ANALYZE, and cancel only those. > https://postgr.es/m/20071024151328.GG6559@... > Commit: > https://postgr.es/m/20071024205536.CB425754229@... > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=745c1b2c2ab > > However, I was outvoted, so we do not limit cancellation to analyze. > Patch and discussion: https://postgr.es/m/20071025164150.GF23566@... > Commit: > https://postgr.es/m/20071026204510.AA02E754229@... > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=acac68b2bca > > ... which means the flag I had added two days earlier has never been > used for anything. We've carried the flag forward to this day for > almost 13 years, dutifully turning it on and off ... but never checking > it anywhere. > > I propose to remove it, as in the attached patch. I'm mildly against that, because I'd really like to start making use of the flag. Not so much for cancellations, but to avoid the drastic impact analyze has on bloat. In OLTP workloads with big tables, and without disabled cost limiting for analyze (or slow IO), the snapshot that analyze holds is often by far the transaction with the oldest xmin. It's not entirely trivial to fix (just ignoring it could lead to detoasting issues), but also not that. Only mildly against because it'd not be hard to reintroduce once we need it. Greetings, Andres Freund |
On Thu, 6 Aug 2020 at 02:07, Andres Freund <[hidden email]> wrote:
Good points, both. The most obvious way to avoid long analyze snapshots is to make the analysis take multiple snapshots as it runs, rather than try to invent some clever way of ignoring the analyze snapshots (which as Alvaro points out, we never did). All we need to do is to have an analyze snapshot last for at most N rows, but keep scanning until we have the desired sample size. Doing that would mean the analyze sample wouldn't come from a single snapshot, but then who cares? There is no requirement for consistency - the sample would be arguably *more* stable because it comes from multiple points in time, not just one. |
In reply to this post by Andres Freund
On Wed, Aug 5, 2020 at 9:07 PM Andres Freund <[hidden email]> wrote:
> I'm mildly against that, because I'd really like to start making use of > the flag. Not so much for cancellations, but to avoid the drastic impact > analyze has on bloat. In OLTP workloads with big tables, and without > disabled cost limiting for analyze (or slow IO), the snapshot that > analyze holds is often by far the transaction with the oldest xmin. > > It's not entirely trivial to fix (just ignoring it could lead to > detoasting issues), but also not that. > > Only mildly against because it'd not be hard to reintroduce once we need > it. I think we should nuke it. It's trivial to reintroduce the flag if we need it later, if and when somebody's willing to do the associated work. In the meantime, it adds confusion. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company |
Robert Haas <[hidden email]> writes:
> On Wed, Aug 5, 2020 at 9:07 PM Andres Freund <[hidden email]> wrote: >> Only mildly against because it'd not be hard to reintroduce once we need >> it. > I think we should nuke it. It's trivial to reintroduce the flag if we > need it later, if and when somebody's willing to do the associated > work. In the meantime, it adds confusion. +1 for removal. It's not clear to me that we'd ever put it back. Long-running ANALYZE snapshots are indeed a problem, but Simon's proposal upthread to just take a new one every so often seems like a much cleaner and simpler answer than having onlookers assume that it's safe to ignore ANALYZE processes. (Given that ANALYZE can invoke user-defined functions, and can be invoked from inside user transactions, any such assumption seems horribly dangerous.) regards, tom lane |
On Thu, Aug 6, 2020 at 2:37 PM Tom Lane <[hidden email]> wrote:
> +1 for removal. It's not clear to me that we'd ever put it back. > Long-running ANALYZE snapshots are indeed a problem, but Simon's proposal > upthread to just take a new one every so often seems like a much cleaner > and simpler answer than having onlookers assume that it's safe to ignore > ANALYZE processes. (Given that ANALYZE can invoke user-defined functions, > and can be invoked from inside user transactions, any such assumption > seems horribly dangerous. Not to get too far from the proposal on the table of just removing something that's been unused for a really long time, which stands on its own merits, but if a particular ANALYZE doesn't invoke any user-defined functions and isn't run inside a transaction, could we skip acquiring a snapshot altogether? That's an extremely common case, though by no means universal. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company |
Robert Haas <[hidden email]> writes:
> Not to get too far from the proposal on the table of just removing > something that's been unused for a really long time, which stands on > its own merits, but if a particular ANALYZE doesn't invoke any > user-defined functions and isn't run inside a transaction, could we > skip acquiring a snapshot altogether? That's an extremely common case, > though by no means universal. I'm inclined to think not. (1) Without a snapshot it's hard to make any non-bogus decisions about which tuples are live and which are dead. Admittedly, with Simon's proposal the final totals would be spongy anyhow, but at least the individual decisions produce meaningful answers. (2) I'm pretty sure there are places in the system that assume that any reader of a table is using an MVCC snapshot. For instance, didn't you introduce some such assumptions along with or just after getting rid of SnapshotNow for catalog scans? regards, tom lane |
On Thu, Aug 6, 2020 at 3:11 PM Tom Lane <[hidden email]> wrote:
> (1) Without a snapshot it's hard to make any non-bogus decisions about > which tuples are live and which are dead. Admittedly, with Simon's > proposal the final totals would be spongy anyhow, but at least the > individual decisions produce meaningful answers. I don't think I believe this. It's impossible to make *consistent* decisions, but it's not difficult to make *non-bogus* decisions. HeapTupleSatisfiesVacuum() and HeapTupleSatifiesUpdate() both make such decisions, and neither takes a snapshot argument. > (2) I'm pretty sure there are places in the system that assume that any > reader of a table is using an MVCC snapshot. For instance, didn't you > introduce some such assumptions along with or just after getting rid of > SnapshotNow for catalog scans? SnapshotSelf still exists and is still used, and IIRC, it has very similar semantics to the old SnapshotNow, so I don't think that we introduced any really general assumptions of this sort. I think the important part of those changes was that all the code that had previously used SnapshotNow to examine system catalog tuples for DDL purposes and catcache lookups and so forth started using an MVCC scan, which removed one (of many) impediments to concurrent DDL. I think the fact that we removed SnapshotNow outright rather than just ceasing to use it for that purpose was mostly so that nobody would accidentally reintroduce code that used it for the sorts of purposes for which it had been used previously, and secondarily for code cleanliness. There's nothing wrong with it fundamentally AFAIK. It's worth mentioning, I think, that the main problem with SnapshotNow was that it provided no particular stability. If you did an index scan under SnapshotNow you might find two copies or no copies of a row being concurrently updated, rather than exactly one. And that in turn could cause problems like failure to build a relcache entry. Now, how important is stability to ANALYZE? If you *either* retake your MVCC snapshots periodically as you re-scan the table *or* use a non-MVCC snapshot for the scan, you can get those same kinds of artifacts: you might see two copies of a just-updated row, or none. Maybe this would actually *break* something - e.g. could there be code that would get confused if we sample multiple rows for the same value in a column that has a UNIQUE index? But I think mostly the consequences would be that you might get somewhat different results from the statistics. It's not clear to me that it would even be correct to categorize those somewhat-different results as "less accurate." Tuples that are invisible to a query often have performance consequences very similar to visible tuples, in terms of the query run time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company |
In reply to this post by Robert Haas
Hi,
On 2020-08-06 14:48:52 -0400, Robert Haas wrote: > On Thu, Aug 6, 2020 at 2:37 PM Tom Lane <[hidden email]> wrote: > > +1 for removal. It's not clear to me that we'd ever put it back. > > Long-running ANALYZE snapshots are indeed a problem, but Simon's proposal > > upthread to just take a new one every so often seems like a much cleaner > > and simpler answer than having onlookers assume that it's safe to ignore > > ANALYZE processes. (Given that ANALYZE can invoke user-defined functions, > > and can be invoked from inside user transactions, any such assumption > > seems horribly dangerous. > > Not to get too far from the proposal on the table of just removing > something that's been unused for a really long time, which stands on > its own merits, but if a particular ANALYZE doesn't invoke any > user-defined functions and isn't run inside a transaction, could we > skip acquiring a snapshot altogether? That's an extremely common case, > though by no means universal. I don't think so, at least not in very common situations. E.g. as long as there's a toast table we need to hold a snapshot to ensure that we don't get failures looking up toasted datums. IIRC there were some other similar issues that I can't quite recall right now. Greetings, Andres Freund |
In reply to this post by Robert Haas
Robert Haas <[hidden email]> writes:
> ... how > important is stability to ANALYZE? If you *either* retake your MVCC > snapshots periodically as you re-scan the table *or* use a non-MVCC > snapshot for the scan, you can get those same kinds of artifacts: you > might see two copies of a just-updated row, or none. Maybe this would > actually *break* something - e.g. could there be code that would get > confused if we sample multiple rows for the same value in a column > that has a UNIQUE index? But I think mostly the consequences would be > that you might get somewhat different results from the statistics. Yeah, that's an excellent point. I can imagine somebody complaining "this query clearly matches a unique index, why is the planner estimating multiple rows out?". But most of the time it wouldn't matter much. (And I think you can get cases like that anyway today.) > It's not clear to me that it would even be correct to categorize those > somewhat-different results as "less accurate." Estimating two rows where the correct answer is one row is clearly "less accurate". But I suspect you'd have to be quite unlucky to get such a result in practice from Simon's proposal, as long as we weren't super-aggressive about changing ANALYZE's snapshot a lot. regards, tom lane |
In reply to this post by Robert Haas
Hi,
On 2020-08-06 16:22:23 -0400, Robert Haas wrote: > On Thu, Aug 6, 2020 at 3:11 PM Tom Lane <[hidden email]> wrote: > > (1) Without a snapshot it's hard to make any non-bogus decisions about > > which tuples are live and which are dead. Admittedly, with Simon's > > proposal the final totals would be spongy anyhow, but at least the > > individual decisions produce meaningful answers. > > I don't think I believe this. It's impossible to make *consistent* > decisions, but it's not difficult to make *non-bogus* decisions. > HeapTupleSatisfiesVacuum() and HeapTupleSatifiesUpdate() both make > such decisions, and neither takes a snapshot argument. Yea, I don't think that's a big problem for the main table. As I just mentioned in an email a few minutes ago, toast is a bit of a different topic. In fact using conceptually like a new snapshot for each sample tuple actually seems like it'd be somewhat of an improvement over using a single snapshot. Given that it's a sample it's not like have very precise expectations of the precise sample, and publishing one that solely consists of pretty old rows by the time we're done doesn't seem like it's a meaningful improvement. I guess there's some danger of distinctness estimates getting worse, by seeing multiple versions of the same tuple multiple times - but they're notoriously inaccurate already, don't think this changes much. > > (2) I'm pretty sure there are places in the system that assume that any > > reader of a table is using an MVCC snapshot. For instance, didn't you > > introduce some such assumptions along with or just after getting rid of > > SnapshotNow for catalog scans? > > SnapshotSelf still exists and is still used, and IIRC, it has very > similar semantics to the old SnapshotNow, so I don't think that we > introduced any really general assumptions of this sort. I think the > important part of those changes was that all the code that had > previously used SnapshotNow to examine system catalog tuples for DDL > purposes and catcache lookups and so forth started using an MVCC scan, > which removed one (of many) impediments to concurrent DDL. I think the > fact that we removed SnapshotNow outright rather than just ceasing to > use it for that purpose was mostly so that nobody would accidentally > reintroduce code that used it for the sorts of purposes for which it > had been used previously, and secondarily for code cleanliness. > There's nothing wrong with it fundamentally AFAIK. Some preaching to the choir: IDK, there's not really much it (along with Self, Any, ...) can safely be used for, unless you have pretty heavyweight additional locking, or look explicitly at exactly one tuple version. Except that it's probably unnecessary, and that there's some disaster recovery benefits, I'd be in favor of prohibiting most snapshot types for [sys]table scans. I'm doubtful that using the term "snapshot" for any of these is a good choice, and I don't think there's benefit in actually going through the snapshot APIs. Especially not when, like *Dirty, they abuse fields inside SnapshotData to return data that can't be returned through the normal API. It'd probably be better to have more explicit APIs for these, rather than going through snapshot. > It's not clear to me that it would even be correct to categorize those > somewhat-different results as "less accurate." Tuples that are > invisible to a query often have performance consequences very similar > to visible tuples, in terms of the query run time. +1 Greetings, Andres Freund |
Andres Freund <[hidden email]> writes:
> In fact using conceptually like a new snapshot for each sample tuple > actually seems like it'd be somewhat of an improvement over using a > single snapshot. Dunno, that feels like a fairly bad idea to me. It seems like it would overemphasize the behavior of whatever queries happened to be running concurrently with the ANALYZE. I do follow the argument that using a single snapshot for the whole ANALYZE overemphasizes a single instant in time, but I don't think that leads to the conclusion that we shouldn't use a snapshot at all. Another angle that would be worth considering, aside from the issue of whether the sample used for pg_statistic becomes more or less representative, is what impact all this would have on the tuple count estimates that go to the stats collector and pg_class.reltuples. Right now, we don't have a great story at all on how the stats collector's count is affected by combining VACUUM/ANALYZE table-wide counts with the incremental deltas reported by transactions happening concurrently with VACUUM/ANALYZE. Would changing this behavior make that better, or worse, or about the same? regards, tom lane |
In reply to this post by Tom Lane-2
On Thu, 6 Aug 2020 at 22:35, Tom Lane <[hidden email]> wrote: Robert Haas <[hidden email]> writes: Seems like we're agreed we can use more than one snapshot, the only discussion is "how many?" The more you take the more weirdness you will see, so adopting an approach of one-snapshot-per-row seems like the worst case for accuracy, even if it does make analyze faster. Keeping the discussion on reducing the impact of bernoulli sampled analyze, I was imagining we would do one snapshot for each block of rows with default statistics_target, so that default behavior would be unaffected. Larger settings would be chunked according to the default, so stats_target=10k(max) would take a 10000/100 = 100 snapshots. That approach allows people to vary that using an existing parameter if needed. |
In reply to this post by Tom Lane-2
On Thu, Aug 6, 2020 at 5:35 PM Tom Lane <[hidden email]> wrote:
> > It's not clear to me that it would even be correct to categorize those > > somewhat-different results as "less accurate." > > Estimating two rows where the correct answer is one row is clearly > "less accurate" [ than estimating one row ]. That's a tautology, so I can't argue with it as far as it goes. Thinking about it more, there are really two ways to think about an estimated row count. On the one hand, if you think of the row count estimate as the number of rows that are going to pop out of a node, then it's always right to think of a unique index as limiting the number of occurrences of a given value to 1. But, if you think of the row count estimate as a way of estimating the amount of work that the node has to do to produce that output, then it isn't. If a table has a lot of inserts and deletes, or a lot of updates, index scans might have to do a lot of extra work chasing down index pointers to tuples that end up being invisible to our scan. The scan may not have any filter quals at all, and even if it does, they are likely cheap to evaluate compared to the cost of finding a locking buffers and checking visibility, so the dominant cost of the scan is really based on the total number of rows that are present, not the number that are visible. Ideally, the presence of those rows would affect the cost estimate for the node in a way very similar to expecting to find more rows. At the same time, it doesn't work to just bump up the row count estimate for the node, because then you'll think more rows will be output, which might cause poor planning decisions at higher levels. It doesn't seem easy to get this 100% right. Tuple visibility can change very quickly, much faster than the inter-ANALYZE interval. And sometimes tuples can be pruned away very quickly, too, and the index pointers may be opportunistically removed very quickly, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company |
Free forum by Nabble | Edit this page |