POC: Cleaning up orphaned files using undo logs

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
330 messages Options
1234567 ... 17
Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Robert Haas
On Thu, Jun 20, 2019 at 6:48 AM Amit Kapila <[hidden email]> wrote:

> > for (;;)
> > {
> >     UnpackedUndoRecord *uur = UndoFetchRecord(urp);
> >     if (i like this one)
> >         break;
> >     urp = uur->uur_blkprev; // should be renamed, since zedstore +
> > probably others will have tuple chains not block chains
> ..
>
> +1 for renaming this variable.  How about uur_prev_ver or uur_prevver
> or uur_verprev?  Any other suggestions?

Maybe just uur_previous or uur_prevundo or something like that.  We've
already got a uur_prevurp, but that's really pretty misnamed and IMHO
it doesn't belong in this structure anyway.  (uur_next is also a bad
name and also doesn't belong in this structure.)

I don't think we want to use 'ver' because that supposes that undo is
being used to track tuple versions, which is a likely use but perhaps
not the only one.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Robert Haas
In reply to this post by akapila
On Thu, Jun 20, 2019 at 6:44 AM Amit Kapila <[hidden email]> wrote:

> BTW, while looking at the code of UndoFetchRecord, I see some problem.
> There is a coding pattern like
> if()
> {
> }
> else
> {
>    LWLockAcquire()
>   ..
>   ..
> }
>
> LWLockRelease().
>
> I think this is not correct.

Independently of that problem, I think it's probably bad that we're
not maintaining the same shared memory state on the master and the
standby.  Doing the same check in one way on the master and in a
different way on the standby is a recipe for surprising and probably
bad behavior differences between master and standby servers.  Those
could be simple things like lock acquire/release not matching, but
they could also be things like performance or correctness differences
that only materialize under certain scenarios.

This is not the only place in the patch set where we have this kind of
thing, and I hate them all.  I don't exactly know what the solution
is, either, but I suspect it will involve either having the recovery
process do a more thorough job updating the shared memory state when
it does undo-related stuff, or running some of the undo-specific
processes on the standby just for the purpose of getting these updates
done.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

akapila
In reply to this post by Robert Haas
On Thu, Jun 20, 2019 at 8:01 PM Robert Haas <[hidden email]> wrote:

>
> On Thu, Jun 20, 2019 at 2:42 AM Amit Kapila <[hidden email]> wrote:
> > Okay, one reason that comes to mind is we don't want to choke the
> > system as applying undo can consume CPU and generate a lot of I/O.  Is
> > that you have in mind or something else?
>
> Yeah, mainly that, but also things like log spam, and even pressure on
> the lock table.  If we are trying over and over again to take useless
> locks, it can affect other things on the system. The main thing,
> however, is the CPU and I/O consumption.
>
> > I see an advantage in having some sort of throttling here, so we can
> > have some wait time (say 100ms) between processing requests.  Do we
> > see any need of guc here?
>
> I don't think that is the right approach. As I said in my previous
> reply, we need a way of holding off the retry of the same error for a
> certain amount of time, probably measured in seconds or tens of
> seconds. Introducing a delay before processing every request is an
> inferior alternative:
>

This delay is for *not* choking the system by constantly performing
undo requests that consume a lot of CPU and I/O as discussed in above
point.  For holding off the same error request to be re-tried, we need
next_retry_time type of method as discussed below.

 if there are a lot of rollbacks, it can cause

> the system to lag; and in the case where there's just one rollback
> that's failing, it will still be way too much log spam (and probably
> CPU time too).  Nobody wants 10 failure messages per second in the
> log.
>
> > > It seems to me that thinking of this in terms of what the undo worker
> > > does and what the undo launcher does is probably not the right
> > > approach. We need to think of it more as an integrated system. Instead
> > > of storing a failure_count with each request in the error queue, how
> > > about storing a next retry time?
> >
> > I think both failure_count and next_retry_time can work in a similar way.
> >
> > I think incrementing next retry time in multiples will be a bit
> > tricky.  Say first-time error occurs at X hours.  We can say that
> > next_retry_time will X+10s=Y and error_occured_at will be X.  The
> > second time it again failed, how will we know that we need set
> > next_retry_time as Y+20s, maybe we can do something like Y-X and then
> > add 10s to it and add the result to the current time.  Now whenever
> > the worker or launcher finds this request, they can check if the
> > current_time is greater than or equal to next_retry_time, if so they
> > can pick that request, otherwise, they check request in next queue.
> >
> > The failure_count can also work in a somewhat similar fashion.
> > Basically, we can use error_occurred at and failure_count to compute
> > the required time.  So, if error is occurred at say X hours and
> > failure count is 3, then we can check if current_time is greater than
> > X+(3 * 10s), then we will allow the entry to be processed, otherwise,
> > it will check other queues for work.
>
> Meh.  Don't get stuck on one particular method of calculating the next
> retry time.  We want to be able to change that easily if whatever we
> try first doesn't work out well.  I am not convinced that we need
> anything more complex than a fixed retry time, probably controlled by
> a GUC (undo_failure_retry_time = 10s?).
>

IIRC, then you only seem to have suggested that we need a kind of
back-off algorithm that gradually increases the retry time up to some
maximum [1].  I think that is a good way to de-prioritize requests
that are repeatedly failing.  Say, there is a request that has already
failed for 5 times and the worker queues it to get executed after 10s.
Immediately after that, another new request has failed for the first
time for the same database and it also got queued to get executed
after 10s.  In this scheme the request that has already failed for 5
times will get a chance before the request that has failed for the
first time.


[1] - https://www.postgresql.org/message-id/CA%2BTgmoYHBkm7M8tNk6Z9G_aEOiw3Bjdux7v9%2BUzmdNTdFmFzjA%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Robert Haas
On Thu, Jun 20, 2019 at 11:35 AM Amit Kapila <[hidden email]> wrote:
> This delay is for *not* choking the system by constantly performing
> undo requests that consume a lot of CPU and I/O as discussed in above
> point.  For holding off the same error request to be re-tried, we need
> next_retry_time type of method as discussed below.

Oh.  That's not what I thought we were talking about.  It's not
unreasonable to think about trying to rate limit undo application just
like we do for vacuum, but a fixed delay between requests would be a
completely inadequate way of attacking that problem.  If the
individual requests are short, it will create too much delay, and if
they are long, it will not create enough.  We would need delays within
a transaction, not just between transactions, similar to how the
vacuum cost delay stuff works.

I suggest that we leave that to one side for now.  It seems like
something that could be added later, maybe in a more general way, and
not something that needs to be or should be closely connected to the
logic for deciding the order in which we're going to process different
transactions in undo.

> > Meh.  Don't get stuck on one particular method of calculating the next
> > retry time.  We want to be able to change that easily if whatever we
> > try first doesn't work out well.  I am not convinced that we need
> > anything more complex than a fixed retry time, probably controlled by
> > a GUC (undo_failure_retry_time = 10s?).
>
> IIRC, then you only seem to have suggested that we need a kind of
> back-off algorithm that gradually increases the retry time up to some
> maximum [1].  I think that is a good way to de-prioritize requests
> that are repeatedly failing.  Say, there is a request that has already
> failed for 5 times and the worker queues it to get executed after 10s.
> Immediately after that, another new request has failed for the first
> time for the same database and it also got queued to get executed
> after 10s.  In this scheme the request that has already failed for 5
> times will get a chance before the request that has failed for the
> first time.

Sure, that's an advantage of increasing back-off times -- you can keep
the stuff that looks hopeless from interfering too much with the stuff
that is more likely to work out. However, I don't think we've actually
done enough testing to know for sure what algorithm will work out
best.  Do we want linear back-off (10s, 20s, 30s, ...)?  Exponential
back-off (1s, 2s, 4s, 8s, ...)?  No back-off (10s, 10s, 10s, 10s)?
Some algorithm that depends on the size of the failed transaction, so
that big things get retried less often? I think it's important to
design the code in such a way that the algorithm can be changed easily
later, because I don't think we can be confident that whatever we pick
for the first attempt will prove to be best.  I'm pretty sure that
storing the failure count INSTEAD OF the next retry time is going to
make it harder to experiment with different algorithms later.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Robert Haas
In reply to this post by Dilip Kumar-2
On Thu, Jun 20, 2019 at 4:54 AM Dilip Kumar <[hidden email]> wrote:
> The idea behind having the loop inside the undo machinery was that
> while traversing the blkprev chain, we can read all the undo records
> on the same undo page under one buffer lock.

That's not a bad goal, although invoking a user-supplied callback
while holding a buffer lock is a little scary.  If we stick with that,
it had better be clearly documented.  Perhaps worth noting:
ReadBuffer() is noticeably more expensive in terms of CPU consumption
than LockBuffer().  So it may work out that we keep a pin to avoid
redoing that and worry less about retaking the buffer locks.  But I'm
not sure: avoiding buffer locks is clearly good, too.

I have a couple of observations after poking at this some more.  One
is that it's not necessarily adequate to have an interface that
iterates backward through the undo chain until the callback returns
true.  Here's an example: suppose you want to walk backward through
the undo chain until you find the first undo record that corresponds
to a change you can see, and then return the undo record immediately
prior to that.  zheap doesn't really need this, because it (mostly?)
stores the XID of the next record we're going to look up in the
current record, and the XID of the first record we're going to look up
in the chain -- so it can tell from the record it's found so far
whether it should bother looking up the next record. That, however,
would not necessarily be true for some other AM.

Suppose you just store an undo pointer in the tuple header, as Heikki
proposed to do for zedstore.  Suppose further that each record has the
undo pointer for the previous record that modified that TID but not
necessarily the XID.  Then imagine a TID where we do an insert and a
bunch of in-place updates.  Then, a scan comes along with an old
snapshot.  It seems to me that what you need to do is walk backward
through the chain of undo records until you see one that has an XID
that is visible to your snapshot, and then the version of the tuple
that you want is in the payload of the next-newer undo record. So what
you want to do is something like this:

look up the undo pointer in the tuple.  call that the current undo record.
loop:
- look up the undo pointer in the current undo record.  call that the
previous undo record.
- if the XID from the previous undo record is visible, then stop; use
the tuple version from the current undo record.
- release the current undo record and let the new current undo record
be the previous undo record.

I'm not sure if this is actually a reasonable design from a
performance point of view, but it doesn't sound totally crazy, and
it's just to illustrate the point that there might be cases that are
too complicated for a loop-until-true model.  In this kind of loop, at
any given time you are holding onto two undo records, working your way
back through the undo log, and you just can't make this work with the
UndoFetchRecord callback interface. Possibly you could have a context
object that could hold onto one or a few buffer pins:

BeginUndoFetch(&cxt);
uur = UndoFetchRecord(&cxt, urp); // maybe do this a bunch of times
FinishUndoFetch(&cxt);

...but I'm not sure if that's exactly what we want either. Still, if
there is a significant savings from avoiding repinning and relocking
the buffer, we want to make it easy for people to get that advantage
as often as possible.

Another point that has occurred to me is that it is probably
impossible to avoid a fairly large number of duplicate undo fetches.
For instance, suppose somebody runs an UPDATE on a tuple that has been
recently updated.  The tuple_update method just gets a TID + snapshot,
so the AM basically has to go look up the tuple all over again,
including checking whether the latest version of the tuple is the one
visible to our snapshot.  So that means repinning and relocking the
same buffers and decoding the same undo record all over again.  I'm
not exactly sure what to do about this, but it seems like a potential
performance problem. I wonder if it's feasible to cache undo lookups
so that in common cases we can just reuse the result of a previous
lookup instead of doing a new one, and I wonder whether it's possible
to make that fast enough that it actually helps...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Tom Lane-2
Robert Haas <[hidden email]> writes:
> On Thu, Jun 20, 2019 at 4:54 AM Dilip Kumar <[hidden email]> wrote:
>> The idea behind having the loop inside the undo machinery was that
>> while traversing the blkprev chain, we can read all the undo records
>> on the same undo page under one buffer lock.

> That's not a bad goal, although invoking a user-supplied callback
> while holding a buffer lock is a little scary.

I nominate Robert for Understater of the Year.  I think there's pretty
much 0 chance of that working reliably.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Robert Haas
On Fri, Jun 21, 2019 at 6:54 PM Tom Lane <[hidden email]> wrote:
> > That's not a bad goal, although invoking a user-supplied callback
> > while holding a buffer lock is a little scary.
>
> I nominate Robert for Understater of the Year.  I think there's pretty
> much 0 chance of that working reliably.

It's an honor to be nominated, although I am pretty sure this is not
my best work in category, even for 2019.  There are certainly useful
things that could be done by such a callback without doing anything
that touches shared memory and without doing anything that consumes
more than a handful of CPU cycles, so it doesn't seem utterly crazy to
think that such a design might survive. However, the constraints we'd
have to impose might chafe.

I am more inclined to ditch the callback model altogether in favor of
putting any necessary looping logic on the caller side. That seems a
lot more flexible, and the only trick is figuring out how to keep it
cheap. Providing some kind of context object that can hold onto one or
more pins seems like the most reasonable approach. Last week it seemed
to me that we would need several, but at the moment I can't think of a
reason why we would need more than one. I think we just want to
optimize the case where several undo lookups in quick succession are
actually reading from the same page, and we don't want to go to the
expense of looking that page up multiple times. It doesn't seem at all
likely that we would have a chain of undo records that leaves a
certain page and then comes back to it later, because this is a log
that grows forward, not some kind of random-access thing. So a cache
of size >1 probably wouldn't help.

Unless I'm still confused.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

akapila
In reply to this post by Robert Haas
On Thu, Jun 20, 2019 at 9:56 PM Robert Haas <[hidden email]> wrote:

> On Thu, Jun 20, 2019 at 11:35 AM Amit Kapila <[hidden email]> wrote:
> >
> > IIRC, then you only seem to have suggested that we need a kind of
> > back-off algorithm that gradually increases the retry time up to some
> > maximum [1].  I think that is a good way to de-prioritize requests
> > that are repeatedly failing.  Say, there is a request that has already
> > failed for 5 times and the worker queues it to get executed after 10s.
> > Immediately after that, another new request has failed for the first
> > time for the same database and it also got queued to get executed
> > after 10s.  In this scheme the request that has already failed for 5
> > times will get a chance before the request that has failed for the
> > first time.
>
> Sure, that's an advantage of increasing back-off times -- you can keep
> the stuff that looks hopeless from interfering too much with the stuff
> that is more likely to work out. However, I don't think we've actually
> done enough testing to know for sure what algorithm will work out
> best.  Do we want linear back-off (10s, 20s, 30s, ...)?  Exponential
> back-off (1s, 2s, 4s, 8s, ...)?  No back-off (10s, 10s, 10s, 10s)?
> Some algorithm that depends on the size of the failed transaction, so
> that big things get retried less often? I think it's important to
> design the code in such a way that the algorithm can be changed easily
> later, because I don't think we can be confident that whatever we pick
> for the first attempt will prove to be best.  I'm pretty sure that
> storing the failure count INSTEAD OF the next retry time is going to
> make it harder to experiment with different algorithms later.
>
Fair enough.  I have implemented it based on next_retry_at and use
constant time 10s for the next retry.  I have used define instead of a
GUC as all the other constants for similar things are defined as of
now.  One thing to note is that we want the linger time (defined as
UNDO_WORKER_LINGER_MS) for a undo worker to be more than failure retry
time (defined as UNDO_FAILURE_RETRY_DELAY_MS) as, otherwise, the undo
worker can exit before retrying the failed requests.  The changes for
this are in patches
0011-Infrastructure-to-register-and-fetch-undo-action-req.patch and
0014-Allow-execution-and-discard-of-undo-by-background-wo.patch.

Apart from these, there are few other changes in the patch series:

0014-Allow-execution-and-discard-of-undo-by-background-wo.patch:
1. Allow the undo workers to respond to cancel command by the user.
CHECK_FOR_INTERRUPTS was missing while the worker was checking for the
next undo request in a loop.
2. Change the value of UNDO_WORKER_LINGER_MS to 20s, so that it is
more than UNDO_FAILURE_RETRY_DELAY_MS.
3. Handled Sigterm signal for undo launcher and workers
4. Fixed the code bug to avoid having CommitTransaction when one of
the workers fails to register.  There is no StartTransaction to match
the same. This was leftover from the previous approach.

0012-Infrastructure-to-execute-pending-undo-actions.patch
1 Fix compiler warning

0007-Provide-interfaces-to-store-and-fetch-undo-records.patch
1. Fixed a bug to unlock the buffer while resetting the undo unpacked record
2. Fixed the spurious release of the lock in UndoFetchRecord.
3. Remove the pointer to previous undo in a different log from
UndoRecordTransaction structure.  Now, a separate low_switch header
contains the same.

0007-Provide-interfaces-to-store-and-fetch-undo-records.patch is
Dilip's patch and he has modified it, but changes were small so there
was not much sense in posting it separately.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

0001-Add-SmgrId-to-smgropen-and-BufferTag.patch (90K) Download Attachment
0002-Move-tablespace-dir-creation-from-smgr.c-to-md.c.patch (3K) Download Attachment
0004-Allow-WAL-record-data-on-first-modification-after-a-.patch (2K) Download Attachment
0005-Add-prefetch-support-for-the-undo-log.patch (10K) Download Attachment
0003-Add-undo-log-manager.patch (238K) Download Attachment
0006-Defect-and-enhancement-in-multi-log-support.patch (5K) Download Attachment
0008-Test-module-for-undo-api.patch (12K) Download Attachment
0009-undo-page-consistency-checker.patch (15K) Download Attachment
0007-Provide-interfaces-to-store-and-fetch-undo-records.patch (122K) Download Attachment
0010-Extend-binary-heap-functionality.patch (7K) Download Attachment
0013-Allow-foreground-transactions-to-perform-undo-action.patch (69K) Download Attachment
0011-Infrastructure-to-register-and-fetch-undo-action-req.patch (92K) Download Attachment
0012-Infrastructure-to-execute-pending-undo-actions.patch (47K) Download Attachment
0014-Allow-execution-and-discard-of-undo-by-background-wo.patch (125K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Robert Haas
On Tue, Jun 25, 2019 at 4:00 PM Amit Kapila <[hidden email]> wrote:
> [ new patches ]

I happened to open up 0001 from this series, which is from Thomas, and
I do not think that the pg_buffercache changes are correct.  The idea
here is that the customer might install version 1.3 or any prior
version on an old release, then upgrade to PostgreSQL 13. When they
do, they will be running with the old SQL definitions and the new
binaries.  At that point, it sure looks to me like the code in
pg_buffercache_pages.c is going to do the Wrong Thing.  There's some
existing code in there to omit the pinning_backends column if the SQL
definitions don't know about it; instead of adding similar code for
the newly-added smgrid column, this patch rips out the existing
backward-compatibility code.  I think that's a double fail.

At some later point, the customer is going to run ALTER EXTENSION
pg_buffercache UPDATE. At that point they are going to be expecting
that their SQL definitions now match the state that would have been
created had they installed on pg_buffercache for the first time on
PG13, starting with pg_buffercache v1.4. Since the view is now going
to have a new column, that would seem to required dropping and
recreating the view, but pg_buffercache--1.3--1.4.sql doesn't do that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

akapila
In reply to this post by Robert Haas
On Sat, Jun 22, 2019 at 2:51 AM Robert Haas <[hidden email]> wrote:

>
> On Thu, Jun 20, 2019 at 4:54 AM Dilip Kumar <[hidden email]> wrote:
> > The idea behind having the loop inside the undo machinery was that
> > while traversing the blkprev chain, we can read all the undo records
> > on the same undo page under one buffer lock.
>
> That's not a bad goal, although invoking a user-supplied callback
> while holding a buffer lock is a little scary.  If we stick with that,
> it had better be clearly documented.  Perhaps worth noting:
> ReadBuffer() is noticeably more expensive in terms of CPU consumption
> than LockBuffer().  So it may work out that we keep a pin to avoid
> redoing that and worry less about retaking the buffer locks.  But I'm
> not sure: avoiding buffer locks is clearly good, too.
>
> I have a couple of observations after poking at this some more.  One
> is that it's not necessarily adequate to have an interface that
> iterates backward through the undo chain until the callback returns
> true.  Here's an example: suppose you want to walk backward through
> the undo chain until you find the first undo record that corresponds
> to a change you can see, and then return the undo record immediately
> prior to that.  zheap doesn't really need this, because it (mostly?)
> stores the XID of the next record we're going to look up in the
> current record, and the XID of the first record we're going to look up
> in the chain -- so it can tell from the record it's found so far
> whether it should bother looking up the next record. That, however,
> would not necessarily be true for some other AM.
>
> Suppose you just store an undo pointer in the tuple header, as Heikki
> proposed to do for zedstore.  Suppose further that each record has the
> undo pointer for the previous record that modified that TID but not
> necessarily the XID.  Then imagine a TID where we do an insert and a
> bunch of in-place updates.  Then, a scan comes along with an old
> snapshot.  It seems to me that what you need to do is walk backward
> through the chain of undo records until you see one that has an XID
> that is visible to your snapshot, and then the version of the tuple
> that you want is in the payload of the next-newer undo record. So what
> you want to do is something like this:
>
> look up the undo pointer in the tuple.  call that the current undo record.
> loop:
> - look up the undo pointer in the current undo record.  call that the
> previous undo record.
> - if the XID from the previous undo record is visible, then stop; use
> the tuple version from the current undo record.
> - release the current undo record and let the new current undo record
> be the previous undo record.
>
> I'm not sure if this is actually a reasonable design from a
> performance point of view, but it doesn't sound totally crazy, and
> it's just to illustrate the point that there might be cases that are
> too complicated for a loop-until-true model.  In this kind of loop, at
> any given time you are holding onto two undo records, working your way
> back through the undo log, and you just can't make this work with the
> UndoFetchRecord callback interface. Possibly you could have a context
> object that could hold onto one or a few buffer pins:
>
> BeginUndoFetch(&cxt);
> uur = UndoFetchRecord(&cxt, urp); // maybe do this a bunch of times
> FinishUndoFetch(&cxt);
>
> ...but I'm not sure if that's exactly what we want either. Still, if
> there is a significant savings from avoiding repinning and relocking
> the buffer, we want to make it easy for people to get that advantage
> as often as possible.
>

IIUC, you are proposing to retain the pin and then lock/unlock the
buffer each time in this API.  I think there is no harm in trying out
something on these lines to see if there is any impact on some of the
common scenarios.

> Another point that has occurred to me is that it is probably
> impossible to avoid a fairly large number of duplicate undo fetches.
> For instance, suppose somebody runs an UPDATE on a tuple that has been
> recently updated.  The tuple_update method just gets a TID + snapshot,
> so the AM basically has to go look up the tuple all over again,
> including checking whether the latest version of the tuple is the one
> visible to our snapshot.  So that means repinning and relocking the
> same buffers and decoding the same undo record all over again.  I'm
> not exactly sure what to do about this, but it seems like a potential
> performance problem. I wonder if it's feasible to cache undo lookups
> so that in common cases we can just reuse the result of a previous
> lookup instead of doing a new one, and I wonder whether it's possible
> to make that fast enough that it actually helps...
>

I think it will be helpful if we can have such a cache, but OTOH, we
can try out such optimizations after the first version as well by
analyzing its benefit.  For zheap, in many cases, the version in heap
itself is all-visible and or is visible as per current snapshot and
the same can be detected by looking at the transaction slot, however,
it might be tricky for zedstore.


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Thomas Munro-5
In reply to this post by Robert Haas
On Fri, Jun 28, 2019 at 6:09 AM Robert Haas <[hidden email]> wrote:
> I happened to open up 0001 from this series, which is from Thomas, and
> I do not think that the pg_buffercache changes are correct.  The idea
> here is that the customer might install version 1.3 or any prior
> version on an old release, then upgrade to PostgreSQL 13. When they
> do, they will be running with the old SQL definitions and the new
> binaries.  At that point, it sure looks to me like the code in
> pg_buffercache_pages.c is going to do the Wrong Thing.  [...]

Yep, that was completely wrong.  Here's a new version.  I tested that
I can install 1.3 in an older release, then pg_upgrade to master, then
look at the view without the new column, then UPGRADE the extension to
1.4, and then the new column appears.

Other new stuff in this tarball (and also at
https://github.com/EnterpriseDB/zheap/tree/undo):

Based on hallway track discussions at PGCon, I have made a few
modifications to the undo log storage and record layer to support
"shared" record sets.  They are groups of records can be used for
temporary storage space for anything that needs to outlive a whole set
of transactions.  The intended usage is extra transaction slots for
updaters and lockers when there isn't enough space on a zheap (or
other AM) page.  The idea is to avoid the need to have in-heap
overflow pages for transient transaction management data, and instead
put that stuff on the conveyor belt of perfectly timed doom[1] along
with old tuple versions.

"Shared" undo records are never executed (that is, they don't really
represent rollback actions), they are just used for storage space that
is eventually discarded.  (I experimented with a way to use these also
to perform rollback actions to clean up stuff like the junk left
behind by aborted CREATE INDEX CONCURRENTLY commands, which seemed
promising, but it turned out to be quite tricky so I abandoned that
for now).

Details:

1.  Renamed UndoPersistence to UndoLogCategory everywhere, and add a
fourth category UNDO_SHARED where transactions can write 'out of band'
data that relates to more than one transaction.

2.  Introduced a new RMGR callback rm_undo_status.  It is used to
decide when record sets in the UNDO_SHARED category should be
discarded (instead of the usual single xid-based rules).  The possible
answers are "discard me now!", "ask me again when a given XID is all
visible", and "ask me again when a given XID is no longer running".

3.  Recognise UNDO_SHARED record set boundaries differently.  Whereas
undolog.c recognises transaction boundaries automatically for the
other categories (UNDO_PERMANENT, UNDO_UNLOGGED, UNDO_TEMP), for
UNDO_SHARED the

4.  Add some quick-and-dirty throw-away test stuff to demonstrate
that.  SELECT test_multixact([1234, 2345]) will create a new record
set that will survive until the given array of transactions is no
longer running, and then it'll be discarded.  You can see that with
SELECT * FROM undoinspect('shared').  Or look at SELECT
pg_stat_undo_logs.  This test simply writes all the xids into its
payload, and then has an rm_undo_status function that returns the
first xid it finds in the list that is still running, or if none are
running returns UNDO_STATUS_DISCARD.

Currently you can only return UNDO_STATUS_WAIT_XMIN so wait for an xid
to be older than the oldest xmin; presumably it'd be useful to be able
to discard as soon as an xid is no longer active, which could be a bit
sooner.

Another small change: several people commented that
UndoLogIsDiscarded(ptr) ought to have some kind of fast path that
doesn't acquire locks since it'll surely be hammered.  Here's an
attempt at that that provides an inlined function that uses a
per-backend recent_discard to avoid doing more work in the (hopefully)
common case that you mostly encounter discarded undo pointers.  I hope
this change will show up in profilers in some zheap workloads but this
hasn't been tested yet.

Another small change/review: the function UndoLogGetNextInsertPtr()
previously took a transaction ID, but I'm not sure if that made sense,
I need to think about it some more.

I pulled the latest patches pulled in from the "undoprocessing" branch
as of late last week, and most of the above is implemented as fixup
commits on top of that.

Next I'm working on DBA facilities for forcing undo records to be
discarded (which consists mostly of sorting out the interlocking to
make that work safely).  And also testing facilities for simulating
undo log switching (when you fill up each log and move to another one,
which are rare code paths run, so we need a good way to make them not
rare).

[1] https://speakerdeck.com/macdice/transactions-in-postgresql-and-other-animals?slide=23

--
Thomas Munro
https://enterprisedb.com

undo-20190701.tgz (254K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Thomas Munro-5
On Mon, Jul 1, 2019 at 7:53 PM Thomas Munro <[hidden email]> wrote:
> 3.  Recognise UNDO_SHARED record set boundaries differently.  Whereas
> undolog.c recognises transaction boundaries automatically for the
> other categories (UNDO_PERMANENT, UNDO_UNLOGGED, UNDO_TEMP), for
> UNDO_SHARED the

... set of records inserted in between calls to
BeginUndoRecordInsert() and FinishUndoRecordInsert() calls is
eventually discarded as a unit, and the rm_undo_status() callback for
the calling AM decides when that is allowed.  In contrast, for the
other categories there may be records from any number undo-aware AMs
that are entirely unaware of each other and they must all be discarded
together if the transaction commits and becomes all visible, so
undolog.c automatically manages the boundaries to make that work when
inserting.

--
Thomas Munro
https://enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

akapila
In reply to this post by Thomas Munro-5
On Mon, Jul 1, 2019 at 1:24 PM Thomas Munro <[hidden email]> wrote:
>
> Another small change/review: the function UndoLogGetNextInsertPtr()
> previously took a transaction ID, but I'm not sure if that made sense,
> I need to think about it some more.
>

The changes you have made related to UndoLogGetNextInsertPtr() doesn't
seem correct to me.

@@ -854,7 +854,9 @@ FindUndoEndLocationAndSize(UndoRecPtr start_urecptr,
  * has already started in this log then lets re-fetch the undo
  * record.
  */
- next_insert = UndoLogGetNextInsertPtr(slot->logno, uur->uur_xid);
+ next_insert = UndoLogGetNextInsertPtr(slot->logno);
+
+ /* TODO this can't happen */
  if (!UndoRecPtrIsValid(next_insert))

I think this is a possible case.  Say while the discard worker tries
to register the rollback request from some log and after it fetches
the undo record corresponding to start location in this function,
another backend adds the new transaction undo.  The same is mentioned
in comments as well.   Can you explain what makes you think that this
can't happen?  If we don't want to pass the xid to
UndoLogGetNextInsertPtr, then I think we need to get the insert
location before fetching the record.  I will think more on it to see
if there is any other problem with the same.

2.
@@ -167,25 +205,14 @@ UndoDiscardOneLog(UndoLogSlot *slot,
TransactionId xmin, bool *hibernate)
+ if (!TransactionIdIsValid(wait_xid) && !pending_abort)
  {
  UndoRecPtr next_insert = InvalidUndoRecPtr;
- /*
- * If more undo has been inserted since we checked last, then
- * we can process that as well.
- */
- next_insert = UndoLogGetNextInsertPtr(logno, undoxid);
- if (!UndoRecPtrIsValid(next_insert))
- continue;
+ next_insert = UndoLogGetNextInsertPtr(logno);

This change is also not safe.  It can lead to discarding the undo of
some random transaction because a new undo records from some other
transaction would have been added since we last fetched the undo
record.  This can be fixed by just removing the call to
UndoLogGetNextInsertPtr.  I have done so in undoprocessing branch and
added the comment as well.

I think the common problem with the above changes is that it assumes
that new undo can't be added to undo logs while discard worker is
processing them.


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

akapila
On Thu, Jul 4, 2019 at 5:23 PM Amit Kapila <[hidden email]> wrote:

>
> On Mon, Jul 1, 2019 at 1:24 PM Thomas Munro <[hidden email]> wrote:
> >
> > Another small change/review: the function UndoLogGetNextInsertPtr()
> > previously took a transaction ID, but I'm not sure if that made sense,
> > I need to think about it some more.
> >
>
> The changes you have made related to UndoLogGetNextInsertPtr() doesn't
> seem correct to me.
>
> @@ -854,7 +854,9 @@ FindUndoEndLocationAndSize(UndoRecPtr start_urecptr,
>   * has already started in this log then lets re-fetch the undo
>   * record.
>   */
> - next_insert = UndoLogGetNextInsertPtr(slot->logno, uur->uur_xid);
> + next_insert = UndoLogGetNextInsertPtr(slot->logno);
> +
> + /* TODO this can't happen */
>   if (!UndoRecPtrIsValid(next_insert))
>
> I think this is a possible case.  Say while the discard worker tries
> to register the rollback request from some log and after it fetches
> the undo record corresponding to start location in this function,
> another backend adds the new transaction undo.  The same is mentioned
> in comments as well.   Can you explain what makes you think that this
> can't happen?  If we don't want to pass the xid to
> UndoLogGetNextInsertPtr, then I think we need to get the insert
> location before fetching the record.  I will think more on it to see
> if there is any other problem with the same.
>

Pushed the fixed on above lines in the undoprocessing branch.  It will
be available in the next set of patches we post.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

akapila
On Fri, Jul 5, 2019 at 5:20 PM Amit Kapila <[hidden email]> wrote:

>
> On Thu, Jul 4, 2019 at 5:23 PM Amit Kapila <[hidden email]> wrote:
> >
> > On Mon, Jul 1, 2019 at 1:24 PM Thomas Munro <[hidden email]> wrote:
> > >
> > > Another small change/review: the function UndoLogGetNextInsertPtr()
> > > previously took a transaction ID, but I'm not sure if that made sense,
> > > I need to think about it some more.
> > >
> >
> > The changes you have made related to UndoLogGetNextInsertPtr() doesn't
> > seem correct to me.
> >
> > @@ -854,7 +854,9 @@ FindUndoEndLocationAndSize(UndoRecPtr start_urecptr,
> >   * has already started in this log then lets re-fetch the undo
> >   * record.
> >   */
> > - next_insert = UndoLogGetNextInsertPtr(slot->logno, uur->uur_xid);
> > + next_insert = UndoLogGetNextInsertPtr(slot->logno);
> > +
> > + /* TODO this can't happen */
> >   if (!UndoRecPtrIsValid(next_insert))
> >
> > I think this is a possible case.  Say while the discard worker tries
> > to register the rollback request from some log and after it fetches
> > the undo record corresponding to start location in this function,
> > another backend adds the new transaction undo.  The same is mentioned
> > in comments as well.   Can you explain what makes you think that this
> > can't happen?  If we don't want to pass the xid to
> > UndoLogGetNextInsertPtr, then I think we need to get the insert
> > location before fetching the record.  I will think more on it to see
> > if there is any other problem with the same.
> >
>
> Pushed the fixed on above lines in the undoprocessing branch.
>

Just in case anyone wants to look at the undoprocessing branch, it is
available at https://github.com/EnterpriseDB/zheap/tree/undoprocessing

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Robert Haas
In reply to this post by akapila
On Tue, Jun 25, 2019 at 4:00 PM Amit Kapila <[hidden email]> wrote:
> Fair enough.  I have implemented it based on next_retry_at and use
> constant time 10s for the next retry.  I have used define instead of a
> GUC as all the other constants for similar things are defined as of
> now.  One thing to note is that we want the linger time (defined as
> UNDO_WORKER_LINGER_MS) for a undo worker to be more than failure retry
> time (defined as UNDO_FAILURE_RETRY_DELAY_MS) as, otherwise, the undo
> worker can exit before retrying the failed requests.

Uh, I think we want exactly the opposite.  We want the workers to exit
before retrying, so that there's a chance for other databases to get
processed, I think.  Am I confused?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Robert Haas
In reply to this post by Thomas Munro-5
On Mon, Jul 1, 2019 at 3:54 AM Thomas Munro <[hidden email]> wrote:
> [ new patches ]

I took a look at 0012 today, Amit's patch for extending the binary
heap machinery, and 0013, Amit's patch for "Infrastructure to register
and fetch undo action requests."

I don't think that binaryheap_allocate_shm() is a good design.  It
presupposes that we want to store the binary heap as its own chunk of
shared memory allocated via ShmemInitStruct(), but we might want to do
something else, like embed in another structure, store it in a DSM or
DSA, etc., and this function can't do any of that.  I think we should
have something more like:

extern Size binaryheap_size(int capacity);
extern void binaryheap_initialize(binaryheap *, int capacity,
binaryheap_comparator compare, void *arg);

Then the caller can do something like:

sz = binaryheap_size(capacity);
bh = ShmemInitStruct(name, sz, &found);
if (!found)
    binaryheap_initialize(bh, capacity, comparator, whatever);

If it wants to get the memory in some other way, it just needs to
initialize bh differently; the rest is the same.  Note that there is
no need, in this design, for binaryheap_size/initialize to make use of
"shared" memory.  They could equally well be used on backend-local
memory.  They do not need to care. You just provide the memory, and
they do their thing.

I wasn't very happy about binaryheap_nth(), binaryheap_remove_nth(),
and binaryheap_remove_nth_unordered() and started looking at how they
are used to try to see if there might be a better way.  That led me to
look at 0013.  Unfortunately, I find it really hard to understand what
this code is actually doing.  There's a lot of redundant and
badly-written stuff in here.  As a general principle, if you have two
or three data structures of some particular type, you don't write a
separate family of functions for manipulating each one.  You write one
function for each operation, and you pass the particular copy of the
data structure with which you are working as an argument.

In the lengthy section of macros definitions at the top of
undorequest.c, we have macros InitXidQueue, XidQueueIsEmpty,
GetXidQueueSize, GetXidQueueElem, GetXidQueueTopElem,
GetXidQueueNthElem, and SetXidQueueElem.  Several of these are used in
only one place or are not used anywhere at all; those should be
removed altogether and inlined into the single call site if there is
one.  Now, then after this, there is a matching set of macros,
InitSizeQueue, SizeQueueIsEmpty, GetSizeQueueSize, GetSizeQueueElem,
GetSizeQueueTopElem, GetSizeQueueNthElem, and  SetSizeQueueElem.  Many
of these macros are exactly the same as the previous set of macros
except that they operate on a different queue, which as I mentioned in
the previous paragraph, is not a good design.  It leads to extensive
code duplication.

Look, for example, at RemoveOldElemsFromSizeQueue and
RemoveOldElemsFromXidQueue.  They are basically identical except for
s/Size/Xid/g and s/SIZE/XID/g, but you can't unify them easily because
they are calling different functions.  However, if you didn't have one
function called GetSizeQueueSize and another called GetXidQueueSize,
but just had a pointer to the relevant binary heap, then both
functions could just call binaryheap_empty() on it, which would be
better style, use fewer macros, generate less machine code, and be
easier to read.  Ideally, you'd get to the point where you could just
have one function rather than two, and pass the queue upon which it
should operate as an argument.  There seems to be a good deal of this
kind of duplication in this file and it really needs to be cleaned up.

Now, one objection to the above line of attack is the different queues
actually contain different types of elements.  Apparently, the XID
queue contains elements of type UndoXidQueue and the size queue
contains elements of type UndoSizeQueue.  It is worth noting here that
these are bad type names, because they sound like they are describing
a type of queue, but it seems that they are actually describing an
element in the queue. However, there are two larger problems:

1. I don't think we should have three different kinds of objects for
each of the three different queues.  It seems like it would be much
simpler and easier to just have one kind of object that stores all the
information we need (full_xid, start_urec_ptr, dbid, request_size,
next_retry_at, error_ocurred_at) and use that everywhere. You could
object that this would increase the storage space requirement, but it
wouldn't be enough to make any real difference and it probably would
be well worth it for the avoidance of complexity.

2. However, I don't think we should have a separate request object for
each queue anyway.  We should insert pointers to the same objects in
all the relevant queue (either size + XID, or else error).  So instead
of having three sets of objects, one for each queue, we'd just have
one set of objects and point to them with as many as two pointers.
We'd therefore need LESS memory than we're using today, because we
wouldn't have separate arrays for XID, size, and error queue elements.

In fact, it seems to me that we shouldn't have any such thing as
"queue entries" at all.  The queues should just be pointing to
RollbackHashEntry *, and we should add all the fields there that are
present in any of the "queue entry" structures.  This would use less
memory still.

I also think we should be using simplehash rather than dynahash.  I'm
not sure that I would really say that simplehash is "simple," but it
does have a nicer API and simpler memory management. There's just a
big old block of memory, and there's no incremental allocation.  That
keeps things simple for the code that wants to go through the queues
and removing dangling pointers.  I think that the way this should work
is that each RollbackHashEntry * should contain a field "bool active."
 Then:

1. When we pull an item out of one of the binary heaps, we check the
active flag.  If it's clear, we ignore the entry and pull the next
item.  If it's set, we  clear the flag and process the item, so that
if it's subsequently pulled from the other queue it will be ignored.

2. If a binary heap is full when we need to insert into it, we can
iterate over all of the elements and throw away any that are !active.
They've already been dequeued and processed from some other queue, so
they're not "really" in this queue any more, even though we haven't
gone to the trouble of actually kicking them out yet.

On another note, UNDO_PEEK_DEPTH is bogus.  It's used in UndoGetWork()
and it passes the depth argument down to GetRollbackHashKeyFromQueue,
which then does binaryheap_nth() on the relevant queue.  Note that
this function is another places that is ends up duplicating code
because of the questionable decision to have separate types of queue
entries for each different queue; otherwise, it could probably just
take the binary heap into which it's peeking as an argument instead of
having three different cases. But that's not the main point here.  The
main point is that it calls a function for whichever type of queue
we've got and gets some kind of queue entry using binaryheap_nth().
But binaryheap_nth(whatever, 2) does not give you the third-smallest
element in the binary heap.  It gives you the third entry in the
array, which may or may not have the heap property, but even if it
does, the third element could be huge.  Consider this binary heap:

0 1 100000 2 3 100001 100002 4 5 6 7 100003 100004 100005 100006

This satisfies the binary heap property, because the element at
position n is always smaller than the elements at positions 2n+1 and
2n+2 (assuming 0-based indexing). But if you want to look at the
smallest three elements in the heap, you can't just look at indexes
0..2.  The second-smallest element must be at index 1 or 2, but it
could be either place.  The third-smallest element could be the other
of 1 and 2, or it could be either child of the smaller one, so there
are three places it might be.  In general, a binary heap is not a good
data structure for finding the smallest N elements of a collection
unless N is 1, and what's going to happen with what you've got here is
that we'll sometimes prioritize an item that would not have been
pulled from the queue for a long time over one that would have
otherwise been processed much sooner.  I'm not sure that's a
show-stopper, but it doesn't seem good, and the current patch doesn't
seem to have any comments justifying it, or at least not in the places
nearby to where this is actually happening.

I think there are more problems here, too.  Let's suppose that we
fixed the problem described in the previous paragraph somehow, or
decided that it won't actually make a big difference and just ignored
it.  Suppose further that we have N active databases which are
generating undo requests.  Luckily, we happen to also have N undo
workers available, and let's suppose that as of a certain moment in
time there is exactly one worker in each database.  Think about what
will happen when one of those workers goes to look for the next undo
request.  It's likely that the first request in the queue will be for
some other database, so it's probably going to have to peak ahead to
find a request for the database to which it's connected -- let's just
assume that there is one.  How far will it have to peak ahead?  Well,
if the requests are uniformly distributed across databases, each
request has a 1-in-N chance of being the right one.  I wrote a little
Perl program to estimate the probability that we won't find the next
request for our databases within 10 requests as a function of the
number of databases:

1 databases => failure chance with 10 lookahead is 0.00%
2 databases => failure chance with 10 lookahead is 0.10%
3 databases => failure chance with 10 lookahead is 1.74%
4 databases => failure chance with 10 lookahead is 5.66%
5 databases => failure chance with 10 lookahead is 10.74%
6 databases => failure chance with 10 lookahead is 16.18%
7 databases => failure chance with 10 lookahead is 21.45%
8 databases => failure chance with 10 lookahead is 26.31%
9 databases => failure chance with 10 lookahead is 30.79%
10 databases => failure chance with 10 lookahead is 34.91%
11 databases => failure chance with 10 lookahead is 38.58%
12 databases => failure chance with 10 lookahead is 41.85%
13 databases => failure chance with 10 lookahead is 44.91%
14 databases => failure chance with 10 lookahead is 47.69%
15 databases => failure chance with 10 lookahead is 50.12%
16 databases => failure chance with 10 lookahead is 52.34%
17 databases => failure chance with 10 lookahead is 54.53%
18 databases => failure chance with 10 lookahead is 56.39%
19 databases => failure chance with 10 lookahead is 58.18%
20 databases => failure chance with 10 lookahead is 59.86%

Assuming my script (attached) doesn't have a bug, with only 8
databases, there's better than a 1-in-4 chance that we'll fail to find
the next entry for the current database within the lookahead window.
That's bad, because then the worker will be sitting around waiting
when it should be doing stuff.  Maybe it will even exit, even though
there's work to be done, and even though all the other databases have
their own workers already.  You can construct way worse examples than
this one, too: imagine that there are two databases, each with a
worker, and one has 99% of the requests and the other one has 1% of
the requests.  It's really unlikely that there's going to be an entry
for the second database within the lookahead window.  And note that
increasing the window doesn't really help either: you just need more
databases than the size of the lookahead window, or even almost as
many as the lookahead window, and things are going to stop working
properly.

On the other hand, suppose that you have 10 databases and one undo
worker.  One database is pretty active and generates a continuous
stream of undo requests at exactly the same speed we can process them.
The others all have 1 pending undo request.  Now, what's going to
happen is that you'll always find the undo request for the current
database within the lookahead window.  So, you'll never exit.  But
that means the undo requests in the other 9 databases will just sit
there for all eternity, because there's no other worker to process
them.  On the other hand, if you had 11 databases, there's a good
chance it would work fine, because the new request for the active
database would likely be outside the lookahead window, and so you'd
find no work to do and exit, allowing a worker to be started up in
some other database.  It would in turn exit and so on and you'd clear
the backlog for the other databases at least for a while, until you
picked the active database again.  Actually, I haven't looked at the
whole patch set, so perhaps there is some solution to this problem
contemplated somewhere, but I consider this argument to be pretty good
evidence that a fixed lookahead distance is probably the wrong thing.

The right things to do about these problems probably need some
discussion, but here's the best idea I have off-hand: instead of just
have 3 binary heaps (size, XID, error), have N+1 "undo work trackers",
each of which contains 3 binary heaps (size, XID, error).  Undo work
tracker #0 contains all requests that are not assigned to any other
undo work tracker.  Each of undo work trackers #1..N contain all the
requests for one particular database, but they all start out unused.
Before launching an undo worker for a particular database, the
launcher must check whether it has an undo work tracker allocated to
that database.  If not, it allocates one and moves all the work for
that database out of tracker #0 and into the newly-allocated tracker.
If there are none free, it must first deallocate an undo work tracker,
moving any remaining work for that tracker back into tracker #0.  With
this approach, there's no need for lookahead, because every worker is
always pulling from a queue that is database-specific, so the next
entry is always guaranteed to be relevant.  And you choose N to be
equal to the number of workers, so that even if every worker is in a
separate database there will be enough trackers for all workers to
have one, plus tracker #0 for whatever's left.

There still remains the problem of figuring out when a worker should
terminate to allow for new workers to be launched, which is a fairly
complex problem that deserves its own discussion, but I think this
design helps. At the very least, you can see whether tracker #0 is
empty.  If it is, you might still want to rebalance workers between
databases, but you don't really need to worry about databases getting
starved altogether, because you know that you can run a worker for
every database that has any pending undo.  If tracker #0 is non-empty
but you have unused workers, you can just allocate trackers for the
databases in tracker #0 and move stuff over there to be processed.  If
tracker #0 is non-empty and all workers are allocated, you are going
to need to ask one of them to exit at some point, to avoid starvation.
I don't know exactly what the algorithm for that should be; I do have
some ideas.  I'm not going to include them in this email though,
because this email is already long and I don't have time to make it
longer right now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

lookahead-failure-chance.pl (516 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

akapila
In reply to this post by Robert Haas
On Fri, Jul 5, 2019 at 7:39 PM Robert Haas <[hidden email]> wrote:

>
> On Tue, Jun 25, 2019 at 4:00 PM Amit Kapila <[hidden email]> wrote:
> > Fair enough.  I have implemented it based on next_retry_at and use
> > constant time 10s for the next retry.  I have used define instead of a
> > GUC as all the other constants for similar things are defined as of
> > now.  One thing to note is that we want the linger time (defined as
> > UNDO_WORKER_LINGER_MS) for a undo worker to be more than failure retry
> > time (defined as UNDO_FAILURE_RETRY_DELAY_MS) as, otherwise, the undo
> > worker can exit before retrying the failed requests.
>
> Uh, I think we want exactly the opposite.  We want the workers to exit
> before retrying, so that there's a chance for other databases to get
> processed, I think.
>

The workers will exit if there is any chance for other databases to
get processed.  Basically, we linger only when we find there is no
work in other databases.  Not only that even if some new work is added
to the queues for some other database then also we stop the lingering
worker if there is no worker available for the new request that has
arrived.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Dilip Kumar-2
In reply to this post by akapila
On Thu, Jul 4, 2019 at 5:24 PM Amit Kapila <[hidden email]> wrote:
>
PFA, the latest version of the undo interface and undo processing patches.

Summary of the changes in the patch set

1. Undo Interface
- Rebased over latest undo storage code
- Implemented undo page compression (don't store the common fields in
all the records instead we get from the first complete record of the
page).
- As per Robert's comment, UnpackedUndoRecord is divided in two parts,
  a) All fields which are set by the caller.
  b) Pointer to structures which are set internally.
- Epoch and the Transaction id are  unified as full transaction id
- Fixed handling of dbid during recovery (TODO in PrepareUndoInsert)

Pending:
- Move loop in UndoFetchRecord to outside and test performance with
keeping pin vs pin+lock across undo records.  This will be done after
testing performance over the zheap code.
- I need to investigate whether Discard checking can be unified in
master and HotStandby in UndoFetchRecord function.

2. Undo Processing
- Defect fix in multi-log rollback for subtransaction.
- Assorted defect fixes.

Others
   - Fixup for undo log code to handle full transaction id in
UndoLogSlot for discard and other bug fixes in undo log.
   - Fixup for Orphan file cleanup to pass dbid in PrepareUndoInsert








--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

undo_20190706.tar.gz (123K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

akapila
In reply to this post by Robert Haas
On Sat, Jul 6, 2019 at 1:47 AM Robert Haas <[hidden email]> wrote:
>
> On Mon, Jul 1, 2019 at 3:54 AM Thomas Munro <[hidden email]> wrote:
> > [ new patches ]
>
> I took a look at 0012 today, Amit's patch for extending the binary
> heap machinery, and 0013, Amit's patch for "Infrastructure to register
> and fetch undo action requests."
>

Thanks for looking into the patches.

> I don't think that binaryheap_allocate_shm() is a good design.  It
> presupposes that we want to store the binary heap as its own chunk of
> shared memory allocated via ShmemInitStruct(), but we might want to do
> something else, like embed in another structure, store it in a DSM or
> DSA, etc., and this function can't do any of that.  I think we should
> have something more like:
>
> extern Size binaryheap_size(int capacity);
> extern void binaryheap_initialize(binaryheap *, int capacity,
> binaryheap_comparator compare, void *arg);
>
> Then the caller can do something like:
>
> sz = binaryheap_size(capacity);
> bh = ShmemInitStruct(name, sz, &found);
> if (!found)
>     binaryheap_initialize(bh, capacity, comparator, whatever);
>
> If it wants to get the memory in some other way, it just needs to
> initialize bh differently; the rest is the same.  Note that there is
> no need, in this design, for binaryheap_size/initialize to make use of
> "shared" memory.  They could equally well be used on backend-local
> memory.  They do not need to care. You just provide the memory, and
> they do their thing.
>

I didn't have other use cases in mind and I think to some extent this
argument holds true for existing binaryheap_allocate allocate.  If we
want to make it more generic, then shouldn't we try to even change the
existing binaryheap_allocate to use this new model, that way the
binary heap allocation API will be more generic?

..
..

>
> Now, one objection to the above line of attack is the different queues
> actually contain different types of elements.  Apparently, the XID
> queue contains elements of type UndoXidQueue and the size queue
> contains elements of type UndoSizeQueue.  It is worth noting here that
> these are bad type names, because they sound like they are describing
> a type of queue, but it seems that they are actually describing an
> element in the queue. However, there are two larger problems:
>
> 1. I don't think we should have three different kinds of objects for
> each of the three different queues.  It seems like it would be much
> simpler and easier to just have one kind of object that stores all the
> information we need (full_xid, start_urec_ptr, dbid, request_size,
> next_retry_at, error_ocurred_at) and use that everywhere. You could
> object that this would increase the storage space requirement,

Yes, this was the reason to keep them separate, but I see your point.

> but it
> wouldn't be enough to make any real difference and it probably would
> be well worth it for the avoidance of complexity.
>

Okay, will give it a try and see if it can avoid some code complexity.
Along with this, I will investigate your other suggestions related to
code improvements as well.


>
> On another note, UNDO_PEEK_DEPTH is bogus.  It's used in UndoGetWork()
> and it passes the depth argument down to GetRollbackHashKeyFromQueue,
> which then does binaryheap_nth() on the relevant queue.  Note that
> this function is another places that is ends up duplicating code
> because of the questionable decision to have separate types of queue
> entries for each different queue; otherwise, it could probably just
> take the binary heap into which it's peeking as an argument instead of
> having three different cases. But that's not the main point here.  The
> main point is that it calls a function for whichever type of queue
> we've got and gets some kind of queue entry using binaryheap_nth().
> But binaryheap_nth(whatever, 2) does not give you the third-smallest
> element in the binary heap.  It gives you the third entry in the
> array, which may or may not have the heap property, but even if it
> does, the third element could be huge.  Consider this binary heap:
>
> 0 1 100000 2 3 100001 100002 4 5 6 7 100003 100004 100005 100006
>
> This satisfies the binary heap property, because the element at
> position n is always smaller than the elements at positions 2n+1 and
> 2n+2 (assuming 0-based indexing). But if you want to look at the
> smallest three elements in the heap, you can't just look at indexes
> 0..2.  The second-smallest element must be at index 1 or 2, but it
> could be either place.  The third-smallest element could be the other
> of 1 and 2, or it could be either child of the smaller one, so there
> are three places it might be.  In general, a binary heap is not a good
> data structure for finding the smallest N elements of a collection
> unless N is 1, and what's going to happen with what you've got here is
> that we'll sometimes prioritize an item that would not have been
> pulled from the queue for a long time over one that would have
> otherwise been processed much sooner.
>

You are right that it won't be nth smallest element from the queue and
we don't even care for that here.  The peeking logic is not to find
the next prioritized element but to check if we can find some element
for the same database in the next few entries to avoid frequent undo
worker restart.

>  I'm not sure that's a
> show-stopper, but it doesn't seem good, and the current patch doesn't
> seem to have any comments justifying it, or at least not in the places
> nearby to where this is actually happening.
>

I agree that we should add more comments explaining this.

> I think there are more problems here, too.  Let's suppose that we
> fixed the problem described in the previous paragraph somehow, or
> decided that it won't actually make a big difference and just ignored
> it.  Suppose further that we have N active databases which are
> generating undo requests.  Luckily, we happen to also have N undo
> workers available, and let's suppose that as of a certain moment in
> time there is exactly one worker in each database.  Think about what
> will happen when one of those workers goes to look for the next undo
> request.  It's likely that the first request in the queue will be for
> some other database, so it's probably going to have to peak ahead to
> find a request for the database to which it's connected -- let's just
> assume that there is one.  How far will it have to peak ahead?  Well,
> if the requests are uniformly distributed across databases, each
> request has a 1-in-N chance of being the right one.  I wrote a little
> Perl program to estimate the probability that we won't find the next
> request for our databases within 10 requests as a function of the
> number of databases:
>
> 1 databases => failure chance with 10 lookahead is 0.00%
> 2 databases => failure chance with 10 lookahead is 0.10%
> 3 databases => failure chance with 10 lookahead is 1.74%
> 4 databases => failure chance with 10 lookahead is 5.66%
> 5 databases => failure chance with 10 lookahead is 10.74%
> 6 databases => failure chance with 10 lookahead is 16.18%
> 7 databases => failure chance with 10 lookahead is 21.45%
> 8 databases => failure chance with 10 lookahead is 26.31%
> 9 databases => failure chance with 10 lookahead is 30.79%
> 10 databases => failure chance with 10 lookahead is 34.91%
> 11 databases => failure chance with 10 lookahead is 38.58%
> 12 databases => failure chance with 10 lookahead is 41.85%
> 13 databases => failure chance with 10 lookahead is 44.91%
> 14 databases => failure chance with 10 lookahead is 47.69%
> 15 databases => failure chance with 10 lookahead is 50.12%
> 16 databases => failure chance with 10 lookahead is 52.34%
> 17 databases => failure chance with 10 lookahead is 54.53%
> 18 databases => failure chance with 10 lookahead is 56.39%
> 19 databases => failure chance with 10 lookahead is 58.18%
> 20 databases => failure chance with 10 lookahead is 59.86%
>
> Assuming my script (attached) doesn't have a bug, with only 8
> databases, there's better than a 1-in-4 chance that we'll fail to find
> the next entry for the current database within the lookahead window.
>

This is a good test scenario, but I think it has not been considered
that there are multiple queues and we peek into each one.

> That's bad, because then the worker will be sitting around waiting
> when it should be doing stuff.  Maybe it will even exit, even though
> there's work to be done, and even though all the other databases have
> their own workers already.
>

I think we should once try with the actual program which can test such
a scenario on the undo patches before reaching any conclusion.  I or
one of my colleague will work on this and report back the results.

>  You can construct way worse examples than
> this one, too: imagine that there are two databases, each with a
> worker, and one has 99% of the requests and the other one has 1% of
> the requests.  It's really unlikely that there's going to be an entry
> for the second database within the lookahead window.
>

I am not sure if that is the case because as soon as the request from
other database get prioritized (say because its XID becomes older) and
came as the first request in one of the queues, the undo worker will
exit (provided it has worked for some threshold time (10s) in that
database) and allow the request from another database to be processed.

>  And note that
> increasing the window doesn't really help either: you just need more
> databases than the size of the lookahead window, or even almost as
> many as the lookahead window, and things are going to stop working
> properly.
>
> On the other hand, suppose that you have 10 databases and one undo
> worker.  One database is pretty active and generates a continuous
> stream of undo requests at exactly the same speed we can process them.
> The others all have 1 pending undo request.  Now, what's going to
> happen is that you'll always find the undo request for the current
> database within the lookahead window.  So, you'll never exit.
>

Following the logic given above, I think here also worker will exit as
soon as the request from other database get prioritised.

>  But
> that means the undo requests in the other 9 databases will just sit
> there for all eternity, because there's no other worker to process
> them.  On the other hand, if you had 11 databases, there's a good
> chance it would work fine, because the new request for the active
> database would likely be outside the lookahead window, and so you'd
> find no work to do and exit, allowing a worker to be started up in
> some other database.
>

As explained above, I think it will work the same way both for 10 or
11 databases.  Note, that we don't always try to look ahead. We look
ahead when we have not worked on the current database for some
threshold amount of time.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


1234567 ... 17