POC: Cleaning up orphaned files using undo logs

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
330 messages Options
123456 ... 17
Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Asim R P
My understanding is smgr pendingDeletes infrastructure will be replaced by these patches.  I still see CommitTransaction() calling smgrDoPendingDeletes() in the latest patch set.  Am I missing something?

Asim
Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Thomas Munro-5
On Mon, Jun 10, 2019 at 5:35 AM Asim R P <[hidden email]> wrote:
> My understanding is smgr pendingDeletes infrastructure will be replaced by these patches.  I still see CommitTransaction() calling smgrDoPendingDeletes() in the latest patch set.  Am I missing something?

Hi Asim,

Thanks for looking at the patch.

The pendingDeletes list is used both for files that should be deleted
if we commit and files that should be deleted if we abort.  This patch
deals only with the abort case, using the undo log instead of
pendingDeletes.  That is the file leak scenario that has an
arbitrarily wide window controlled by the user and is probably the
source of almost all cases that you hear of of disks filling up with
orphaned junk AFAICS.

There could in theory be a persistent stuff-to-do-if-we-commit system
exactly unlike undo logs (records to be discarded on abort, executed
on commit).  I haven't thought much about how it'd work, but Andres
did suggest something like that for another purpose just the other
day, and although it's hard to think of a name for it, it doesn't seem
crazy as long as it doesn't add overheads when you're not using it.
Without such a mechanism, you can probably leak files belonging to
tables that you have dropped in a committed transaction, if you die in
CommitTransaction() after it has called RecordTransactionCommit() but
before it reaches smgrDoPendingDeletes(), and even then probably only
if there is super well-timed checkpoint so that you recover without
replaying the drop.  I'm not try to tackle that today.

BTW, there is yet another kind of deferred unlinking going on.  In
SyncPostCheckpoint() (formerly known as mdpostckpt()) we defer the
last bit of the job until after the next checkpoint.  At that point we
only expect the first segment to exist and we expect it to be empty.
That's a mechanism introduced by commit 6cc4451b5c47 to make sure that
we don't reuse relfilenode numbers too soon in some crash scenarios.
That means there is another very narrow window there to leak a file
(though these ones are empty): you die after the checkpoint is logged
but before SyncPostCheckpoint() is run, or even after that but before
the operating system has flushed the directory.

--
Thomas Munro
https://enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Robert Haas
In reply to this post by akapila
On Mon, May 27, 2019 at 5:59 AM Amit Kapila <[hidden email]> wrote:
> Apart from fixing the above comments, the patch is rebased on latest
> undo patchset.  As of now, I have split the binaryheap.c changes into
> a separate patch.  We are stilll enhancing the patch to compute
> oldestXidHavingUnappliedUndo which touches various parts of patch, so
> splitting further without completing that can make it a bit difficult
> to work on that.

Some review comments around execute_undo_actions:

The 'nopartial' argument to execute_undo_actions is confusing.  First,
it would probably be worth spelling it out instead of abbreviation:
not_partial_transaction rather than nopartial.  Second, it is usually
better to phrase parameter names in terms of what they are rather than
in terms of what they are not: complete_transaction rather than
not_partial_transaction.  Third, it's unclear from these comments why
we'd be undoing something other than a complete transaction.  It looks
as though the answer is that this flag will be false when we're
undoing a subxact -- in which case, why not invert the sense of the
flag and call it 'bool subxact'?  I might be wrong, but it seems like
that would be a whole lot clearer. Fourth, the block at the top of
this function, guarded by nopartial, seems like it must be vulnerable
to race conditions.  If we're undoing the complete transaction, then
it checks whether UndoFetchRecord() still returns anything.  But that
could change not just at the beginning of the function, but also any
time in the middle, or so it seems to me.  I doubt that this is the
right level at which to deal with this sort of interlocking. I think
there should be some higher-level mechanism that prevents two
processes from trying to undo the same transaction at the same time,
like a heavyweight lock or some kind of flag in the shared memory data
structure that keeps track of pending undo, so that we never even
reach this code unless we know that this XID needs undo work and no
other process is already doing it.  If you're the only one undoing XID
123456, then there shouldn't be any chance of the undo disappearing
from underneath you.  And we definitely want to guarantee that only
one process is undoing any given XID at a time.

The 'blk_chain_complete' variable which is set in this function and
passed down to execute_undo_actions_page() and then to the rmgr's
rm_undo callback also looks problematic.  First, not every AM that
uses undo may even have the notion of a 'block chain'; zedstore for
example uses TIDs as a 48-bit integer, not a block + offset number, so
it's really not going to have a 'block chain.'  Second, even in
zheap's usage, it seems to me that the block chain could be complete
even when this gets set to false. It gets set to true when we're
undoing a toplevel transaction (not a subxact) and we were able to
fetch all of the undo for that toplevel transaction. But even if
that's not true, the chain for an individual block could still be
complete, because all the remaining undo for the block at issue
might've been in the chunk of undo we already read; the remaining undo
could be for other blocks.  For that reason, I can't see how the zheap
code that relies on this value can be correct; it uses this value to
decide whether to stick zeroes in the transaction slot, but if the
scenario described above happened, then I suppose the XID would not
get cleared from the slot during undo.  Maybe zheap is just relying on
that being harmless, since if all of the undo actions have been
correctly executed for the page, the fact that the transaction slot is
still bogusly used by an aborted xact won't matter; nothing will refer
to it. However, it seems to me that it would be better for zheap to
set things up so that the first undo record for a particular txn/page
combination is flagged in some way (in the payload!) so that undo can
zero the slot if the action being undone is the one that claimed the
slot.  That seems cleaner on principle, and it also avoids having
supposedly AM-independent code pass down details that are driven by
zheap's particular needs.

While it's probably moot since I think this code should go away
anyway, I find it poor style to write something like:

+ if (nopartial && !UndoRecPtrIsValid(urec_ptr))
+ blk_chain_complete = true;
+ else
+ blk_chain_complete = false;

"if (x) y = true; else y = false;" can be more compactly written as "y
= x;", like this:

blk_chain_complete = nopartial && !UndoRecPtrIsValid(urec_ptr);

I think that the signature for rm_undo can be simplified considerably.
I think blk_chain_complete should go away for the reasons discussed
above.  Also, based on our conversations with Heikki at PGCon, we
decided that we should not presume that the AM wants the records
grouped by block, so the blkno argument should go away.  In addition,
I don't see much reason to have a first_idx argument. Instead of
passing a pointer to the caller's entire array and telling the
callback where to start looking, couldn't we just pass a pointer to
the first record the callback should examine, i.e. instead of passing
urp_array, pass urp_array + first_idx.  Then instead of having a
last_idx argument, have an argument for the number of entries in the
array, computed as last_idx - first_idx + 1.  With those changes,
rm_undo would look like this:

bool (*rm_undo) (UndoRecInfo *urp_array, int count, Oid reloid,
FullTransactionId full_xid);

Now for the $10m question: why even pass reloid and full_xid?  Aren't
those values going to be present inside every UnpackedUndoRecord?  Why
not just let the callback get them from the first record (or however
it wants to do things)?  Perhaps there is some documentation value
here in that it implies that the value will be the same for every
record, but we could also handle that by just documenting in the
appropriate places that undo is done by transaction and relation and
therefore the callback is entitled to assume that the same value will
be present in every record.  Then again, I am not sure we really want
the callback to assume that reloid doesn't change.  I don't see a
reason offhand not to just pass as many records as we have for a given
transaction and let the callback do what it likes.  So maybe that's
another reason to get rid of the reloid argument, at least.  And then
we could document that all the record will have the same full_xid
(unless we decide that we don't want to guarantee that either).

Additionally, it strikes me that urp_array is not the greatest name.
Generally, putting _array into the name of the variable to indicate
that it's an array doesn't seem all that great from a coding-style
perspective.  I mean, sometimes it's the best you can do, but it's not
amazing.  And urp seems like it's using an abbreviation without any
real reason.  For contrast, consider this existing precedent:

extern SysScanDesc systable_beginscan_ordered(Relation heapRelation,
                                                   Relation indexRelation,
                                                   Snapshot snapshot,
                                                   int nkeys, ScanKey key);

Or this one:

extern TupleDesc CreateTupleDesc(int natts, Form_pg_attribute *attrs);

Notice that in each case the array parameter (which is the last one)
is named based on what data it contains rather than on the fact that
it is an array.

Finally, I observe that rm_undo returns a Boolean, but it's not used
for anything.  The only call to rm_undo in the current patch set is in
execute_undo_actions_page, which returns that value to the caller, but
the callers just discard it.  I suppose maybe this was intended to
report success or failure, but I think the way that rm_undo will
report failure is to ERROR.  Or, if we want to allow a fail-soft
behavior for some reason, then the callers all need to check the
value.  I'm not sure whether there's a use case for that or not.

Putting all that together, I suggest a signature like this:

void (*rm_undo) (int nrecords, UndoRecInfo *records);

Or if we decide we need to have a fail-soft behavior, then like this:

bool (*rm_undo) (int nrecords, UndoRecInfo *records);

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Thomas Munro-5
In reply to this post by Thomas Munro-5
On Mon, Jun 10, 2019 at 3:00 PM Thomas Munro <[hidden email]> wrote:
> On Mon, Jun 10, 2019 at 5:35 AM Asim R P <[hidden email]> wrote:
> > My understanding is smgr pendingDeletes infrastructure will be replaced by these patches.  I still see CommitTransaction() calling smgrDoPendingDeletes() in the latest patch set.  Am I missing something?

> Thanks for looking at the patch.

Hello,

Here is a new rebased version of the full patch set for orphaned file
cleanup.  The orphaned file cleanup code itself hasn't changed but
there are some changes in lower down patches:

* getting rid of more global variables, instead using eg
CurrentSession->attached_undo_logs (the session.h infrastructure that
is intended to avoid creating more multithreading-hostile code)

* using undo log "slots" in various APIs to make it clearer that slots
can be recycled, which has locking implications, plus several locking
bug fixes that motivated that change

* current versions of the record and worker code discussed upthread by
Amit and others

The code is also at https://github.com/EnterpriseDB/zheap/tree/undo
and includes patches from
https://github.com/EnterpriseDB/zheap/tree/undoprocessing and
https://github.com/EnterpriseDB/zheap/tree/undo_interface_v1 where
some parts of this stack (workers etc) are being developed.

--
Thomas Munro
https://enterprisedb.com

undo-20190614.tgz (233K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

akapila
On Fri, Jun 14, 2019 at 8:26 AM Thomas Munro <[hidden email]> wrote:
>
> * current versions of the record and worker code discussed upthread by
> Amit and others
>

Thanks for posting the complete patchset.

Last time, I mentioned the remaining work in undo-processing patchset,
the status of which is as follows:
1. Enhance uur_progress so that it updates undo action apply progress
at regular intervals.  This has been done.  The idea is that we update
the transaction's undo apply progress at regular intervals so that
after a crash we can skip already applied undo.   The undo apply
progress is updated in terms of the number of blocks processed.
I think it is better to change the name of uur_progress to something
like uur_apply_progress.  Any suggestions?

2. Enhance to support oldestXidHavingUnappliedUndo, more on that
later.  This has been done.  The idea here is that we register all the
undo apply (transaction abort) requests in the hash table (referred to
as Rollback Hash Table in the patch) and we have a hard limit (after
that we won't allow new transactions to write undo) on how many such
requests can be pending.  So scanning this table gives us the value of
oldestXidHavingUnappliedUndo (actually the value for this will be
smallest of 'xid having pending undo' and 'oldestXmin').  As this
rollback hash table is not persistent, after start, we need to take a
pass over undo logs to register all the pending abort requests in the
rollback hash table.  There are two main purposes which this value
serves (a) Any Xid below this is all-visible, so it can help in
visibility checks, (b) it can help us implementing the rule that "No
aborted XID with an age >2^31 can have unapplied undo.".  This part
helps us to decide to truncate the clog because we can't truncate the
clog for transactions having undo.

3. Split the patch.  The patch is split into five patches.  I will
give a brief description of each patch which to a good extent is
mentioned in the commit message for each patch as well:
0010-Extend-binary-heap-functionality -  This patch adds the routines
to allocate binary heap in shared memory and to remove nth element
from binary heap.  These routines will be used by a later patch that
will allow an efficient way to process the pending rollback requests.

0011-Infrastructure-to-register-and-fetch-undo-action-req - This patch
provides an infrastructure to register and fetch undo action requests.
This infrastructure provides a way to allow execution of undo actions.
One might think that we can always execute undo actions on error or
explicit rollback by the user, however, there are cases when that is
not possible.  For example, (a) if the system crash while doing the
operation, then after startup, we need a way to perform undo actions;
(b) If we get an error while
performing undo actions.

Apart from this, when there are large rollback requests, then it is
quite inefficient to perform all the undo actions and then return
control to the user.

0012-Infrastructure-to-execute-pending-undo-actions - This provides an
infrastructure to execute pending undo actions.  To apply the undo
actions, we collect the undo records in bulk and try to process them
together.  We ensure to update the transaction's progress at regular
intervals so that after a crash we can skip already applied undo.
This needs some more work to generalize the processing of undo records
so that this infrastructure can be used by other AM's as well.

0013-Allow-foreground-transactions-to-perform-undo-action - This patch
allows foreground transactions to perform undo actions on abort.  We
always perform rollback actions after cleaning up the current
(sub)transaction.  This will ensure that we perform the actions
immediately after an error (and release the locks) rather than when
the user issues Rollback command at some later point of time.  We are
releasing the locks after the undo actions are applied.  The reason to
delay lock release is that if we release locks before applying undo
actions, then the parallel session can acquire the lock before us
which can lead to deadlock.

0014-Allow-execution-and-discard-of-undo-by-background-wo-  - This
patch allows execution and discard of undo by background workers.
Undo launcher is responsible for launching the workers iff there is
some work available in one of the work queues and there are more
workers available.  The worker is launched to handle requests for a
particular database.

The discard worker is responsible for discarding the undo log of
transactions that are committed and all-visible or are rolled-back.
It also registers the request for aborted transactions in the work
queues.  It iterates through all the active logs one-by-one and tries
to discard the transactions that are old enough to matter.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

akapila
In reply to this post by Robert Haas
On Thu, Jun 13, 2019 at 3:13 AM Robert Haas <[hidden email]> wrote:

>
> On Mon, May 27, 2019 at 5:59 AM Amit Kapila <[hidden email]> wrote:
> > Apart from fixing the above comments, the patch is rebased on latest
> > undo patchset.  As of now, I have split the binaryheap.c changes into
> > a separate patch.  We are stilll enhancing the patch to compute
> > oldestXidHavingUnappliedUndo which touches various parts of patch, so
> > splitting further without completing that can make it a bit difficult
> > to work on that.
>
> Some review comments around execute_undo_actions:
>
> The 'nopartial' argument to execute_undo_actions is confusing.  First,
> it would probably be worth spelling it out instead of abbreviation:
> not_partial_transaction rather than nopartial.  Second, it is usually
> better to phrase parameter names in terms of what they are rather than
> in terms of what they are not: complete_transaction rather than
> not_partial_transaction.  Third, it's unclear from these comments why
> we'd be undoing something other than a complete transaction.  It looks
> as though the answer is that this flag will be false when we're
> undoing a subxact -- in which case, why not invert the sense of the
> flag and call it 'bool subxact'?  I might be wrong, but it seems like
> that would be a whole lot clearer.
>

The idea was that it could be use for multiple purposes like 'rolling
back complete xact', 'rolling back subxact', 'rollback at page-level'
or any similar future need even though not all code paths use that
function.  I am not wedded to any particular name here, but among your
suggestions complete_transaction sounds better to me.  Are you okay
going with that?

> Fourth, the block at the top of
> this function, guarded by nopartial, seems like it must be vulnerable
> to race conditions.  If we're undoing the complete transaction, then
> it checks whether UndoFetchRecord() still returns anything.  But that
> could change not just at the beginning of the function, but also any
> time in the middle, or so it seems to me.
>

It won't change in between because we have ensured at top-level that
no two processes can start executing pending undo at the same time.
Basically, anyone wants to execute the undo actions will have an entry
in rollback hash table and that will be marked as in-progress.  As
mentioned in comments, the race is only "after discard worker
fetches the record and found that this transaction need to be rolled
back, backend might concurrently execute the actions and remove the
request from rollback hash table."

>  I doubt that this is the
> right level at which to deal with this sort of interlocking. I think
> there should be some higher-level mechanism that prevents two
> processes from trying to undo the same transaction at the same time,
> like a heavyweight lock or some kind of flag in the shared memory data
> structure that keeps track of pending undo, so that we never even
> reach this code unless we know that this XID needs undo work
>

Introducing heavyweight lock can create different sort of problems
because we need to hold it till all the actions are applied to avoid
what I have mentioned above.  The problem will be that discard worker
will be blocked till backend/undo worker applies the complete actions
unless we just take this lock conditionally in discard worker.

Another way could be that we re-fetch the undo record when we are
registering the undo request under RollbackRequestLock and check it's
status again becuase in that case backend or other undoworker won't be
able to remove the request from hash table concurrently.  However, the
advantage of checking it in execute_undo_actions is that we can
optimize it in the future to avoid re-fetching this record when
actually fetching the records to apply undo actions.

> and no
> other process is already doing it.
>

This part is already ensured in the current code.

>
> The 'blk_chain_complete' variable which is set in this function and
> passed down to execute_undo_actions_page() and then to the rmgr's
> rm_undo callback also looks problematic.
>

I agree this parameter should go away from the generic interface
considering the requirements from zedstore.

>  First, not every AM that
> uses undo may even have the notion of a 'block chain'; zedstore for
> example uses TIDs as a 48-bit integer, not a block + offset number, so
> it's really not going to have a 'block chain.'  Second, even in
> zheap's usage, it seems to me that the block chain could be complete
> even when this gets set to false. It gets set to true when we're
> undoing a toplevel transaction (not a subxact) and we were able to
> fetch all of the undo for that toplevel transaction. But even if
> that's not true, the chain for an individual block could still be
> complete, because all the remaining undo for the block at issue
> might've been in the chunk of undo we already read; the remaining undo
> could be for other blocks.  For that reason, I can't see how the zheap
> code that relies on this value can be correct; it uses this value to
> decide whether to stick zeroes in the transaction slot, but if the
> scenario described above happened, then I suppose the XID would not
> get cleared from the slot during undo.  Maybe zheap is just relying on
> that being harmless, since if all of the undo actions have been
> correctly executed for the page, the fact that the transaction slot is
> still bogusly used by an aborted xact won't matter; nothing will refer
> to it. However, it seems to me that it would be better for zheap to
> set things up so that the first undo record for a particular txn/page
> combination is flagged in some way (in the payload!) so that undo can
> zero the slot if the action being undone is the one that claimed the
> slot.  That seems cleaner on principle, and it also avoids having
> supposedly AM-independent code pass down details that are driven by
> zheap's particular needs.
>

Yeah, we can do what you are suggesting for zheap or in many cases, we
should be able to detect it via uur_blkprev of the last record of
page.  The invalid value will indicate that the chain for the page is
complete.

>
> I think that the signature for rm_undo can be simplified considerably.
> I think blk_chain_complete should go away for the reasons discussed
> above.  Also, based on our conversations with Heikki at PGCon, we
> decided that we should not presume that the AM wants the records
> grouped by block, so the blkno argument should go away.  In addition,
> I don't see much reason to have a first_idx argument. Instead of
> passing a pointer to the caller's entire array and telling the
> callback where to start looking, couldn't we just pass a pointer to
> the first record the callback should examine, i.e. instead of passing
> urp_array, pass urp_array + first_idx.  Then instead of having a
> last_idx argument, have an argument for the number of entries in the
> array, computed as last_idx - first_idx + 1.  With those changes,
> rm_undo would look like this:
>
> bool (*rm_undo) (UndoRecInfo *urp_array, int count, Oid reloid,
> FullTransactionId full_xid);
>

I agree.

> Now for the $10m question: why even pass reloid and full_xid?  Aren't
> those values going to be present inside every UnpackedUndoRecord?  Why
> not just let the callback get them from the first record (or however
> it wants to do things)?  Perhaps there is some documentation value
> here in that it implies that the value will be the same for every
> record, but we could also handle that by just documenting in the
> appropriate places that undo is done by transaction and relation and
> therefore the callback is entitled to assume that the same value will
> be present in every record.  Then again, I am not sure we really want
> the callback to assume that reloid doesn't change.  I don't see a
> reason offhand not to just pass as many records as we have for a given
> transaction and let the callback do what it likes.  So maybe that's
> another reason to get rid of the reloid argument, at least.  And then
> we could document that all the record will have the same full_xid
> (unless we decide that we don't want to guarantee that either).
>
> Additionally, it strikes me that urp_array is not the greatest name.
> Generally, putting _array into the name of the variable to indicate
> that it's an array doesn't seem all that great from a coding-style
> perspective.  I mean, sometimes it's the best you can do, but it's not
> amazing.  And urp seems like it's using an abbreviation without any
> real reason.  For contrast, consider this existing precedent:
>
> extern SysScanDesc systable_beginscan_ordered(Relation heapRelation,
>                                                    Relation indexRelation,
>                                                    Snapshot snapshot,
>                                                    int nkeys, ScanKey key);
>
> Or this one:
>
> extern TupleDesc CreateTupleDesc(int natts, Form_pg_attribute *attrs);
>
> Notice that in each case the array parameter (which is the last one)
> is named based on what data it contains rather than on the fact that
> it is an array.
>

Agreed, will change accordingly.

> Finally, I observe that rm_undo returns a Boolean, but it's not used
> for anything.  The only call to rm_undo in the current patch set is in
> execute_undo_actions_page, which returns that value to the caller, but
> the callers just discard it.  I suppose maybe this was intended to
> report success or failure, but I think the way that rm_undo will
> report failure is to ERROR.
>

For Error case, it is fine to report failure, but there can be cases
where we don't need to apply undo actions like when the relation is
dropped/truncated, undo actions are already applied.  The original
idea was to cover such cases by the return value.  I agree that
currently, caller ignores this value, but there is some value in
keeping it.  So, I am in favor of a signature with bool as the return
value.


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Robert Haas
On Mon, Jun 17, 2019 at 6:03 AM Amit Kapila <[hidden email]> wrote:
> The idea was that it could be use for multiple purposes like 'rolling
> back complete xact', 'rolling back subxact', 'rollback at page-level'
> or any similar future need even though not all code paths use that
> function.  I am not wedded to any particular name here, but among your
> suggestions complete_transaction sounds better to me.  Are you okay
> going with that?

Sure, let's try that for now and see how it looks.  We can always
change it again if it seems to be a good idea later.

> It won't change in between because we have ensured at top-level that
> no two processes can start executing pending undo at the same time.
> Basically, anyone wants to execute the undo actions will have an entry
> in rollback hash table and that will be marked as in-progress.  As
> mentioned in comments, the race is only "after discard worker
> fetches the record and found that this transaction need to be rolled
> back, backend might concurrently execute the actions and remove the
> request from rollback hash table."
>
> [ discussion of alternatives ]

I'm not precisely sure what the best thing to do here is, but I'm
skeptical that the code in question belongs in this function.  There
are two separate things going on here: one is this revalidation that
the undo hasn't been discarded, and the other is executing the undo
actions. Those are clearly separate tasks, and they are not tasks that
always get done together: sometimes we do only one, and sometimes we
do both.  Any function that looks like this is inherently suspicious:

whatever(....., bool flag)
{
    if (flag)
    {
         // lengthy block of code
    }

    // another lengthy block of code
}

There has to be a reason not to just split this into two functions and
let the caller decide whether to call one or both.

> For Error case, it is fine to report failure, but there can be cases
> where we don't need to apply undo actions like when the relation is
> dropped/truncated, undo actions are already applied.  The original
> idea was to cover such cases by the return value.  I agree that
> currently, caller ignores this value, but there is some value in
> keeping it.  So, I am in favor of a signature with bool as the return
> value.

OK.  So then the callers can't keep ignoring it... and there should be
some test framework that verifies the behavior when the return value
is false.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

akapila
On Mon, Jun 17, 2019 at 7:30 PM Robert Haas <[hidden email]> wrote:

>
> On Mon, Jun 17, 2019 at 6:03 AM Amit Kapila <[hidden email]> wrote:
>
> I'm not precisely sure what the best thing to do here is, but I'm
> skeptical that the code in question belongs in this function.  There
> are two separate things going on here: one is this revalidation that
> the undo hasn't been discarded, and the other is executing the undo
> actions. Those are clearly separate tasks, and they are not tasks that
> always get done together: sometimes we do only one, and sometimes we
> do both.  Any function that looks like this is inherently suspicious:
>
> whatever(....., bool flag)
> {
>     if (flag)
>     {
>          // lengthy block of code
>     }
>
>     // another lengthy block of code
> }
>
> There has to be a reason not to just split this into two functions and
> let the caller decide whether to call one or both.
>
Yeah, because some of the information required to perform the
necessary steps (in the code under the flag) is quite central to this
function (see undo apply progress update part) and it is used at more
than one place in this function.  I have refactored the code in this
function, see if it makes sense now.  You need to check patch
0012-Infrastructure-to-execute-pending-undo-actions.patch for these
changes.

> > For Error case, it is fine to report failure, but there can be cases
> > where we don't need to apply undo actions like when the relation is
> > dropped/truncated, undo actions are already applied.  The original
> > idea was to cover such cases by the return value.  I agree that
> > currently, caller ignores this value, but there is some value in
> > keeping it.  So, I am in favor of a signature with bool as the return
> > value.
>
> OK.  So then the callers can't keep ignoring it...
>
I again thought about this but couldn't come up with anything
meaningful.  The idea is to ignore some undo records if they belong to
the same relation which is already gone. I think we can do something
about it in zheap specific code and make the generic code return void.

I have fixed the other comments raised by you.  See
0012-Infrastructure-to-execute-pending-undo-actions.patch

Apart from the changes related to the undo apply, this patch series
contains changes for making the transaction header at a location
immediately after
UndoRecordHeader which makes it easy to update the same.  The changes
are in patches 0007-Provide-interfaces-to-store-and-fetch-undo-records.patch
and 0012-Infrastructure-to-execute-pending-undo-actions.patch.

There are no changes in undo log module patches.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

0005-Add-prefetch-support-for-the-undo-log.patch (10K) Download Attachment
0001-Add-SmgrId-to-smgropen-and-BufferTag.patch (90K) Download Attachment
0004-Allow-WAL-record-data-on-first-modification-after-a-.patch (2K) Download Attachment
0002-Move-tablespace-dir-creation-from-smgr.c-to-md.c.patch (3K) Download Attachment
0003-Add-undo-log-manager.patch (238K) Download Attachment
0006-Defect-and-enhancement-in-multi-log-support.patch (5K) Download Attachment
0008-Test-module-for-undo-api.patch (12K) Download Attachment
0007-Provide-interfaces-to-store-and-fetch-undo-records.patch (122K) Download Attachment
0010-Extend-binary-heap-functionality.patch (7K) Download Attachment
0009-undo-page-consistency-checker.patch (15K) Download Attachment
0012-Infrastructure-to-execute-pending-undo-actions.patch (47K) Download Attachment
0011-Infrastructure-to-register-and-fetch-undo-action-req.patch (90K) Download Attachment
0013-Allow-foreground-transactions-to-perform-undo-action.patch (69K) Download Attachment
0014-Allow-execution-and-discard-of-undo-by-background-wo.patch (124K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Robert Haas
On Tue, Jun 18, 2019 at 7:31 AM Amit Kapila <[hidden email]> wrote:
> [ new patches ]

I tried writing some code that throws an error from an undo log
handler and the results were not good.  It appears that the code will
retry in a tight loop:

2019-06-18 13:58:53.262 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
2019-06-18 13:58:53.264 EDT [42803] ERROR:  robert_undo

It seems clear that the error-handling aspect of this patch has not
been given enough thought.  It's debatable what strategy should be
used when undo fails, but retrying 40 times per millisecond isn't the
right answer. I assume we want some kind of cool-down between retries.
10 seconds?  A minute?  Some kind of back-off algorithm that gradually
increases the retry time up to some maximum?  Should there be one or
more GUCs?

Another thing that is not very nice is that when I tried to shut down
the server via 'pg_ctl stop' while the above was happening, it did not
shut down.  I had to use an immediate shutdown.  That's clearly not
OK.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Robert Haas
On Tue, Jun 18, 2019 at 2:07 PM Robert Haas <[hidden email]> wrote:
> On Tue, Jun 18, 2019 at 7:31 AM Amit Kapila <[hidden email]> wrote:
> > [ new patches ]
>
> I tried writing some code that throws an error from an undo log
> handler and the results were not good.

I discovered another bothersome thing here: if you have a single
transaction that generates a bunch of undo records, the first one has
uur_dbid set correctly and the remaining records have uur_dbid set to
0.  That means if you try to write a sanity check like if
(record->uur_dbid != MyDatabaseId) elog(ERROR, "the undo system messed
up") it fails.

The original idea of UnpackedUndoRecord was this: you would put a
bunch of data into an UnpackedUndoRecord that you wanted written to
undo, and the undo system would find ways to compress stuff out of the
on-disk representation by e.g. omitting the fork number if it's
MAIN_FORKNUM. Then, when you read an undo record, it would decompress
so that you ended up with the same UnpackedUndoRecord that you had at
the beginning. However, the inclusion of transaction headers has made
this a bit confusing: that stuff isn't being added by the user but by
the undo system itself. It's not very clear from the comments what the
contract is around these things: do you need to set uur_dbid to
MyDatabaseId when preparing to insert an undo record? Or can you just
leave it unset and then it'll possibly be set at decoding time?  The
comments for the UnpackedUndoRecord structure don't explain this.

I'd really like to see this draw a cleaner distinction between the
stuff that the user is expected to set and the other stuff we deal
with internally to the undo subsystem.  For example, suppose that
UnpackedUndoRecord didn't include any of the fields that are only
present in the transaction header.  Maybe there's another structure,
like UndoTransactionHeader, that includes those fields.  The client of
the undo subsystem creates a bunch of UnpackedUndoRecords and inserts
them.  At undo time, the callback gets back an identical set of
UnpackedUndoRecords.  And maybe it also gets a pointer to the
UndoTransactionHeader which contains all of the system-generated
stuff. Under this scheme, uur_xid, uur_xidepoch (which still need to
be combined into uur_fxid), uur_progress, uur_dbid, uur_next,
uur_prevlogstart, and uur_prevurp would all move out of the
UnpackedUndoRecord and into the UndoTransactionHeader. The user would
supply none of those things when inserting undo records, but the
rm_undo callback could examine those values if it wished.

A weaker approach would be to at least clean up the structure
definition so that the transaction-header fields set by the system are
clearly segregated from the per-record fields set by the
undo-inserter, with comments explaining that those fields don't need
to be set but will (or may?) be set at undo time. That would be better
than what we have right now - because it would hopefully make it much
more clear which fields need to be set on insert and which fields can
be expected to be set when decoding - but I think it's probably not
going far enough.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

akapila
In reply to this post by Robert Haas
On Tue, Jun 18, 2019 at 11:37 PM Robert Haas <[hidden email]> wrote:

>
> On Tue, Jun 18, 2019 at 7:31 AM Amit Kapila <[hidden email]> wrote:
> > [ new patches ]
>
> I tried writing some code that throws an error from an undo log
> handler and the results were not good.  It appears that the code will
> retry in a tight loop:
>
> 2019-06-18 13:58:53.262 EDT [42803] ERROR:  robert_undo
> 2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
> 2019-06-18 13:58:53.263 EDT [42803] ERROR:  robert_undo
..
>
> It seems clear that the error-handling aspect of this patch has not
> been given enough thought.  It's debatable what strategy should be
> used when undo fails, but retrying 40 times per millisecond isn't the
> right answer.
>

The reason for the same is that currently, the undo worker keep on
executing the requests if there are any.  I think this is good when
there are different requests, but getting the same request from error
queue and doing it, again and again, doesn't seem to be good and I
think it will not help either.

> I assume we want some kind of cool-down between retries.
> 10 seconds?  A minute?  Some kind of back-off algorithm that gradually
> increases the retry time up to some maximum?
>

Yeah, something on these lines would be good.  How about if we add
failure_count with each request in error queue?   Now, it will get
incremented on each retry and we can wait in proportion to that, say
10s after the first retry, 20s after second and so on and maximum up
to 10 failure_count (100s) will be allowed after which worker will
exit considering it has no more work to do.

Actually, we also need to think about what we should with such
requests because even if undo worker exits after retrying for some
threshold number of times, undo launcher will again launch a new
worker for this request unless we have some special handling for the
same.

We can issue some WARNING once any particular request reached the
maximum number of retries but not sure if that is enough because the
user might not notice the same or didn't take any action.  Do we want
to PANIC at some point of time, if so, when or the other alternative
is we can try at regular intervals till we succeed?

>  Should there be one or
> more GUCs?
>

Yeah, we can do that, something like undo_apply_error_retry_count, but
I am not completely sure about this, maybe some pre-defined number say
10 or 20 should be enough.  However, I am fine if you or others think
that a guc can help users in this case.

> Another thing that is not very nice is that when I tried to shut down
> the server via 'pg_ctl stop' while the above was happening, it did not
> shut down.  I had to use an immediate shutdown.  That's clearly not
> OK.
>

CHECK_FOR_INTERRUPTS is missing at one place, will fix.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Dilip Kumar-2
In reply to this post by Robert Haas
On Wed, Jun 19, 2019 at 2:40 AM Robert Haas <[hidden email]> wrote:

>
> On Tue, Jun 18, 2019 at 2:07 PM Robert Haas <[hidden email]> wrote:
> > On Tue, Jun 18, 2019 at 7:31 AM Amit Kapila <[hidden email]> wrote:
> > > [ new patches ]
> >
> > I tried writing some code that throws an error from an undo log
> > handler and the results were not good.
>
> I discovered another bothersome thing here: if you have a single
> transaction that generates a bunch of undo records, the first one has
> uur_dbid set correctly and the remaining records have uur_dbid set to
> 0.  That means if you try to write a sanity check like if
> (record->uur_dbid != MyDatabaseId) elog(ERROR, "the undo system messed
> up") it fails.
>
> The original idea of UnpackedUndoRecord was this: you would put a
> bunch of data into an UnpackedUndoRecord that you wanted written to
> undo, and the undo system would find ways to compress stuff out of the
> on-disk representation by e.g. omitting the fork number if it's
> MAIN_FORKNUM. Then, when you read an undo record, it would decompress
> so that you ended up with the same UnpackedUndoRecord that you had at
> the beginning. However, the inclusion of transaction headers has made
> this a bit confusing: that stuff isn't being added by the user but by
> the undo system itself. It's not very clear from the comments what the
> contract is around these things: do you need to set uur_dbid to
> MyDatabaseId when preparing to insert an undo record? Or can you just
> leave it unset and then it'll possibly be set at decoding time?  The
> comments for the UnpackedUndoRecord structure don't explain this.
>
> I'd really like to see this draw a cleaner distinction between the
> stuff that the user is expected to set and the other stuff we deal
> with internally to the undo subsystem.  For example, suppose that
> UnpackedUndoRecord didn't include any of the fields that are only
> present in the transaction header.  Maybe there's another structure,
> like UndoTransactionHeader, that includes those fields.  The client of
> the undo subsystem creates a bunch of UnpackedUndoRecords and inserts
> them.  At undo time, the callback gets back an identical set of
> UnpackedUndoRecords.  And maybe it also gets a pointer to the
> UndoTransactionHeader which contains all of the system-generated
> stuff. Under this scheme, uur_xid, uur_xidepoch (which still need to
> be combined into uur_fxid), uur_progress, uur_dbid, uur_next,
> uur_prevlogstart, and uur_prevurp would all move out of the
> UnpackedUndoRecord and into the UndoTransactionHeader. The user would
> supply none of those things when inserting undo records, but the
> rm_undo callback could examine those values if it wished.
>
> A weaker approach would be to at least clean up the structure
> definition so that the transaction-header fields set by the system are
> clearly segregated from the per-record fields set by the
> undo-inserter, with comments explaining that those fields don't need
> to be set but will (or may?) be set at undo time. That would be better
> than what we have right now - because it would hopefully make it much
> more clear which fields need to be set on insert and which fields can
> be expected to be set when decoding - but I think it's probably not
> going far enough.

I think it's a fair point.   We can keep pointer to
UndoRecordTransaction(urec_progress, dbid, uur_next) and
UndoRecordLogSwitch(urec_prevurp, urec_prevlogstart) in
UnpackedUndoRecord and include them whenever undo record contain these
headers.  Transaction header in the first record of the transaction
and log-switch header in the first record after undo-log switch during
a transaction.  IMHO uur_fxid, we can keep as part of the main
UnpackedUndoRecord, because as part of the other work  "Compression
for undo records to consider rmgrid, xid,cid,reloid for each record",
the FullTransactionId, will be present in every UnpackedUndoRecord
(although it will not be stored in every undo record).


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Robert Haas
In reply to this post by akapila
On Wed, Jun 19, 2019 at 2:45 AM Amit Kapila <[hidden email]> wrote:
> The reason for the same is that currently, the undo worker keep on
> executing the requests if there are any.  I think this is good when
> there are different requests, but getting the same request from error
> queue and doing it, again and again, doesn't seem to be good and I
> think it will not help either.

Even if there are multiple requests involved, you don't want a tight
loop like this.

> > I assume we want some kind of cool-down between retries.
> > 10 seconds?  A minute?  Some kind of back-off algorithm that gradually
> > increases the retry time up to some maximum?
>
> Yeah, something on these lines would be good.  How about if we add
> failure_count with each request in error queue?   Now, it will get
> incremented on each retry and we can wait in proportion to that, say
> 10s after the first retry, 20s after second and so on and maximum up
> to 10 failure_count (100s) will be allowed after which worker will
> exit considering it has no more work to do.
>
> Actually, we also need to think about what we should with such
> requests because even if undo worker exits after retrying for some
> threshold number of times, undo launcher will again launch a new
> worker for this request unless we have some special handling for the
> same.
>
> We can issue some WARNING once any particular request reached the
> maximum number of retries but not sure if that is enough because the
> user might not notice the same or didn't take any action.  Do we want
> to PANIC at some point of time, if so, when or the other alternative
> is we can try at regular intervals till we succeed?

PANIC is a terrible idea.  How would that fix anything?  You'll very
possibly still have the same problem after restarting, and so you'll
just keep on hitting the PANIC. That will mean that in addition to
whatever problem with undo you already had, you now have a system that
you can't use for anything at all, because it keeps restarting.

The design goal here should be that if undo for a transaction fails,
we keep retrying periodically, but with minimal adverse impact on the
rest of the system.  That means you can't retry in a loop. It also
means that the system needs to provide fairness: that is, it shouldn't
be possible to create a system where one or more transactions for
which undo keeps failing cause other transactions that could have been
undone to get starved.

It seems to me that thinking of this in terms of what the undo worker
does and what the undo launcher does is probably not the right
approach. We need to think of it more as an integrated system. Instead
of storing a failure_count with each request in the error queue, how
about storing a next retry time?  I think the error queue needs to be
ordered by database_id, then by next_retry_time, and then by order of
insertion.  (The last part is important because next_retry_time is
going to be prone to having ties, and we need to break those ties in
the right way.) So, when a per-database worker starts up, it's pulling
from the queues in alternation, ignoring items that are not for the
current database.  When it pulls from the error queue, it looks at the
item for the current database that has the lowest retry time - if
that's still in the future, then it ignores the queue until something
new (perhaps with a lower retry_time) is added, or until the first
next_retry_time arrives.  If the item that it pulls again fails, it
gets inserted back into the error queue but with a higher next retry
time.

This might not be exactly right, but the point is that there should
probably be NO logic that causes a worker to retry the same
transaction immediately afterward, with or without a delay. It should
be all be driven off what gets pulled out of the error queue.  In the
above sketch, if a worker gets to the point where there's nothing in
the error queue for the current database with a timestamp that is <=
the current time, then it can't pull anything else from that queue; if
there's no other work to do, it exits.  If there is other work to do,
it does that and then maybe enough time will have passed to allow
something to be pulled from the error queue, or maybe not.  Meanwhile
some other worker running in the same database might pull the item
before the original worker gets back to it.  Meanwhile if the worker
exits because there's nothing more to do in that database, the
launcher can also see the error queue.  When enough time has passed,
it can notice that there is an item (or items) that could be pulled
from the error queue for that database and launch a worker for that
database if necessary (or else let an existing worker take care of
it).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Robert Haas
In reply to this post by Robert Haas
On Tue, Jun 18, 2019 at 2:07 PM Robert Haas <[hidden email]> wrote:
> On Tue, Jun 18, 2019 at 7:31 AM Amit Kapila <[hidden email]> wrote:
> > [ new patches ]
>
> I tried writing some code [ to use these patches ].

I spent some more time experimenting with this patch set today and I
think that the UndoFetchRecord interface is far too zheap-centric.  I
expected that I would be able to do this:

UnpackedUndoRecord *uur = UndoFetchRecord(urp);
// do stuff with uur
UndoRecordRelease(uur);

But I can't, because the UndoFetchRecord API requires me to pass not
only an undo record but also a block number, an offset number, an XID,
and a callback.  I think I could get the effect that I want by
defining a callback that always returns true.  Then I could do:

UndoRecPtr junk;
UnpackedUndoRecord *uur = UndoFetchRecord(urp, InvalidBlockNumber,
InvalidOffsetNumber, &junk, always_returns_true);
// do stuff with uur
UndoRecordRelease(uur);

That seems ridiculously baroque.  I think the most common thing that
an AM will want to do with an UndoRecPtr is look up that exact record;
that is, for example, what zedstore will want to do. However, even if
some AMs, like zheap, want to search backward through a chain of
records, there's no real reason to suppose that all of them will want
to search by block number + offset.  They might want to search by some
bit of data buried in the payload, for example.

I think the basic question here is whether we really need anything
more complicated than:

extern UnpackedUndoRecord *UndoFetchRecord(UndoRecPtr urp);

I mean, if you had that, the caller can implement looping easily
enough, and insert any test they want:

for (;;)
{
    UnpackedUndoRecord *uur = UndoFetchRecord(urp);
    if (i like this one)
        break;
    urp = uur->uur_blkprev; // should be renamed, since zedstore +
probably others will have tuple chains not block chains
    UndoRecordRelease(uur);
}

The question in my mind is whether there's some performance advantage
of having the undo layer manage the looping rather than the caller do
it.  If there is, then there's a lot of zheap code that ought to be
changed to use it, because it's just using the same satisfies-callback
everywhere.  If there's not, we should just simplify the undo record
lookup along the lines mentioned above and put all the looping into
the callers.  zheap could provide a wrapper around UndoFetchRecord
that does a search by block and offset, so that we don't have to
repeat that logic in multiple places.

BTW, an actually generic iterator interface would probably look more like this:

typedef bool (*SatisfyUndoRecordCallback)(void *callback_data,
UnpackedUndoRecord *uur);
extern UnpackedUndoRecord *UndoFetchRecord(UndoRecPtr urp, UndoRecPtr
*found, SatisfyUndoRecordCallback callback, void *callback_data);

Now we're not assuming anything about what parts of the record the
callback wants to examine.  It can do whatever it likes.  Typically
with this sort of interface a caller will define a file-private struct
that is known to both the callback and the caller of UndoFetchRecord,
but not elsewhere.

If we decide we need an iterator within the undo machinery itself,
then I think it should look like the above, and I think it should
accept NULL for found, callback, and callback_data, so that somebody
who wants to just look up a record, full stop, can do just:

UnpackedUndoRecord *uur = UndoFetchRecord(urp, NULL, NULL, NULL);

which seems tolerable.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Robert Haas
In reply to this post by Dilip Kumar-2
On Wed, Jun 19, 2019 at 9:13 AM Dilip Kumar <[hidden email]> wrote:

> I think it's a fair point.   We can keep pointer to
> UndoRecordTransaction(urec_progress, dbid, uur_next) and
> UndoRecordLogSwitch(urec_prevurp, urec_prevlogstart) in
> UnpackedUndoRecord and include them whenever undo record contain these
> headers.  Transaction header in the first record of the transaction
> and log-switch header in the first record after undo-log switch during
> a transaction.  IMHO uur_fxid, we can keep as part of the main
> UnpackedUndoRecord, because as part of the other work  "Compression
> for undo records to consider rmgrid, xid,cid,reloid for each record",
> the FullTransactionId, will be present in every UnpackedUndoRecord
> (although it will not be stored in every undo record).

I agree that fxid needs to be set all the time.

I'm not sure I'm entirely following the rest of what you are saying
here, but let me say again that I don't think UnpackedUndoRecord
should include a bunch of stuff that callers (1) don't need to set
when inserting and (2) can't count on having set when fetching.  Stuff
of that type should be handled in some way that spares clients of the
undo system from having to worry about it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

akapila
In reply to this post by Robert Haas
On Wed, Jun 19, 2019 at 8:25 PM Robert Haas <[hidden email]> wrote:

>
> On Wed, Jun 19, 2019 at 2:45 AM Amit Kapila <[hidden email]> wrote:
> > The reason for the same is that currently, the undo worker keep on
> > executing the requests if there are any.  I think this is good when
> > there are different requests, but getting the same request from error
> > queue and doing it, again and again, doesn't seem to be good and I
> > think it will not help either.
>
> Even if there are multiple requests involved, you don't want a tight
> loop like this.
>

Okay, one reason that comes to mind is we don't want to choke the
system as applying undo can consume CPU and generate a lot of I/O.  Is
that you have in mind or something else?

I see an advantage in having some sort of throttling here, so we can
have some wait time (say 100ms) between processing requests.  Do we
see any need of guc here?

I think on one side it seems a good idea to have multiple guc's for
tuning undo worker machinery because whatever default values we pick
might not be good for some of the users.  OTOH, giving too many guc's
can also make the system difficult to understand and can confuse users
or at the least, they won't know how exactly to use those.  It seems
to me that we should first complete the entire patch and then we can
decide which all things need separate guc.

> > > I assume we want some kind of cool-down between retries.
> > > 10 seconds?  A minute?  Some kind of back-off algorithm that gradually
> > > increases the retry time up to some maximum?
> >
> > Yeah, something on these lines would be good.  How about if we add
> > failure_count with each request in error queue?   Now, it will get
> > incremented on each retry and we can wait in proportion to that, say
> > 10s after the first retry, 20s after second and so on and maximum up
> > to 10 failure_count (100s) will be allowed after which worker will
> > exit considering it has no more work to do.
> >
> > Actually, we also need to think about what we should with such
> > requests because even if undo worker exits after retrying for some
> > threshold number of times, undo launcher will again launch a new
> > worker for this request unless we have some special handling for the
> > same.
> >
> > We can issue some WARNING once any particular request reached the
> > maximum number of retries but not sure if that is enough because the
> > user might not notice the same or didn't take any action.  Do we want
> > to PANIC at some point of time, if so, when or the other alternative
> > is we can try at regular intervals till we succeed?
>
> PANIC is a terrible idea.  How would that fix anything?  You'll very
> possibly still have the same problem after restarting, and so you'll
> just keep on hitting the PANIC. That will mean that in addition to
> whatever problem with undo you already had, you now have a system that
> you can't use for anything at all, because it keeps restarting.
>
> The design goal here should be that if undo for a transaction fails,
> we keep retrying periodically, but with minimal adverse impact on the
> rest of the system.  That means you can't retry in a loop. It also
> means that the system needs to provide fairness: that is, it shouldn't
> be possible to create a system where one or more transactions for
> which undo keeps failing cause other transactions that could have been
> undone to get starved.
>

Agreed.

> It seems to me that thinking of this in terms of what the undo worker
> does and what the undo launcher does is probably not the right
> approach. We need to think of it more as an integrated system. Instead
> of storing a failure_count with each request in the error queue, how
> about storing a next retry time?
>

I think both failure_count and next_retry_time can work in a similar way.

I think incrementing next retry time in multiples will be a bit
tricky.  Say first-time error occurs at X hours.  We can say that
next_retry_time will X+10s=Y and error_occured_at will be X.  The
second time it again failed, how will we know that we need set
next_retry_time as Y+20s, maybe we can do something like Y-X and then
add 10s to it and add the result to the current time.  Now whenever
the worker or launcher finds this request, they can check if the
current_time is greater than or equal to next_retry_time, if so they
can pick that request, otherwise, they check request in next queue.

The failure_count can also work in a somewhat similar fashion.
Basically, we can use error_occurred at and failure_count to compute
the required time.  So, if error is occurred at say X hours and
failure count is 3, then we can check if current_time is greater than
X+(3 * 10s), then we will allow the entry to be processed, otherwise,
it will check other queues for work.

>  I think the error queue needs to be
> ordered by database_id, then by next_retry_time, and then by order of
> insertion.  (The last part is important because next_retry_time is
> going to be prone to having ties, and we need to break those ties in
> the right way.)
>

I think it makes sense to order requests by next_retry_time,
error_occurred_at (this will ensure the order of insertion).  However,
I am not sure if there is a need to club the requests w.r.t database
id.  It can starve the error requests from other databases.  Moreover,
we already have a functionality wherein if the undo worker doesn't
encounter the next request from the same database on which it is
operating for a certain amount of time, then it will peek ahead (few
entries) in each queue to get the request for the same database.  We
don't sort by db_id in other queues as well, so it will be consistent
for this queue if we just sort by next_retry_time and
error_occurred_at.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Dilip Kumar-2
In reply to this post by Robert Haas
On Wed, Jun 19, 2019 at 11:35 PM Robert Haas <[hidden email]> wrote:

>
> On Tue, Jun 18, 2019 at 2:07 PM Robert Haas <[hidden email]> wrote:
> > On Tue, Jun 18, 2019 at 7:31 AM Amit Kapila <[hidden email]> wrote:
> > > [ new patches ]
> >
> > I tried writing some code [ to use these patches ].
>
> I spent some more time experimenting with this patch set today and I
> think that the UndoFetchRecord interface is far too zheap-centric.  I
> expected that I would be able to do this:
>
> UnpackedUndoRecord *uur = UndoFetchRecord(urp);
> // do stuff with uur
> UndoRecordRelease(uur);
>
> But I can't, because the UndoFetchRecord API requires me to pass not
> only an undo record but also a block number, an offset number, an XID,
> and a callback.  I think I could get the effect that I want by
> defining a callback that always returns true.  Then I could do:
>
> UndoRecPtr junk;
> UnpackedUndoRecord *uur = UndoFetchRecord(urp, InvalidBlockNumber,
> InvalidOffsetNumber, &junk, always_returns_true);
> // do stuff with uur
> UndoRecordRelease(uur);
>
> That seems ridiculously baroque.  I think the most common thing that
> an AM will want to do with an UndoRecPtr is look up that exact record;
> that is, for example, what zedstore will want to do. However, even if
> some AMs, like zheap, want to search backward through a chain of
> records, there's no real reason to suppose that all of them will want
> to search by block number + offset.  They might want to search by some
> bit of data buried in the payload, for example.
>
> I think the basic question here is whether we really need anything
> more complicated than:
>
> extern UnpackedUndoRecord *UndoFetchRecord(UndoRecPtr urp);
>
> I mean, if you had that, the caller can implement looping easily
> enough, and insert any test they want:
>
> for (;;)
> {
>     UnpackedUndoRecord *uur = UndoFetchRecord(urp);
>     if (i like this one)
>         break;
>     urp = uur->uur_blkprev; // should be renamed, since zedstore +
> probably others will have tuple chains not block chains
>     UndoRecordRelease(uur);
> }

The idea behind having the loop inside the undo machinery was that
while traversing the blkprev chain, we can read all the undo records
on the same undo page under one buffer lock.

>
> The question in my mind is whether there's some performance advantage
> of having the undo layer manage the looping rather than the caller do
> it.  If there is, then there's a lot of zheap code that ought to be
> changed to use it, because it's just using the same satisfies-callback
> everywhere.  If there's not, we should just simplify the undo record
> lookup along the lines mentioned above and put all the looping into
> the callers.  zheap could provide a wrapper around UndoFetchRecord
> that does a search by block and offset, so that we don't have to
> repeat that logic in multiple places.
>
> BTW, an actually generic iterator interface would probably look more like this:
>
> typedef bool (*SatisfyUndoRecordCallback)(void *callback_data,
> UnpackedUndoRecord *uur);
Right, it should be this way.

> extern UnpackedUndoRecord *UndoFetchRecord(UndoRecPtr urp, UndoRecPtr
> *found, SatisfyUndoRecordCallback callback, void *callback_data);
>
> Now we're not assuming anything about what parts of the record the
> callback wants to examine.  It can do whatever it likes.  Typically
> with this sort of interface a caller will define a file-private struct
> that is known to both the callback and the caller of UndoFetchRecord,
> but not elsewhere.
>
> If we decide we need an iterator within the undo machinery itself,
> then I think it should look like the above, and I think it should
> accept NULL for found, callback, and callback_data, so that somebody
> who wants to just look up a record, full stop, can do just:
>
> UnpackedUndoRecord *uur = UndoFetchRecord(urp, NULL, NULL, NULL);
>
> which seems tolerable.
>
I agree with this.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

akapila
On Thu, Jun 20, 2019 at 2:24 PM Dilip Kumar <[hidden email]> wrote:

>
> On Wed, Jun 19, 2019 at 11:35 PM Robert Haas <[hidden email]> wrote:
> >
> > On Tue, Jun 18, 2019 at 2:07 PM Robert Haas <[hidden email]> wrote:
> > > On Tue, Jun 18, 2019 at 7:31 AM Amit Kapila <[hidden email]> wrote:
> > > > [ new patches ]
> > >
> > > I tried writing some code [ to use these patches ].
> >
> > I spent some more time experimenting with this patch set today and I
> > think that the UndoFetchRecord interface is far too zheap-centric.  I
> > expected that I would be able to do this:
> >
> > UnpackedUndoRecord *uur = UndoFetchRecord(urp);
> > // do stuff with uur
> > UndoRecordRelease(uur);
> >
> > But I can't, because the UndoFetchRecord API requires me to pass not
> > only an undo record but also a block number, an offset number, an XID,
> > and a callback.  I think I could get the effect that I want by
> > defining a callback that always returns true.  Then I could do:
> >
> > UndoRecPtr junk;
> > UnpackedUndoRecord *uur = UndoFetchRecord(urp, InvalidBlockNumber,
> > InvalidOffsetNumber, &junk, always_returns_true);
> > // do stuff with uur
> > UndoRecordRelease(uur);
> >
> > That seems ridiculously baroque.  I think the most common thing that
> > an AM will want to do with an UndoRecPtr is look up that exact record;
> > that is, for example, what zedstore will want to do. However, even if
> > some AMs, like zheap, want to search backward through a chain of
> > records, there's no real reason to suppose that all of them will want
> > to search by block number + offset.  They might want to search by some
> > bit of data buried in the payload, for example.
> >
> > I think the basic question here is whether we really need anything
> > more complicated than:
> >
> > extern UnpackedUndoRecord *UndoFetchRecord(UndoRecPtr urp);
> >
> > I mean, if you had that, the caller can implement looping easily
> > enough, and insert any test they want:
> >
> > for (;;)
> > {
> >     UnpackedUndoRecord *uur = UndoFetchRecord(urp);
> >     if (i like this one)
> >         break;
> >     urp = uur->uur_blkprev; // should be renamed, since zedstore +
> > probably others will have tuple chains not block chains
> >     UndoRecordRelease(uur);
> > }
>
> The idea behind having the loop inside the undo machinery was that
> while traversing the blkprev chain, we can read all the undo records
> on the same undo page under one buffer lock.
>

I think if we want we can hold this buffer and allow it to be released
in UndoRecordRelease.  However, this buffer needs to be stored in some
common structure which can be then handed over to UndoRecordRelease.
Another thing is that as of now the API allocates the memory just once
for UnpackedUndoRecord whereas in the new scheme it needs to be
allocated again and again.  I think this is a relatively minor thing,
but it might be better if we can avoid palloc again and again.

BTW, while looking at the code of UndoFetchRecord, I see some problem.
There is a coding pattern like
if()
{
}
else
{
   LWLockAcquire()
  ..
  ..
}

LWLockRelease().

I think this is not correct.

> >
> > BTW, an actually generic iterator interface would probably look more like this:
> >
> > typedef bool (*SatisfyUndoRecordCallback)(void *callback_data,
> > UnpackedUndoRecord *uur);
> Right, it should be this way.
>
> > extern UnpackedUndoRecord *UndoFetchRecord(UndoRecPtr urp, UndoRecPtr
> > *found, SatisfyUndoRecordCallback callback, void *callback_data);
> >
> > Now we're not assuming anything about what parts of the record the
> > callback wants to examine.  It can do whatever it likes.  Typically
> > with this sort of interface a caller will define a file-private struct
> > that is known to both the callback and the caller of UndoFetchRecord,
> > but not elsewhere.
> >
> > If we decide we need an iterator within the undo machinery itself,
> > then I think it should look like the above, and I think it should
> > accept NULL for found, callback, and callback_data, so that somebody
> > who wants to just look up a record, full stop, can do just:
> >
> > UnpackedUndoRecord *uur = UndoFetchRecord(urp, NULL, NULL, NULL);
> >
> > which seems tolerable.
> >
> I agree with this.
>

+1.


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

akapila
In reply to this post by Robert Haas
On Wed, Jun 19, 2019 at 11:35 PM Robert Haas <[hidden email]> wrote:

>
> On Tue, Jun 18, 2019 at 2:07 PM Robert Haas <[hidden email]> wrote:
> > On Tue, Jun 18, 2019 at 7:31 AM Amit Kapila <[hidden email]> wrote:
> > > [ new patches ]
> >
> > I tried writing some code [ to use these patches ].
>
>
> for (;;)
> {
>     UnpackedUndoRecord *uur = UndoFetchRecord(urp);
>     if (i like this one)
>         break;
>     urp = uur->uur_blkprev; // should be renamed, since zedstore +
> probably others will have tuple chains not block chains
..

+1 for renaming this variable.  How about uur_prev_ver or uur_prevver
or uur_verprev?  Any other suggestions?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Robert Haas
In reply to this post by akapila
On Thu, Jun 20, 2019 at 2:42 AM Amit Kapila <[hidden email]> wrote:
> Okay, one reason that comes to mind is we don't want to choke the
> system as applying undo can consume CPU and generate a lot of I/O.  Is
> that you have in mind or something else?

Yeah, mainly that, but also things like log spam, and even pressure on
the lock table.  If we are trying over and over again to take useless
locks, it can affect other things on the system. The main thing,
however, is the CPU and I/O consumption.

> I see an advantage in having some sort of throttling here, so we can
> have some wait time (say 100ms) between processing requests.  Do we
> see any need of guc here?

I don't think that is the right approach. As I said in my previous
reply, we need a way of holding off the retry of the same error for a
certain amount of time, probably measured in seconds or tens of
seconds. Introducing a delay before processing every request is an
inferior alternative: if there are a lot of rollbacks, it can cause
the system to lag; and in the case where there's just one rollback
that's failing, it will still be way too much log spam (and probably
CPU time too).  Nobody wants 10 failure messages per second in the
log.

> > It seems to me that thinking of this in terms of what the undo worker
> > does and what the undo launcher does is probably not the right
> > approach. We need to think of it more as an integrated system. Instead
> > of storing a failure_count with each request in the error queue, how
> > about storing a next retry time?
>
> I think both failure_count and next_retry_time can work in a similar way.
>
> I think incrementing next retry time in multiples will be a bit
> tricky.  Say first-time error occurs at X hours.  We can say that
> next_retry_time will X+10s=Y and error_occured_at will be X.  The
> second time it again failed, how will we know that we need set
> next_retry_time as Y+20s, maybe we can do something like Y-X and then
> add 10s to it and add the result to the current time.  Now whenever
> the worker or launcher finds this request, they can check if the
> current_time is greater than or equal to next_retry_time, if so they
> can pick that request, otherwise, they check request in next queue.
>
> The failure_count can also work in a somewhat similar fashion.
> Basically, we can use error_occurred at and failure_count to compute
> the required time.  So, if error is occurred at say X hours and
> failure count is 3, then we can check if current_time is greater than
> X+(3 * 10s), then we will allow the entry to be processed, otherwise,
> it will check other queues for work.

Meh.  Don't get stuck on one particular method of calculating the next
retry time.  We want to be able to change that easily if whatever we
try first doesn't work out well.  I am not convinced that we need
anything more complex than a fixed retry time, probably controlled by
a GUC (undo_failure_retry_time = 10s?).  An escalating time between
retries would be important and advantageous if we expected the sizes
of these queues to grow into the millions, but the current design
seems to be contemplating something more in the tends-of-thousands
range and I am not sure we're going to need it at that level.  We
should try simple things first and then see where we need to make it
more complex.

At some basic level, the queue needs to be ordered by increasing retry
time.  You can do that with your design, but you have to recompute the
next retry time from the error_occurred_at and failure_count values
every time you examine an entry.  It's almost certainly better to
store the next_retry_time explicitly.  That way, if for example we
change the logic for computing the next_retry_time to something really
complicated, it doesn't have any effect on the code that keeps the
queue in order -- it just looks at the computed value.  If we end up
with something very simple, like error_occurred_at + constant, it may
end up seeming a little silly, but I think that's a price well worth
paying for code maintainability.  If we end up with error_occurred_at
+ Min(failure_count * 10, 100) or something of that sort, then we can
also store failure_count in each record, but it will just be part of
the payload, not the sort key, so adding it or removing it won't
affect the code that maintains the queue ordering.

> >  I think the error queue needs to be
> > ordered by database_id, then by next_retry_time, and then by order of
> > insertion.  (The last part is important because next_retry_time is
> > going to be prone to having ties, and we need to break those ties in
> > the right way.)
>
> I think it makes sense to order requests by next_retry_time,
> error_occurred_at (this will ensure the order of insertion).  However,
> I am not sure if there is a need to club the requests w.r.t database
> id.  It can starve the error requests from other databases.  Moreover,
> we already have a functionality wherein if the undo worker doesn't
> encounter the next request from the same database on which it is
> operating for a certain amount of time, then it will peek ahead (few
> entries) in each queue to get the request for the same database.  We
> don't sort by db_id in other queues as well, so it will be consistent
> for this queue if we just sort by next_retry_time and
> error_occurred_at.

You're misunderstanding my point. We certainly do not wish to always
pick the request from the database with the lowest OID, or anything
like that.  However, we do need a worker for a particular database to
find the work pending for that database efficiently.  Peeking ahead a
few requests is a version of that, but I'm not sure it's going to be
good enough.  Suppose we look ahead 3 requests but there are 10
databases. Then, if all 10 databases have requests pending, it is
likely that we won't find the next request for our particular database
even though it exists -- the first 3 may easily be for all other
databases.  If you look ahead more requests, that doesn't really fix
it - it just means you need more databases for the problem to become
likely.  And note that this problem happens even if every database
contains a worker.  Some of those workers will erroneously think that
they should exit.

I'm not sure exactly what to do about this.  My first thought was that
for all of the queues we might need to have a queue per database (or
something equivalent) rather than just one big queue.  But that has
problems too: it will mean that a database worker will never exit as
long as there is any work at all to be done in that database, even if
some other database is getting starved. Somehow we need to balance the
efficiency of having a worker for a particular database process many
requests before exiting against the need to ensure fairness across
databases, and it doesn't sound to me like we quite know what exactly
we ought to do there.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


123456 ... 17