POC: Cleaning up orphaned files using undo logs

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
147 messages Options
1234 ... 8
Reply | Threaded
Open this post in threaded view
|

POC: Cleaning up orphaned files using undo logs

Thomas Munro-3
Hello hackers,

The following sequence creates an orphaned file:

BEGIN;
CREATE TABLE t ();
<kill -9 this backend>

Occasionally there are reports of systems that have managed to produce
a lot of them, perhaps through ENOSPC-induced panics, OOM signals or
buggy/crashing extensions etc.  The most recent example I found in the
archives involved 1.7TB of unexpected files and some careful cleanup
work.

Relation files are created eagerly, and rollback is handled by pushing
PendingRelDelete objects onto the pendingDeletes list, to be discarded
on commit or processed on abort.  That's effectively a kind of
specialised undo log, but it's in memory only, so it's less persistent
than the effects it is supposed to undo.

Here's a proof-of-concept patch that plugs the gap using the undo log
technology we're developing as part of the zHeap project.  Note that
zHeap is not involved here: the SMGR module is acting as a direct
client of the undo machinery.  Example:

postgres=# begin;
BEGIN
postgres=# create table t1 ();
CREATE TABLE
postgres=# create table t2 ();
CREATE TABLE

... now we can see that this transaction has some undo data (discard < insert):

postgres=# select logno, discard, insert, xid, pid from pg_stat_undo_logs;
 logno |     discard      |      insert      | xid |  pid
-------+------------------+------------------+-----+-------
     0 | 00000000000021EF | 0000000000002241 | 581 | 18454
(1 row)

... and, if the test_undorecord module is installed, we can inspect
the records it holds:

postgres=# call dump_undo_records(0);
NOTICE:  0000000000002224: Storage: CREATE dbid=12655, tsid=1663, relfile=24594
NOTICE:  00000000000021EF: Storage: CREATE dbid=12655, tsid=1663, relfile=24591
CALL

If we COMMIT, the undo data is discarded by advancing the discard
pointer (tail) to match the insert pointer (head).  If we ROLLBACK,
either explicitly or automatically by crashing and recovering, then
the files will be unlinked and the insert pointer will be rewound;
either way the undo log eventually finishes up "empty" again (discard
== insert).  This is done with a system of per-rmgr-ID record types
and callbacks, similar to redo.  The rollback action are either
executed immediately or offloaded to an undo worker process, depending
on simple heuristics.

Of course this isn't free, and the current patch makes table creation
slower.  The goal is to make sure that there is no scenario (kill -9,
power cut etc) in which there can be a new relation file on disk, but
not a corresponding undo record that would unlink that file if the
transaction later has to roll back.  Currently, that means that we
need to flush the WAL record that will create the undo record that
will unlink the file *before* we create the relation file.  I suspect
that could be mitigated quite easily, by deferring file creation in a
backend-local queue until forced by access or commit.  I didn't try to
do that in this basic version.

There are probably other ways to solve the specific problem of
orphaned files, but this approach is built on a general reusable
facility and I think it is a nice way to show the undo concepts, and
how they are separate from zheap.  Specifically, it demonstrates the
more traditional of the two uses for undo logs: a reliable way to
track actions that must be performed on rollback.  (The other use is:
seeing past versions of data, for vacuumless MVCC; that's a topic for
later).

Patches 0001-0006 are development snapshots of material posted on
other threads already[1][2], hacked around by me to make this possible
(see those threads for further developments in those patches including
some major strengthening work, coming soon).  The subject of this
thread is 0007, the core of which is just a couple of hundred lines
written by me, based on an idea from Robert Haas.

Personally I think it'd be a good feature to get into PostgreSQL 12,
and I will add it to the CF that is about to start to seek feedback.
It passes make check on Unix and Windows, though currently it's
failing some of the TAP tests for reasons I'm looking into (possibly
due to bugs in the lower level patches, not sure).

Thanks for reading,

[1] https://www.postgresql.org/message-id/flat/CAEepm=2EqROYJ_xYz4v5kfr4b0qw_Lq_6Pe8RTEC8rx3upWsSQ@...
[2] https://www.postgresql.org/message-id/flat/CAFiTN-sYQ8r8ANjWFYkXVfNxgXyLRfvbX9Ee4SxO9ns-OBBgVA@...

--
Thomas Munro
http://www.enterprisedb.com

undo-smgr-v1.tgz (152K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Robert Haas
While I've been involved in the design discussions for this patch set,
I haven't looked at any of the code personally in a very long time.  I
certainly don't claim to be an independent reviewer, and I encourage
others to review this work also.  That said, here are some review
comments.

I decided to start with 0005, as that has the user-facing
documentation for this feature. There is a spurious whitespace-only
hunk in monitoring.sgml.

+     <entry>Process ID of the backend currently attached to this undo log
+      for writing.</entry>

or NULL/0/something if none?

+   each undo log that exists.  Undo logs are extents within a contiguous
+   addressing space that have their own head and tail pointers.

This sentence seems to me to have so little detail that it's not going
to help anyone, and it also seems somewhat out-of-place here.  I think
it would be better to link to the longer explanation in the new
storage section instead.

+   Each backend that has written undo data is associated with one or more undo

extra space

+<para>
+Undo logs hold data that is used for rolling back and for implementing
+MVCC in access managers that are undo-aware (currently "zheap").  The storage
+format of undo logs is optimized for reusing existing files.
+</para>

I think the mention of zheap should be removed here since the hope is
that the undo stuff can be committed independently of and prior to
zheap.

I think you mean access methods, not access managers.  I suggest
making that an xref.

Maybe add a little more detail, e.g.

Undo logs provide a place for access methods to store data that can be
used to perform necessary cleanup operations after a transaction
abort.  The data will be retained after a transaction abort until the
access method successfully performs the required cleanup operations.
After a transaction commit, undo data will be retained until the
transaction is all-visible.  This makes it possible for access
managers to use undo data to implement MVCC.  Since it most cases undo
data is discarded very quickly, the undo system has been optimized to
minimize writes to disk and to reuse existing files efficiently.

+<para>
+Undo data exists in a 64 bit address space broken up into numbered undo logs
+that represent 1TB extents, for efficient management.  The space is further
+broken up into 1MB segment files, for physical storage.  The name of each file
+is the address of of the first byte in the file, with a period inserted after
+the part that indicates the undo log number.
+</para>

I cannot read this section and know what an undo filename is going to
look like.  Also, the remarks about efficient management seems like it
might be unclear to someone not already familiar with how this works.
Maybe something like:

Undo data exists in a 64-bit address space divided into 2^34 undo
logs, each with a theoretical capacity of 1TB.  The first time a
backend writes undo, it attaches to an existing undo log whose
capacity is not yet exhausted and which is not currently being used by
any other backend; or if no suitable undo log already exists, it
creates a new one.  To avoid wasting space, each undo log is further
divided into 1MB segment files, so that segments which are no longer
needed can be removed (possibly recycling the underlying file by
renaming it) and segments which are not yet needed do not need to be
physically created on disk.  An undo segment file has a name like
<example>, where <thing> is the undo log number and <thang> is the
segment number.

I think it's good to spell out the part about attaching to undo logs
here, because when people look at pg_undo, the number of files will be
roughly proportional to the number of backends, and we should try to
help them understand - at least in general terms - why that happens.

+<para>
+Just as relations can have one of the three persistence levels permanent,
+unlogged or temporary, the undo data that is generated by modifying them must
+be stored in an undo log of the same persistence level.  This enables the
+undo data to be discarded at appropriate times along with the relations that
+reference it.
+</para>

This is not quite general, because we're not necessarily talking about
modifications to the files.  In fact, in this POC, we're explicitly
talking about the cleanup of the files themselves.  Also, it's not
technically correct to say that the persistence level has to match.
You could put everything in permanent undo logs.  It would just suck.

Moving on to 0003, the developer documentation:

+The undo log subsystem provides a way to store data that is needed for
+a limited time.  Undo data is generated whenever zheap relations are
+modified, but it is only useful until (1) the generating transaction
+is committed or rolled back and (2) there is no snapshot that might
+need it for MVCC purposes.  See src/backend/access/zheap/README for
+more information on zheap.  The undo log subsystem is concerned with

Again, I think this should be rewritten to make it independent of
zheap.  We hope that this facility is not only usable by but will
actually be used by other AMs.

+their location within a 64 bit address space.  Unlike redo data, the
+addressing space is internally divided up unto multiple numbered logs.

Except it's not totally unlike; cf. the log and seg arguments to
XLogFileNameById.  The xlog division is largely a historical accident
of having to support systems with 32-bit arithmetic and has minimal
consequences in practice, and it's a lot less noticeable now than it
used to be, but it does still kinda exist.  I would try to sharpen
this wording a bit to de-emphasize the contrast over whether a log/seg
distinction exists and instead just contrast multiple insertion points
vs. a single one.

+level code (zheap) is largely oblivious to this internal structure and

Another zheap reference.

+eviction provoked by memory pressure, then no disk IO is generated.

I/O?

+Keeping the undo data physically separate from redo data and accessing
+it though the existing shared buffers mechanism allows it to be
+accessed efficiently for MVCC purposes.

And also non-MVCC purposes.  I mean, it's not very feasible to do
post-abort cleanup driven solely off the WAL, because the WAL segments
might've been archived or recycled and there's no easy way to access
the bits we want.  Saying this is for MVCC purposes specifically seems
misleading.

+shared memory and can be inspected in the pg_stat_undo_logs view.  For

Replace "in" with "via" or "through" or something?

+shared memory and can be inspected in the pg_stat_undo_logs view.  For
+each undo log, a set of properties called the undo log's meta-data are
+tracked:

"called the undo log's meta-data" seems a bit awkward.

+* the "discard" pointer; data before this point has been discarded
+* the "insert" pointer: new data will be written here
+* the "end" pointer: a new undo segment file will be needed at this point

why ; for the first and : for the others?

+The three pointers discard, insert and end move strictly forwards
+until the whole undo log has been exhausted.  At all times discard <=
+insert <= end.  When discard == insert, the undo log is empty

I think you should either remove "discard, insert and end" from this
sentence, relying on people to remember the list they just read, or
else punctuate it like this: The three pointers -- discard, insert,
and end -- move...

+logs are held in a fixed-sized pool in shared memory.  The size of
+the array is a multiple of max_connections, and limits the total size of
+transactions.

I think you should elaborate on "limits the total size of transactions."

+The meta-data for all undo logs is written to disk at every
+checkpoint.  It is stored in files under PGDATA/pg_undo/, using the

Even unlogged and temporary undo logs?

+level of the relation being modified and the current value of the GUC

Suggest: the corresponding relation

+suitable undo log must be either found or created.  The system should
+stabilize on one undo log per active writing backend (or more if
+different tablespaces are persistence levels are used).

Won't edge effects drive the number up considerably?

+and they cannot be accessed by other backend including undo workers.

Grammar.  Also, begs the question "so how does this work if the undo
workers are frozen out?"

+Responsibility for WAL-logging the contents of the undo log lies with
+client code (ie zheap).  While undolog.c WAL-logs all meta-data

Another zheap reference.

+hard coded to use md.c unconditionally, PostgreSQL 12 routes IO for the undo

Suggest I/O rather than IO.

I'll see if I can find time to actually review some of this code at
some point.  Regarding 0006, I can't help but notice that it is
completely devoid of documentation and README updates, which will not
do.  Regarding 0007, that's an impressively small patch.

...Robert

Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Kuntal Ghosh
In reply to this post by Thomas Munro-3
Hello Thomas,

On Thu, Nov 1, 2018 at 8:53 AM Thomas Munro
<[hidden email]> wrote:
>
> It passes make check on Unix and Windows, though currently it's
> failing some of the TAP tests for reasons I'm looking into (possibly
> due to bugs in the lower level patches, not sure).
>
I looked into the regression failures when the tap-tests are enabled.
It seems that we're not estimating and allocating the shared memory
for rollback-hash tables correctly. I've added a patch to fix the
same.


--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

0001-Fix-shared-memory-size-for-rollback-hash-table.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Dilip Kumar-2
On Mon, Nov 5, 2018 at 5:13 PM Kuntal Ghosh <[hidden email]> wrote:

>
> Hello Thomas,
>
> On Thu, Nov 1, 2018 at 8:53 AM Thomas Munro
> <[hidden email]> wrote:
> >
> > It passes make check on Unix and Windows, though currently it's
> > failing some of the TAP tests for reasons I'm looking into (possibly
> > due to bugs in the lower level patches, not sure).
> >
> I looked into the regression failures when the tap-tests are enabled.
> It seems that we're not estimating and allocating the shared memory
> for rollback-hash tables correctly. I've added a patch to fix the
> same.
>

I have included your fix in the latest version of the undo-worker patch[1]

[1] https://www.postgresql.org/message-id/flat/CAFiTN-sYQ8r8ANjWFYkXVfNxgXyLRfvbX9Ee4SxO9ns-OBBgVA%40mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Thomas Munro-3
In reply to this post by Kuntal Ghosh
On Tue, Nov 6, 2018 at 12:42 AM Kuntal Ghosh <[hidden email]> wrote:

> On Thu, Nov 1, 2018 at 8:53 AM Thomas Munro
> <[hidden email]> wrote:
> > It passes make check on Unix and Windows, though currently it's
> > failing some of the TAP tests for reasons I'm looking into (possibly
> > due to bugs in the lower level patches, not sure).
> >
> I looked into the regression failures when the tap-tests are enabled.
> It seems that we're not estimating and allocating the shared memory
> for rollback-hash tables correctly. I've added a patch to fix the
> same.

Thanks Kuntal.

--
Thomas Munro
http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Dmitry Dolgov
> On Thu, Nov 8, 2018 at 4:03 AM Thomas Munro <[hidden email]> wrote:
>
> On Tue, Nov 6, 2018 at 12:42 AM Kuntal Ghosh <[hidden email]> wrote:
> > On Thu, Nov 1, 2018 at 8:53 AM Thomas Munro
> > <[hidden email]> wrote:
> > > It passes make check on Unix and Windows, though currently it's
> > > failing some of the TAP tests for reasons I'm looking into (possibly
> > > due to bugs in the lower level patches, not sure).
> > >
> > I looked into the regression failures when the tap-tests are enabled.
> > It seems that we're not estimating and allocating the shared memory
> > for rollback-hash tables correctly. I've added a patch to fix the
> > same.
>
> Thanks Kuntal.

Thanks for the patch,

Unfortunately, cfbot complains about these patches and can't apply them for
some reason, so I did this manually to check it out. All of them (including the
fix from Kuntal) were applied without conflicts, but compilation stopped here

undoinsert.c: In function ‘UndoRecordAllocateMulti’:
undoinsert.c:547:18: error: ‘urec’ may be used uninitialized in this
function [-Werror=maybe-uninitialized]
   urec->uur_info = 0;  /* force recomputation of info bits */
   ~~~~~~~~~~~~~~~^~~

Could you please post a fixed version of the patch?

Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Thomas Munro-3
On Sat, Dec 1, 2018 at 5:12 AM Dmitry Dolgov <[hidden email]> wrote:

> > On Thu, Nov 8, 2018 at 4:03 AM Thomas Munro <[hidden email]> wrote:
> > On Tue, Nov 6, 2018 at 12:42 AM Kuntal Ghosh <[hidden email]> wrote:
> > > On Thu, Nov 1, 2018 at 8:53 AM Thomas Munro
> > > <[hidden email]> wrote:
> > > > It passes make check on Unix and Windows, though currently it's
> > > > failing some of the TAP tests for reasons I'm looking into (possibly
> > > > due to bugs in the lower level patches, not sure).
> > > >
> > > I looked into the regression failures when the tap-tests are enabled.
> > > It seems that we're not estimating and allocating the shared memory
> > > for rollback-hash tables correctly. I've added a patch to fix the
> > > same.
> >
> > Thanks Kuntal.
>
> Thanks for the patch,
>
> Unfortunately, cfbot complains about these patches and can't apply them for
> some reason, so I did this manually to check it out. All of them (including the
> fix from Kuntal) were applied without conflicts, but compilation stopped here
>
> undoinsert.c: In function ‘UndoRecordAllocateMulti’:
> undoinsert.c:547:18: error: ‘urec’ may be used uninitialized in this
> function [-Werror=maybe-uninitialized]
>    urec->uur_info = 0;  /* force recomputation of info bits */
>    ~~~~~~~~~~~~~~~^~~
>
> Could you please post a fixed version of the patch?

Sorry for my silence... I got stuck on a design problem with the lower
level undo log management code that I'm now close to having figured
out.  I'll have a new patch soon.

--
Thomas Munro
http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Andres Freund
Hi,

On 2018-12-03 18:43:04 +1300, Thomas Munro wrote:

> On Sat, Dec 1, 2018 at 5:12 AM Dmitry Dolgov <[hidden email]> wrote:
> > > On Thu, Nov 8, 2018 at 4:03 AM Thomas Munro <[hidden email]> wrote:
> > > On Tue, Nov 6, 2018 at 12:42 AM Kuntal Ghosh <[hidden email]> wrote:
> > > > On Thu, Nov 1, 2018 at 8:53 AM Thomas Munro
> > > > <[hidden email]> wrote:
> > > > > It passes make check on Unix and Windows, though currently it's
> > > > > failing some of the TAP tests for reasons I'm looking into (possibly
> > > > > due to bugs in the lower level patches, not sure).
> > > > >
> > > > I looked into the regression failures when the tap-tests are enabled.
> > > > It seems that we're not estimating and allocating the shared memory
> > > > for rollback-hash tables correctly. I've added a patch to fix the
> > > > same.
> > >
> > > Thanks Kuntal.
> >
> > Thanks for the patch,
> >
> > Unfortunately, cfbot complains about these patches and can't apply them for
> > some reason, so I did this manually to check it out. All of them (including the
> > fix from Kuntal) were applied without conflicts, but compilation stopped here
> >
> > undoinsert.c: In function ‘UndoRecordAllocateMulti’:
> > undoinsert.c:547:18: error: ‘urec’ may be used uninitialized in this
> > function [-Werror=maybe-uninitialized]
> >    urec->uur_info = 0;  /* force recomputation of info bits */
> >    ~~~~~~~~~~~~~~~^~~
> >
> > Could you please post a fixed version of the patch?
>
> Sorry for my silence... I got stuck on a design problem with the lower
> level undo log management code that I'm now close to having figured
> out.  I'll have a new patch soon.

Given this patch has been in waiting for author for ~two months, I'm
unfortunately going to have to mark it as returned with feedback. Please
resubmit once refreshed.

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Thomas Munro-5
On Sun, Feb 3, 2019 at 11:09 PM Andres Freund <[hidden email]> wrote:
> On 2018-12-03 18:43:04 +1300, Thomas Munro wrote:
> > Sorry for my silence... I got stuck on a design problem with the lower
> > level undo log management code that I'm now close to having figured
> > out.  I'll have a new patch soon.

Hello all,

Here's a new WIP version of this patch set.  It builds on a fairly
deep stack of patches being developed by several people.  As mentioned
before, it's a useful crash-test dummy for a whole stack of technology
we're working on, but it's also aiming to solve a real problem.

It currently fails in one regression test for a well understood
reason, fix on the way (see end), and there are some other stability
problems being worked on.

Here's a quick tour of the observable behaviour, having installed the
pg_buffercache and test_undorecord extensions:

==================

postgres=# begin;
BEGIN
postgres=# create table foo ();
CREATE TABLE

Check if our transaction has generated undo data:

postgres=# select logno, discard, insert, xid, pid from pg_stat_undo_logs ;
 logno |     discard      |      insert      | xid |  pid
-------+------------------+------------------+-----+-------
     0 | 0000000000002CD9 | 0000000000002D1A | 476 | 39169
(1 row)

Here, we see that undo log number 0 has some undo data because discard
< insert.  We can find out what it says:

postgres=# call dump_undo_records(0);
NOTICE:  0000000000002CD9: Storage: CREATE dbid=12916, tsid=1663,
relfile=16386; xid=476, next xact=0
CALL

The undo record shown there lives in shared buffers, and we can see
that it's in there with pg_buffercache (the new column smgrid 1 means
undo data; 0 is regular relation data):

postgres=# select bufferid, smgrid, relfilenode, relblocknumber,
isdirty, usagecount from pg_buffercache where smgrid = 1;
 bufferid | smgrid | relfilenode | relblocknumber | isdirty | usagecount
----------+--------+-------------+----------------+---------+------------
        3 |      1 |           0 |              1 | t       |          5
(1 row)

Even though that's just a dirty page in shared buffers, if we crash
now and recover, it'll be recreated by a new WAL record that was
flushed *before* creating the relation file.  We can see that with
pg_waldump:

rmgr: Storage ... PRECREATE base/12916/16384, blkref #0: smgr 1 rel
1663/0/0 blk 1 FPW
rmgr: Storage ... CREATE base/12916/16384

The PRECREATE record dirtied block 1 of undo log 0.  In this case it
happened to include a FPW of the undo log page too, following the
usual rules.  FPWs are rare for undo pages because of the
REGBUF_WILL_INIT optimisation that applies to the zeroed out pages
(which is most undo pages, due to the append-mostly access pattern).

Finally, we if commit we see the undo data is discarded by a
background worker, and if we roll back explicitly or crash and run
recovery, the file is unlinked.  Here's an example of the crash case:

postgres=# begin;
BEGIN
postgres=# create table foo ();
CREATE TABLE
postgres=# select relfilenode from pg_class where relname = 'foo';
 relfilenode
-------------
       16395
(1 row)

postgres=# select pg_backend_pid();
 pg_backend_pid
----------------
          39169
(1 row)

$ kill -9 39169

... server restarts, recovers ...

$ ls pgdata/base/12916/16395
pgdata/base/12916/16395

It's still there, though it's been truncated by an undo worker (see
end of email).  And finally, after the next checkpoint:

$ ls pgdata/base/12916/16395
ls: pgdata/base/12916/16395: No such file or directory

That's the end of the quick tour.

Most of these patches should probably be discussed in other threads,
but I'm posting a snapshot of the full stack here anyway.  Here's a
patch-by-patch summary:

=== 0001 "Refactor the fsync mechanism to support future SMGR
implementations." ===

The 0001 patch has its own CF thread
https://commitfest.postgresql.org/22/1829/ and is from Shawn Debnath
(based on earlier work by me), but I'm including a copy here for
convenience/cfbot.

=== 0002 "Add SmgrId to smgropen() and BufferTag." ===

This is new, and is based on the discussion from another recent
thread[1] about how we should identify buffers belonging to different
storage managers.  In earlier versions of the patch-set I had used a
special reserved DB OID for undo data.  Tom Lane didn't like that idea
much, and Anton Shyrabokau (via Shawn Debnath) suggested making
ForkNumber narrower so we can add a new field to BufferTag, and Andres
Freund +1'd my proposal to add the extra value as a parameter to
smgropen().  So, here is a patch that tries those ideas.

Another way to do this would be to widen RelFileNode instead, to avoid
having to pass around the SMGR ID separately in various places.
Looking at the number of places that have to chance, you can probably
see why we wanted to use a magic DB OID instead, and I'm not entirely
convinced that it wasn't better that way, or that I've found all the
places that need to carry an smgrid alongside a RelFileNode.

Archeological note: smgropen() was like that ~15 years ago before
commit 87bd9563, but buffer tags didn't include the SMGR ID.

I decided to call md.c's ID "SMGR_RELATION", describing what it really
holds -- regular relations -- rather than perpetuating the doubly
anachronistic "magnetic disk" name.

While here, I resurrected the ancient notion of a per-SMGR 'open'
routine, so that a small amount of md.c-specific stuff could be kicked
out of smgr.c and future implementations can do their own thing here
too.

While doing that work I realised that at least pg_rewind needs to
learn about how different storage managers map blocks to files, so
that's a new TODO item requiring more thought.  I wonder what other
places know how to map { RelFileNode, ForkNumber, BlockNumber } to a
path + offset, and I wonder what to think about the fact that some of
them may be non-backend code...

=== 0003 "Add undo log manager." ===

This and the next couple of patches live in CF thread
https://commitfest.postgresql.org/22/1649/ but here's a much newer
snapshot that hasn't been posted there yet.

Manages a set of undo logs in shared memory, manages undo segment
files, tracks discard, insert, end pointers visible in
pg_stat_undo_logs.  With this patch you can allocate and discard space
in undo logs using the UndoRecPtr type to refer to addresses, but
there is no access to the data yet.  Improvements since the last
version are not requiring DSM segments, proper FPW support and reduced
WAL traffic.  Previously there were extra per-xact and per-checkpoint
records requiring retry-loops in code that inserted undo data.

=== 0004 "Provide access to undo log data via the buffer manager." ===

Provide SMGR_UNDO.  While the 0003 patch deals with allocating and
discarding undo address space and makes sure that backing files exist,
this patch lets you read and write buffered data in them.

=== 0005 "Allow WAL record data on first modification after a checkpoint." ===

Provide a way for data to be attached to a WAL-registered block that
is only included if this turns out to be the first WAL record that
touches the block after a checkpoint.  This is a bit like FPW images,
except that it's arbitrary extra data and happens even if FPW is off.
This is used to capture a copy of the (tiny) undo log meta-data
(primary the insertion pointer) to fix a consistency problem when
recovering from an online checkpoint.

=== 0006 + 0007 "Provide interfaces to store and fetch undo records." ===

This is a snapshot of work by my colleagues Dilip, Rafia and others
based on earlier prototyping by Robert.  While the earlier patches
give you buffered binary undo data, this patch introduces the concept
of high level undo records that can be inserted, and read back given
an UndoRecPtr.  This is a version presented on another thread already;
here it's lightly changed due to rebasing by me.

Undo-aware modules should design a set of undo record types, and
insert exactly the same ones at do and undo time.

The 0007 patch is fixups from me to bring that code into line with
changes to the lower level patches.  Future versions will be squashed
and tidied up; still working on that.

=== 0008 + 0009 "Undo worker and transaction rollback" ===

This has a CF thread at https://commitfest.postgresql.org/22/1828/ and
again this is a snapshot of work from Dilip, Rafia and others, with a
fixup from me.  Still working on coordinating that for the next
version.

This provides a way for RMGR modules to register a callback function
that will receive all the undo records they inserted during a given
[sub]transaction if it rolls back.  It also provides a system of
background workers that can execute those undo records in case the
rollback happens after crash recovery, or in case the work can be
usefully pushed into the background during a regular online rollback.
This is a complex topic and I'm not attempting to explain it here.

There are a few known problems with this and Dilip is working on a
more sophisticated worker management system, but I'll let him write
about that, over in that other thread.

I think it'd probably be a good idea to split this patch into two or
three; the RMGR undo support, the xact.c integration and the worker
machinery.  But maybe that's just me.

Archeological note: XXXX_undo() callback functions registered via
rmgrlist.h a bit like this originally appeared in the work by Vadim
Mikheev (author of WAL) in commit b58c0411bad4, but that was
apparently never completed once people figured out that you can make a
force, steal, redo, no-undo database work (curiously I saw a slide
from a university lecture somewhere saying that would be impossible).
The stub functions were removed from the tree in 4c8495a1.  Our new
work differs from Vadim's original vision by putting undo data in a
separate place from the WAL, and accessing it via shared buffers.  I
guess that might be because Vadim planned to use undo for rollback
only, not for MVCC (but I might be wrong about that).  That difference
might explains why eg Vadim's function heap_undo() took an XLogRecord,
whereas our proposal takes a different type.  Our proposal also passes
more than one records at a time to the undo handler; in future this
will allow us to collect up all undo records relating to a page of
(eg) zheap, and process them together for mechanical sympathy.

=== 0010 "Add developer documentation for the undo log storage subsystem." ===

Updated based on Robert's review up-thread.  No coverage of background
workers yet -- that is under development.

=== 0011 "Add user-facing documentation for undo logs." ===

Updated based on Robert's review up-thread.

=== 0012 "Add test_undorecord test module." ===

Provides quick and dirty dump_undo_records() procedure for testing.

=== 0013 "Use undo-based rollback to clean up files on abort." ===

Finally, this is the actual feature that this CF item is about.  The
main improvement here is that the previous version unlinked files
immediately when executing undo actions, which broke the protocol
established by commit 6cc4451b, namely that you can't reuse a
relfilenode until after the next checkpoint, and the existence of an
(empty) first relation segment in the filesystem is the only thing
preventing that.  That is fixed in this version (but see problem 2
below).

Known problems:

1.  A couple of tests fail with "ERROR:  buffer is pinned in
InvalidateBuffer".  That's because ROLLBACK TO SAVEPOINT is executing
the undo actions that drop the buffers for a newly created table
before the subtransaction has been cleaned up.  Amit is working on a
solution to that.  More soon.

2.  The are two levels of deferment of file unlinking in current
PostgreSQL.  First, when you create a new relation, it is pushed on
pendingDeletes; this patch-set replaces that in-memory list with
persistent undo records as discussed.  There is a second level of
deferment: we unlink all the segments of the file except the first
one, which we truncate, and then finally the zero-length file is
unlinked after the next checkpoint; this is an important part of
PostgreSQL's protocol for not reusing relfilenodes too soon.  That
means that there is still a very narrow window after the checkpoint is
logged but before we've unlinked that file where you could still crash
and leak a zero-length file.  I've thought about a couple of solutions
to close that window, including a file renaming scheme where .zombie
files get cleaned up on crash, but that seemed like something that
could be improved later.

There is something else that goes wrong under parallel make check,
which I must have introduced recently but haven't tracked down yet.  I
wanted to post a snapshot version for discussion anyway.  More soon.

This code is available at https://github.com/EnterpriseDB/zheap/tree/undo.

[1] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BDE0mmiBZMtZyvwWtgv1sZCniSVhXYsXkvJ_Wo%2B83vvw%40mail.gmail.com

--
Thomas Munro
https://enterprisedb.com

undo-smgr-v2.tgz (209K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Dilip Kumar-2
On Tue, Mar 12, 2019 at 6:51 PM Thomas Munro <[hidden email]> wrote:

>
> On Sun, Feb 3, 2019 at 11:09 PM Andres Freund <[hidden email]> wrote:
> > On 2018-12-03 18:43:04 +1300, Thomas Munro wrote:
> > > Sorry for my silence... I got stuck on a design problem with the lower
> > > level undo log management code that I'm now close to having figured
> > > out.  I'll have a new patch soon.
>
> Hello all,
>
> Here's a new WIP version of this patch set.  It builds on a fairly
> deep stack of patches being developed by several people.  As mentioned
> before, it's a useful crash-test dummy for a whole stack of technology
> we're working on, but it's also aiming to solve a real problem.
>
> It currently fails in one regression test for a well understood
> reason, fix on the way (see end), and there are some other stability
> problems being worked on.
>
> Here's a quick tour of the observable behaviour, having installed the
> pg_buffercache and test_undorecord extensions:
>
> ==================
>
> postgres=# begin;
> BEGIN
> postgres=# create table foo ();
> CREATE TABLE
>
> Check if our transaction has generated undo data:
>
> postgres=# select logno, discard, insert, xid, pid from pg_stat_undo_logs ;
>  logno |     discard      |      insert      | xid |  pid
> -------+------------------+------------------+-----+-------
>      0 | 0000000000002CD9 | 0000000000002D1A | 476 | 39169
> (1 row)
>
> Here, we see that undo log number 0 has some undo data because discard
> < insert.  We can find out what it says:
>
> postgres=# call dump_undo_records(0);
> NOTICE:  0000000000002CD9: Storage: CREATE dbid=12916, tsid=1663,
> relfile=16386; xid=476, next xact=0
> CALL
>
> The undo record shown there lives in shared buffers, and we can see
> that it's in there with pg_buffercache (the new column smgrid 1 means
> undo data; 0 is regular relation data):
>
> postgres=# select bufferid, smgrid, relfilenode, relblocknumber,
> isdirty, usagecount from pg_buffercache where smgrid = 1;
>  bufferid | smgrid | relfilenode | relblocknumber | isdirty | usagecount
> ----------+--------+-------------+----------------+---------+------------
>         3 |      1 |           0 |              1 | t       |          5
> (1 row)
>
> Even though that's just a dirty page in shared buffers, if we crash
> now and recover, it'll be recreated by a new WAL record that was
> flushed *before* creating the relation file.  We can see that with
> pg_waldump:
>
> rmgr: Storage ... PRECREATE base/12916/16384, blkref #0: smgr 1 rel
> 1663/0/0 blk 1 FPW
> rmgr: Storage ... CREATE base/12916/16384
>
> The PRECREATE record dirtied block 1 of undo log 0.  In this case it
> happened to include a FPW of the undo log page too, following the
> usual rules.  FPWs are rare for undo pages because of the
> REGBUF_WILL_INIT optimisation that applies to the zeroed out pages
> (which is most undo pages, due to the append-mostly access pattern).
>
> Finally, we if commit we see the undo data is discarded by a
> background worker, and if we roll back explicitly or crash and run
> recovery, the file is unlinked.  Here's an example of the crash case:
>
> postgres=# begin;
> BEGIN
> postgres=# create table foo ();
> CREATE TABLE
> postgres=# select relfilenode from pg_class where relname = 'foo';
>  relfilenode
> -------------
>        16395
> (1 row)
>
> postgres=# select pg_backend_pid();
>  pg_backend_pid
> ----------------
>           39169
> (1 row)
>
> $ kill -9 39169
>
> ... server restarts, recovers ...
>
> $ ls pgdata/base/12916/16395
> pgdata/base/12916/16395
>
> It's still there, though it's been truncated by an undo worker (see
> end of email).  And finally, after the next checkpoint:
>
> $ ls pgdata/base/12916/16395
> ls: pgdata/base/12916/16395: No such file or directory
>
> That's the end of the quick tour.
>
> Most of these patches should probably be discussed in other threads,
> but I'm posting a snapshot of the full stack here anyway.  Here's a
> patch-by-patch summary:
>
> === 0001 "Refactor the fsync mechanism to support future SMGR
> implementations." ===
>
> The 0001 patch has its own CF thread
> https://commitfest.postgresql.org/22/1829/ and is from Shawn Debnath
> (based on earlier work by me), but I'm including a copy here for
> convenience/cfbot.
>
> === 0002 "Add SmgrId to smgropen() and BufferTag." ===
>
> This is new, and is based on the discussion from another recent
> thread[1] about how we should identify buffers belonging to different
> storage managers.  In earlier versions of the patch-set I had used a
> special reserved DB OID for undo data.  Tom Lane didn't like that idea
> much, and Anton Shyrabokau (via Shawn Debnath) suggested making
> ForkNumber narrower so we can add a new field to BufferTag, and Andres
> Freund +1'd my proposal to add the extra value as a parameter to
> smgropen().  So, here is a patch that tries those ideas.
>
> Another way to do this would be to widen RelFileNode instead, to avoid
> having to pass around the SMGR ID separately in various places.
> Looking at the number of places that have to chance, you can probably
> see why we wanted to use a magic DB OID instead, and I'm not entirely
> convinced that it wasn't better that way, or that I've found all the
> places that need to carry an smgrid alongside a RelFileNode.
>
> Archeological note: smgropen() was like that ~15 years ago before
> commit 87bd9563, but buffer tags didn't include the SMGR ID.
>
> I decided to call md.c's ID "SMGR_RELATION", describing what it really
> holds -- regular relations -- rather than perpetuating the doubly
> anachronistic "magnetic disk" name.
>
> While here, I resurrected the ancient notion of a per-SMGR 'open'
> routine, so that a small amount of md.c-specific stuff could be kicked
> out of smgr.c and future implementations can do their own thing here
> too.
>
> While doing that work I realised that at least pg_rewind needs to
> learn about how different storage managers map blocks to files, so
> that's a new TODO item requiring more thought.  I wonder what other
> places know how to map { RelFileNode, ForkNumber, BlockNumber } to a
> path + offset, and I wonder what to think about the fact that some of
> them may be non-backend code...
>
> === 0003 "Add undo log manager." ===
>
> This and the next couple of patches live in CF thread
> https://commitfest.postgresql.org/22/1649/ but here's a much newer
> snapshot that hasn't been posted there yet.
>
> Manages a set of undo logs in shared memory, manages undo segment
> files, tracks discard, insert, end pointers visible in
> pg_stat_undo_logs.  With this patch you can allocate and discard space
> in undo logs using the UndoRecPtr type to refer to addresses, but
> there is no access to the data yet.  Improvements since the last
> version are not requiring DSM segments, proper FPW support and reduced
> WAL traffic.  Previously there were extra per-xact and per-checkpoint
> records requiring retry-loops in code that inserted undo data.
>
> === 0004 "Provide access to undo log data via the buffer manager." ===
>
> Provide SMGR_UNDO.  While the 0003 patch deals with allocating and
> discarding undo address space and makes sure that backing files exist,
> this patch lets you read and write buffered data in them.
>
> === 0005 "Allow WAL record data on first modification after a checkpoint." ===
>
> Provide a way for data to be attached to a WAL-registered block that
> is only included if this turns out to be the first WAL record that
> touches the block after a checkpoint.  This is a bit like FPW images,
> except that it's arbitrary extra data and happens even if FPW is off.
> This is used to capture a copy of the (tiny) undo log meta-data
> (primary the insertion pointer) to fix a consistency problem when
> recovering from an online checkpoint.
>
> === 0006 + 0007 "Provide interfaces to store and fetch undo records." ===
>
> This is a snapshot of work by my colleagues Dilip, Rafia and others
> based on earlier prototyping by Robert.  While the earlier patches
> give you buffered binary undo data, this patch introduces the concept
> of high level undo records that can be inserted, and read back given
> an UndoRecPtr.  This is a version presented on another thread already;
> here it's lightly changed due to rebasing by me.
>
> Undo-aware modules should design a set of undo record types, and
> insert exactly the same ones at do and undo time.
>
> The 0007 patch is fixups from me to bring that code into line with
> changes to the lower level patches.  Future versions will be squashed
> and tidied up; still working on that.
Currently, undo branch[1] contain an older version of the (undo
interface + some fixup).  Now, I have merged the latest changes from
the zheap branch[2] to the undo branch[1]
which can be applied on top of the undo storage commit[3].  For
merging those changes, I need to add some changes to the undo log
storage patch as well for handling the multi log transaction.  So I
have attached two patches, 1) improvement to undo log storage 2)
complete undo interface patch which include 0006+0007 from undo
branch[1] + new changes on the zheap branch.

[1] https://github.com/EnterpriseDB/zheap/tree/undo
[2] https://github.com/EnterpriseDB/zheap
[3] https://github.com/EnterpriseDB/zheap/tree/undo
(b397d96176879ed5b09cf7322b8d6f2edd8043a5)

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

0001-Enhance-multi-log-handling-in-undo-log.patch (10K) Download Attachment
0002-Provide-interfaces-to-store-and-fetch-undo-records.patch (97K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Robert Haas
On Fri, Apr 19, 2019 at 6:16 AM Dilip Kumar <[hidden email]> wrote:
> Currently, undo branch[1] contain an older version of the (undo
> interface + some fixup).  Now, I have merged the latest changes from
> the zheap branch[2] to the undo branch[1]
> which can be applied on top of the undo storage commit[3].  For
> merging those changes, I need to add some changes to the undo log
> storage patch as well for handling the multi log transaction.  So I
> have attached two patches, 1) improvement to undo log storage 2)
> complete undo interface patch which include 0006+0007 from undo
> branch[1] + new changes on the zheap branch.

Some review comments:

+#define AtAbort_ResetUndoBuffers() ResetUndoBuffers()

I don't think this really belongs in xact.c.  Seems like we ought to
declare it in the appropriate header file.  Perhaps we also ought to
consider using a static inline function rather than a macro, although
I guess it doesn't really matter.

+void
+SetCurrentUndoLocation(UndoRecPtr urec_ptr)
+{
+ UndoLogControl *log = UndoLogGet(UndoRecPtrGetLogNo(urec_ptr), false);
+ UndoPersistence upersistence = log->meta.persistence;

Could we arrange things so that the caller passes the persistence
level, instead of having to recalculate it here?  This doesn't seem to
be a particularly cheap operation.  UndoLogGet() will call
get_undo_log() which I guess will normally just do a hash table lookup
but which in the worst case will take an LWLock and perform a linear
search of a shared memory array.  But at PrepareUndoInsert time we
knew the persistence level already, so I don't see why
InsertPreparedUndo should have to force it to be looked up all over
again -- and while in a critical section, no less.

Another thing that is strange about this patch is that it adds these
start_urec_ptr and latest_urec_ptr arrays and then uses them for
absolutely nothing. I think that's a pretty clear sign that the
division of this work into multiple patches is not correct.  We
shouldn't have one patch that tracks some information that is used
nowhere for anything and then another patch that adds a user of that
information -- the two should go together.

Incidentally, wouldn't StartTransaction() need to reinitialize these fields?

+ * When the undorecord for a transaction gets inserted in the next log then we

undo record

+ * insert a transaction header for the first record in the new log and update
+ * the transaction header with this new logs location.  We will also keep

This appears to be nonsensical.  You're saying that you add a
transaction header to the new log and then update it with its own
location.  That can't be what you mean.

+ * Incase, this is not the first record in new log (aka new log already

"Incase," -> "If"
"aka" -> "i.e."

Overall this whole paragraph is a bit hard to understand.

+ * same transaction spans across multiple logs depending on which log is

delete "across"

+ * processed first by the discard worker.  If it processes the first log which
+ * contains the transactions first record, then it can get the last record
+ * of that transaction even if it is in different log and then processes all
+ * the undo records from last to first.  OTOH, if the next log get processed

Not sure how that would work if the number of logs is >2.

This whole paragraph is also hard to understand.

+static UndoBuffers def_buffers[MAX_UNDO_BUFFERS];
+static int buffer_idx;

This is a file-global variable with a very generic name that is very
similar to a local variable name used by multiple functions within the
file (bufidx) and no comments.  Ugh.

+UndoRecordIsValid(UndoLogControl * log, UndoRecPtr urp)

The locking regime for this function is really confusing.  It requires
that the caller hold discard_lock on entry, and on exit the lock will
still be held if the return value is true but will no longer be held
if the return value is false.  Yikes!  Anybody who reads code that
uses this function is not going to guess that it has such strange
behavior.  I'm not exactly sure how to redesign this, but I think it's
not OK the way you have it. One option might be to inline the logic
into each call site.

+/*
+ * Overwrite the first undo record of the previous transaction to update its
+ * next pointer.  This will just insert the already prepared record by
+ * UndoRecordPrepareTransInfo.  This must be called under the critical section.
+ * This will just overwrite the undo header not the data.
+ */
+static void
+UndoRecordUpdateTransInfo(int idx)

It bugs me that this function goes back in to reacquire the discard
lock for the purpose of preventing a concurrent undo discard.
Apparently, if the other transaction's undo has been discarded between
the prepare phase and where we are now, we're OK with that and just
exit without doing anything; otherwise, we update the previous
transaction header.  But that seems wrong.  When we enter a critical
section, I think we should aim to know exactly what modifications we
are going to make within that critical section.

I also wonder how the concurrent discard could really happen.  We must
surely be holding exclusive locks on the relevant buffers -- can undo
discard really discard undo when the relevant buffers are x-locked?

It seems to me that remaining_bytes is a crock that should be ripped
out entirely, both here and in InsertUndoRecord.  It seems to me that
UndoRecordUpdateTransInfo can just contrive to set remaining_bytes
correctly.  e.g.

do
{
    // stuff
    if (!BufferIsValid(buffer))
    {
        Assert(InRecovery);
        already_written += (BLCKSZ - starting_byte);
        done = (already_written >= undo_len);
    }
    else
    {
        page = BufferGetPage(buffer);
        done = InsertUndoRecord(...);
        MarkBufferDirty(buffer);
    }
} while (!done);

InsertPreparedUndo needs similar treatment.

To make this work, I guess the long string of assignments in
InsertUndoRecord will need to be made unconditional, but that's
probably pretty cheap.  As a fringe benefit, all of those work_blah
variables that are currently referencing file-level globals can be
replaced with something local to this function, which will surely
please the coding style police.

+ * In recovery, 'xid' refers to the transaction id stored in WAL, otherwise,
+ * it refers to the top transaction id because undo log only stores mapping
+ * for the top most transactions.
+ */
+UndoRecPtr
+PrepareUndoInsert(UnpackedUndoRecord *urec, FullTransactionId fxid,

xid vs fxid

+ urec->uur_xidepoch = EpochFromFullTransactionId(fxid);

We need to make some decisions about how we're going to handle 64-bit
XIDs vs. 32-bit XIDs in undo. This doesn't look like a well-considered
scheme.  In general, PrepareUndoInsert expects the caller to have
populated the UnpackedUndoRecord, but here, uur_xidepoch is getting
overwritten with the high bits of the caller-specified XID.  The
low-order bits aren't stored anywhere by this function, but the caller
is presumably expected to have placed them inside urec->uur_xid.  And
it looks like the low-order bits (urec->uur_xid) get stored for every
undo record, but the high-order bits (urec->xidepoch) only get stored
when we emit a transaction header.  This all seems very confusing.

I would really like to see us replace uur_xid and uur_xidepoch with a
FullTransactionId; now that we have that concept, it seems like bad
practice to break what is really a FullTransactionId into two halves
and store them separately.  However, it would be a bit unfortunate to
store an additional 4 bytes of mostly-static data in every undo
record.  What if we went the other way?   That is, remove urec_xid
from UndoRecordHeader and from UnpackedUndoRecord.  Make it the
responsibility of whoever is looking up an undo record to know which
transaction's undo they are searching.  zheap, at least, generally
does know this: if it's starting from a page, then it has the XID +
epoch available from the transaction slot, and if it's starting from
an undo record, you need to know the XID for which you are searching,
I guess from uur_prevxid.

I also think that we need to remove uur_prevxid.  That field does not
seem to be properly a general-purpose part of the undo machinery, but
a zheap-specific consideration.  I think it's job is to tell you which
transaction last modified the current tuple, but zheap can put that
data in the payload if it likes.  It is a waste to store it in every
undo record, because it's not needed if the older undo has already
been discarded or if the operation is an insert.

+ * Insert a previously-prepared undo record.  This will write the actual undo

Looks like this now inserts all previously-prepared undo records
(rather than just a single one).

+ * in page.  We start writting immediately after the block header.

Spelling.

+ * Helper function for UndoFetchRecord.  It will fetch the undo record pointed
+ * by urp and unpack the record into urec.  This function will not release the
+ * pin on the buffer if complete record is fetched from one buffer, so caller
+ * can reuse the same urec to fetch the another undo record which is on the
+ * same block.  Caller will be responsible to release the buffer inside urec
+ * and set it to invalid if it wishes to fetch the record from another block.
+ */
+UnpackedUndoRecord *
+UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
+ UndoPersistence persistence)

I don't really understand why uur_buffer is part of an
UnpackedUndoRecord.  It doesn't look like the other fields, which tell
you about the contents of an undo record that you have created or that
you have parsed.  Instead, it's some kind of scratch space for keeping
track of a buffer that we're using in the process of reading an undo
record.  It looks like it should be an argument to UndoGetOneRecord()
and ResetUndoRecord().

I also wonder whether it's really a good design to make the caller
responsible for invalidating the buffer before accessing another
block.  Maybe it would be simpler to have this function just check
whether the buffer it's been given is the right one; if not, unpin it
and pin the new one instead.  But I'm not really sure...

+ /* If we already have a buffer pin then no need to allocate a new one. */
+ if (!BufferIsValid(buffer))
+ {
+ buffer = ReadBufferWithoutRelcache(SMGR_UNDO,
+    rnode, UndoLogForkNum, cur_blk,
+    RBM_NORMAL, NULL,
+    RelPersistenceForUndoPersistence(persistence));
+
+ urec->uur_buffer = buffer;
+ }

I think you should move this code inside the loop that follows.  Then
at the bottom of that loop, instead of making a similar call, just set
buffer = InvalidBuffer.  Then when you loop around it'll do the right
thing and you'll need less code.

Notice that having both the local variable buffer and the structure
member urec->uur_buffer is actually making this code more complex. You
are already setting urec->uur_buffer = InvalidBuffer when you do
UnlockReleaseBuffer().  If you didn't have a separate 'buffer'
variable you wouldn't need to clear them both here.  In fact I think
what you should have is an argument Buffer *curbuf, or something like
that, and no uur_buffer at all.

+ /*
+ * If we have copied the data then release the buffer, otherwise, just
+ * unlock it.
+ */
+ if (is_undo_rec_split)
+ UnlockReleaseBuffer(buffer);
+ else
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);

Ugh.  I think what's going on here is: UnpackUndoRecord copies the
data if it switches buffers, but not otherwise.  So if the record is
split, we can release the lock and pin, but otherwise we have to keep
the pin to avoid having the data get clobbered.  But having this code
know about what UnpackUndoRecord does internally seems like an
abstraction violation.  It's also probably not right if we ever want
to fetch undo records in bulk, as I see that the latest code in zheap
master does.  I think we should switch UnpackUndoRecord over to always
copying the data and just avoid all this.

(To make that cheaper, we may want to teach UnpackUndoRecord to store
data into scratch space provided by the caller rather than using
palloc to get its own space, but I'm not actually sure that's (a)
worth it or (b) actually better.)

[ Skipping over some things ]

+bool
+UnpackUndoRecord(UnpackedUndoRecord *uur, Page page, int starting_byte,
+ int *already_decoded, bool header_only)

I think we should split this function into three functions that use a
context object, call it say UnpackUndoContext.  The caller will do:

BeginUnpackUndo(&ucontext);  // just once
UnpackUndoData(&ucontext, page, starting_byte);  // any number of times
FinishUnpackUndo(&uur, &ucontext);  // just once

The undo context will store an enum value that tells us the "stage" of decoding:

- UNDO_DECODE_STAGE_HEADER: We have not yet decoded even the record
header; we need to do that next.
- UNDO_DECODE_STAGE_RELATION_DETAILS: The next thing to be decoded is
the relation details, if present.
- UNDO_DECODE_STAGE_BLOCK: The next thing to be decoded is the block
details, if present.
- UNDO_DECODE_STAGE_TRANSACTION: The next thing to be decoded is the
transaction details, if present.
- UNDO_DECODE_STAGE_PAYLOAD: The next thing to be decoded is the
payload details, if present.
- UNDO_DECODE_STAGE_DONE: Decoding is complete.

It will also store the number of bytes that have been already been
copied as part of whichever stage is current.  A caller who wants only
part of the record can stop when ucontext.stage > desired_stage; e.g.
the current header_only flag corresponds to stopping when
ucontext.stage > UNDO_DECODE_STAGE_HEADER, and the potential
optimization mentioned in UndoGetOneRecord could be done by stopping
when ucontext.stage > UNDO_DECODE_STAGE_BLOCK (although I don't know
if that's worth doing).

In this scheme, BeginUnpackUndo just needs to set the stage to
UNDO_DECODE_STAGE_HEADER and the number of bytes copied to 0.  The
context object contains all the work_blah things (no more global
variables!), but BeginUnpackUndo does not need to initialize them,
since they will be overwritten before they are examined.  And
FinishUnpackUndo just needs to copy all of the fields from the
work_blah things into the UnpackedUndoRecord.  The tricky part is
UnpackUndoData itself, which I propose should look like a big switch
where all branches fall through.  Roughly:

switch (ucontext->stage)
{
case UNDO_DECODE_STAGE_HEADER:
if (!ReadUndoBytes(...))
  return;
stage = UNDO_DECODE_STAGE_RELATION_DETAILS;
/* FALLTHROUGH */
case UNDO_DECODE_STAGE_RELATION_DETAILS:
if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
{
  if (!ReadUndoBytes(...))
    return;
}
stage = UNDO_DECODE_STAGE_BLOCK;
/* FALLTHROUGH */
etc.

ReadUndoBytes would need some adjustments in this scheme; it wouldn't
need my_bytes_read any more since it would only get called for
structures that are not yet completely read.  (Regardless of whether
we adopt this idea, the nocopy flag to ReadUndoBytes appears to be
unused and can be removed.)

We could use a similar context object for InsertUndoRecord.
BeginInsertUndoRecord(&ucontext, &uur) would initialize all of the
work_blah structures within the context object.  InsertUndoData will
be a big switch.  Maybe no need for a "finish" function here.  There
can also be a SkipInsertingUndoData function that can be called
instead of InsertUndoData if the page is discarded.  I think this
would be more elegant than what we've got now.

This is not a complete review, but I'm out of time and energy for the moment...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Shawn Debnath
In reply to this post by Thomas Munro-5
On Wed, Mar 13, 2019 at 02:20:29AM +1300, Thomas Munro wrote:
 

> === 0002 "Add SmgrId to smgropen() and BufferTag." ===
>
> This is new, and is based on the discussion from another recent
> thread[1] about how we should identify buffers belonging to different
> storage managers.  In earlier versions of the patch-set I had used a
> special reserved DB OID for undo data.  Tom Lane didn't like that idea
> much, and Anton Shyrabokau (via Shawn Debnath) suggested making
> ForkNumber narrower so we can add a new field to BufferTag, and Andres
> Freund +1'd my proposal to add the extra value as a parameter to
> smgropen().  So, here is a patch that tries those ideas.
>
> Another way to do this would be to widen RelFileNode instead, to avoid
> having to pass around the SMGR ID separately in various places.
> Looking at the number of places that have to chance, you can probably
> see why we wanted to use a magic DB OID instead, and I'm not entirely
> convinced that it wasn't better that way, or that I've found all the
> places that need to carry an smgrid alongside a RelFileNode.
>
> Archeological note: smgropen() was like that ~15 years ago before
> commit 87bd9563, but buffer tags didn't include the SMGR ID.
>
> I decided to call md.c's ID "SMGR_RELATION", describing what it really
> holds -- regular relations -- rather than perpetuating the doubly
> anachronistic "magnetic disk" name.
>
> While here, I resurrected the ancient notion of a per-SMGR 'open'
> routine, so that a small amount of md.c-specific stuff could be kicked
> out of smgr.c and future implementations can do their own thing here
> too.
>
> While doing that work I realised that at least pg_rewind needs to
> learn about how different storage managers map blocks to files, so
> that's a new TODO item requiring more thought.  I wonder what other
> places know how to map { RelFileNode, ForkNumber, BlockNumber } to a
> path + offset, and I wonder what to think about the fact that some of
> them may be non-backend code...

Given the scope of this patch, it might be prudent to start a separate
thread for it. So far, this discussion has been burried within other
discussions and I want to ensure folks don't miss this.

Thanks.

--
Shawn Debnath
Amazon Web Services (AWS)


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Dilip Kumar-2
In reply to this post by Robert Haas
On Fri, Apr 19, 2019 at 10:13 PM Robert Haas <[hidden email]> wrote:

>
> On Fri, Apr 19, 2019 at 6:16 AM Dilip Kumar <[hidden email]> wrote:
> > Currently, undo branch[1] contain an older version of the (undo
> > interface + some fixup).  Now, I have merged the latest changes from
> > the zheap branch[2] to the undo branch[1]
> > which can be applied on top of the undo storage commit[3].  For
> > merging those changes, I need to add some changes to the undo log
> > storage patch as well for handling the multi log transaction.  So I
> > have attached two patches, 1) improvement to undo log storage 2)
> > complete undo interface patch which include 0006+0007 from undo
> > branch[1] + new changes on the zheap branch.
Thanks for the review Robert.  Please find my reply inline.
>
> Some review comments:
>
> +#define AtAbort_ResetUndoBuffers() ResetUndoBuffers()
>
> I don't think this really belongs in xact.c.  Seems like we ought to
> declare it in the appropriate header file.  Perhaps we also ought to
> consider using a static inline function rather than a macro, although
> I guess it doesn't really matter.

Moved to undoinsert.h

>
> +void
> +SetCurrentUndoLocation(UndoRecPtr urec_ptr)
> +{
> + UndoLogControl *log = UndoLogGet(UndoRecPtrGetLogNo(urec_ptr), false);
> + UndoPersistence upersistence = log->meta.persistence;
>
Right.  This should not be part of this patch so removed.

>
> + * When the undorecord for a transaction gets inserted in the next log then we
>
> undo record
Changed
>
> + * insert a transaction header for the first record in the new log and update
> + * the transaction header with this new logs location.  We will also keep
>
> This appears to be nonsensical.  You're saying that you add a
> transaction header to the new log and then update it with its own
> location.  That can't be what you mean.
Actually, what I meant is update the transaction's header which is in
the old log.  I have changed the comments

>
> + * Incase, this is not the first record in new log (aka new log already
>
> "Incase," -> "If"
> "aka" -> "i.e."
>
Done

> Overall this whole paragraph is a bit hard to understand.
I tired to improve it in newer version.

> + * same transaction spans across multiple logs depending on which log is
>
> delete "across"
Fixed
>
> + * processed first by the discard worker.  If it processes the first log which
> + * contains the transactions first record, then it can get the last record
> + * of that transaction even if it is in different log and then processes all
> + * the undo records from last to first.  OTOH, if the next log get processed
>
> Not sure how that would work if the number of logs is >2.
> This whole paragraph is also hard to understand.
Actually, what I meant is that if it spread in multiple logs for
example 3 logs(1,2,3) and the discard worker check the log 1 first
then for aborted transaction it will follow the chain of undo headers
and register complete request for rollback and it will apply all undo
action in log1,2and 3 together.  Whereas if it encounters log2 first
it will register request for undo actions in log2 and 3 and similarly
if it encounter log 3 first then it will only process that log.  We
have done that so that we can maintain the order of undo apply.
However, there is possibility that we always collect all undos and
apply together but for that we need to add one more pointer in the
transaction header (transaction's undo header in previous log).  May
be the next log pointer we can keep in separate header instead if
keeping in transaction  header so that it will only occupy space on
log switch.

I think this comment don't belong here it's more related to undo
discard processing so I have removed.

>
> +static UndoBuffers def_buffers[MAX_UNDO_BUFFERS];
> +static int buffer_idx;
>
> This is a file-global variable with a very generic name that is very
> similar to a local variable name used by multiple functions within the
> file (bufidx) and no comments.  Ugh.
>
Comment added.

> +UndoRecordIsValid(UndoLogControl * log, UndoRecPtr urp)
>
> The locking regime for this function is really confusing.  It requires
> that the caller hold discard_lock on entry, and on exit the lock will
> still be held if the return value is true but will no longer be held
> if the return value is false.  Yikes!  Anybody who reads code that
> uses this function is not going to guess that it has such strange
> behavior.  I'm not exactly sure how to redesign this, but I think it's
> not OK the way you have it. One option might be to inline the logic
> into each call site.
I think the simple solution will be that inside UndoRecordIsValid
function we can directly check UndoLogIsDiscarded if oldest_data is
not yet initialized, I think we don't need to release the discard lock
for that.  So I have changed it like that.

>
> +/*
> + * Overwrite the first undo record of the previous transaction to update its
> + * next pointer.  This will just insert the already prepared record by
> + * UndoRecordPrepareTransInfo.  This must be called under the critical section.
> + * This will just overwrite the undo header not the data.
> + */
> +static void
> +UndoRecordUpdateTransInfo(int idx)
>
> It bugs me that this function goes back in to reacquire the discard
> lock for the purpose of preventing a concurrent undo discard.
> Apparently, if the other transaction's undo has been discarded between
> the prepare phase and where we are now, we're OK with that and just
> exit without doing anything; otherwise, we update the previous
> transaction header.  But that seems wrong.  When we enter a critical
> section, I think we should aim to know exactly what modifications we
> are going to make within that critical section.
>
> I also wonder how the concurrent discard could really happen.  We must
> surely be holding exclusive locks on the relevant buffers -- can undo
> discard really discard undo when the relevant buffers are x-locked?
>
> It seems to me that remaining_bytes is a crock that should be ripped
> out entirely, both here and in InsertUndoRecord.  It seems to me that
> UndoRecordUpdateTransInfo can just contrive to set remaining_bytes
> correctly.  e.g.
>
> do
> {
>     // stuff
>     if (!BufferIsValid(buffer))
>     {
>         Assert(InRecovery);
>         already_written += (BLCKSZ - starting_byte);
>         done = (already_written >= undo_len);
>     }
>     else
>     {
>         page = BufferGetPage(buffer);
>         done = InsertUndoRecord(...);
>         MarkBufferDirty(buffer);
>     }
> } while (!done);
>
> InsertPreparedUndo needs similar treatment.
>
> To make this work, I guess the long string of assignments in
> InsertUndoRecord will need to be made unconditional, but that's
> probably pretty cheap.  As a fringe benefit, all of those work_blah
> variables that are currently referencing file-level globals can be
> replaced with something local to this function, which will surely
> please the coding style police.
Got fixed as part of last comment fix where we introduced
SkipInsertingUndoData and globals moved to the context.

>
> + * In recovery, 'xid' refers to the transaction id stored in WAL, otherwise,
> + * it refers to the top transaction id because undo log only stores mapping
> + * for the top most transactions.
> + */
> +UndoRecPtr
> +PrepareUndoInsert(UnpackedUndoRecord *urec, FullTransactionId fxid,
>
> xid vs fxid
>
> + urec->uur_xidepoch = EpochFromFullTransactionId(fxid);
>
> We need to make some decisions about how we're going to handle 64-bit
> XIDs vs. 32-bit XIDs in undo. This doesn't look like a well-considered
> scheme.  In general, PrepareUndoInsert expects the caller to have
> populated the UnpackedUndoRecord, but here, uur_xidepoch is getting
> overwritten with the high bits of the caller-specified XID.  The
> low-order bits aren't stored anywhere by this function, but the caller
> is presumably expected to have placed them inside urec->uur_xid.  And
> it looks like the low-order bits (urec->uur_xid) get stored for every
> undo record, but the high-order bits (urec->xidepoch) only get stored
> when we emit a transaction header.  This all seems very confusing.
Yeah it seems bit confusing.  Actually, discard worker process the
transaction chain from one transaction header to the next transaction
header so we need epoch only when it's first record of the transaction
and currently we have set all header information inside
PrepareUndoInsert.  Xid is stored by caller as caller needs it for the
MVCC purpose.  I think caller can always set it and if transaction
header get added then it will be stored otherwise not. So I think we
can remove setting it here.

>
> I would really like to see us replace uur_xid and uur_xidepoch with a
> FullTransactionId; now that we have that concept, it seems like bad
> practice to break what is really a FullTransactionId into two halves
> and store them separately.  However, it would be a bit unfortunate to
> store an additional 4 bytes of mostly-static data in every undo
> record.  What if we went the other way?   That is, remove urec_xid
> from UndoRecordHeader and from UnpackedUndoRecord.  Make it the
> responsibility of whoever is looking up an undo record to know which
> transaction's undo they are searching.  zheap, at least, generally
> does know this: if it's starting from a page, then it has the XID +
> epoch available from the transaction slot, and if it's starting from
> an undo record, you need to know the XID for which you are searching,
> I guess from uur_prevxid.
Right, from uur_prevxid we would know that for which xid's undo we are
looking for but without having uur_xid in undo record it self how we
would know which undo record is inserted by the xid we are looking
for?  Because in zheap while following the undo chain and if slot got
switch, then there is possibility (because of slot reuse) that we
might get some other transaction's undo record for the same zheap
tuple, but we want to traverse back as we want to find the record
inserted by uur_prevxid.  So we need uur_xid as well to tell who is
inserter of this undo record?

>
> I also think that we need to remove uur_prevxid.  That field does not
> seem to be properly a general-purpose part of the undo machinery, but
> a zheap-specific consideration.  I think it's job is to tell you which
> transaction last modified the current tuple, but zheap can put that
> data in the payload if it likes.  It is a waste to store it in every
> undo record, because it's not needed if the older undo has already
> been discarded or if the operation is an insert.
Done
>
> + * Insert a previously-prepared undo record.  This will write the actual undo
>
> Looks like this now inserts all previously-prepared undo records
> (rather than just a single one).
Fixed
>
> + * in page.  We start writting immediately after the block header.
>
> Spelling.
Done

>
> + * Helper function for UndoFetchRecord.  It will fetch the undo record pointed
> + * by urp and unpack the record into urec.  This function will not release the
> + * pin on the buffer if complete record is fetched from one buffer, so caller
> + * can reuse the same urec to fetch the another undo record which is on the
> + * same block.  Caller will be responsible to release the buffer inside urec
> + * and set it to invalid if it wishes to fetch the record from another block.
> + */
> +UnpackedUndoRecord *
> +UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
> + UndoPersistence persistence)
>
> I don't really understand why uur_buffer is part of an
> UnpackedUndoRecord.  It doesn't look like the other fields, which tell
> you about the contents of an undo record that you have created or that
> you have parsed.  Instead, it's some kind of scratch space for keeping
> track of a buffer that we're using in the process of reading an undo
> record.  It looks like it should be an argument to UndoGetOneRecord()
> and ResetUndoRecord().
>
> I also wonder whether it's really a good design to make the caller
> responsible for invalidating the buffer before accessing another
> block.  Maybe it would be simpler to have this function just check
> whether the buffer it's been given is the right one; if not, unpin it
> and pin the new one instead.  But I'm not really sure...
I am not sure what will be better here, But I thought caller anyway
has to release the last buffer so why not to make it responsible to
keeping the track of the first buffer of the undo record and caller
understands it better that it needs to hold the first buffer of the
undo record because it hope that the previous undo record in the chain
might fall on the same buffer?
May be we can make caller completely responsible for reading the
buffer for the first block of the undo record and it will always pass
the valid buffer and UndoGetOneRecord only need to read buffer if the
undo record is spilt and it can release them right there.  So the
caller will always keep track of the first buffer where undo record
start and whenever the undo record pointer change it will be
responsible for changing the buffer.

> + /* If we already have a buffer pin then no need to allocate a new one. */
> + if (!BufferIsValid(buffer))
> + {
> + buffer = ReadBufferWithoutRelcache(SMGR_UNDO,
> +    rnode, UndoLogForkNum, cur_blk,
> +    RBM_NORMAL, NULL,
> +    RelPersistenceForUndoPersistence(persistence));
> +
> + urec->uur_buffer = buffer;
> + }
>
> I think you should move this code inside the loop that follows.  Then
> at the bottom of that loop, instead of making a similar call, just set
> buffer = InvalidBuffer.  Then when you loop around it'll do the right
> thing and you'll need less code.
Done

>
> Notice that having both the local variable buffer and the structure
> member urec->uur_buffer is actually making this code more complex. You
> are already setting urec->uur_buffer = InvalidBuffer when you do
> UnlockReleaseBuffer().  If you didn't have a separate 'buffer'
> variable you wouldn't need to clear them both here.  In fact I think
> what you should have is an argument Buffer *curbuf, or something like
> that, and no uur_buffer at all.
Done

>
> + /*
> + * If we have copied the data then release the buffer, otherwise, just
> + * unlock it.
> + */
> + if (is_undo_rec_split)
> + UnlockReleaseBuffer(buffer);
> + else
> + LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
>
> Ugh.  I think what's going on here is: UnpackUndoRecord copies the
> data if it switches buffers, but not otherwise.  So if the record is
> split, we can release the lock and pin, but otherwise we have to keep
> the pin to avoid having the data get clobbered.  But having this code
> know about what UnpackUndoRecord does internally seems like an
> abstraction violation.  It's also probably not right if we ever want
> to fetch undo records in bulk, as I see that the latest code in zheap
> master does.  I think we should switch UnpackUndoRecord over to always
> copying the data and just avoid all this.
Done

>
> (To make that cheaper, we may want to teach UnpackUndoRecord to store
> data into scratch space provided by the caller rather than using
> palloc to get its own space, but I'm not actually sure that's (a)
> worth it or (b) actually better.)
>
> [ Skipping over some things ]
>
> +bool
> +UnpackUndoRecord(UnpackedUndoRecord *uur, Page page, int starting_byte,
> + int *already_decoded, bool header_only)
>
> I think we should split this function into three functions that use a
> context object, call it say UnpackUndoContext.  The caller will do:
>
> BeginUnpackUndo(&ucontext);  // just once
> UnpackUndoData(&ucontext, page, starting_byte);  // any number of times
> FinishUnpackUndo(&uur, &ucontext);  // just once
>
> The undo context will store an enum value that tells us the "stage" of decoding:
>
> - UNDO_DECODE_STAGE_HEADER: We have not yet decoded even the record
> header; we need to do that next.
> - UNDO_DECODE_STAGE_RELATION_DETAILS: The next thing to be decoded is
> the relation details, if present.
> - UNDO_DECODE_STAGE_BLOCK: The next thing to be decoded is the block
> details, if present.
> - UNDO_DECODE_STAGE_TRANSACTION: The next thing to be decoded is the
> transaction details, if present.
> - UNDO_DECODE_STAGE_PAYLOAD: The next thing to be decoded is the
> payload details, if present.
> - UNDO_DECODE_STAGE_DONE: Decoding is complete.
>
> It will also store the number of bytes that have been already been
> copied as part of whichever stage is current.  A caller who wants only
> part of the record can stop when ucontext.stage > desired_stage; e.g.
> the current header_only flag corresponds to stopping when
> ucontext.stage > UNDO_DECODE_STAGE_HEADER, and the potential
> optimization mentioned in UndoGetOneRecord could be done by stopping
> when ucontext.stage > UNDO_DECODE_STAGE_BLOCK (although I don't know
> if that's worth doing).
>
> In this scheme, BeginUnpackUndo just needs to set the stage to
> UNDO_DECODE_STAGE_HEADER and the number of bytes copied to 0.  The
> context object contains all the work_blah things (no more global
> variables!), but BeginUnpackUndo does not need to initialize them,
> since they will be overwritten before they are examined.  And
> FinishUnpackUndo just needs to copy all of the fields from the
> work_blah things into the UnpackedUndoRecord.  The tricky part is
> UnpackUndoData itself, which I propose should look like a big switch
> where all branches fall through.  Roughly:
>
> switch (ucontext->stage)
> {
> case UNDO_DECODE_STAGE_HEADER:
> if (!ReadUndoBytes(...))
>   return;
> stage = UNDO_DECODE_STAGE_RELATION_DETAILS;
> /* FALLTHROUGH */
> case UNDO_DECODE_STAGE_RELATION_DETAILS:
> if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
> {
>   if (!ReadUndoBytes(...))
>     return;
> }
> stage = UNDO_DECODE_STAGE_BLOCK;
> /* FALLTHROUGH */
> etc.
>
> ReadUndoBytes would need some adjustments in this scheme; it wouldn't
> need my_bytes_read any more since it would only get called for
> structures that are not yet completely read.
Yeah so we can directly jump to the header which is not yet completely
read but if any of the header is partially read then we need to
maintain some kind of partial read variable otherwise from 'already
read' we wouldn't be able to know how many bytes of the header got
read in last call unless we calculate that from uur_info or maintain
the partial_read in context like I have done in the new version.

(Regardless of whether
> we adopt this idea, the nocopy flag to ReadUndoBytes appears to be
> unused and can be removed.)

Yup.

>
> We could use a similar context object for InsertUndoRecord.
> BeginInsertUndoRecord(&ucontext, &uur) would initialize all of the
> work_blah structures within the context object.  InsertUndoData will
> be a big switch.  Maybe no need for a "finish" function here.  There
> can also be a SkipInsertingUndoData function that can be called
> instead of InsertUndoData if the page is discarded.  I think this
> would be more elegant than what we've got now.

Done.

Note:
- I think the ucontext->stage are same for the insert and DECODE can
we just declare only
  one enum and give some generic name e.g. UNDO_PROCESS_STAGE_HEADER ?
- In SkipInsertingUndoData also I have to go through all the stages so
that if we find some valid
block then stage is right for inserting the partial record?  Do you
think I could have avoided that?

Apart from these changes I have also included UndoRecordBulkFetch in
the undoinsert.c.

I have tested this patch with my local test modules which basically
insert, fetch and bulk fetch multiple records and compare the
contents.  My test patch is still not in good shape so I will post the
test module later.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

0001-Enhance-multi-log-handling-in-undo-log.patch (10K) Download Attachment
0002-Add-prefetch-support-for-the-undo-log.patch (10K) Download Attachment
0003-Provide-interfaces-to-store-and-fetch-undo-records_v2.patch (113K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Robert Haas
On Thu, Apr 25, 2019 at 7:15 AM Dilip Kumar <[hidden email]> wrote:
> > +static UndoBuffers def_buffers[MAX_UNDO_BUFFERS];
> > +static int buffer_idx;
> >
> > This is a file-global variable with a very generic name that is very
> > similar to a local variable name used by multiple functions within the
> > file (bufidx) and no comments.  Ugh.
> >
> Comment added.

The variable name is still bad, and the comment isn't very helpful
either.  First, you can't tell by looking at the name that it has
anything to do with the undo_buffers variable, because undo_buffers
and buffer_idx are not obviously related names.  Second, it's not an
index; it's a count.  A count tells you how many of something you
have; an index tells you which one of those you are presently thinking
about.  Third, the undo_buffer array is itself poorly named, because
it's not an array of all the undo buffers in the world or anything
like that, but rather an array of undo buffers for some particular
purpose.  "static UndoBuffers *undo_buffer" is about as helpful as
"int integer" and I hope I don't need to explain why that isn't
usually a good thing to write.  Maybe prepared_undo_buffer for the
array and nprepared_undo_buffer for the count, or something like that.

> I think the simple solution will be that inside UndoRecordIsValid
> function we can directly check UndoLogIsDiscarded if oldest_data is
> not yet initialized, I think we don't need to release the discard lock
> for that.  So I have changed it like that.

Can we instead eliminate the special case?  It seems like the if
(log->oldest_data == InvalidUndoRecPtr) case will be taken very
rarely, so if it's buggy, we might not notice.

> Right, from uur_prevxid we would know that for which xid's undo we are
> looking for but without having uur_xid in undo record it self how we
> would know which undo record is inserted by the xid we are looking
> for?  Because in zheap while following the undo chain and if slot got
> switch, then there is possibility (because of slot reuse) that we
> might get some other transaction's undo record for the same zheap
> tuple, but we want to traverse back as we want to find the record
> inserted by uur_prevxid.  So we need uur_xid as well to tell who is
> inserter of this undo record?

It seems desirable to me to make this the caller's problem.  When we
are processing undo a transaction at a time, we'll know the XID
because it will be available from the transaction header.  If a system
like zheap maintains a pointer to an undo record someplace in the
middle of a transaction, it can also store the XID if it needs it.
The thing is, the zheap code almost works that way already.
Transaction slots within a page store both the undo record pointer and
the XID.  The only case where zheap doesn't store the undo record
pointer and the XID is when a slot switch occurs, but that could be
changed.

If we moved urec_xid into UndoRecordTransaction, we'd save 4 bytes per
undo record across the board.  When zheap emits undo records for
updates or deletes, they would need store an UndoRecPtr (8 bytes) +
FullTransactionId (8 bytes) in the payload unless the previous change
to that TID is all-visible or the previous change to that TID was made
by the same transaction.  Also, zheap would no longer need to store
the slot number in the payload in any case, because this would
substitute for that (and permit more efficient lookups, to boot).  So
the overall impact on zheap update and delete records would be
somewhere between -4 bytes (when we save the space used by XID and
incur no other cost) and +12 bytes (when we lose the XID but gain the
UndoRecPtr + FullTransactionId).

That worst case could be further optimized.  For example, instead of
storing a FullTransactionId, zheap could store the difference between
the XID to which the current record pertains (which in this model the
caller is required to know) and the XID of whoever last modified the
tuple. That difference certainly can't exceed 4 billion (or even 2
billion) so 4 bytes is enough.  That reduces the worst-case difference
to +8 bytes.  Probably zheap could use payloads with some kind of
variable-length encoding and squeeze out even more in common cases,
but I'm not sure that's necessary or worth the complexity.

Let's also give uur_blkprev its own UREC_INFO_* constant and omit it
when this is the first time we're touching this block in this
transaction and thus the value is InvalidUndoRecPtr.  In the
pretty-common case where a transaction updates one tuple on the page
and never comes back, this - together with the optimization in the
previous paragraph - will cause zheap to come out even on undo,
because it'll save 4 bytes by omitting urec_xid and 8 bytes by
omitting uur_blkrev, and it'll lose 8 bytes storing an UndoRecPtr in
the payload and 4 bytes storing an XID-difference.

Even with those changes, zheap's update and delete could still come
out a little behind on undo volume if hitting many tuples on the same
page, because for every tuple they hit after the first, we'll still
need the UndoRecPtr for the previous change to that page (uur_blkprev)
and we'll also have the UndoRecPtr extracted from the tuple's previous
slot, store in the payload.  So we'll end up +8 bytes in this case.  I
think that's acceptable, because it often won't happen, it's hardly
catastrophic if it does, and saving 4 bytes on every insert, and on
every update or delete where the old undo is already discarded is
pretty sweet.

> Yeah so we can directly jump to the header which is not yet completely
> read but if any of the header is partially read then we need to
> maintain some kind of partial read variable otherwise from 'already
> read' we wouldn't be able to know how many bytes of the header got
> read in last call unless we calculate that from uur_info or maintain
> the partial_read in context like I have done in the new version.

Right, we need to know the bytes already read for the next header.

> Note:
> - I think the ucontext->stage are same for the insert and DECODE can
> we just declare only
>   one enum and give some generic name e.g. UNDO_PROCESS_STAGE_HEADER ?

I agree.  Maybe UNDO_PACK_STAGE_WHATEVER or, more briefly, UNDO_PACK_WHATEVER.

> - In SkipInsertingUndoData also I have to go through all the stages so
> that if we find some valid
> block then stage is right for inserting the partial record?  Do you
> think I could have avoided that?

Hmm, I didn't foresee that, but no, I don't think you can avoid that.
That problem wouldn't occur before we added the stage stuff, since
we'd just go through all the stages every time and each one would know
its own size and do nothing if that number of bytes had already been
passed, but with this design there seems to be no way around it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Robert Haas
Replying to myself to resend to the list, since my previous attempt
seems to have been eaten by a grue.

On Tue, Apr 30, 2019 at 11:14 AM Robert Haas <[hidden email]> wrote:

>
> On Tue, Apr 30, 2019 at 2:16 AM Dilip Kumar <[hidden email]> wrote:
> > Like previous version these patch set also applies on:
> > https://github.com/EnterpriseDB/zheap/tree/undo
> > (b397d96176879ed5b09cf7322b8d6f2edd8043a5)
>
> Some more review of 0003:
>
> There is a whitespace-only hunk in xact.c.
>
> It would be nice if AtAbort_ResetUndoBuffers didn't need to exist at
> all.  Then, this patch would make no changes whatsoever to xact.c.
> We'd still need such changes in other patches, because the whole idea
> of undo is tightly bound up with the concept of transactions.  Yet,
> this particular patch wouldn't touch that file, and that would be
> nice.  In general, the reason why we need AtCommit/AtAbort/AtEOXact
> callbacks is to adjust the values of global variables (or the data
> structures to which they point) at commit or abort time.  And that is
> also the case here.  The thing is, I'm not sure why we need these
> particular global variables.  Is there some way that we can get by
> without them? The obvious thing to do seems to be to invent yet
> another context object, allocated via a new function, which can then
> be passed to PrepareUndoInsert, InsertPreparedUndo,
> UndoLogBuffersSetLSN, UnlockReleaseUndoBuffers, etc.  This would
> obsolete UndoSetPrepareSize, since you'd instead pass the size to the
> context allocator function.
>
> UndoRecordUpdateTransInfo should declare a local variable
> XactUndoRecordInfo *something = &xact_urec_info[idx] and use that
> variable wherever possible.
>
> It should also probably use while (1) { ... } rather than do { ... }
> while (true).
>
> In UndoBufferGetSlot you could replace 'break;' with 'return i;' and
> then more than half the function would need one less level of
> indentation.  This function should also declare PreparedUndoBuffer
> *something and set that variable equal to &prepared_undo_buffers[i] at
> the top of the loop and again after the loop, and that variable should
> then be used whenever possible.
>
> UndoRecordRelationDetails seems to need renaming now that it's down to
> a single member.
>
> The comment for UndoRecordBlock needs updating, as it still refers to blkprev.
>
> The comment for UndoRecordBlockPrev refers to "Identifying
> information" but it's not identifying anything.  I think you could
> just delete "Identifying information for" from this sentence and not
> lose anything.  And I think several of the nearby comments that refer
> to "Identifying information" could more simply and more correctly just
> refer to "Information".
>
> I don't think that SizeOfUrecNext needs to exist.
>
> xact.h should not add an include for undolog.h.  There are no other
> changes to this header, so presumably the header does not depend on
> anything in undolog.h.  If .c files that include xact.h need
> undolog.h, then the header should be included in those files, not in
> the header itself.  That way, we avoid making partial recompiles more
> painful than necessary.
>
> UndoGetPrevUndoRecptr looks like a strange interface.  Why not just
> arrange not to call the function in the first place if prevurp is
> valid?
>
> Every use of palloc0 in this code should be audited to check whether
> it is really necessary to zero the memory before use.  If it will be
> initialized before it's used for anything anyway, it doesn't need to
> be pre-zeroed.
>
> FinishUnpackUndo looks nice!  But it is missing a blank line in one
> place, and 'if it presents' should be 'if it is present' in a whole
> bunch of places.
>
> BeginInsertUndo also looks to be missing a few blank lines.
>
> Still need to do some cleanup of the UnpackedUndoRecord, e.g. unifying
> uur_xid and uur_xidepoch into uur_fxid.
>
> InsertUndoData ends in an unnecessary return, as does SkipInsertingUndoData.
>
> I think the argument to SkipInsertingUndoData should be the number of
> bytes to skip, rather than the starting byte within the block.
>
> Function header comment formatting is not consistent.  Compare
> FinishUnpackUndo (function name recapitulated on first line of
> comment) to ReadUndoBytes (not recapitulated) to UnpackUndoData
> (entire header comment jammed into one paragraph with function name at
> start).  I prefer the style where the function name is not restated,
> but YMMV.  Anyway, it has to be consistent.
>
> UndoGetPrevRecordLen should not declare char *page instead of Page
> page, I think.
>
> UndoRecInfo looks a bit silly, I think.  Isn't index just the index of
> this entry in the array?  You can always figure that out by ptr -
> array_base. Instead of having UndoRecPtr urp in this structure, how
> about adding that to UnpackedUndoRecord?  When inserting, caller
> leaves it unset and InsertPreparedUndo sets it; when retrieving,
> UndoFetchRecord or UndoRecordBulkFetch sets it.  With those two
> changes, I think you can get rid of UndoRecInfo entirely and just
> return an array of UnpackedUndoRecords.  This might also eliminate the
> need for an 'urp' member in PreparedUndoSpace.
>
> I'd probably use UREC_INFO_BLKPREV rather than UREC_INFO_BLOCKPREV for
> consistency.
>
> Similarly UndoFetchRecord and UndoRecordBulkFetch seems a bit
> inconsistent.  Perhaps UndoBulkFetchRecord.
>
> In general, I find the code for updating transaction headers to be
> really hard to understand.  I'm not sure exactly what can be done
> about that.  Like, why is UndoRecordPrepareTransInfo unpacking undo?
> Why does it take two undo record pointers as arguments and how are
> they different?
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

akapila
On Wed, May 1, 2019 at 6:02 AM Robert Haas <[hidden email]> wrote:

>
> Replying to myself to resend to the list, since my previous attempt
> seems to have been eaten by a grue.
>
> On Tue, Apr 30, 2019 at 11:14 AM Robert Haas <[hidden email]> wrote:
> >
> > On Tue, Apr 30, 2019 at 2:16 AM Dilip Kumar <[hidden email]> wrote:
> > > Like previous version these patch set also applies on:
> > > https://github.com/EnterpriseDB/zheap/tree/undo
> > > (b397d96176879ed5b09cf7322b8d6f2edd8043a5)
> >
> > Some more review of 0003:
> >

Another suggestion:

+/*
+ * Insert a previously-prepared undo records.  This will write the actual undo
+ * record into the buffers already pinned and locked in PreparedUndoInsert,
+ * and mark them dirty.  This step should be performed after entering a
+ * criticalsection; it should never fail.
+ */
+void
+InsertPreparedUndo(void)
+{
..
..
+
+ /* Advance the insert pointer past this record. */
+ UndoLogAdvance(urp, size);
+ }
..
}

UndoLogAdvance internally takes LWLock and we don't recommend doing
that in the critical section which will happen as this function is
supposed to be invoked in the critical section as mentioned in
comments.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

akapila
In reply to this post by Dilip Kumar-2
Thomas told me offlist that this email of mine didn't hit
pgsql-hackers, so trying it again by resending.

On Mon, Apr 29, 2019 at 3:51 PM Amit Kapila <[hidden email]> wrote:

>
> On Fri, Apr 19, 2019 at 3:46 PM Dilip Kumar <[hidden email]> wrote:
> >
> > On Tue, Mar 12, 2019 at 6:51 PM Thomas Munro <[hidden email]> wrote:
> > >
> >
> > Currently, undo branch[1] contain an older version of the (undo
> > interface + some fixup).  Now, I have merged the latest changes from
> > the zheap branch[2] to the undo branch[1]
> > which can be applied on top of the undo storage commit[3].  For
> > merging those changes, I need to add some changes to the undo log
> > storage patch as well for handling the multi log transaction.  So I
> > have attached two patches, 1) improvement to undo log storage 2)
> > complete undo interface patch which include 0006+0007 from undo
> > branch[1] + new changes on the zheap branch.
> >
> > [1] https://github.com/EnterpriseDB/zheap/tree/undo
> > [2] https://github.com/EnterpriseDB/zheap
> > [3] https://github.com/EnterpriseDB/zheap/tree/undo
> > (b397d96176879ed5b09cf7322b8d6f2edd8043a5)
> >
> > []
>
> Dilip has posted the patch for "undo record interface", next in series
> is a patch that handles transaction rollbacks (the machinery to
> perform undo actions) and background workers to manage undo.
>
> Transaction Rollbacks
> ----------------------------------
> We always perform rollback actions after cleaning up the current
> (sub)transaction.  This will ensure that we perform the actions
> immediately after an error (and release the locks) rather than when
> the user issues Rollback command at some later point of time.  We are
> releasing the locks after the undo actions are applied.  The reason to
> delay lock release is that if we release locks before applying undo
> actions, then the parallel session can acquire the lock before us
> which can lead to deadlock.
>
> We promote the error to FATAL error if it occurred while applying undo
> for a subtransaction.  The reason we can't proceed without applying
> subtransaction's undo is that the modifications made in that case must
> not be visible even if the main transaction commits.  Normally, the
> backends that receive the request to perform Rollback (To Savepoint)
> applies the undo actions, but there are cases where it is preferable
> to push the requests to background workers.  The main reasons to push
> the requests to background workers are (a) The request for a very
> large rollback, this will allow us to return control to users quickly.
> There is a guc rollback_overflow_size which indicates that rollbacks
> greater than the configured size are performed lazily by background
> workers. (b) While applying the undo actions, if there is an error, we
> push such a request to background workers.
>
> Undo Requests and Undo workers
> --------------------------------------------------
> To improve the efficiency of the rollbacks, we create three queues and
> a hash table for the rollback requests.  A Xid based priority queue
> which will allow us to process the requests of older transactions and
> help us to move oldesdXidHavingUndo (this is a xid-horizon below which
> all the transactions are visible) forward.  A size-based queue which
> will help us to perform the rollbacks of larger aborts in a timely
> fashion so that we don't get stuck while processing them during
> discard of the logs.  An error queue to hold the requests for
> transactions that failed to apply its undo.  The rollback hash table
> is used to avoid duplicate undo requests by backends and discard
> worker.
>
> Undo launcher is responsible for launching the workers iff there is
> some work available in one of the work queues and there are more
> workers available.  The worker is launched to handle requests for a
> particular database.  Each undo worker then start reading from one of
> the queues the requests for that particular database.  A worker would
> peek into each queue for the requests from a particular database if it
> needs to switch a database in less than undo_worker_quantum ms (10s as
> default) after starting. Also, if there is no work, it lingers for
> UNDO_WORKER_LINGER_MS (10s as default).  This avoids restarting the
> workers too frequently.
>
> The discard worker is responsible for discarding the undo log of
> transactions that are committed and all-visible or are rolled-back.
> It also registers the request for aborted transactions in the work
> queues.  It iterates through all the active logs one-by-one and tries
> to discard the transactions that are old enough to matter.
>
> The details of how all of this works are described in
> src/backend/access/undo/README.UndoProcessing.  The main idea to keep
> a readme is to allow reviewers to understand this patch, later we can
> decide parts of it to move to comments in code and others to main
> README of undo.
>
> Question:  Currently, TwoPhaseFileHeader stores just TransactionId, so
> for the systems (like zheap) that support FullTransactionId, the
> two-phase transaction will be tricky to support as we need
> FullTransactionId during rollbacks.  Is it a good idea to store
> FullTransactionId in TwoPhaseFileHeader?
>
> Credits:
> --------------
> Designed by: Andres Freund, Amit Kapila, Robert Haas, and Thomas Munro
> Author: Amit Kapila, Dilip Kumar, Kuntal Ghosh, and Thomas Munro
>
> This patch is based on the latest Dilip's patch for undo record
> interface.  The branch can be accessed at
> https://github.com/EnterpriseDB/zheap/tree/undoprocessing
>
> Inputs on design/code are welcome.
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

undoprocessing_1.patch (265K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Dilip Kumar-2
In reply to this post by Robert Haas
On Tue, Apr 30, 2019 at 11:45 AM Dilip Kumar <[hidden email]> wrote:

The attached patch will provide mechanism for masking the necessary
bits in undo pages for supporting consistency checking for the undo
pages.  Ideally we can merge this patch with the main interface patch
but currently I have kept it separate for mainly because a) this is
still a WIP patch and b) review of the changes will be easy.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

0005-undo-page-consistency-checker_WIP_v3.patch (15K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Dilip Kumar-2
In reply to this post by Robert Haas
On Tue, Apr 30, 2019 at 8:44 PM Robert Haas <[hidden email]> wrote:

> UndoRecInfo looks a bit silly, I think.  Isn't index just the index of
> this entry in the array?  You can always figure that out by ptr -
> array_base. Instead of having UndoRecPtr urp in this structure, how
> about adding that to UnpackedUndoRecord?  When inserting, caller
> leaves it unset and InsertPreparedUndo sets it; when retrieving,
> UndoFetchRecord or UndoRecordBulkFetch sets it.  With those two
> changes, I think you can get rid of UndoRecInfo entirely and just
> return an array of UnpackedUndoRecords.  This might also eliminate the
> need for an 'urp' member in PreparedUndoSpace.
>
Yeah, at least in this patch it looks silly.  Actually, I added that
index to make the qsort stable when execute_undo_action sorts them in
block order.  But, as part of this patch we don't have that processing
so either we can remove this structure completely as you suggested but
undo processing patch has to add that structure or we can just add
comment that why we added this index field.

I am ok with other comments and will work on them.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: POC: Cleaning up orphaned files using undo logs

Robert Haas
On Thu, May 2, 2019 at 5:32 AM Dilip Kumar <[hidden email]> wrote:
> Yeah, at least in this patch it looks silly.  Actually, I added that
> index to make the qsort stable when execute_undo_action sorts them in
> block order.  But, as part of this patch we don't have that processing
> so either we can remove this structure completely as you suggested but
> undo processing patch has to add that structure or we can just add
> comment that why we added this index field.

Well, the qsort comparator could compute the index as ptr - array_base
just like any other code, couldn't it?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


1234 ... 8