Undo logs

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
46 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Undo logs

Thomas Munro-3
Hello hackers,

As announced elsewhere[1][2][3], at EnterpriseDB we are working on a
proposal to add in-place updates with undo logs to PostgreSQL.  The
goal is to improve performance and resource usage by recycling space
better.

The lowest level piece of this work is a physical undo log manager,
which I've personally been working on.  Later patches will build on
top, adding record-oriented access and then the main "zheap" access
manager and related infrastructure.  My colleagues will write about
those.

The README files[4][5] explain in more detail, but here is a
bullet-point description of what the attached patch set gives you:

1.  Efficient appending of new undo data from many concurrent
backends.  Like logs.
2.  Efficient discarding of old undo data that isn't needed anymore.
Like queues.
3.  Efficient buffered random reading of undo data.  Like relations.

A test module is provided that can be used to exercise the undo log
code paths without needing any of the later zheap patches.

This is work in progress.  A few aspects are under active development
and liable to change, as indicated by comments, and there are no doubt
bugs and room for improvement.  The code is also available at
github.com/EnterpriseDB/zheap (these patches are from the
undo-log-storage branch, see also the master branch which has the full
zheap feature).  We'd be grateful for any questions, feedback or
ideas.

[1] https://amitkapila16.blogspot.com/2018/03/zheap-storage-engine-to-provide-better.html
[2] https://rhaas.blogspot.com/2018/01/do-or-undo-there-is-no-vacuum.html
[3] https://www.pgcon.org/2018/schedule/events/1190.en.html
[4] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/access/undo
[5] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/storage/smgr

--
Thomas Munro
http://www.enterprisedb.com

0001-Add-undo-log-manager-v1.patch (148K) Download Attachment
0002-Provide-access-to-undo-log-data-via-the-buffer-ma-v1.patch (51K) Download Attachment
0003-Add-developer-documentation-for-the-undo-log-stor-v1.patch (16K) Download Attachment
0004-Add-tests-for-the-undo-log-manager-v1.patch (56K) Download Attachment
0005-Add-user-facing-documentation-for-undo-logs-v1.patch (14K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Undo logs

James Sewell
Exciting stuff! Really looking forward to having a play with this.

James Sewell,
Chief Architect



Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009
<a href="tel:(+61)%202%208099%209000" value="+61280999000" style="color:rgb(17,85,204)" target="_blank">(+61) 2 8099 9000  W www.jirotech.com  <a href="tel:(+61)%202%208099%209000" style="color:rgb(17,85,204)" target="_blank">(+61) 2 8099 9099

On 25 May 2018 at 08:22, Thomas Munro <[hidden email]> wrote:
Hello hackers,

As announced elsewhere[1][2][3], at EnterpriseDB we are working on a
proposal to add in-place updates with undo logs to PostgreSQL.  The
goal is to improve performance and resource usage by recycling space
better.

The lowest level piece of this work is a physical undo log manager,
which I've personally been working on.  Later patches will build on
top, adding record-oriented access and then the main "zheap" access
manager and related infrastructure.  My colleagues will write about
those.

The README files[4][5] explain in more detail, but here is a
bullet-point description of what the attached patch set gives you:

1.  Efficient appending of new undo data from many concurrent
backends.  Like logs.
2.  Efficient discarding of old undo data that isn't needed anymore.
Like queues.
3.  Efficient buffered random reading of undo data.  Like relations.

A test module is provided that can be used to exercise the undo log
code paths without needing any of the later zheap patches.

This is work in progress.  A few aspects are under active development
and liable to change, as indicated by comments, and there are no doubt
bugs and room for improvement.  The code is also available at
github.com/EnterpriseDB/zheap (these patches are from the
undo-log-storage branch, see also the master branch which has the full
zheap feature).  We'd be grateful for any questions, feedback or
ideas.

[1] https://amitkapila16.blogspot.com/2018/03/zheap-storage-engine-to-provide-better.html
[2] https://rhaas.blogspot.com/2018/01/do-or-undo-there-is-no-vacuum.html
[3] https://www.pgcon.org/2018/schedule/events/1190.en.html
[4] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/access/undo
[5] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/storage/smgr

--
Thomas Munro
http://www.enterprisedb.com



The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Reply | Threaded
Open this post in threaded view
|

Re: Undo logs

Simon Riggs
In reply to this post by Thomas Munro-3
On 24 May 2018 at 23:22, Thomas Munro <[hidden email]> wrote:

> As announced elsewhere[1][2][3], at EnterpriseDB we are working on a
> proposal to add in-place updates with undo logs to PostgreSQL.  The
> goal is to improve performance and resource usage by recycling space
> better.

Cool

> The lowest level piece of this work is a physical undo log manager,

> 1.  Efficient appending of new undo data from many concurrent
> backends.  Like logs.
> 2.  Efficient discarding of old undo data that isn't needed anymore.
> Like queues.
> 3.  Efficient buffered random reading of undo data.  Like relations.

Like an SLRU?

> [4] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/access/undo
> [5] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/storage/smgr

I think there are quite a few design decisions there that need to be
discussed, so lets crack on and discuss them please.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Undo logs

Thomas Munro-3
Hi Simon,

On Mon, May 28, 2018 at 11:40 PM, Simon Riggs <[hidden email]> wrote:

> On 24 May 2018 at 23:22, Thomas Munro <[hidden email]> wrote:
>> The lowest level piece of this work is a physical undo log manager,
>
>> 1.  Efficient appending of new undo data from many concurrent
>> backends.  Like logs.
>> 2.  Efficient discarding of old undo data that isn't needed anymore.
>> Like queues.
>> 3.  Efficient buffered random reading of undo data.  Like relations.
>
> Like an SLRU?

Yes, but with some difference:

1.  There is a variable number of undo logs.  Each one corresponds to
a range of the 64 bit address space, and has its own head and tail
pointers, so that concurrent writers don't contend for buffers when
appending data.  (Unlike SLRUs which are statically defined, one for
clog.c, one for commit_ts.c, ...).
2.  Undo logs use regular buffers instead of having their own mini
buffer pool, ad hoc search and reclamation algorithm etc.
3.  Undo logs support temporary, unlogged and permanent storage (=
local buffers and reset-on-crash-restart, for undo data relating to
relations of those persistence levels).
4.  Undo logs storage files are preallocated (rather than being
extended block by block), and the oldest file is renamed to become the
newest file in common cases, like WAL.

>> [4] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/access/undo
>> [5] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/storage/smgr
>
> I think there are quite a few design decisions there that need to be
> discussed, so lets crack on and discuss them please.

What do you think about using the main buffer pool?

Best case: pgbench type workload, discard pointer following closely
behind insert pointer, we never write anything out to disk (except for
checkpoints when we write a few pages), never advance the buffer pool
clock hand, and we use and constantly recycle 1-2 pages per connection
via the free list (as can be seen by monitoring insert - discard in
the pg_stat_undo_logs view).

Worst case: someone opens a snapshot and goes out to lunch so we can't
discard old undo data, and then we start to compete with other stuff
for buffers, and we hope the buffer reclamation algorithm is good at
its job (or can be improved).

I just talked about this proposal at a pgcon unconference session.
Here's some of the feedback I got:

1.  Jeff Davis pointed out that I'm probably wrong about not needing
FPI, and there must at least be checksum problems with torn pages.  He
also gave me an idea on how to fix that very cheaply, and I'm still
processing that feedback.
2.  Andres Freund thought it seemed OK if we have smgr.c routing to
md.c for relations and undofile.c for undo, but if we're going to
generalise this technique to put other things into shared buffers
eventually too (like the SLRUs, as proposed by Shawn Debnath in
another unconf session) then it might be worth investigating how to
get md.c to handle all of their needs.  They'd all just use fd.c
files, after all, so it'd be weird if we had to maintain several
different similar things.
3.  Andres also suggested that high frequency free page list access
might be quite contended in the "best case" described above.  I'll look
into that.
4.  Someone said that segment sizes probably shouldn't be hard coded
(cf WAL experience).

I also learned in other sessions that there are other access managers
in development that need undo logs.  I'm hoping to find out more about
that.

--
Thomas Munro
http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Undo logs

Dilip Kumar-2
Hello hackers,

As Thomas has already mentioned upthread that we are working on an
undo-log based storage and he has posted the patch sets for the lowest
layer called undo-log-storage.

This is the next layer which sits on top of the undo log storage,
which will provide an interface for prepare, insert, or fetch the undo
records. This layer will use undo-log-storage to reserve the space for
the undo records and buffer management routine to write and read the
undo records.

To prepare an undo record, first, it will allocate required space
using undo_log_storage module. Next, it will pin and lock the required
buffers and return an undo record pointer where it will insert the
record.  Finally, it calls the Insert routine for final insertion of
prepared record. Additionally, there is a mechanism for multi-insert,
wherein multiple records are prepared and inserted at a time.

To fetch an undo record, a caller must provide a valid undo record
pointer. Optionally, the caller can provide a callback function with
the information of the block and offset, which will help in faster
retrieval of undo record, otherwise, it has to traverse the undo-chain.

These patch sets will apply on top of the undo-log-storage branch [1],
commit id fa3803a048955c4961581e8757fe7263a98fe6e6.

[1] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/


undo_interface_v1.patch is the main patch for providing the undo interface.
undo_interface_test_v1.patch is a simple test module to test the undo
interface layer.


On Thu, May 31, 2018 at 4:27 AM, Thomas Munro
<[hidden email]> wrote:

> Hi Simon,
>
> On Mon, May 28, 2018 at 11:40 PM, Simon Riggs <[hidden email]> wrote:
>> On 24 May 2018 at 23:22, Thomas Munro <[hidden email]> wrote:
>>> The lowest level piece of this work is a physical undo log manager,
>>
>>> 1.  Efficient appending of new undo data from many concurrent
>>> backends.  Like logs.
>>> 2.  Efficient discarding of old undo data that isn't needed anymore.
>>> Like queues.
>>> 3.  Efficient buffered random reading of undo data.  Like relations.
>>
>> Like an SLRU?
>
> Yes, but with some difference:
>
> 1.  There is a variable number of undo logs.  Each one corresponds to
> a range of the 64 bit address space, and has its own head and tail
> pointers, so that concurrent writers don't contend for buffers when
> appending data.  (Unlike SLRUs which are statically defined, one for
> clog.c, one for commit_ts.c, ...).
> 2.  Undo logs use regular buffers instead of having their own mini
> buffer pool, ad hoc search and reclamation algorithm etc.
> 3.  Undo logs support temporary, unlogged and permanent storage (=
> local buffers and reset-on-crash-restart, for undo data relating to
> relations of those persistence levels).
> 4.  Undo logs storage files are preallocated (rather than being
> extended block by block), and the oldest file is renamed to become the
> newest file in common cases, like WAL.
>
>>> [4] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/access/undo
>>> [5] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/storage/smgr
>>
>> I think there are quite a few design decisions there that need to be
>> discussed, so lets crack on and discuss them please.
>
> What do you think about using the main buffer pool?
>
> Best case: pgbench type workload, discard pointer following closely
> behind insert pointer, we never write anything out to disk (except for
> checkpoints when we write a few pages), never advance the buffer pool
> clock hand, and we use and constantly recycle 1-2 pages per connection
> via the free list (as can be seen by monitoring insert - discard in
> the pg_stat_undo_logs view).
>
> Worst case: someone opens a snapshot and goes out to lunch so we can't
> discard old undo data, and then we start to compete with other stuff
> for buffers, and we hope the buffer reclamation algorithm is good at
> its job (or can be improved).
>
> I just talked about this proposal at a pgcon unconference session.
> Here's some of the feedback I got:
>
> 1.  Jeff Davis pointed out that I'm probably wrong about not needing
> FPI, and there must at least be checksum problems with torn pages.  He
> also gave me an idea on how to fix that very cheaply, and I'm still
> processing that feedback.
> 2.  Andres Freund thought it seemed OK if we have smgr.c routing to
> md.c for relations and undofile.c for undo, but if we're going to
> generalise this technique to put other things into shared buffers
> eventually too (like the SLRUs, as proposed by Shawn Debnath in
> another unconf session) then it might be worth investigating how to
> get md.c to handle all of their needs.  They'd all just use fd.c
> files, after all, so it'd be weird if we had to maintain several
> different similar things.
> 3.  Andres also suggested that high frequency free page list access
> might be quite contended in the "best case" described above.  I'll look
> into that.
> 4.  Someone said that segment sizes probably shouldn't be hard coded
> (cf WAL experience).
>
> I also learned in other sessions that there are other access managers
> in development that need undo logs.  I'm hoping to find out more about
> that.
>
> --
> Thomas Munro
> http://www.enterprisedb.com
>


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

undo_interface_v1.patch (88K) Download Attachment
undo_interface_test_v1.patch (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Undo logs

Dilip Kumar-3
On Fri, Aug 31, 2018 at 3:08 PM, Dilip Kumar <[hidden email]> wrote:

> Hello hackers,
>
> As Thomas has already mentioned upthread that we are working on an
> undo-log based storage and he has posted the patch sets for the lowest
> layer called undo-log-storage.
>
> This is the next layer which sits on top of the undo log storage,
> which will provide an interface for prepare, insert, or fetch the undo
> records. This layer will use undo-log-storage to reserve the space for
> the undo records and buffer management routine to write and read the
> undo records.
>
> To prepare an undo record, first, it will allocate required space
> using undo_log_storage module. Next, it will pin and lock the required
> buffers and return an undo record pointer where it will insert the
> record.  Finally, it calls the Insert routine for final insertion of
> prepared record. Additionally, there is a mechanism for multi-insert,
> wherein multiple records are prepared and inserted at a time.
>
> To fetch an undo record, a caller must provide a valid undo record
> pointer. Optionally, the caller can provide a callback function with
> the information of the block and offset, which will help in faster
> retrieval of undo record, otherwise, it has to traverse the undo-chain.
>
> These patch sets will apply on top of the undo-log-storage branch [1],
> commit id fa3803a048955c4961581e8757fe7263a98fe6e6.
>
> [1] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/
>
>
> undo_interface_v1.patch is the main patch for providing the undo interface.
> undo_interface_test_v1.patch is a simple test module to test the undo
> interface layer.
>

Thanks to Robert Haas for designing an early prototype for forming
undo record. Later, I’ve completed the remaining parts of the code
including undo record prepare, insert, fetch and other related APIs
with help of Rafia Sabih. Thanks to Amit Kapila for providing valuable
design inputs.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Undo logs

Thomas Munro-3
On Fri, Aug 31, 2018 at 10:24 PM Dilip Kumar
<[hidden email]> wrote:

> On Fri, Aug 31, 2018 at 3:08 PM, Dilip Kumar <[hidden email]> wrote:
> > As Thomas has already mentioned upthread that we are working on an
> > undo-log based storage and he has posted the patch sets for the lowest
> > layer called undo-log-storage.
> >
> > This is the next layer which sits on top of the undo log storage,
> > which will provide an interface for prepare, insert, or fetch the undo
> > records. This layer will use undo-log-storage to reserve the space for
> > the undo records and buffer management routine to write and read the
> > undo records.

I have also pushed a new WIP version of the lower level undo log
storage layer patch set to a public branch[1].  I'll leave the earlier
branch[2] there because the record-level patch posted by Dilip depends
on it for now.

The changes are mostly internal: it doesn't use DSM segments any more.
Originally I wanted to use DSM because I didn't want arbitrary limits,
but in fact DSM slots can run out in unpredictable ways, and unlike
parallel query the undo log subsystem doesn't have a plan B for when
it can't get the space it needs due to concurrent queries.  Instead,
this version uses a pool of size 4 * max_connections, fixed at startup
in regular shared memory.  This creates an arbitrary limit on
transaction size, but it's a large at 1TB per slot, can be increased,
doesn't disappear unpredictably, is easy to monitor
(pg_stat_undo_logs), and is probably a useful brake on a system in
trouble.

More soon.

[1] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage-v2
[2] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage

--
Thomas Munro
http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Undo logs

Dilip Kumar-2
On Sun, Sep 2, 2018 at 12:18 AM, Thomas Munro
<[hidden email]> wrote:

> On Fri, Aug 31, 2018 at 10:24 PM Dilip Kumar
> <[hidden email]> wrote:
>> On Fri, Aug 31, 2018 at 3:08 PM, Dilip Kumar <[hidden email]> wrote:
>> > As Thomas has already mentioned upthread that we are working on an
>> > undo-log based storage and he has posted the patch sets for the lowest
>> > layer called undo-log-storage.
>> >
>> > This is the next layer which sits on top of the undo log storage,
>> > which will provide an interface for prepare, insert, or fetch the undo
>> > records. This layer will use undo-log-storage to reserve the space for
>> > the undo records and buffer management routine to write and read the
>> > undo records.
>
> I have also pushed a new WIP version of the lower level undo log
> storage layer patch set to a public branch[1].  I'll leave the earlier
> branch[2] there because the record-level patch posted by Dilip depends
> on it for now.
Rebased undo_interface patches on top of the new branch of undo-log-storage[1].

[1] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage-v2

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

undo_interface_v2.patch (88K) Download Attachment
undo_interface_test_v2.patch (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Undo logs

akapila
In reply to this post by Thomas Munro-3
On Sun, Sep 2, 2018 at 12:19 AM Thomas Munro
<[hidden email]> wrote:

>
> On Fri, Aug 31, 2018 at 10:24 PM Dilip Kumar
> <[hidden email]> wrote:
> > On Fri, Aug 31, 2018 at 3:08 PM, Dilip Kumar <[hidden email]> wrote:
> > > As Thomas has already mentioned upthread that we are working on an
> > > undo-log based storage and he has posted the patch sets for the lowest
> > > layer called undo-log-storage.
> > >
> > > This is the next layer which sits on top of the undo log storage,
> > > which will provide an interface for prepare, insert, or fetch the undo
> > > records. This layer will use undo-log-storage to reserve the space for
> > > the undo records and buffer management routine to write and read the
> > > undo records.
>
> I have also pushed a new WIP version of the lower level undo log
> storage layer patch set to a public branch[1].  I'll leave the earlier
> branch[2] there because the record-level patch posted by Dilip depends
> on it for now.
>

I have started reading the patch and have a few assorted comments
which are mentioned below.  I have been involved in the high-level
design of this module and I have also shared given some suggestions
during development, but this is mainly Thomas's work with some help
from Dilip.  It would be good if other members of the community also
review the design or participate in the discussion.

Comments
------------------
undo/README
-----------------------
1.
+The undo log subsystem provides a way to store data that is needed for
+a limited time.  Undo data is generated
whenever zheap relations are
+modified, but it is only useful until (1) the generating transaction
+is committed or
rolled back and (2) there is no snapshot that might
+need it for MVCC purposes.

I think the snapshots need it for MVCC purpose and we need it till the
transaction is committed and all-visible.

2.
+* the tablespace that holds its segment files
+* the persistence level (permanent, unlogged, temporary)
+* the
"discard" pointer; data before this point has been discarded
+* the "insert" pointer: new data will be written here
+*
the "end" pointer: a new undo segment file will be needed at this point
+
+The three pointers discard, insert and end
move strictly forwards
+until the whole undo log has been exhausted.  At all times discard <=
+insert <= end.  When
discard == insert, the undo log is empty
+(everything that has ever been inserted has since been discarded).
+The
insert pointer advances when regular backends allocate new space,
+and the discard pointer usually advances when an
undo worker process
+determines that no session could need the data either for rollback or
+for finding old versions of
tuples to satisfy a snapshot.  In some
+special cases including single-user mode and temporary undo logs the
+discard
pointer might also be advanced synchronously by a foreground
+session.

Here, the use of insert and discard pointer are explained nicely.  Can
you elaborate the usage of end pointer as well?

3.
+UndoLogControl objects corresponding to the current set of active undo
+logs are held in a fixed-sized pool in shared
memory.  The size of
+the array is a multiple of max_connections, and limits the total size of
+transactions.

Here, it is mentioned the array is multiple of max_connections, but
the code uses MaxBackends.  Can you sync them?

4.
+The meta-data for all undo logs is written to disk at every
+checkpoint.  It is stored in files under
PGDATA/pg_undo/, using the
+checkpoint's redo point (a WAL LSN) as its filename.  At startup time,
+the redo point's
file can be used to restore all undo logs' meta-data
+as of the moment of the redo point into shared memory.  Changes
to the
+discard pointer and end pointer are WAL-logged by undolog.c and will
+bring the in-memory meta-data up to date
in the event of recovery
+after a crash.  Changes to insert pointers are included in other WAL
+records (see below).

I see one inconvenience for using checkpoint's redo point for meta
file name which is what if someone uses pg_resetxlog to truncate the
redo?  Is there any reason we can't keep a different name for the meta
file?

5.
+stabilize on one undo log per active writing backend (or more if
+different tablespaces are persistence levels are
used).

/tablespaces are persistence levels/tablespaces and persistence levels

I think due to the above design, we can now reach the maximum number
of undo logs quickly as the patch now uses fixed shared memory to
represent them.  I am not sure if there is an easy way to avoid that.
Can we try to expose guc for a maximum number of undo slots such that
instead of MaxBackends * 4, it could be MaxBackends * <new_guc>?

undo-log-manager patch
------------------------------------
6.
@@ -127,6 +128,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
  size = add_size(size,
ProcGlobalShmemSize());
  size = add_size(size, XLOGShmemSize());
  size = add_size(size,
CLOGShmemSize());
+ size = add_size(size, UndoLogShmemSize());
  size = add_size(size,
CommitTsShmemSize());
  size = add_size(size, SUBTRANSShmemSize());
  size = add_size(size,
TwoPhaseShmemSize());
@@ -219,6 +221,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
  */

XLOGShmemInit();
  CLOGShmemInit();
+ UndoLogShmemInit();

It seems to me that we will always allocate shared memory for undo
logs irrespective of whether someone wants to use them or not.  Am I
right?  If so, isn't it better if we find some way that this memory is
allocated only when someone has a need for it?

7.
+/*
+ * How many undo logs can be active at a time?  This creates a theoretical
+ * maximum transaction size, but it we
set it to a factor the maximum number
+ * of backends it will be a very high limit.  Alternative designs involving
+ *
demand paging or dynamic shared memory could remove this limit but
+ * introduce other problems.
+ */
+static inline
size_t
+UndoLogNumSlots(void)
+{+ return MaxBackends * 4;
+}

Seems like typos in above comment
/but it we/but if we
/factor the maximum number  -- the sentence is not completely clear.

8.
+ * Extra shared memory will be managed using DSM segments.
+ */
+Size
+UndoLogShmemSize(void)

You told in the email that the patch doesn't use DSM segments anymore,
but the comment seems to indicate otherwise.

9.
/*
+ * TODO: Should this function be usable in a critical section?
+
 * Woudl it make sense to detect that we are in a critical

Typo
/Woudl/Would

10.
+static void
+undolog_xlog_attach(XLogReaderState *record)
+{
+ xl_undolog_attach *xlrec = (xl_undolog_attach *)
XLogRecGetData(record);
+ UndoLogControl *log;
+
+ undolog_xid_map_add(xlrec->xid, xlrec->logno);
+
+ /*
+
 * Whatever follows is the first record for this transaction.  Zheap will
+ * use this to add
UREC_INFO_TRANSACTION.
+ */
+ log = get_undo_log(xlrec->logno, false);
+ /* TODO */

There are a lot of TODO's in the code, among them, above is not at all clear.

11.
+ UndoRecPtr oldest_data;
+
+} UndoLogControl;

Extra space after the last member looks odd.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Undo logs

akapila
On Mon, Oct 15, 2018 at 6:09 PM Amit Kapila <[hidden email]> wrote:

>
> On Sun, Sep 2, 2018 at 12:19 AM Thomas Munro
> <[hidden email]> wrote:
> > I have also pushed a new WIP version of the lower level undo log
> > storage layer patch set to a public branch[1].  I'll leave the earlier
> > branch[2] there because the record-level patch posted by Dilip depends
> > on it for now.
> >
>
> I have started reading the patch and have a few assorted comments
> which are mentioned below.  I have been involved in the high-level
> design of this module and I have also shared given some suggestions
> during development, but this is mainly Thomas's work with some help
> from Dilip.  It would be good if other members of the community also
> review the design or participate in the discussion.
>
> Comments
> ------------------
>

Some more comments/questions on the design level choices you have made
in this patch and some general comments.

1.  To allocate an undo log (UndoLogAllocate()), it seems first we are
creating the shared memory state for an undo log, write a WAL for it,
create an actual file and segment in it and write a separate WAL for
it.  Now imagine the system crashed after creating a shared memory
state and before actually allocating an undo log segment, then it is
quite possible that after recovery we will block multiple slots for
undo logs without having actual undo logs for them.  Apart from that
writing separate WAL for them doesn't appear to be the best way to
deal with it considering that we also need to write a third WAL to
attach an undo log.

Now, IIUC, one advantage of arranging the things this way is that we
avoid dropping the tablespaces when a particular undo log exists in
it.  I understand that this design kind of works, but I think we
should try to think of some alternatives here.  You might have already
thought of making it work similar to how the interaction for regular
tables or temp_tablespaces works with dropping the tablespaces but
decided to do something different here.  Can you explain why you have
made a different design choice here?

2.
extend_undo_log()
{
..
+ /*
+ * Flush the parent dir so that the directory metadata survives a crash
+ * after this point.
+
 */
+ UndoLogDirectory(log->meta.tablespace, dir);
+ fsync_fname(dir, true);
+
+ /*
+ * If we're not in
recovery, we need to WAL-log the creation of the new
+ * file(s).  We do that after the above filesystem
modifications, in
+ * violation of the data-before-WAL rule as exempted by
+ *
src/backend/access/transam/README.  This means that it's possible for
+ * us to crash having made some or all of the
filesystem changes but
+ * before WAL logging, but in that case we'll eventually try to create the
+ * same
segment(s) again, which is tolerated.
+ */
+ if (!InRecovery)
+ {
+ xl_undolog_extend xlrec;
+
XLogRecPtr ptr;
..
}

I don't understand this WAL logging action.  If the crash happens
before or during syncing the file, then we anyway don't have WAL to
replay.  If it happens after WAL writing, then anyway we are sure that
the extended undo log segment must be there.  Can you explain how this
works?

3.
+static void
+allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace,
+
UndoLogOffset end)
{
..
}

What will happen if the transaction creating undo log segment rolls
back?  Do we want to have pendingDeletes stuff as we have for normal
relation files?  This might also help in clearing the shared memory
state (undo log slots) if any.

4.
+choose_undo_tablespace(bool force_detach, Oid *tablespace)
{
..
/*
+ * Take the tablespace create/drop lock while we look the name up.
+ * This prevents the
tablespace from being dropped while we're trying
+ * to resolve the name, or while the called is trying
to create an
+ * undo log in it.  The caller will have to release this lock.
+ */
+
LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE);
..

This appears quite expensive, for selecting an undo log to attach, we
might need to wait for an unrelated tablespace create/drop.  Have you
considered any other ideas to prevent this?  How other callers of
get_tablespace_oid prevent it from being dropped?  If we don't find
any better solution, then I think at the very least we should start a
separate thread to know the opinion of others on this matter.  I think
this is somewhat related to point-1.

5.
+static inline Oid
+UndoRecPtrGetTablespace(UndoRecPtr urp)
+{
+ UndoLogNumber logno = UndoRecPtrGetLogNo
(urp);
+ UndoLogTableEntry  *entry;
+
+ /*
+ * Fast path, for undo logs we've seen before.  This is safe because
+
 * tablespaces are constant for the lifetime of an undo log number.
+ */
+ entry = undologtable_lookup
(undologtable_cache, logno);
+ if (likely(entry))
+ return entry->tablespace;
+
+ /*
+ * Slow path:
force cache entry to be created.  Raises an error if the
+ * undo log has been entirely discarded, or hasn't
been created yet.  That
+ * is appropriate here, because this interface is designed for accessing
+ *
undo pages via bufmgr, and we should never be trying to access undo
+ * pages that have been discarded.
+ */
+
UndoLogGet(logno, false);

It seems UndoLogGet() probes hash table first, so what is the need for
doing it in the caller and if you think it is better to perform in the
caller, then maybe we should avoid doing it inside
UndoLogGet()->get_undo_log()->undologtable_lookup().

6.
+get_undo_log(UndoLogNumber logno, bool locked)
{
..
+ /*
+ * If we didn't find it, then it must already have been entirely
+ *
discarded.  We create a negative cache entry so that we can answer
+ * this question quickly next time.
+
*
+ * TODO: We could track the lowest known undo log number, to reduce
+ * the
negative cache entry bloat.
+ */
+ if (result == NULL)
+ {
+ /*
+
* Sanity check: the caller should not be asking about undo logs
+ * that have
never existed.
+ */
+ if (logno >= shared->next_logno)
+
elog(ERROR, "undo log %u hasn't been created yet", logno);
+ entry = undologtable_insert
(undologtable_cache, logno, &found);
+ entry->number = logno;
+ entry->control =
NULL;
+ entry->tablespace = 0;
+ }
..
}

Are you planning to take care of this TODO?  In any case, do we have
any mechanism to clear this bloat or will it stay till the end of the
session?  If it is later, then I think it is important to take care of
TODO.

7.
+void UndoLogNewSegment(UndoLogNumber logno, Oid tablespace, int segno);
+/* Redo interface. */
+extern void
undolog_redo(XLogReaderState *record);

You might want to add an extra line before /* Redo interface. */
following what has been done earlier in this file.

8.
+ * XXX For now an xl_undolog_meta object is filled in, in case it turns out
+ * to be necessary to write it into the
WAL record (like FPI, this must be
+ * logged once for each undo log after each checkpoint).  I think this should
+ *
be moved out of this interface and done differently -- to review.
+ */
+UndoRecPtr
+UndoLogAllocate(size_t size,
UndoPersistence persistence)

This function doesn't appear to be filling xl_undolog_meta.  Am I
missing something, if not, then this comments needs to be changed?

9.
+static bool
+choose_undo_tablespace(bool force_detach, Oid *tablespace)
+{
+ char   *rawname;
+ List
*namelist;
+ bool need_to_unlock;
+ int length;
+ int i;
+
+ /* We need a
modifiable copy of string. */
+ rawname = pstrdup(undo_tablespaces);

I don't see the usage of rawname outside this function, isn't it
better to free it?  I understand that this function won't be called
frequently enough to matter, but still there is some theoretical
danger if a user continuously changes undo_tablespaces.

10.
+attach_undo_log(UndoPersistence persistence, Oid tablespace)
{
..
+ /*
+ * For now we have a simple linked list of unattached undo logs for each
+ * persistence level.
 We'll grovel though it to find something for the

Typo.
/though/through

11.
+attach_undo_log(UndoPersistence persistence, Oid tablespace)
{
..
+ /* WAL-log the creation of this new undo log. */
+ {
+
xl_undolog_create xlrec;
+
+ xlrec.logno = logno;
+ xlrec.tablespace = log-
>meta.tablespace;
+ xlrec.persistence = log->meta.persistence;
+
+
XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_CREATE);
+ }
..
}

Do we need to WAL log this for temporary/unlogged persistence level?

12.
+choose_undo_tablespace(bool force_detach, Oid *tablespace)
{
..
+ oid = get_tablespace_oid(name, true);
..

Do we need to check permissions to see if the current user is allowed
to create in this tablespace?

13.
+UndoLogAllocate(size_t size, UndoPersistence persistence)
{
..
+ log->meta.prevlogno = prevlogno;

Is it okay to update meta information without lock or we should do it
few lines down after taking mutex lock?  If it is okay, then it is
better to write a comment for the same?

14.
+static void
+allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace,
+
UndoLogOffset end)
{
..
+ /* Flush the contents of the file to disk. */
+ if (pg_fsync(fd) != 0)
+ elog(ERROR, "cannot fsync
file \"%s\": %m", path);
..
}

You might want to have a wait event for this as we do have at other
places where we perform fsync.

15.
+allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace,
+
UndoLogOffset end)
{
..
+ if (!InRecovery)
+ {
+ xl_undolog_extend xlrec;
+ XLogRecPtr ptr;
+
+
xlrec.logno = logno;
+ xlrec.end = end;
+
+ XLogBeginInsert();
+ XLogRegisterData
((char *) &xlrec, sizeof(xlrec));
+ ptr = XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_EXTEND);
+
XLogFlush(ptr);
+ }
..
}

Do we need it for temporary/unlogged persistence level?

16.
+static void
+undolog_xlog_create(XLogReaderState *record)
+{
+ xl_undolog_create *xlrec = (xl_undolog_create *)
XLogRecGetData(record);
+ UndoLogControl *log;
+ UndoLogSharedData *shared = MyUndoLogState.shared;
+
+ /*
Create meta-data space in shared memory. */
+ LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+ /* TODO: assert that
it doesn't exist already? */
+ log = allocate_undo_log();
+ LWLockAcquire(&log->mutex, LW_EXCLUSIVE);

Do we need to acquire UndoLogLock during replay?  What else can be
going in concurrent to this which can create a problem?

17.
UndoLogAllocate()
{
..
+ /*
+ * While we have the lock, check if we have been forcibly detached by
+ *
DROP TABLESPACE.  That can only happen between transactions (see
+ * DropUndoLogsInsTablespace()).
+
*/
..
}

Function name in above comment is wrong.

18.
+ {
+ {"undo_tablespaces", PGC_USERSET, CLIENT_CONN_STATEMENT,
+ gettext_noop("Sets the tablespace(s) to use for undo logs."),
+ NULL,
+ GUC_LIST_INPUT | GUC_LIST_QUOTE
+ },
+ &undo_tablespaces,
+ "",
+ check_undo_tablespaces, assign_undo_tablespaces, NULL
+ },

It seems you need to update variable_is_guc_list_quote for this variable.

Till now, I have mainly reviewed undo log allocation part.  This is a
big patch and can take much more time to complete the review.  I will
review the other parts of the patch later.  I have changed the status
of this CF entry as "Waiting on Author", feel free to change it once
you think all the comments are addressed.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Undo logs

Dilip Kumar-2
In reply to this post by Dilip Kumar-2
On Mon, Sep 3, 2018 at 11:26 AM Dilip Kumar <[hidden email]> wrote:

Thomas has already posted the latest version of undo log patches on
'Cleaning up orphaned files using undo logs' thread[1].  So I have
rebased the undo-interface patch also.  This patch also includes
latest defect fixes from the main zheap branch [2].

I have also done some changes to the undo-log patches.  Basically, it
is just some cleanup work and also make these patches independently
compilable.  I have moved some of the code into undo-log patches and
also moved out some code which is not relevant to undo-log.

Some examples:
1. Moved UndoLog Startup and Checkpoint related code into
'0001-Add-undo-log-manager_v2.patch patch'
+ /* Recover undo log meta data corresponding to this checkpoint. */
+ StartupUndoLogs(ControlFile->checkPointCopy.redo);
+
2. Removed undo-worker related stuff out of this patch
+ case WAIT_EVENT_UNDO_DISCARD_WORKER_MAIN:
+ event_name = "UndoDiscardWorkerMain";
+ break;
+ case WAIT_EVENT_UNDO_LAUNCHER_MAIN:
+ event_name = "UndoLauncherMain";
+ break;

[1] https://www.postgresql.org/message-id/flat/CAEepm=0ULqYgM2aFeOnrx6YrtBg3xUdxALoyCG+XpssKqmezug@...
[2] https://github.com/EnterpriseDB/zheap/

Patch applying order:
0001-Add-undo-log-manager.patch
0002-Provide-access-to-undo-log-data-via-the-buffer-manag.patch
0003-undo-interface-v3.patch
0004-Add-tests-for-the-undo-log-manager.patch from Cleaning up
orphaned files using undo logs' thread[1]
0004-undo-interface-test-v3.patch

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

0002-Provide-access-to-undo-log-data-via-the-buffer-manag_v2.patch (55K) Download Attachment
0001-Add-undo-log-manager_v2.patch (157K) Download Attachment
0003-undo-interface-v3.patch (92K) Download Attachment
0004-undo-interface-test-v3.patch (8K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Undo logs

akapila
On Mon, Nov 5, 2018 at 5:10 PM Dilip Kumar <[hidden email]> wrote:
>
[review for undo record layer (0003-undo-interface-v3)]

I might sound repeating myself, but just to be clear, I was involved
in the design of this patch as well and I have given a few high-level
inputs for this patch.  I have used this interface in the zheap
development, but haven't done any sort of detailed review which I am
doing now.  I encourage others also to review this patch.

1.
 * NOTES:
+ * Handling multilog -
+ *  It is possible that the undo record of a transaction can be spread across
+ *  multiple undo log.  And, we need some special handling while inserting the
+ *  undo for discard and rollback to work sanely.

I think before describing how the undo record is spread across
multiple logs, you can explain how it is laid out when that is not the
case.  You can also explain how undo record headers are linked.  I am
not sure file header is the best place or it should be mentioned in
README, but I think for now we can use file header for this purpose
and later we can move it to README if required.

2.
+/*
+ * FIXME:  Do we want to support undo tuple size which is more than the BLCKSZ
+ * if not than undo record can spread across 2 buffers at the max.
+ */

+#define MAX_BUFFER_PER_UNDO    2

I think here the right question is what is the possibility of undo
record to be greater than BLCKSZ?  For zheap, as of today, we don'
have any such requirement as the largest undo record is written for
update or multi_insert and in both cases we don't exceed the limit of
BLCKSZ.  I guess some user other than zheap could probably have such
requirement and I don't think it is impossible to enhance this if we
have any requirement.

If anybody else has an opinion here, please feel to share it.

3.
+/*
+ * FIXME:  Do we want to support undo tuple size which is more than the BLCKSZ
+ * if not than undo record can spread across 2 buffers at the max.
+ */
+#define MAX_BUFFER_PER_UNDO    2
+
+/*
+ * Consider buffers needed for updating previous transaction's
+ * starting undo record. Hence increased by 1.
+ */
+#define MAX_UNDO_BUFFERS       (MAX_PREPARED_UNDO + 1) * MAX_BUFFER_PER_UNDO
+
+/* Maximum number of undo record that can be prepared before calling insert. */
+#define MAX_PREPARED_UNDO 2

I think it is better to define MAX_PREPARED_UNDO before
MAX_UNDO_BUFFERS as the first one is used in the definition of a
second.

4.
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned.
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * If not in recovery, 'xid' should refer to the top transaction id because
+ * undo log only stores mapping for the top most transactions.
+ * If in recovery, 'xid' refers to the transaction id stored in WAL.
+ */
+UndoRecPtr
+PrepareUndoInsert(UnpackedUndoRecord *urec, UndoPersistence upersistence,
+   TransactionId xid)

This function locks the buffer as well which is right as we need to do
that before critical section, but the function header comments doesn't
indicate it.  You can modify it as:
"Upon return, the necessary undo buffers are pinned and locked."

Note that similar modification is required in .h file as well.

5.
+/*
+ * Insert a previously-prepared undo record.  This will lock the buffers
+ * pinned in the previous step, write the actual undo record into them,
+ * and mark them dirty.  For persistent undo, this step should be performed
+ * after entering a critical section; it should never fail.
+ */
+void
+InsertPreparedUndo(void)

Here, the comments are wrong as buffers are already locked in the
previous step.  A similar change is required in .h file as well.

6.
+InsertPreparedUndo(void)
{
..
/*
+ * Try to insert the record into the current page. If it doesn't
+ * succeed then recall the routine with the next page.
+ */
+ if (InsertUndoRecord(uur, page, starting_byte, &already_written, false))
+ {
+ undo_len += already_written;
+ MarkBufferDirty(buffer);
+ break;
+ }
+
+ MarkBufferDirty(buffer);
..
}

Here, you are writing into a shared buffer and marking it dirty, isn't
it a good idea to Assert for being in the critical section?

7.
+/* Maximum number of undo record that can be prepared before calling insert. */
+#define MAX_PREPARED_UNDO 2

/record/records

I think this definition doesn't define the maximum number of undo
records that can be prepared as the caller can use UndoSetPrepareSize
to change it.  I think you can modify the comment as below or
something on those lines:
"This defines the number of undo records that can be prepared before
calling insert by default.  If you need to prepare more than
MAX_PREPARED_UNDO undo records, then you must call UndoSetPrepareSize
first."

8.
+ * Caller should call this under log->discard_lock
+ */
+static bool
+IsPrevTxnUndoDiscarded(UndoLogControl *log, UndoRecPtr prev_xact_urp)
+{
+ if (log->oldest_data == InvalidUndoRecPtr)
..

Isn't it a good idea to have an Assert that we already have discard_lock?

9.
+ UnpackedUndoRecord uur; /* prev txn's first undo record.*/
+} PreviousTxnInfo;

Extra space at the of comment is required.

10.
+/*
+ * Check if previous transactions undo is already discarded.
+ *
+ * Caller should call this under log->discard_lock
+ */
+static bool
+IsPrevTxnUndoDiscarded(UndoLogControl *log, UndoRecPtr prev_xact_urp)
+{

The name suggests that this function is doing something special for
the previous transaction whereas this it just checks whether undo is
discarded corresponding to a particular undo location.  Isn't it
better if we name it as UndoRecordExists or UndoRecordIsValid?  Then
explain in comments when do you consider particular record exists.

Another point to note is that you are not releasing the lock in all
paths, so it is better to mention in comments when will it be released
and when not.


11.
+IsPrevTxnUndoDiscarded(UndoLogControl *log, UndoRecPtr prev_xact_urp)
+{
+ if (log->oldest_data == InvalidUndoRecPtr)
+ {
+ /*
+ * oldest_data is not yet initialized.  We have to check
+ * UndoLogIsDiscarded and if it's already discarded then we have
+ * nothing to do.
+ */
+ LWLockRelease(&log->discard_lock);
+ if (UndoLogIsDiscarded(prev_xact_urp))
+ return true;

The comment in above code is just trying to write the code in words.
I think here you should tell why we need to call UndoLogIsDiscarded
when oldest_data is not initialized and or the scenario when
oldest_data will not be initialized.

12.
+ * The actual update will be done by UndoRecordUpdateTransInfo under the
+ * critical section.
+ */
+void
+PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched)
+{
..

I find this function name bit awkward.  How about UndoRecordPrepareTransInfo?

13.
+PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched)
{
..
+ /*
+ * TODO: For now we don't know how to build a transaction chain for
+ * temporary undo logs.  That's because this log might have been used by a
+ * different backend, and we can't access its buffers.  What should happen
+ * is that the undo data should be automatically discarded when the other
+ * backend detaches, but that code doesn't exist yet and the undo worker
+ * can't do it either.
+ */
+ if (log->meta.persistence == UNDO_TEMP)
+ return;

Aren't we already dealing with this case in the other patch [1]?
Basically, I think we should discard it at commit time and or when the
backend is detached.

14.
+PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched)
{
..
/*
+ * If previous transaction's urp is not valid means this backend is
+ * preparing its first undo so fetch the information from the undo log
+ * if it's still invalid urp means this is the first undo record for this
+ * log and we have nothing to update.
+ */
+ if (!UndoRecPtrIsValid(prev_xact_urp))
+ return;
..

This comment is confusing.  It appears to be saying same thing twice.
You can write it along something like:

"The absence of previous transaction's undo indicate that this backend
is preparing its first undo in which case we have nothing to update.".

15.
+PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched)
/*
+ * Acquire the discard lock before accessing the undo record so that
+ * discard worker doen't remove the record while we are in process of
+ * reading it.
+ */

Typo doen't/doesn't.

I think you can use 'can't' instead of doesn't.

This is by no means a complete review, rather just noticed a few
things while reading the patch.

[1] - https://www.postgresql.org/message-id/CAFiTN-t8fv-qYG9zynhS-1jRrvt_o5C-wCMRtzOsK8S%3DMXvKKw%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Undo logs

akapila
In reply to this post by akapila
On Wed, Oct 17, 2018 at 3:27 PM Amit Kapila <[hidden email]> wrote:

>
> On Mon, Oct 15, 2018 at 6:09 PM Amit Kapila <[hidden email]> wrote:
> >
> > On Sun, Sep 2, 2018 at 12:19 AM Thomas Munro
> > <[hidden email]> wrote:
> > > I have also pushed a new WIP version of the lower level undo log
> > > storage layer patch set to a public branch[1].  I'll leave the earlier
> > > branch[2] there because the record-level patch posted by Dilip depends
> > > on it for now.
> > >
> >
>
> Till now, I have mainly reviewed undo log allocation part.  This is a
> big patch and can take much more time to complete the review.  I will
> review the other parts of the patch later.
>

I think I see another issue with this patch.  Basically, during
extend_undo_log, there is an assumption that no one could update
log->meta.end concurrently, but it
is not true as the same can be updated by UndoLogDiscard which can
lead to assertion failure in extend_undo_log.

+static void
+extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end)
{
..
/*
+ * Create all the segments needed to increase 'end' to the requested
+ * size.  This is quite expensive, so we will try to avoid it completely
+ * by renaming files into place in UndoLogDiscard instead.
+ */
+ end = log->meta.end;
+ while (end < new_end)
+ {
+ allocate_empty_undo_segment(logno, log->meta.tablespace, end);
+ end += UndoLogSegmentSize;
+ }
..
+ Assert(end == new_end);
..
/*
+ * We didn't need to acquire the mutex to read 'end' above because only
+ * we write to it.  But we need the mutex to update it, because the
+ * checkpointer might read it concurrently.
+ *
+ * XXX It's possible for meta.end to be higher already during
+ * recovery, because of the timing of a checkpoint; in that case we did
+ * nothing above and we shouldn't update shmem here.  That interaction
+ * needs more analysis.
+ */
+ LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+ if (log->meta.end < end)
+ log->meta.end = end;
+ LWLockRelease(&log->mutex);
..
}

Assume, before we read log->meta.end in above code, concurrently,
discard process discards the undo and moves the end pointer to a later
location, then the above assertion will fail.  Now, if discard happens
just after we read log->meta.end and before we do other stuff in this
function, then it will crash in recovery.

Can't we just remove this Assert?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Undo logs

Dilip Kumar-2
In reply to this post by Dilip Kumar-2
On Mon, Nov 5, 2018 at 5:09 PM Dilip Kumar <[hidden email]> wrote:
>
> On Mon, Sep 3, 2018 at 11:26 AM Dilip Kumar <[hidden email]> wrote:
>
> Thomas has already posted the latest version of undo log patches on
> 'Cleaning up orphaned files using undo logs' thread[1].  So I have
> rebased the undo-interface patch also.  This patch also includes
> latest defect fixes from the main zheap branch [2].
>
Hi Thomas,

The latest patch for undo log storage is not compiling on the head, I
think it needs to be rebased due to your commit related to "pg_pread()
and pg_pwrite() for data files and WAL"

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Undo logs

Dilip Kumar-2
In reply to this post by akapila
On Sat, Nov 10, 2018 at 9:27 AM Amit Kapila <[hidden email]> wrote:

>
> On Mon, Nov 5, 2018 at 5:10 PM Dilip Kumar <[hidden email]> wrote:
> >
> [review for undo record layer (0003-undo-interface-v3)]
>
> I might sound repeating myself, but just to be clear, I was involved
> in the design of this patch as well and I have given a few high-level
> inputs for this patch.  I have used this interface in the zheap
> development, but haven't done any sort of detailed review which I am
> doing now.  I encourage others also to review this patch.
Thanks for the review, please find my reply inline.

>
> 1.
>  * NOTES:
> + * Handling multilog -
> + *  It is possible that the undo record of a transaction can be spread across
> + *  multiple undo log.  And, we need some special handling while inserting the
> + *  undo for discard and rollback to work sanely.
>
> I think before describing how the undo record is spread across
> multiple logs, you can explain how it is laid out when that is not the
> case.  You can also explain how undo record headers are linked.  I am
> not sure file header is the best place or it should be mentioned in
> README, but I think for now we can use file header for this purpose
> and later we can move it to README if required.
Added in the header.

>
> 2.
> +/*
> + * FIXME:  Do we want to support undo tuple size which is more than the BLCKSZ
> + * if not than undo record can spread across 2 buffers at the max.
> + */
>
> +#define MAX_BUFFER_PER_UNDO    2
>
> I think here the right question is what is the possibility of undo
> record to be greater than BLCKSZ?  For zheap, as of today, we don'
> have any such requirement as the largest undo record is written for
> update or multi_insert and in both cases we don't exceed the limit of
> BLCKSZ.  I guess some user other than zheap could probably have such
> requirement and I don't think it is impossible to enhance this if we
> have any requirement.
>
> If anybody else has an opinion here, please feel to share it.
Should we remove this FIXME or lets wait for some other opinion.  As
of now I have kept it as it is.

>
> 3.
> +/*
> + * FIXME:  Do we want to support undo tuple size which is more than the BLCKSZ
> + * if not than undo record can spread across 2 buffers at the max.
> + */
> +#define MAX_BUFFER_PER_UNDO    2
> +
> +/*
> + * Consider buffers needed for updating previous transaction's
> + * starting undo record. Hence increased by 1.
> + */
> +#define MAX_UNDO_BUFFERS       (MAX_PREPARED_UNDO + 1) * MAX_BUFFER_PER_UNDO
> +
> +/* Maximum number of undo record that can be prepared before calling insert. */
> +#define MAX_PREPARED_UNDO 2
>
> I think it is better to define MAX_PREPARED_UNDO before
> MAX_UNDO_BUFFERS as the first one is used in the definition of a
> second.
Done

>
> 4.
> +/*
> + * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
> + * intended to insert.  Upon return, the necessary undo buffers are pinned.
> + * This should be done before any critical section is established, since it
> + * can fail.
> + *
> + * If not in recovery, 'xid' should refer to the top transaction id because
> + * undo log only stores mapping for the top most transactions.
> + * If in recovery, 'xid' refers to the transaction id stored in WAL.
> + */
> +UndoRecPtr
> +PrepareUndoInsert(UnpackedUndoRecord *urec, UndoPersistence upersistence,
> +   TransactionId xid)
>
> This function locks the buffer as well which is right as we need to do
> that before critical section, but the function header comments doesn't
> indicate it.  You can modify it as:
> "Upon return, the necessary undo buffers are pinned and locked."
>
> Note that similar modification is required in .h file as well.
Done

>
> 5.
> +/*
> + * Insert a previously-prepared undo record.  This will lock the buffers
> + * pinned in the previous step, write the actual undo record into them,
> + * and mark them dirty.  For persistent undo, this step should be performed
> + * after entering a critical section; it should never fail.
> + */
> +void
> +InsertPreparedUndo(void)
>
> Here, the comments are wrong as buffers are already locked in the
> previous step.  A similar change is required in .h file as well.
Fixed

>
> 6.
> +InsertPreparedUndo(void)
> {
> ..
> /*
> + * Try to insert the record into the current page. If it doesn't
> + * succeed then recall the routine with the next page.
> + */
> + if (InsertUndoRecord(uur, page, starting_byte, &already_written, false))
> + {
> + undo_len += already_written;
> + MarkBufferDirty(buffer);
> + break;
> + }
> +
> + MarkBufferDirty(buffer);
> ..
> }
>
> Here, you are writing into a shared buffer and marking it dirty, isn't
> it a good idea to Assert for being in the critical section?
Done

>
> 7.
> +/* Maximum number of undo record that can be prepared before calling insert. */
> +#define MAX_PREPARED_UNDO 2
>
> /record/records
>
> I think this definition doesn't define the maximum number of undo
> records that can be prepared as the caller can use UndoSetPrepareSize
> to change it.  I think you can modify the comment as below or
> something on those lines:
> "This defines the number of undo records that can be prepared before
> calling insert by default.  If you need to prepare more than
> MAX_PREPARED_UNDO undo records, then you must call UndoSetPrepareSize
> first."
Fixed

>
> 8.
> + * Caller should call this under log->discard_lock
> + */
> +static bool
> +IsPrevTxnUndoDiscarded(UndoLogControl *log, UndoRecPtr prev_xact_urp)
> +{
> + if (log->oldest_data == InvalidUndoRecPtr)
> ..
>
> Isn't it a good idea to have an Assert that we already have discard_lock?
Done

>
> 9.
> + UnpackedUndoRecord uur; /* prev txn's first undo record.*/
> +} PreviousTxnInfo;
>
> Extra space at the of comment is required.

Done

>
> 10.
> +/*
> + * Check if previous transactions undo is already discarded.
> + *
> + * Caller should call this under log->discard_lock
> + */
> +static bool
> +IsPrevTxnUndoDiscarded(UndoLogControl *log, UndoRecPtr prev_xact_urp)
> +{
>
> The name suggests that this function is doing something special for
> the previous transaction whereas this it just checks whether undo is
> discarded corresponding to a particular undo location.  Isn't it
> better if we name it as UndoRecordExists or UndoRecordIsValid?  Then
> explain in comments when do you consider particular record exists.
>
Changed to UndoRecordIsValid

> Another point to note is that you are not releasing the lock in all
> paths, so it is better to mention in comments when will it be released
> and when not.
Done

>
>
> 11.
> +IsPrevTxnUndoDiscarded(UndoLogControl *log, UndoRecPtr prev_xact_urp)
> +{
> + if (log->oldest_data == InvalidUndoRecPtr)
> + {
> + /*
> + * oldest_data is not yet initialized.  We have to check
> + * UndoLogIsDiscarded and if it's already discarded then we have
> + * nothing to do.
> + */
> + LWLockRelease(&log->discard_lock);
> + if (UndoLogIsDiscarded(prev_xact_urp))
> + return true;
>
> The comment in above code is just trying to write the code in words.
> I think here you should tell why we need to call UndoLogIsDiscarded
> when oldest_data is not initialized and or the scenario when
> oldest_data will not be initialized.
Fixed

>
> 12.
> + * The actual update will be done by UndoRecordUpdateTransInfo under the
> + * critical section.
> + */
> +void
> +PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched)
> +{
> ..
>
> I find this function name bit awkward.  How about UndoRecordPrepareTransInfo?
Changed as per the suggestion

>
> 13.
> +PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched)
> {
> ..
> + /*
> + * TODO: For now we don't know how to build a transaction chain for
> + * temporary undo logs.  That's because this log might have been used by a
> + * different backend, and we can't access its buffers.  What should happen
> + * is that the undo data should be automatically discarded when the other
> + * backend detaches, but that code doesn't exist yet and the undo worker
> + * can't do it either.
> + */
> + if (log->meta.persistence == UNDO_TEMP)
> + return;
>
> Aren't we already dealing with this case in the other patch [1]?
> Basically, I think we should discard it at commit time and or when the
> backend is detached.
Changed

>
> 14.
> +PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched)
> {
> ..
> /*
> + * If previous transaction's urp is not valid means this backend is
> + * preparing its first undo so fetch the information from the undo log
> + * if it's still invalid urp means this is the first undo record for this
> + * log and we have nothing to update.
> + */
> + if (!UndoRecPtrIsValid(prev_xact_urp))
> + return;
> ..
>
> This comment is confusing.  It appears to be saying same thing twice.
> You can write it along something like:
>
> "The absence of previous transaction's undo indicate that this backend
> is preparing its first undo in which case we have nothing to update.".
Done as per the suggestion

>
> 15.
> +PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched)
> /*
> + * Acquire the discard lock before accessing the undo record so that
> + * discard worker doen't remove the record while we are in process of
> + * reading it.
> + */
>
> Typo doen't/doesn't.
>
> I think you can use 'can't' instead of doesn't.
Fixed

>
> This is by no means a complete review, rather just noticed a few
> things while reading the patch.
>
> [1] - https://www.postgresql.org/message-id/CAFiTN-t8fv-qYG9zynhS-1jRrvt_o5C-wCMRtzOsK8S%3DMXvKKw%40mail.gmail.com
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

0003-undo-interface-v4.patch (93K) Download Attachment
0004-undo-interface-test-v4.patch (93K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Undo logs

akapila
On Wed, Nov 14, 2018 at 2:42 PM Dilip Kumar <[hidden email]> wrote:

>
> On Sat, Nov 10, 2018 at 9:27 AM Amit Kapila <[hidden email]> wrote:
> >
> > On Mon, Nov 5, 2018 at 5:10 PM Dilip Kumar <[hidden email]> wrote:
> > >
> > [review for undo record layer (0003-undo-interface-v3)]
> >
> > I might sound repeating myself, but just to be clear, I was involved
> > in the design of this patch as well and I have given a few high-level
> > inputs for this patch.  I have used this interface in the zheap
> > development, but haven't done any sort of detailed review which I am
> > doing now.  I encourage others also to review this patch.
>
> Thanks for the review, please find my reply inline.
> >
> > 1.
> >  * NOTES:
> > + * Handling multilog -
> > + *  It is possible that the undo record of a transaction can be spread across
> > + *  multiple undo log.  And, we need some special handling while inserting the
> > + *  undo for discard and rollback to work sanely.
> >
> > I think before describing how the undo record is spread across
> > multiple logs, you can explain how it is laid out when that is not the
> > case.  You can also explain how undo record headers are linked.  I am
> > not sure file header is the best place or it should be mentioned in
> > README, but I think for now we can use file header for this purpose
> > and later we can move it to README if required.
> Added in the header.
>
> >
> > 2.
> > +/*
> > + * FIXME:  Do we want to support undo tuple size which is more than the BLCKSZ
> > + * if not than undo record can spread across 2 buffers at the max.
> > + */
> >
> > +#define MAX_BUFFER_PER_UNDO    2
> >
> > I think here the right question is what is the possibility of undo
> > record to be greater than BLCKSZ?  For zheap, as of today, we don'
> > have any such requirement as the largest undo record is written for
> > update or multi_insert and in both cases we don't exceed the limit of
> > BLCKSZ.  I guess some user other than zheap could probably have such
> > requirement and I don't think it is impossible to enhance this if we
> > have any requirement.
> >
> > If anybody else has an opinion here, please feel to share it.
>
> Should we remove this FIXME or lets wait for some other opinion.  As
> of now I have kept it as it is.
> >

I think you can keep it with XXX instead of Fixme as there is nothing to fix.

Both the patches 0003-undo-interface-v4.patch and
0004-undo-interface-test-v4.patch appears to be same except for the
name?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Undo logs

Dilip Kumar-2
On Wed, Nov 14, 2018 at 2:58 PM Amit Kapila <[hidden email]> wrote:
> I think you can keep it with XXX instead of Fixme as there is nothing to fix.
Changed
>
> Both the patches 0003-undo-interface-v4.patch and
> 0004-undo-interface-test-v4.patch appears to be same except for the
> name?
My bad, please find the updated patch.

>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

0004-undo-interface-test-v5.patch (8K) Download Attachment
0003-undo-interface-v5.patch (93K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Undo logs

Dilip Kumar-2
On Wed, Nov 14, 2018 at 3:48 PM Dilip Kumar <[hidden email]> wrote:

>
> On Wed, Nov 14, 2018 at 2:58 PM Amit Kapila <[hidden email]> wrote:
> > I think you can keep it with XXX instead of Fixme as there is nothing to fix.
> Changed
> >
> > Both the patches 0003-undo-interface-v4.patch and
> > 0004-undo-interface-test-v4.patch appears to be same except for the
> > name?
> My bad, please find the updated patch.
>
There was some problem in a assert and one comment was not aligned
properly so I have fixed that in the latest patch.



--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

0003-undo-interface-v6.patch (93K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Undo logs

Dilip Kumar-2
On Thu, Nov 15, 2018 at 12:14 PM Dilip Kumar <[hidden email]> wrote:
>
Updated patch (merged latest code from the zheap main branch [1]).
The main chain is related to removing relfilenode and tablespace id
from the undo record and store reloid.
Earlier, we kept it thinking that we will perform rollback without
database connection but that's not the case now.

[1]  https://github.com/EnterpriseDB/zheap

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

0003-undo-interface-v7.patch (93K) Download Attachment
0004-undo-interface-test-v7.patch (9K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Undo logs

akapila
On Fri, Nov 16, 2018 at 9:46 AM Dilip Kumar <[hidden email]> wrote:
>
> On Thu, Nov 15, 2018 at 12:14 PM Dilip Kumar <[hidden email]> wrote:
> >
> Updated patch (merged latest code from the zheap main branch [1]).
>

Review comments:
-------------------------------
1.
UndoRecordPrepareTransInfo()
{
..
+ /*
+ * The absence of previous transaction's undo indicate that this backend
+ * is preparing its first undo in which case we have nothing to update.
+ */
+ if (!UndoRecPtrIsValid(prev_xact_urp))
+ return;
..
}

It is expected that the caller of UndoRecPtrIsValid should have
discard lock, but I don't see that how this the call from this place
ensures the same?

2.
UndoRecordPrepareTransInfo()
{
..
/*
+ * The absence of previous transaction's undo indicate that this backend
+ * is preparing its first undo in which case we have nothing to update.
+ */
+ if (!UndoRecPtrIsValid(prev_xact_urp))
+ return;
+
+ /*
+ * Acquire the discard lock before accessing the undo record so that
+ * discard worker doesn't remove the record while we are in process of
+ * reading it.
+ */
+ LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+ if (!UndoRecordIsValid(log, prev_xact_urp))
+ return;
..
}

I don't understand this logic where you are checking the same
information with and without a lock, is there any reason for same?  It
seems we don't need the first call to  UndoRecPtrIsValid is not
required.

3.
UndoRecordPrepareTransInfo()
{
..
+ while (true)
+ {
+ bufidx = InsertFindBufferSlot(rnode, cur_blk,
+   RBM_NORMAL,
+   log->meta.persistence);
+ prev_txn_info.prev_txn_undo_buffers[index] = bufidx;
+ buffer = undo_buffer[bufidx].buf;
+ page = BufferGetPage(buffer);
+ index++;
+
+ if (UnpackUndoRecord(&prev_txn_info.uur, page, starting_byte,
+ &already_decoded, true))
+ break;
+
+ starting_byte = UndoLogBlockHeaderSize;
+ cur_blk++;
+ }


Can you write some commentary on what this code is doing?

There is no need to use index++; as a separate statement, you can do
it while assigning the buffer in that index.

4.
+UndoRecordPrepareTransInfo(UndoRecPtr urecptr, bool log_switched)
+{
+ UndoRecPtr prev_xact_urp;

I think you can simply name this variable as xact_urp.  All this and
related prev_* terminology used for variables seems confusing to me. I
understand that you are trying to update the last transactions undo
record information, but you can explain that via comments.  Keeping
such information as part of variable names not only makes their length
longer, but is also confusing.

5.
/*
+ * Structure to hold the previous transaction's undo update information.
+ */
+typedef struct PreviousTxnUndoRecord
+{
+ UndoRecPtr prev_urecptr; /* prev txn's starting urecptr */
+ int prev_txn_undo_buffers[MAX_BUFFER_PER_UNDO];
+ UnpackedUndoRecord uur; /* prev txn's first undo record. */
+} PreviousTxnInfo;
+
+static PreviousTxnInfo prev_txn_info;

Due to reasons mentioned in point-4, lets name the structure and it's
variables as below:

typedef struct XactUndoRecordInfo
{
UndoRecPtr start_urecptr; /* prev txn's starting urecptr */
int idx_undo_buffers[MAX_BUFFER_PER_UNDO];
UnpackedUndoRecord first_uur; /* prev txn's first undo record. */
} XactUndoRecordInfo;

static XactUndoRecordInfo xact_ur_info;

6.
+static int
+InsertFindBufferSlot(RelFileNode rnode,

The name of this function is not clear, can we change it to
UndoGetBufferSlot or UndoGetBuffer?

7.
+UndoRecordAllocateMulti(UnpackedUndoRecord *undorecords, int nrecords,
+ UndoPersistence upersistence, TransactionId txid)
{
..
+ /*
+ * If this is the first undo record for this transaction then set the
+ * uur_next to the SpecialUndoRecPtr.  This is the indication to allocate
+ * the space for the transaction header and the valid value of the uur_next
+ * will be updated while preparing the first undo record of the next
+ * transaction.
+ */
+ first_rec_in_recovery = InRecovery && IsTransactionFirstRec(txid);
..
}

I think it will be better if we move this comment few lines down:
+ if (need_start_undo && i == 0)
+ {
+ urec->uur_next = SpecialUndoRecPtr;

BTW, is the only reason set a special value (SpecialUndoRecPtr) for
uur_next is for allocating transaction header? If so, can't we
directly set the corresponding flag (UREC_INFO_TRANSACTION) in
uur_info and then remove it from UndoRecordSetInfo?

I think it would have been better if there is one central location to
set uur_info, but as that is becoming tricky,
we should not try to add some special stuff to make it possible.

8.
+UndoRecordExpectedSize(UnpackedUndoRecord *uur)
+{
+ Size size;
+
+ /* Fixme : Temporary hack to allow zheap to set some value for uur_info. */
+ /* if (uur->uur_info == 0) */
+ UndoRecordSetInfo(uur);

Can't we move UndoRecordSetInfo in it's caller
(UndoRecordAllocateMulti)?  It seems another caller of this function
doesn't expect this.  If we do that way, then we can have an Assert
for non-zero uur_info in UndoRecordExpectedSize.

9.
bool
+InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+ int starting_byte, int *already_written, bool header_only)
+{
+ char   *writeptr = (char *) page + starting_byte;
+ char   *endptr = (char *) page + BLCKSZ;
+ int my_bytes_written = *already_written;
+
+ if (uur->uur_info == 0)
+ UndoRecordSetInfo(uur);

Do we really need UndoRecordSetInfo here?  If not, then just add an
assert for non-zero uur_info?

10
UndoRecordAllocateMulti()
{
..
else
+ {
+ /*
+ * It is okay to initialize these variables as these are used only
+ * with the first record of transaction.
+ */
+ urec->uur_next = InvalidUndoRecPtr;
+ urec->uur_xidepoch = 0;
+ urec->uur_dbid = 0;
+ urec->uur_progress = 0;
+ }
+
+
+ /* calculate the size of the undo record. */
+ size += UndoRecordExpectedSize(urec);
+ }

Remove one extra line before comment "calculate the size of ..".

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

123