Avoiding smgrimmedsync() during nbtree index builds

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Avoiding smgrimmedsync() during nbtree index builds

Andres Freund
Hi,

Every nbtree index build currently does an smgrimmedsync at the end:

/*
 * Read tuples in correct sort order from tuplesort, and load them into
 * btree leaves.
 */
static void
_bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
...
        /*
         * When we WAL-logged index pages, we must nonetheless fsync index files.
         * Since we're building outside shared buffers, a CHECKPOINT occurring
         * during the build has no way to flush the previously written data to
         * disk (indeed it won't know the index even exists).  A crash later on
         * would replay WAL from the checkpoint, therefore it wouldn't replay our
         * earlier WAL entries. If we do not fsync those pages here, they might
         * still not be on disk when the crash occurs.
         */
        if (wstate->btws_use_wal)
        {
                RelationOpenSmgr(wstate->index);
                smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM);
        }

In cases we create lots of small indexes, e.g. because of an initial
schema load, partition creation or something like that, that turns out
to be a major limiting factor (unless one turns fsync off).


One way to address that would be to put newly built indexes into s_b
(using a strategy, to avoid blowing out the whole cache), instead of
using smgrwrite() etc directly. But that's a discussion with a bit more
complex tradeoffs.


What I wonder is why the issue addressed in the comment I copied above
can't more efficiently be addressed using sync requests, like we do for
other writes?  It's possibly bit more complicated than just passing
skipFsync=false to smgrwrite/smgrextend, but it should be quite doable?


A quick hack (probably not quite correct!) to evaluate the benefit shows
that the attached script takes 2m17.223s with the smgrimmedsync and
0m22.870s passing skipFsync=false to write/extend. Entirely IO bound in
the former case, CPU bound in the latter.

Creating lots of tables with indexes (directly or indirectly through
relations having a toast table) is pretty common, particularly after the
introduction of partitioning.


Thinking through the correctness of replacing smgrimmedsync() with sync
requests, the potential problems that I can see are:

1) redo point falls between the log_newpage() and the
   write()/register_dirty_segment() in smgrextend/smgrwrite.
2) redo point falls between write() and register_dirty_segment()

But both of these are fine in the context of initially filling a newly
created relfilenode, as far as I can tell? Otherwise the current
smgrimmedsync() approach wouldn't be safe either, as far as I can tell?

Greetings,

Andres Freund

createlots.sql (291 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Avoiding smgrimmedsync() during nbtree index builds

Heikki Linnakangas
On 21/01/2021 22:36, Andres Freund wrote:

> Hi,
>
> Every nbtree index build currently does an smgrimmedsync at the end:
>
> /*
>   * Read tuples in correct sort order from tuplesort, and load them into
>   * btree leaves.
>   */
> static void
> _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
> ...
> /*
> * When we WAL-logged index pages, we must nonetheless fsync index files.
> * Since we're building outside shared buffers, a CHECKPOINT occurring
> * during the build has no way to flush the previously written data to
> * disk (indeed it won't know the index even exists).  A crash later on
> * would replay WAL from the checkpoint, therefore it wouldn't replay our
> * earlier WAL entries. If we do not fsync those pages here, they might
> * still not be on disk when the crash occurs.
> */
> if (wstate->btws_use_wal)
> {
> RelationOpenSmgr(wstate->index);
> smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM);
> }
>
> In cases we create lots of small indexes, e.g. because of an initial
> schema load, partition creation or something like that, that turns out
> to be a major limiting factor (unless one turns fsync off).
>
>
> One way to address that would be to put newly built indexes into s_b
> (using a strategy, to avoid blowing out the whole cache), instead of
> using smgrwrite() etc directly. But that's a discussion with a bit more
> complex tradeoffs.
>
>
> What I wonder is why the issue addressed in the comment I copied above
> can't more efficiently be addressed using sync requests, like we do for
> other writes?  It's possibly bit more complicated than just passing
> skipFsync=false to smgrwrite/smgrextend, but it should be quite doable?

Makes sense.

> A quick hack (probably not quite correct!) to evaluate the benefit shows
> that the attached script takes 2m17.223s with the smgrimmedsync and
> 0m22.870s passing skipFsync=false to write/extend. Entirely IO bound in
> the former case, CPU bound in the latter.
>
> Creating lots of tables with indexes (directly or indirectly through
> relations having a toast table) is pretty common, particularly after the
> introduction of partitioning.
>
>
> Thinking through the correctness of replacing smgrimmedsync() with sync
> requests, the potential problems that I can see are:
>
> 1) redo point falls between the log_newpage() and the
>     write()/register_dirty_segment() in smgrextend/smgrwrite.
> 2) redo point falls between write() and register_dirty_segment()
>
> But both of these are fine in the context of initially filling a newly
> created relfilenode, as far as I can tell? Otherwise the current
> smgrimmedsync() approach wouldn't be safe either, as far as I can tell?

Hmm. If the redo point falls between write() and the
register_dirty_segment(), and the checkpointer finishes the whole
checkpoint before register_dirty_segment(), you are not safe. That can't
happen with write from the buffer manager, because the checkpointer
would block waiting for the flush of the buffer to finish.

- Heikki


Reply | Threaded
Open this post in threaded view
|

Re: Avoiding smgrimmedsync() during nbtree index builds

Andres Freund
Hi,

On 2021-01-21 23:54:04 +0200, Heikki Linnakangas wrote:

> On 21/01/2021 22:36, Andres Freund wrote:
> > A quick hack (probably not quite correct!) to evaluate the benefit shows
> > that the attached script takes 2m17.223s with the smgrimmedsync and
> > 0m22.870s passing skipFsync=false to write/extend. Entirely IO bound in
> > the former case, CPU bound in the latter.
> >
> > Creating lots of tables with indexes (directly or indirectly through
> > relations having a toast table) is pretty common, particularly after the
> > introduction of partitioning.
> >
> >
> > Thinking through the correctness of replacing smgrimmedsync() with sync
> > requests, the potential problems that I can see are:
> >
> > 1) redo point falls between the log_newpage() and the
> >     write()/register_dirty_segment() in smgrextend/smgrwrite.
> > 2) redo point falls between write() and register_dirty_segment()
> >
> > But both of these are fine in the context of initially filling a newly
> > created relfilenode, as far as I can tell? Otherwise the current
> > smgrimmedsync() approach wouldn't be safe either, as far as I can tell?
>
> Hmm. If the redo point falls between write() and the
> register_dirty_segment(), and the checkpointer finishes the whole checkpoint
> before register_dirty_segment(), you are not safe. That can't happen with
> write from the buffer manager, because the checkpointer would block waiting
> for the flush of the buffer to finish.

Hm, right.

The easiest way to address that race would be to just record the redo
pointer in _bt_leafbuild() and continue to do the smgrimmedsync if a
checkpoint started since the start of the index build.

Another approach would be to utilize PGPROC.delayChkpt, but I would
rather not unnecessarily expand the use of that.

It's kind of interesting - in my aio branch I moved the
register_dirty_segment() to before the actual asynchronous write (due to
availability of the necessary data), which ought to be safe because of
the buffer interlocking. But that doesn't work here, or for other places
doing writes without going through s_b.  It'd be great if we could come
up with a general solution, but I don't immediately see anything great.

The best I can come up with is adding helper functions to wrap some of
the complexity for "unbuffered" writes of doing an immedsync iff the
redo pointer changed. Something very roughly like

typedef struct UnbufferedWriteState { XLogRecPtr redo; uint64 numwrites;} UnbufferedWriteState;
void unbuffered_prep(UnbufferedWriteState* state);
void unbuffered_write(UnbufferedWriteState* state, ...);
void unbuffered_extend(UnbufferedWriteState* state, ...);
void unbuffered_finish(UnbufferedWriteState* state);

which wouldn't just do the dance to avoid the immedsync() if possible,
but also took care of PageSetChecksumInplace() (and PageEncryptInplace()
if we get that [1]).

Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/20210112193431.2edcz776qjen7kao%40alap3.anarazel.de