Postgres, fsync, and OSs (specifically linux)

classic Classic list List threaded Threaded
71 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Re: Postgres, fsync, and OSs (specifically linux)

Thomas Munro-3
On Thu, Aug 30, 2018 at 2:44 PM Craig Ringer <[hidden email]> wrote:
> On 15 August 2018 at 07:32, Thomas Munro <[hidden email]> wrote:
>> I will soon post some more fix-up patches that add EXEC_BACKEND
>> support, Windows support, and a counting scheme to fix the timing
>> issue that I mentioned in my first review.  I will probably squash it
>> all down to a tidy patch-set after that.

I went down a bit of a rabbit hole with the Windows support for
Andres's patch set.  I have something that works as far as I can tell,
but my Windows environment consists of throwing things at Appveyor and
seeing what sticks, so I'm hoping that someone with a real Windows
system and knowledge will be able to comment.

New patches in this WIP patch set:

0012: Fix for EXEC_BACKEND.

0013: Windows.  This involved teaching latch.c to deal with Windows
asynchronous IO events, since you can't wait for pipe readiness via
WSAEventSelect.  Pipes and sockets exist in different dimensions on
Windows, and there are no "Unix" domain sockets (well, there are but
they aren't usable yet[1]).  An alternative would be to use TCP
sockets for this, and then the code would look more like the Unix
code, but that seems a bit strange.  Note that the Windows version
doesn't actually hand off file handles like the Unix code (it could
fairly easily, but there is no reason to think that would actually be
useful on that platform).  I may be way off here...

The 0013 patch also fixes a mistake in the 0010 patch: it is not
appropriate to call CFI() while waiting to notify the checkpointer of
a dirty segment, because then ^C could cause the following checkpoint
not to flush dirty data.  SendFsyncRequest() is essentially blocking,
except that it uses non-blocking IO so that it multiplex postmaster
death detection.

0014: Fix the ordering race condition mentioned upthread[2].  All
files are assigned an increasing sequence number after [re]opening (ie
before their first write), so that the checkpointer process can track
the fd that must have the oldest Linux f_wb_err that could be relevant
for writes done by PostgreSQL.

The other patches in this tarball are all as posted already, but are
now rebased and assembled in one place.  Also pushed to
https://github.com/macdice/postgres/tree/fsyncgate .

Thoughts?

[1] https://blogs.msdn.microsoft.com/commandline/2017/12/19/af_unix-comes-to-windows/
[2] https://www.postgresql.org/message-id/CAEepm%3D04ZCG_8N3m61kXZP-7Ecr02HUNNG-QsAhwyFLim4su2g%40mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com

fsyncgate-v3.tgz (41K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Postgres, fsync, and OSs (specifically linux)

Thomas Munro-3
On Fri, Sep 28, 2018 at 9:37 PM Thomas Munro
<[hidden email]> wrote:
> The 0013 patch also fixes a mistake in the 0010 patch: it is not
> appropriate to call CFI() while waiting to notify the checkpointer of
> a dirty segment, because then ^C could cause the following checkpoint
> not to flush dirty data.

(Though of course it wouldn't actually do that due to an LWLock being
held, but still, I removed the CFI because it was at best misleading).

--
Thomas Munro
http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Postgres, fsync, and OSs (specifically linux)

Thomas Munro-3
In reply to this post by Thomas Munro-3
On Fri, Sep 28, 2018 at 9:37 PM Thomas Munro
<[hidden email]> wrote:
> The other patches in this tarball are all as posted already, but are
> now rebased and assembled in one place.  Also pushed to
> https://github.com/macdice/postgres/tree/fsyncgate .

Here is a new version that fixes an assertion failure during crash
recovery, revealed by cfbot.  I also took the liberty of squashing the
patch stack into one and writing a new commit message, except the
Windows part which seems worth keeping separate until we agree it's
the right way forward.

--
Thomas Munro
http://www.enterprisedb.com

0001-Keep-file-descriptors-open-to-avoid-losing-errors-v4.patch (87K) Download Attachment
0002-Add-an-fsync-request-pipe-for-Windows-v4.patch (22K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Postgres, fsync, and OSs (specifically linux)

Thomas Munro-3
Hello hackers,

Let's try to get this issue resolved.  Here is my position on the
course of action we should take in back-branches:

1.  I am -1 on back-patching the fd-transfer code.  It's a significant
change, and even when sufficiently debugged (I don't think it's there
yet), we have no idea what will happen on all the kernels we support
under extreme workloads.  IMHO there is no way we can spring this on
users in a point release.

2.  I am +1 on back-patching Craig's PANIC-on-failure logic.  Doing
nothing is not an option I like.  I have some feedback and changes to
propose though; see attached.

Responses to a review from Robert:

On Thu, Jul 19, 2018 at 7:23 AM Robert Haas <[hidden email]> wrote:

> 2. I don't like promote_ioerr_to_panic() very much, partly because the
> same pattern gets repeated over and over, and partly because it would
> be awkwardly-named if we discovered that another 2 or 3 errors needed
> similar handling (or some other variant handling).  I suggest instead
> having a function like report_critical_fsync_failure(char *path) that
> does something like this:
>
>     int elevel = ERROR;
>     if (errno == EIO)
>         elevel = PANIC;
>     ereport(elevel,
>                 (errcode_for_file_access(),
>                  errmsg("could not fsync file \"%s\": %m", path);
>
> And similarly I'd add report_critical_close_failure.  In some cases,
> this would remove wording variations (e.g. in twophase.c) but I think
> that's fine, and maybe an improvement, as discussed on another recent
> thread.
I changed it to look like data_sync_elevel(ERROR) and made it treat
all errnos the same.  ENOSPC, EIO, EWOK, EIEIO, it makes no difference
to the level of faith I have that my data still exists.

> 3. slru.c calls pg_fsync() but isn't changed by the patch.  That looks wrong.

Fixed.

> 4. The comment changes in snapbuild.c interact with the TODO that
> immediately follows.  I think more adjustment is needed here.

I don't understand this.

> 5. It seems odd that you adjusted the comment for
> pg_fsync_no_writethrough() but not pg_fsync_writethrough() or
> pg_fsync().  Either pg_fsync_writethrough() doesn't have the same
> problem, in which case, awesome, but let's add a comment, or it does,
> in which case it should refer to the other one.  And I think
> pg_fsync() itself needs a comment saying that every caller must be
> careful to use promote_ioerr_to_panic() or
> report_critical_fsync_failure() or whatever we end up calling it
> unless the fsync is not critical for data integrity.

I removed these comments and many others; I don't see the point in
scattering descriptions of this problem and references to specific
versions of Linux and -hackers archive links all over the place.  I
added a comment in one place, and also added some user documentation
of the problem.

> 6. In md.c, there's a stray blank line added.  But more importantly,
> the code just above that looks like this:
>
>                      if (!FILE_POSSIBLY_DELETED(errno) ||
>                          failures > 0)
> -                        ereport(ERROR,
> +                        ereport(promote_ioerr_to_panic(ERROR),
>                                  (errcode_for_file_access(),
>                                   errmsg("could not fsync file \"%s\": %m",
>                                          path)));
>                      else
>                          ereport(DEBUG1,
>                                  (errcode_for_file_access(),
>                                   errmsg("could not fsync file \"%s\"
> but retrying: %m",
>                                          path)));
>
> I might be all wet here, but it seems like if we enter the bottom
> branch, we still need the promote-to-panic behavior.
That case only is only reached if FILE_POSSIBLY_DELETED() on the first
time through the loop, and it detects an errno value not actually from
fsync().  It's from FileSync(), when it tries to reopen a virtual fd
and gets ENOENT, before calling fsync().  Code further down then
absorbs incoming requests before checking if that was expected,
closing a race.  The comments could make that clearer, admittedly.

> 7. The comment adjustment for SyncDataDirectory mentions an
> "important" fact about fsync behavior, but then doesn't seem to change
> any logic on that basis.  I think in general a number of these
> comments need a little more thought, but in this particular case, I
> think we also need to consider what the behavior should be (and the
> comment should reflect our considered judgement on that point, and the
> implementation should match).

I updated the comment.  I don't think this is too relevant to the
fsync() failure case, because we'll be rewriting all changes from the
WAL again during recovery; I think this function is mostly useful for
switching from fsync = off to fsync = on and restarting, not coping
with previous fsync() failures by retrying (which we know to be
useless anyway).  Someone could argue that if you restarted after
changing fsync from off to on, then this may be the first time you
learn that write-back failed, and then you're somewhat screwed whether
we panic or not, but I don't see any solution to that.  Don't run
databases with fsync = off.

> 8. Andres suggested to me off-list that we should have a GUC to
> disable the promote-to-panic behavior in case it turns out to be a
> show-stopper for some user.  I think that's probably a good idea.
> Adding many new ways to PANIC in a minor release without providing any
> way to go back to the old behavior sounds unfriendly.  Surely, anyone
> who suffers much from this has really serious other problems anyway,
> but all the same I think we should provide an escape hatch.

+1.  See the new GUC data_sync_retry, defaulting to false.  If set to
true, we also need to fix the problem reported in [1], so here's the
patch for that too.

Other comments:

I don't see why sync_file_range(SYNC_FILE_RANGE_WRITE) should get a
pass here.  Inspection of some version of the kernel might tell us it
can't advance the error counter and report failure, but what do we
gain by relying on that?  Changed.

FD_DELETE_AT_CLOSE is not a good way to detect temporary files in
recent versions, as it doesn't detect the kind of shared temporary
files used by Parallel Hash; FD_TEMP_FILE_LIMIT is a better way.
Changed.  (We could also just not bother exempting temporary files?)

I plan to continue working on the fd-transfer system as part of a
larger sync queue redesign project for 12.  If we can get an agreement
that we can't possibly back-patch the fd-transfer logic, then we can
move all future discussion of that topic over to the other thread[2],
and this thread can be about consensus to back-patch the PANIC patch.
Thoughts?

[1] https://www.postgresql.org/message-id/flat/87y3i1ia4w.fsf@...
[2] https://www.postgresql.org/message-id/flat/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmaZxk0JYkxw+b21fNrw@...

0001-Don-t-forget-about-failed-fsync-requests-v4.patch (3K) Download Attachment
0002-PANIC-on-fsync-failure-v4.patch (21K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Postgres, fsync, and OSs (specifically linux)

Craig Ringer-3
On Fri, 19 Oct 2018 at 07:27, Thomas Munro <[hidden email]> wrote:

2.  I am +1 on back-patching Craig's PANIC-on-failure logic.  Doing
nothing is not an option I like.  I have some feedback and changes to
propose though; see attached.

Thanks very much for the work on reviewing and revising this.
 
I don't see why sync_file_range(SYNC_FILE_RANGE_WRITE) should get a
pass here.  Inspection of some version of the kernel might tell us it
can't advance the error counter and report failure, but what do we
gain by relying on that?  Changed.

I was sure it made sense at the time, but I can't explain that decision now, and it looks like we should treat it as a failure.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Reply | Threaded
Open this post in threaded view
|

Re: Postgres, fsync, and OSs (specifically linux)

Thomas Munro-3
On Fri, Oct 19, 2018 at 6:42 PM Craig Ringer <[hidden email]> wrote:
> On Fri, 19 Oct 2018 at 07:27, Thomas Munro <[hidden email]> wrote:
>> 2.  I am +1 on back-patching Craig's PANIC-on-failure logic.  Doing
>> nothing is not an option I like.  I have some feedback and changes to
>> propose though; see attached.
>
> Thanks very much for the work on reviewing and revising this.

My plan is do a round of testing and review of this stuff next week
once the dust is settled on the current minor releases (including
fixing a few typos I just spotted and some word-smithing).  All going
well, I will then push the resulting patches to master and all
supported stable branches, unless other reviews or objections appear.
At some point not too far down the track I hope to be ready to
consider committing that other patch that will completely change all
of this code in the master branch, but in any case Craig's patch will
get almost a full minor release cycle to sit in the stable branches
before release.

--
Thomas Munro
http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Postgres, fsync, and OSs (specifically linux)

Robert Haas
On Wed, Nov 7, 2018 at 9:41 PM Thomas Munro
<[hidden email]> wrote:

> My plan is do a round of testing and review of this stuff next week
> once the dust is settled on the current minor releases (including
> fixing a few typos I just spotted and some word-smithing).  All going
> well, I will then push the resulting patches to master and all
> supported stable branches, unless other reviews or objections appear.
> At some point not too far down the track I hope to be ready to
> consider committing that other patch that will completely change all
> of this code in the master branch, but in any case Craig's patch will
> get almost a full minor release cycle to sit in the stable branches
> before release.

I did a read-through of these patches.

+ new_requests = entry->requests[forknum];
+ entry->requests[forknum] =
+ bms_join(new_requests, requests);

What happens if bms_join fails, too?

+        recover from the WAL after any failure is reported, preferrably

preferably.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: Postgres, fsync, and OSs (specifically linux)

Thomas Munro-3
On Fri, Nov 9, 2018 at 7:07 AM Robert Haas <[hidden email]> wrote:

> On Wed, Nov 7, 2018 at 9:41 PM Thomas Munro
> <[hidden email]> wrote:
> > My plan is do a round of testing and review of this stuff next week
> > once the dust is settled on the current minor releases (including
> > fixing a few typos I just spotted and some word-smithing).  All going
> > well, I will then push the resulting patches to master and all
> > supported stable branches, unless other reviews or objections appear.
> > At some point not too far down the track I hope to be ready to
> > consider committing that other patch that will completely change all
> > of this code in the master branch, but in any case Craig's patch will
> > get almost a full minor release cycle to sit in the stable branches
> > before release.
>
> I did a read-through of these patches.
>
> + new_requests = entry->requests[forknum];
> + entry->requests[forknum] =
> + bms_join(new_requests, requests);
>
> What happens if bms_join fails, too?

My reasoning for choosing bms_join() is that it cannot fail, assuming
the heap is not corrupted.  It simply ORs the two bit-strings into
whichever is the longer input string, and frees the shorter input
string.  (In an earlier version I used bms_union(), this function's
non-destructive sibling, but then realised that it could fail to
allocate() causing us to lose track of a 1 bit).

Philosophical point: if pfree() throws, then bms_join() throws, but
(assuming AllocSetFree() implementation) it can only throw if the heap
is corrupted, eg elog(ERROR, "could not find block containing chunk
%p", chunk) and possibly other errors.  Of course it's impossible to
make guarantees of any kind in case of arbitrary corruption.  But
perhaps we could do this in a critical section, so errors are promoted
to PANIC.

> +        recover from the WAL after any failure is reported, preferrably
>
> preferably.

Thanks.

--
Thomas Munro
http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Postgres, fsync, and OSs (specifically linux)

Robert Haas
On Thu, Nov 8, 2018 at 3:04 PM Thomas Munro
<[hidden email]> wrote:
> My reasoning for choosing bms_join() is that it cannot fail, assuming
> the heap is not corrupted.  It simply ORs the two bit-strings into
> whichever is the longer input string, and frees the shorter input
> string.  (In an earlier version I used bms_union(), this function's
> non-destructive sibling, but then realised that it could fail to
> allocate() causing us to lose track of a 1 bit).

Oh, OK.  I was assuming it was allocating.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: Postgres, fsync, and OSs (specifically linux)

Thomas Munro-3
On Fri, Nov 9, 2018 at 9:06 AM Robert Haas <[hidden email]> wrote:

> On Thu, Nov 8, 2018 at 3:04 PM Thomas Munro
> <[hidden email]> wrote:
> > My reasoning for choosing bms_join() is that it cannot fail, assuming
> > the heap is not corrupted.  It simply ORs the two bit-strings into
> > whichever is the longer input string, and frees the shorter input
> > string.  (In an earlier version I used bms_union(), this function's
> > non-destructive sibling, but then realised that it could fail to
> > allocate() causing us to lose track of a 1 bit).
>
> Oh, OK.  I was assuming it was allocating.
I did some more testing using throw-away fault injection patch 0003.
I found one extra problem:  fsync_fname() needed data_sync_elevel()
treatment, because it is used in eg CheckPointCLOG().

With data_sync_retry = on, if you update a row, touch
/tmp/FileSync_EIO and try to checkpoint then the checkpoint fails, and
the cluster keeps running.  Future checkpoint attempts report the same
error about the same file, showing that patch 0001 works (we didn't
forget about the dirty file).  Then rm /tmp/FileSync_EIO, and the next
checkpoint should succeed.

With data_sync_retry = off (the default), the same test produces a
PANIC, showing that patch 0002 works.

It's similar if you touch /tmp/pg_sync_EIO instead.  That shows that
cases like fsync_fname("pg_xact") also cause PANIC when
data_sync_retry = off, but it hides the bug that 0001 fixes when
data_sync_retry = on, hence my desire to test the two different fault
injection points.

I think these patches are looking good now.  If I don't spot any other
problems or hear any objections, I will commit them tomorrow-ish.

--
Thomas Munro
http://www.enterprisedb.com

0001-Don-t-forget-about-failed-fsync-requests-v5.patch (3K) Download Attachment
0002-PANIC-on-fsync-failure-v5.patch (21K) Download Attachment
0003-fsync-fault-injection.-For-testing-only-v5.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Postgres, fsync, and OSs (specifically linux)

Thomas Munro-3
On Fri, Nov 9, 2018 at 9:03 AM Thomas Munro
<[hidden email]> wrote:
> On Fri, Nov 9, 2018 at 7:07 AM Robert Haas <[hidden email]> wrote:
> > On Wed, Nov 7, 2018 at 9:41 PM Thomas Munro
> > <[hidden email]> wrote:
> > > My plan is do a round of testing and review of this stuff next week
> > > once the dust is settled on the current minor releases (including
> > > fixing a few typos I just spotted and some word-smithing).  All going
> > > well, I will then push the resulting patches to master and all
> > > supported stable branches, unless other reviews or objections appear.

...

On Sun, Nov 18, 2018 at 3:20 PM Thomas Munro
<[hidden email]> wrote:
> I think these patches are looking good now.  If I don't spot any other
> problems or hear any objections, I will commit them tomorrow-ish.

Hearing no objections, pushed to all supported branches.

Thank you to Craig for all his work getting to the bottom of this, to
Andres for his open source diplomacy, and the Linux guys for their
change "errseq: Always report a writeback error once" which came out
of that.

Some more comments:

* The promotion of errors from close() to PANIC may or may not be
effective considering that it doesn't have interlocking with
concurrent checkpoints.  I'm not sure if it can really happen on local
file systems anyway... this may fall under the category of "making
PostgreSQL work reliably on NFS", a configuration that is not
recommended currently, and a separate project IMV.

* In 9.4 and 9.5 there is no checking of errors from
sync_file_range(), and I didn't add any for now.  It was claimed that
sync_file_range() without BEFORE/AFTER can't consume errors[1].
Errors are promoted in 9.6+ for consistency because we already looked
at the return code, so we won't long rely on that knowledge in the
long term.

* I personally believe it is safe to run with data_sync_retry = on on
any file system on FreeBSD, and ZFS on any operating system... but I
see no need to make recommendations about that in the documentation,
other than that you should investigate the behaviour of your operating
system if you really want to turn it on.

* A PANIC (and possibly ensuing crash restart loop if the I/O error is
not transient) is of course a very unpleasant failure mode, but it is
one that we already had for the WAL and control file.  So I'm not sure
I'd personally bother to run with the non-default setting even on a
system where I believe it to be safe (considering the low likelihood
that I/O failure is isolated to certain files); at best it probably
gives you a better experience if the fs underneath a non-default
tablespace dies.

* The GUC is provided primarily because this patch is so drastic in
its effect that it seems like we owe our users a way to disable it on
principle, and that seems to outweigh a desire not to add GUCs in
back-branches.

* If I/O errors happen, your system is probably toast and you need to
fail over or restore from backups, but at least we won't tell you any
lies about checkpoints succeeding.  In rare scenarios, perhaps
involving a transient failure of virtualised storage with thin
provisioning as originally described by Craig, the system may actually
be able to continue running, and with this change we should now be
able to avoid data loss by recovering from the WAL.

* As noted the commit message, this isn't quite the end of the story.
See the fsync queue redesign thread[2], WIP for master only.

[1] https://www.postgresql.org/message-id/20180430160945.5s5qfoqryhtmugxl%40alap3.anarazel.de
[2] https://www.postgresql.org/message-id/flat/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmaZxk0JYkxw+b21fNrw@...

--
Thomas Munro
http://www.enterprisedb.com

1234