Postgres, fsync, and OSs (specifically linux)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
71 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Re: Postgres, fsync, and OSs (specifically linux)

Thomas Munro-3
On Sun, Apr 29, 2018 at 1:58 PM, Craig Ringer <[hidden email]> wrote:

> On 28 April 2018 at 23:25, Simon Riggs <[hidden email]> wrote:
>> On 27 April 2018 at 15:28, Andres Freund <[hidden email]> wrote:
>>>   While I'm a bit concerned adding user-code before a checkpoint, if
>>>   we'd do it as a shell command it seems pretty reasonable. And useful
>>>   even without concern for the fsync issue itself. Checking for IO
>>>   errors could e.g. also include checking for read errors - it'd not be
>>>   unreasonable to not want to complete a checkpoint if there'd been any
>>>   media errors.
>>
>> It seems clear that we need to evaluate our compatibility not just
>> with an OS, as we do now, but with an OS/filesystem.
>>
>> Although people have suggested some approaches, I'm more interested in
>> discovering how we can be certain we got it right.
>
> TBH, we can't be certain, because there are too many failure modes,
> some of which we can't really simulate in practical ways, or automated
> ways.

+1

Testing is good, but unless you have a categorical statement from the
relevant documentation or kernel team or you have the source code, I'm
not sure how you can ever really be sure about this.  I think we have
a fair idea now what several open kernels do, but we still haven't got
a clue about Windows, AIX, HPUX and Solaris and we only have half the
answer for Illumos, and no "negative" test result can prove that they
can't throw away write-back errors or data.

Considering the variety in interpretation and liberties taken, I
wonder if fsync() is underspecified and someone should file an issue
over at http://www.opengroup.org/austin/ about that.

--
Thomas Munro
http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Postgres, fsync, and OSs (specifically linux)

Craig Ringer-3
On 30 April 2018 at 09:09, Thomas Munro <[hidden email]> wrote:

> Considering the variety in interpretation and liberties taken, I
> wonder if fsync() is underspecified and someone should file an issue
> over at http://www.opengroup.org/austin/ about that.

All it's going to achieve is adding an "is implementation-defined"
caveat, but that's at least a bit of a heads-up.

I filed patches for Linux man-pages ages ago. I'll update them and
post to LKML; apparently bugzilla has a lot of spam and many people
ignore notifications, so they might just bitrot forever otherwise.

Meanwhile, do we know if, on Linux 4.13+, if we get a buffered write
error due to dirty writeback before we close() a file we don't
fsync(), we'll get the error on close()?

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Postgres, fsync, and OSs (specifically linux)

Andres Freund
On 2018-04-30 10:14:23 +0800, Craig Ringer wrote:
> Meanwhile, do we know if, on Linux 4.13+, if we get a buffered write
> error due to dirty writeback before we close() a file we don't
> fsync(), we'll get the error on close()?

Not quite sure what you're getting at with "a file we don't fsync" - if
we don't, we don't care about durability anyway, no? Or do you mean
where we fsync in a different process?

Either way, the answer is mostly no: On NFS et al where close() implies
an fsync you'll get the error at that time, otherwise you'll get it at
the next fsync().

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: Postgres, fsync, and OSs (specifically linux)

Craig Ringer-3
> Not quite sure what you're getting at with "a file we don't fsync" - if
> we don't, we don't care about durability anyway, no? Or do you mean
> where we fsync in a different process?

Right.

> Either way, the answer is mostly no: On NFS et al where close() implies
> an fsync you'll get the error at that time, otherwise you'll get it at
> the next fsync().

Thanks.

The reason I ask is that if we got notified of already-detected
writeback errors (on 4.13+) on close() too, it'd narrow the window a
little for problems, since normal backends could PANIC if close() of a
persistent file raised EIO. Otherwise we're less likely to see the
error, since the checkpointer won't see it - it happened before the
checkpointer open()ed the file. It'd still be no help for dirty
writeback that happens after we close() in a user backend / the
bgwriter and before we re-open(), but it'd be nice if the kernel would
tell us on close() if it knows of a writeback error.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Postgres, fsync, and OSs (specifically linux)

Craig Ringer-3
Hrm, something else that just came up. On 9.6+ we use sync_file_range.
It's surely going to eat errors:

        rc = sync_file_range(fd, offset, nbytes,
                             SYNC_FILE_RANGE_WRITE);

        /* don't error out, this is just a performance optimization */
        if (rc != 0)
        {
            ereport(WARNING,
                    (errcode_for_file_access(),
                     errmsg("could not flush dirty data: %m")));
        }

so that has to panic too.

I'm very suspicious about the safety of the msync() path too.

I'll post an update to my PANIC-everywhere patch that add these cases.

Reply | Threaded
Open this post in threaded view
|

Re: Postgres, fsync, and OSs (specifically linux)

Andres Freund
On 2018-04-30 13:03:24 +0800, Craig Ringer wrote:

> Hrm, something else that just came up. On 9.6+ we use sync_file_range.
> It's surely going to eat errors:
>
>         rc = sync_file_range(fd, offset, nbytes,
>                              SYNC_FILE_RANGE_WRITE);
>
>         /* don't error out, this is just a performance optimization */
>         if (rc != 0)
>         {
>             ereport(WARNING,
>                     (errcode_for_file_access(),
>                      errmsg("could not flush dirty data: %m")));
>         }

It's not. Only SYNC_FILE_RANGE_WAIT_{BEFORE,AFTER} eat errors. Which
seems sensible, because they could be considered data integrity
operations.

fs/sync.c:
SYSCALL_DEFINE4(sync_file_range, int, fd, loff_t, offset, loff_t, nbytes,
                                unsigned int, flags)
{
...

        if (flags & SYNC_FILE_RANGE_WAIT_BEFORE) {
                ret = file_fdatawait_range(f.file, offset, endbyte);
                if (ret < 0)
                        goto out_put;
        }

        if (flags & SYNC_FILE_RANGE_WRITE) {
                ret = __filemap_fdatawrite_range(mapping, offset, endbyte,
                                                 WB_SYNC_NONE);
                if (ret < 0)
                        goto out_put;
        }

        if (flags & SYNC_FILE_RANGE_WAIT_AFTER)
                ret = file_fdatawait_range(f.file, offset, endbyte);


where
int file_fdatawait_range(struct file *file, loff_t start_byte, loff_t end_byte)
{
        struct address_space *mapping = file->f_mapping;

        __filemap_fdatawait_range(mapping, start_byte, end_byte);
        return file_check_and_advance_wb_err(file);
}
EXPORT_SYMBOL(file_fdatawait_range);

the critical call is file_check_and_advance_wb_err(). That's the call
that checks and clears errors.

Would be good to add a kernel xfstest (gets used on all relevant FS
these days), to make sure that doesn't change.


> I'm very suspicious about the safety of the msync() path too.

That seems justified however:

SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags)
{
...
                        error = vfs_fsync_range(file, fstart, fend, 1);
..


 */
int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
{
...
        return file->f_op->fsync(file, start, end, datasync);
}
EXPORT_SYMBOL(vfs_fsync_range);



int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
{
...
        ret = file_write_and_wait_range(file, start, end);
        if (ret)
                return ret;
...


STATIC int
xfs_file_fsync(
        struct file *file,
        loff_t start,
        loff_t end,
        int datasync)
{
...
        error = file_write_and_wait_range(file, start, end);
        if (error)
                return error;


int file_write_and_wait_range(struct file *file, loff_t lstart, loff_t lend)
{
...
        err2 = file_check_and_advance_wb_err(file);
        if (!err)
                err = err2;
        return err;
}


Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: Postgres, fsync, and OSs (specifically linux)

Craig Ringer-3
On 1 May 2018 at 00:09, Andres Freund <[hidden email]> wrote:

> It's not. Only SYNC_FILE_RANGE_WAIT_{BEFORE,AFTER} eat errors. Which
> seems sensible, because they could be considered data integrity
> operations.

Ah, I misread that. Thankyou.

>> I'm very suspicious about the safety of the msync() path too.
>
> That seems justified however:

I'll add EIO tests there.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: Postgres, fsync, and OSs (specifically linux)

Catalin Iacob
In reply to this post by Andres Freund
On Sat, Apr 28, 2018 at 12:28 AM, Andres Freund <[hidden email]> wrote:
> Before linux v4.13 errors in kernel writeback would be reported at most
> once, without a guarantee that that'd happen (IIUC memory pressure could
> lead to the relevant information being evicted) - but it was pretty
> likely.  After v4.13 (see https://lwn.net/Articles/724307/) errors are
> reported exactly once to all open file descriptors for a file with an
> error - but never for files that have been opened after the error
> occurred.

snip

> == Proposed Linux Changes ==
>
> - Matthew Wilcox proposed (and posted a patch) that'd partially revert
>   behaviour to the pre v4.13 world, by *also* reporting errors to
>   "newer" file-descriptors if the error hasn't previously been
>   reported. That'd still not guarantee that the error is reported
>   (memory pressure could evict information without open fd), but in most
>   situations we'll again get the error in the checkpointer.
>
>   This seems largely be agreed upon. It's unclear whether it'll go into
>   the stable backports for still-maintained >= v4.13 kernels.

This is now merged, if it's not reverted it will appear in v4.17.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fff75eb2a08c2ac96404a2d79685668f3cf5a7a3

The commit is cc-ed to stable so it should get picked up in the near future.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b4678df184b314a2bd47d2329feca2c2534aa12b

Reply | Threaded
Open this post in threaded view
|

Re: Postgres, fsync, and OSs (specifically linux)

Andres Freund
In reply to this post by Andres Freund
Hi,

On 2018-04-27 15:28:42 -0700, Andres Freund wrote:
> I went to LSF/MM 2018 to discuss [0] and related issues. Overall I'd say
> it was a very productive discussion.  I'll first try to recap the
> current situation, updated with knowledge I gained. Secondly I'll try to
> discuss the kernel changes that seem to have been agreed upon. Thirdly
> I'll try to sum up what postgres needs to change.

LWN summarized the discussion as well:

https://lwn.net/SubscriberLink/752952/6825e6a1ddcfb1f3/

- Andres

Reply | Threaded
Open this post in threaded view
|

Re: Postgres, fsync, and OSs (specifically linux)

Andres Freund
In reply to this post by Craig Ringer-3
On 2018-05-01 09:38:03 +0800, Craig Ringer wrote:

> On 1 May 2018 at 00:09, Andres Freund <[hidden email]> wrote:
>
> > It's not. Only SYNC_FILE_RANGE_WAIT_{BEFORE,AFTER} eat errors. Which
> > seems sensible, because they could be considered data integrity
> > operations.
>
> Ah, I misread that. Thankyou.
>
> >> I'm very suspicious about the safety of the msync() path too.
> >
> > That seems justified however:
>
> I'll add EIO tests there.

Do you have a patchset including that?  I didn't find anything after a
quick search...

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: Postgres, fsync, and OSs (specifically linux)

Craig Ringer-3
On 10 May 2018 at 06:55, Andres Freund <[hidden email]> wrote:

> Do you have a patchset including that?  I didn't find anything after a
> quick search...

There was an earlier rev on the other thread but without msync checks.

I've added panic for msync in the attached, and tidied the comments a bit.

I didn't add comments on why we panic to each individual pg_fsync or
FileSync caller that panics; instead pg_fsync points to
pg_fsync_no_writethrough which explains it briefly and has links.

I looked at callers of pg_fsync, pg_fsync_no_writethrough,
pg_fsync_writethrough, mdsync, and FileSync when writing this.

WAL writing already PANIC'd on fsync failure, which helps, though we
now know that's not sufficient for complete safety.


Patch on top of v11 HEAD @ ddc1f32ee507

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

v2-0001-PANIC-when-we-detect-a-possible-fsync-I-O-error-i.patch (16K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Postgres, fsync, and OSs (specifically linux)

Andres Freund
Hi,

On 2018-05-10 09:50:03 +0800, Craig Ringer wrote:
>   while ((src = (RewriteMappingFile *) hash_seq_search(&seq_status)) != NULL)
>   {
>   if (FileSync(src->vfd, WAIT_EVENT_LOGICAL_REWRITE_SYNC) != 0)
> - ereport(ERROR,
> + ereport(PANIC,
>   (errcode_for_file_access(),
>   errmsg("could not fsync file \"%s\": %m", src->path)));

To me this (and the other callers) doesn't quite look right. First, I
think we should probably be a bit more restrictive about when PANIC
out. It seems like we should PANIC on ENOSPC and EIO, but possibly not
others.  Secondly, I think we should centralize the error handling. It
seems likely that we'll acrue some platform specific workarounds, and I
don't want to copy that knowledge everywhere.

Also, don't we need the same on close()?

- Andres

Reply | Threaded
Open this post in threaded view
|

Re: Postgres, fsync, and OSs (specifically linux)

Robert Haas
On Thu, May 17, 2018 at 12:44 PM, Andres Freund <[hidden email]> wrote:

> Hi,
>
> On 2018-05-10 09:50:03 +0800, Craig Ringer wrote:
>>       while ((src = (RewriteMappingFile *) hash_seq_search(&seq_status)) != NULL)
>>       {
>>               if (FileSync(src->vfd, WAIT_EVENT_LOGICAL_REWRITE_SYNC) != 0)
>> -                     ereport(ERROR,
>> +                     ereport(PANIC,
>>                                       (errcode_for_file_access(),
>>                                        errmsg("could not fsync file \"%s\": %m", src->path)));
>
> To me this (and the other callers) doesn't quite look right. First, I
> think we should probably be a bit more restrictive about when PANIC
> out. It seems like we should PANIC on ENOSPC and EIO, but possibly not
> others.  Secondly, I think we should centralize the error handling. It
> seems likely that we'll acrue some platform specific workarounds, and I
> don't want to copy that knowledge everywhere.

Maybe something like:

ereport(promote_eio_to_panic(ERROR), ...)

?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: Postgres, fsync, and OSs (specifically linux)

Ashutosh Bapat
On Thu, May 17, 2018 at 11:45 PM, Robert Haas <[hidden email]> wrote:

> On Thu, May 17, 2018 at 12:44 PM, Andres Freund <[hidden email]> wrote:
>> Hi,
>>
>> On 2018-05-10 09:50:03 +0800, Craig Ringer wrote:
>>>       while ((src = (RewriteMappingFile *) hash_seq_search(&seq_status)) != NULL)
>>>       {
>>>               if (FileSync(src->vfd, WAIT_EVENT_LOGICAL_REWRITE_SYNC) != 0)
>>> -                     ereport(ERROR,
>>> +                     ereport(PANIC,
>>>                                       (errcode_for_file_access(),
>>>                                        errmsg("could not fsync file \"%s\": %m", src->path)));
>>
>> To me this (and the other callers) doesn't quite look right. First, I
>> think we should probably be a bit more restrictive about when PANIC
>> out. It seems like we should PANIC on ENOSPC and EIO, but possibly not
>> others.  Secondly, I think we should centralize the error handling. It
>> seems likely that we'll acrue some platform specific workarounds, and I
>> don't want to copy that knowledge everywhere.
>
> Maybe something like:
>
> ereport(promote_eio_to_panic(ERROR), ...)

Well, searching for places where error is reported with level PANIC,
using word PANIC would miss these instances. People will have to
remember to search with promote_eio_to_panic. May be handle the errors
inside FileSync() itself or a wrapper around that.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Reply | Threaded
Open this post in threaded view
|

Re: Postgres, fsync, and OSs (specifically linux)

Andres Freund
In reply to this post by Andres Freund
Hi,

On 2018-04-27 15:28:42 -0700, Andres Freund wrote:

> == Potential Postgres Changes ==
>
> Several operating systems / file systems behave differently (See
> e.g. [2], thanks Thomas) than we expected. Even the discussed changes to
> e.g. linux don't get to where we thought we are. There's obviously also
> the question of how to deal with kernels / OSs that have not been
> updated.
>
> Changes that appear to be necessary, even for kernels with the issues
> addressed:
>
> - Clearly we need to treat fsync() EIO, ENOSPC errors as a PANIC and
>   retry recovery. While ENODEV (underlying device went away) will be
>   persistent, it probably makes sense to treat it the same or even just
>   give up and shut down.  One question I see here is whether we just
>   want to continue crash-recovery cycles, or whether we want to limit
>   that.
Craig has a patch for this, although I'm not yet 100% happy with it.


> - We need more aggressive error checking on close(), for ENOSPC and
>   EIO. In both cases afaics we'll have to trigger a crash recovery
>   cycle. It's entirely possible to end up in a loop on NFS etc, but I
>   don't think there's a way around that.

This needs to be handled.


> - The outstanding fsync request queue isn't persisted properly [3]. This
>   means that even if the kernel behaved the way we'd expected, we'd not
>   fail a second checkpoint :(. It's possible that we don't need to deal
>   with this because we'll henceforth PANIC, but I'd argue we should fix
>   that regardless. Seems like a time-bomb otherwise (e.g. after moving
>   to DIO somebody might want to relax the PANIC...).

> What we could do:
>
> - forward file descriptors from backends to checkpointer (using
>   SCM_RIGHTS) when marking a segment dirty. That'd require some
>   optimizations (see [4]) to avoid doing so repeatedly.  That'd
>   guarantee correct behaviour in all linux kernels >= 4.13 (possibly
>   backported by distributions?), and I think it'd also make it vastly
>   more likely that errors are reported in earlier kernels.
>
>   This should be doable without a noticeable performance impact, I
>   believe.  I don't think it'd be that hard either, but it'd be a bit of
>   a pain to backport it to all postgres versions, as well as a bit
>   invasive for that.
>
>   The infrastructure this'd likely end up building (hashtable of open
>   relfilenodes), would likely be useful for further things (like caching
>   file size).
I've written a patch series for this. Took me quite a bit longer than I
had hoped.

The attached patchseries consists out of a few preparatory patches:
- freespace optimization to not call smgrexists() unnecessarily
- register_dirty_segment() optimization to not queue requests for
  segments that locally are known to already have been dirtied.  This
  seems like a good optimization regardless of further changes.  Doesn't
  yet deal with the mdsync counter wrapping around (which is unlikely to
  ever happen in practice, it's 32bit).
- some fd.c changes, I don't think they're quite right yet
- new functions to send/recv data over a unix domain socket, *including*
  a file descriptor.

The main patch guarantees that fsync requests are forwarded from
backends to the checkpointer, including the file descriptor. As we do so
immediately at mdwrite() time, that guarantees that the fd has been open
from before the write started, therefore linux will guarantee that that
FD will see errors.

The design of the patch went through a few iterations. I initially
attempted to make the fsync request hashtable shared, but that turned
out to be a lot harder to do reliably *and* fast than I was anticipating
(we'd need to hold a lock for the entirety of mdsync(), dynahash doesn't
allow iteration while other backends modify).

So what I instead did was to replace the shared memory fsync request
queue with a unix domain socket (created in postmaster, using
socketpair()).  CheckpointerRequest structs are written to that queue,
including the associated file descriptor.  The checkpointer absorbs
those requests, and updates the local pending requests hashtable in
local process memory.  To facilitate that mdsync() has read all requests
from the last cycle, checkpointer self-enqueues a token, which allows
to detect the end of the relevant portion of the queue.

The biggest complication in all this scheme is that checkpointer now
needs to keep a file descriptor open for every segment. That obviously
requires adding a few new fields to the hashtable entry. But the bigger
issue is that it's now possible that pending requests need to be
processed earlier than the next checkpoint, because of file descriptor
limits.  To address that absorbing the fsync request queue will now do a
mdsync() style pass, doing the necessary fsyncs.

Because mdsync() (or rather its new workhorse mdsyncpass()) will now not
open files itself, there's no need to do deal with retries for files
that have been deleted. For the cases where we didn't yet receive a
fsync cancel request, we'll just fsync the fd. That's unnecessary, but
harmless.


Obviously this is currently heavily unix specific (according to my
research all our unix platforms support say that they support sending
fds across unix domain sockets w/ SCM_RIGHTS).  It's unclear whether any
OS but linux benefits from not closing file descriptors before fsync().

We could make this work for windows, without *too* much trouble (one can
just open fds in another process, using that process' handle).

I think there's some advantage in using the same approach
everywhere. For one not maintaining two radically different approaches
for complicated code.  It'd also allow us to offload more fsyncs to
checkpointer, not just the ones for normal relation files, which does
seem advantageous. Not having ugly retry logic around deleted files in
mdsync() also seems nice.  But there's cases where this is likely
slower, due to the potential of having to wait for checkpointer when the
queue is full.

I'll note that I think the new mdsync() is considerably simpler. Even if
we do not decide to use an approach as presented here, I think we should
make some of those changes. Specifically not unlinking the pending
requests bitmap in mdsync() seems like it both resolves existing bug
(see upthread) and makes the code simpler.


I plan to switch to working on something else for a day or two next
week, and then polish this further. I'd greatly appreciate comments till
then.


I didn't want to do this now, but I think we should also consider
removing all awareness of segments from the fsync request queue. Instead
it should deal with individual files, and the segmentation should be
handled by md.c.  That'll allow us to move all the necessary code to
smgr.c (or checkpointer?); Thomas said that'd be helpful for further
work.  I personally think it'd be a lot simpler, because having to have
long bitmaps with only the last bit set for large append only relations
isn't a particularly sensible approach imo.  The only thing that that'd
make more complicated is that the file/database unlink requests get more
expensive (as they'd likely need to search the whole table), but that
seems like a sensible tradeoff. Alternatively using a tree structure
would be an alternative obviously.  Personally I was thinking that we
should just make the hashtable be over a pathname, that seems most
generic.

Greetings,

Andres Freund

v1-0001-freespace-Don-t-constantly-close-files-when-readi.patch (1K) Download Attachment
v1-0002-Add-functions-to-send-receive-data-FD-over-a-unix.patch (3K) Download Attachment
v1-0003-Make-FileGetRawDesc-ensure-there-s-an-associated-.patch (795 bytes) Download Attachment
v1-0004-WIP-Allow-to-create-a-transient-file-for-a-previo.patch (2K) Download Attachment
v1-0005-WIP-Allow-more-transient-files-and-allow-to-query.patch (2K) Download Attachment
v1-0006-WIP-Optimize-register_dirty_segment-to-not-repeat.patch (10K) Download Attachment
v1-0007-Heavily-WIP-Send-file-descriptors-to-checkpointer.patch (46K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Postgres, fsync, and OSs (specifically linux)

Stephen Frost
In reply to this post by Ashutosh Bapat
Greetings,

* Ashutosh Bapat ([hidden email]) wrote:

> On Thu, May 17, 2018 at 11:45 PM, Robert Haas <[hidden email]> wrote:
> > On Thu, May 17, 2018 at 12:44 PM, Andres Freund <[hidden email]> wrote:
> >> Hi,
> >>
> >> On 2018-05-10 09:50:03 +0800, Craig Ringer wrote:
> >>>       while ((src = (RewriteMappingFile *) hash_seq_search(&seq_status)) != NULL)
> >>>       {
> >>>               if (FileSync(src->vfd, WAIT_EVENT_LOGICAL_REWRITE_SYNC) != 0)
> >>> -                     ereport(ERROR,
> >>> +                     ereport(PANIC,
> >>>                                       (errcode_for_file_access(),
> >>>                                        errmsg("could not fsync file \"%s\": %m", src->path)));
> >>
> >> To me this (and the other callers) doesn't quite look right. First, I
> >> think we should probably be a bit more restrictive about when PANIC
> >> out. It seems like we should PANIC on ENOSPC and EIO, but possibly not
> >> others.  Secondly, I think we should centralize the error handling. It
> >> seems likely that we'll acrue some platform specific workarounds, and I
> >> don't want to copy that knowledge everywhere.
> >
> > Maybe something like:
> >
> > ereport(promote_eio_to_panic(ERROR), ...)
>
> Well, searching for places where error is reported with level PANIC,
> using word PANIC would miss these instances. People will have to
> remember to search with promote_eio_to_panic. May be handle the errors
> inside FileSync() itself or a wrapper around that.
No, that search wouldn't miss those instances- such a search would find
promote_eio_to_panic() and then someone would go look up the uses of
that function.  That hardly seems like a serious issue for folks hacking
on PG.

I'm not saying that having a wrapper around FileSync() would be bad or
having it handle things, but I don't agree with the general notion that
we can't have a function which handles the complicated bits about the
kind of error because someone grep'ing the source for PANIC might have
to do an additional lookup.

Thanks!

Stephen

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Postgres, fsync, and OSs (specifically linux)

Abhijit Menon-Sen-2
At 2018-05-18 20:27:57 -0400, [hidden email] wrote:
>
> I don't agree with the general notion that we can't have a function
> which handles the complicated bits about the kind of error because
> someone grep'ing the source for PANIC might have to do an additional
> lookup.

Or we could just name the function promote_eio_to_PANIC.

(I understood the objection to be about how 'grep PANIC' wouldn't find
these lines at all, not that there would be an additional lookup.)

-- Abhijit

Reply | Threaded
Open this post in threaded view
|

Re: Postgres, fsync, and OSs (specifically linux)

Stephen Frost
Greetings,

* Abhijit Menon-Sen ([hidden email]) wrote:
> At 2018-05-18 20:27:57 -0400, [hidden email] wrote:
> >
> > I don't agree with the general notion that we can't have a function
> > which handles the complicated bits about the kind of error because
> > someone grep'ing the source for PANIC might have to do an additional
> > lookup.
>
> Or we could just name the function promote_eio_to_PANIC.

Ugh, I'm not thrilled with that either.

> (I understood the objection to be about how 'grep PANIC' wouldn't find
> these lines at all, not that there would be an additional lookup.)

... and my point was that 'grep PANIC' would, almost certainly, find the
function promote_eio_to_panic(), and someone could trivially look up all
the callers of that function then.

Thanks!

Stephen

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Postgres, fsync, and OSs (specifically linux)

Thomas Munro-3
In reply to this post by Andres Freund
On Sat, May 19, 2018 at 9:03 AM, Andres Freund <[hidden email]> wrote:
> I've written a patch series for this. Took me quite a bit longer than I
> had hoped.

Great.

> I plan to switch to working on something else for a day or two next
> week, and then polish this further. I'd greatly appreciate comments till
> then.

Took it for a spin on macOS and FreeBSD.  First problem:

+       if (socketpair(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0, fsync_fds) < 0)

SOCK_CLOEXEC isn't portable (FreeBSD yes since 10, macOS no, others
who knows).  Adding FD_CLOEXEC to your later fcntl() calls is probably
the way to do it?  I understand from reading the Linux man pages that
there are race conditions with threads but that doesn't apply here.

Next, make check hangs in initdb on both of my pet OSes when md.c
raises an error (fseek fails) and we raise and error while raising and
error and deadlock against ourselves.  Backtrace here:
https://paste.debian.net/1025336/

Apparently the initial error was that mdextend() called _mdnblocks()
which called FileSeek() on vfd 43 == fd 30, pathname "base/1/2704",
but when I check my operating system open file descriptor table I find
that there is no fd 30: there is a 29 and a 31, so it has already been
unexpectedly closed.

I could dig further and/or provide a shell on a system with dev tools.

> I didn't want to do this now, but I think we should also consider
> removing all awareness of segments from the fsync request queue. Instead
> it should deal with individual files, and the segmentation should be
> handled by md.c.  That'll allow us to move all the necessary code to
> smgr.c (or checkpointer?); Thomas said that'd be helpful for further
> work.  I personally think it'd be a lot simpler, because having to have
> long bitmaps with only the last bit set for large append only relations
> isn't a particularly sensible approach imo.  The only thing that that'd
> make more complicated is that the file/database unlink requests get more
> expensive (as they'd likely need to search the whole table), but that
> seems like a sensible tradeoff. Alternatively using a tree structure
> would be an alternative obviously.  Personally I was thinking that we
> should just make the hashtable be over a pathname, that seems most
> generic.

+1

I'll be posting a patch shortly that also needs similar machinery, but
can't easily share with md.c due to technical details.  I'd love there
to be just one of those, and for it to be simpler and general.

--
Thomas Munro
http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: Postgres, fsync, and OSs (specifically linux)

Thomas Munro-3
On Sat, May 19, 2018 at 4:51 PM, Thomas Munro
<[hidden email]> wrote:
> Next, make check hangs in initdb on both of my pet OSes when md.c
> raises an error (fseek fails) and we raise and error while raising and
> error and deadlock against ourselves.  Backtrace here:
> https://paste.debian.net/1025336/

Ah, I see now that something similar is happening on Linux too, so I
guess you already knew this.

https://travis-ci.org/postgresql-cfbot/postgresql/builds/380913223

--
Thomas Munro
http://www.enterprisedb.com

1234