Hello hackers,
I want to be able to do synchronous vectored file I/O, so I made wrapper macros for preadv() and pwritev() with fallbacks for systems that don't have them. Following the precedent of the pg_pread() and pg_pwrite() macros, the "pg_" prefix reflects a subtle contract change: the fallback paths might have the side effect of changing the file position. They're non-standard system calls, but the BSDs and Linux have had them for a long time, and for other systems we can use POSIX readv()/writev() with an additional lseek(). The worst case is Windows (and maybe our favourite antique Unix build farm animal?) which has none of those things, so there is a further fallback to a loop. Windows does have ReadFileScatter() and WriteFileGather(), but those only work for overlapped (= asynchronous), unbuffered, page aligned access. They'll very likely be useful for native AIO+DIO support in the future, but don't fit the bill here. This is part of a project to consolidate and offload I/O (about which more soon), but seemed isolated enough to post separately and I guess it could be independently useful. |
Thomas Munro <[hidden email]> writes:
> I want to be able to do synchronous vectored file I/O, so I made > wrapper macros for preadv() and pwritev() with fallbacks for systems > that don't have them. Following the precedent of the pg_pread() and > pg_pwrite() macros, the "pg_" prefix reflects a subtle contract > change: the fallback paths might have the side effect of changing the > file position. In a quick look, seems OK with some nits: 1. port.h cannot assume that <limits.h> has already been included; nor do I want to fix that by including <limits.h> there. Do we really need to define a fallback value of IOV_MAX? If so, maybe the answer is to put the replacement struct iovec and IOV_MAX in some new header. 2. I'm not really that happy about loading <sys/uio.h> into every compilation we do, which would be another reason for a new specialized header that either includes <sys/uio.h> or provides fallback definitions. 3. The patch as given won't prove anything except that the code compiles. Is it worth fixing at least one code path to make use of pg_preadv and pg_pwritev, so we can make sure this code is tested before there's layers of other new code on top? regards, tom lane |
On Sun, Dec 20, 2020 at 12:34 PM Tom Lane <[hidden email]> wrote:
> Thomas Munro <[hidden email]> writes: > > I want to be able to do synchronous vectored file I/O, so I made > > wrapper macros for preadv() and pwritev() with fallbacks for systems > > that don't have them. Following the precedent of the pg_pread() and > > pg_pwrite() macros, the "pg_" prefix reflects a subtle contract > > change: the fallback paths might have the side effect of changing the > > file position. > > In a quick look, seems OK with some nits: Thanks for looking! > 1. port.h cannot assume that <limits.h> has already been included; > nor do I want to fix that by including <limits.h> there. Do we > really need to define a fallback value of IOV_MAX? If so, > maybe the answer is to put the replacement struct iovec and > IOV_MAX in some new header. Ok, I moved all this stuff into port/pg_uio.h. > 2. I'm not really that happy about loading <sys/uio.h> into > every compilation we do, which would be another reason for a > new specialized header that either includes <sys/uio.h> or > provides fallback definitions. Ack. > 3. The patch as given won't prove anything except that the code > compiles. Is it worth fixing at least one code path to make > use of pg_preadv and pg_pwritev, so we can make sure this code > is tested before there's layers of other new code on top? OK, here's a patch to zero-fill fresh WAL segments with pwritev(). I'm drawing a blank on trivial candidate uses for preadv(), without infrastructure from later patches. ![]() ![]() |
Thomas Munro <[hidden email]> writes:
> OK, here's a patch to zero-fill fresh WAL segments with pwritev(). > I'm drawing a blank on trivial candidate uses for preadv(), without > infrastructure from later patches. This looks OK to me. I tried it on prairiedog (has writev and pwrite but not pwritev) as well as gaur (has only writev). They seem happy. One minor thought is that in + struct iovec iov[Min(IOV_MAX, 1024)]; /* cap stack space */ it seems like pretty much every use of IOV_MAX would want some similar cap. Should we centralize that idea with, say, #define PG_IOV_MAX Min(IOV_MAX, 1024) ? Or will the plausible cap vary across uses? regards, tom lane |
On Sun, Dec 20, 2020 at 8:07 PM Tom Lane <[hidden email]> wrote:
> One minor thought is that in > > + struct iovec iov[Min(IOV_MAX, 1024)]; /* cap stack space */ > > it seems like pretty much every use of IOV_MAX would want some > similar cap. Should we centralize that idea with, say, > > #define PG_IOV_MAX Min(IOV_MAX, 1024) > > ? Or will the plausible cap vary across uses? Hmm. For the real intended user of this, namely worker processes that simulate AIO when native AIO isn't available, higher level code will limit the iov count to much smaller numbers anyway. It wants to try to stay under typical device limits for vectored I/O, because split requests would confound attempts to model and limit queue depth and control latency. In Andres's AIO prototype he currently has a macro PGAIO_MAX_COMBINE set to 16 (meaning approximately 16 data block or wal reads/writes = 128KB worth of scatter/gather per I/O request); I guess it should really be Min(IOV_MAX, <something>), but I don't currently have an opinion on the <something>, except that it should surely be closer to 16 than 1024 (for example /sys/block/nvme0n1/queue/max_segments is 33 here). I mention all this to explain that I don't think the code in patch 0002 is going to turn out to be very typical: it's trying to minimise system calls by staying under an API limit (though I cap it for allocation sanity), whereas more typical code probably wants to stay under a device limit, so I don't immediately have another use for eg PG_IOV_MAX. |
In reply to this post by Thomas Munro-5
Hi,
On 2020-12-20 16:26:42 +1300, Thomas Munro wrote: > > 1. port.h cannot assume that <limits.h> has already been included; > > nor do I want to fix that by including <limits.h> there. Do we > > really need to define a fallback value of IOV_MAX? If so, > > maybe the answer is to put the replacement struct iovec and > > IOV_MAX in some new header. > > Ok, I moved all this stuff into port/pg_uio.h. Can we come up with a better name than 'uio'? I find that a not exactly meaningful name. Or perhaps we could just leave the functions in port/port.h, but extract the value of the define at configure time? That way only pread.c / pwrite.c would need it. > > 3. The patch as given won't prove anything except that the code > > compiles. Is it worth fixing at least one code path to make > > use of pg_preadv and pg_pwritev, so we can make sure this code > > is tested before there's layers of other new code on top? > > OK, here's a patch to zero-fill fresh WAL segments with pwritev(). I think that's a good idea. However, I see two small issues: 1) If we write larger amounts at once, we need to handle partial writes. That's a large enough amount of IO that it's much more likely to hit a memory shortage or such in the kernel - we had to do that a while a go for WAL writes (which can also be larger), if memory serves. Perhaps we should have pg_pwritev/readv unconditionally go through pwrite.c/pread.c and add support for "continuing" partial writes/reads in one central place? > I'm drawing a blank on trivial candidate uses for preadv(), without > infrastructure from later patches. Can't immediately think of something either. Greetings, Andres Freund |
> > I'm drawing a blank on trivial candidate uses for preadv(), without > > infrastructure from later patches. > > Can't immediately think of something either. This might be not that trivial , but maybe acquire_sample_rows() from analyze.c ? Please note however there's patch https://www.postgresql.org/message-id/20201109180644.GJ16415%40tamriel.snowman.net ( https://commitfest.postgresql.org/30/2799/ ) for adding fadvise, but maybe those two could be even combined so you would be doing e.g. 16x fadvise() and then grab 8 pages in one preadv() call ? I'm find unlikely however that preadv give any additional performance benefit there after having fadvise, but clearly a potential place to test. -J. |
On Mon, Dec 21, 2020 at 8:25 PM Jakub Wartak <[hidden email]> wrote:
> > > I'm drawing a blank on trivial candidate uses for preadv(), without > > > infrastructure from later patches. > > > > Can't immediately think of something either. > > This might be not that trivial , but maybe acquire_sample_rows() from analyze.c ? > > Please note however there's patch https://www.postgresql.org/message-id/20201109180644.GJ16415%40tamriel.snowman.net ( https://commitfest.postgresql.org/30/2799/ ) for adding fadvise, but maybe those two could be even combined so you would be doing e.g. 16x fadvise() and then grab 8 pages in one preadv() call ? I'm find unlikely however that preadv give any additional performance benefit there after having fadvise, but clearly a potential place to test. Oh, interesting, that looks like another test case to study with the AIO patch set, but I don't think it's worth trying to do a simpler/half-baked version in the meantime. (Since that ANALYZE patch uses PrefetchBuffer() it should automatically benefit: the posix_fadvise() calls will be replaced with consolidated preadv() calls in a worker process or native AIO equivalent so that system calls are mostly gone from the initiating process, and by the time you try to access the buffer it'll hopefully see that it's finished without any further system calls. Refinements are possible though, like making use of recent_buffer to avoid double-lookup, and tuning/optimisation for how often IOs should be consolidated and submitted.) |
In reply to this post by Andres Freund
On Mon, Dec 21, 2020 at 11:40 AM Andres Freund <[hidden email]> wrote:
> On 2020-12-20 16:26:42 +1300, Thomas Munro wrote: > > > 1. port.h cannot assume that <limits.h> has already been included; > > > nor do I want to fix that by including <limits.h> there. Do we > > > really need to define a fallback value of IOV_MAX? If so, > > > maybe the answer is to put the replacement struct iovec and > > > IOV_MAX in some new header. > > > > Ok, I moved all this stuff into port/pg_uio.h. > > Can we come up with a better name than 'uio'? I find that a not exactly > meaningful name. > Or perhaps we could just leave the functions in port/port.h, but extract > the value of the define at configure time? That way only pread.c / > pwrite.c would need it. That won't work for the struct definition, so client code would need to remember to do: #ifdef HAVE_SYS_UIO_H #include <sys/uio.h> #endif ... which is a little tedious, or port.h would need to do that and incur an overhead in every translation unit, which Tom objected to. That's why I liked the separate header idea. > > > 3. The patch as given won't prove anything except that the code > > > compiles. Is it worth fixing at least one code path to make > > > use of pg_preadv and pg_pwritev, so we can make sure this code > > > is tested before there's layers of other new code on top? > > > > OK, here's a patch to zero-fill fresh WAL segments with pwritev(). > > I think that's a good idea. However, I see two small issues: 1) If we > write larger amounts at once, we need to handle partial writes. That's a > large enough amount of IO that it's much more likely to hit a memory > shortage or such in the kernel - we had to do that a while a go for WAL > writes (which can also be larger), if memory serves. > > Perhaps we should have pg_pwritev/readv unconditionally go through > pwrite.c/pread.c and add support for "continuing" partial writes/reads > in one central place? 1. Define PG_IOV_MAX, a reasonable size for local variables, not larger than IOV_MAX. 2 Use 32 rather than 1024, based on off-list complaint about 1024 potentially swamping the IO system unfairly. 3. Add a wrapper pg_pwritev_retry() to retry automatically on short writes. (I didn't write pg_preadv_retry(), because I don't currently need it for anything so I didn't want to have to think about EOF vs short-reads-for-implementation-reasons.) 4. I considered whether pg_pwrite() should have built-in retry instead of a separate wrapper, but I thought of an argument against hiding the "raw" version: the AIO patch set already understands short reads/writes and knows how to retry at a higher level, as that's needed for native AIO too, so I think it makes sense to be able to keep things the same and not solve the same problem twice. A counter argument would be that you could get the retry underway sooner with a tight loop, but I'm not expecting this to be common. > > I'm drawing a blank on trivial candidate uses for preadv(), without > > infrastructure from later patches. > > Can't immediately think of something either. One idea I had for the future is for xlogreader.c to read the WAL into a large multi-page circular buffer, which could wrap around using a pair of iovecs, but that requires lots more work . ![]() ![]() |
On Wed, Dec 23, 2020 at 12:06 AM Thomas Munro <[hidden email]> wrote:
> On Mon, Dec 21, 2020 at 11:40 AM Andres Freund <[hidden email]> wrote: > > Can we come up with a better name than 'uio'? I find that a not exactly > > meaningful name. > > Ok, let's try port/pg_iovec.h. I pushed it with that name, and a couple more cosmetic changes. I'll keep an eye on the build farm. |
On Mon, Jan 11, 2021 at 3:34 PM Thomas Munro <[hidden email]> wrote:
> On Wed, Dec 23, 2020 at 12:06 AM Thomas Munro <[hidden email]> wrote: > > On Mon, Dec 21, 2020 at 11:40 AM Andres Freund <[hidden email]> wrote: > > > Can we come up with a better name than 'uio'? I find that a not exactly > > > meaningful name. > > > > Ok, let's try port/pg_iovec.h. > > I pushed it with that name, and a couple more cosmetic changes. I'll > keep an eye on the build farm. Since only sifaka has managed to return a result so far (nice CPU), I had plenty of time to notice that macOS Big Sur has introduced preadv/pwritev. They were missing on Catalina[1]. [1] https://cirrus-ci.com/task/6002157537198080 |
On Mon, Jan 11, 2021 at 3:59 PM Thomas Munro <[hidden email]> wrote:
> On Mon, Jan 11, 2021 at 3:34 PM Thomas Munro <[hidden email]> wrote: > > I pushed it with that name, and a couple more cosmetic changes. I'll > > keep an eye on the build farm. > > Since only sifaka has managed to return a result so far (nice CPU), I > had plenty of time to notice that macOS Big Sur has introduced > preadv/pwritev. They were missing on Catalina[1]. The rest of buildfarm was OK with it too, but I learned of a small problem through CI testing of another patch: it's not OK for src/port/pwrite.c to do this: if (part > 0) elog(ERROR, "unexpectedly wrote more than requested"); ... because now when I try to use pg_pwrite() in pg_test_fsync, Windows fails to link: libpgport.lib(pwrite.obj) : error LNK2019: unresolved external symbol errstart referenced in function pg_pwritev_with_retry [C:\projects\postgresql\pg_test_fsync.vcxproj] I'll go and replace that with an assertion. |
In reply to this post by Thomas Munro-5
On 11.01.2021 05:59, Thomas Munro wrote:
> Since only sifaka has managed to return a result so far (nice CPU), I > had plenty of time to notice that macOS Big Sur has introduced > preadv/pwritev. They were missing on Catalina[1]. > > [1] https://cirrus-ci.com/task/6002157537198080 Hi, Thomas! Indeed, pwritev is not available on macOS Catalina. So I get compiler warnings about that: /Users/shinderuk/src/pgwork/devel/build/../src/port/pwrite.c:117:10: warning: 'pwritev' is only available on macOS 11.0 or newer [-Wunguarded-availability-new] part = pg_pwritev(fd, iov, iovcnt, offset); ^~~~~~~~~~ /Users/shinderuk/src/pgwork/devel/build/../src/include/port/pg_iovec.h:49:20: note: expanded from macro 'pg_pwritev' #define pg_pwritev pwritev ^~~~~~~ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.1.sdk/usr/include/sys/uio.h:104:9: note: 'pwritev' has been marked as being introduced in macOS 11.0 here, but the deployment target is macOS 10.15.0 ssize_t pwritev(int, const struct iovec *, int, off_t) __DARWIN_NOCANCEL(pwritev) __API_AVAILABLE(macos(11.0), ios(14.0), watchos(7.0), tvos(14.0)); ^ /Users/shinderuk/src/pgwork/devel/build/../src/port/pwrite.c:117:10: note: enclose 'pwritev' in a __builtin_available check to silence this warning part = pg_pwritev(fd, iov, iovcnt, offset); ^~~~~~~~~~ /Users/shinderuk/src/pgwork/devel/build/../src/include/port/pg_iovec.h:49:20: note: expanded from macro 'pg_pwritev' #define pg_pwritev pwritev ^~~~~~~ 1 warning generated. (... several more warnings ...) And initdb fails: running bootstrap script ... dyld: lazy symbol binding failed: Symbol not found: _pwritev Referenced from: /Users/shinderuk/src/pgwork/devel/install/bin/postgres Expected in: /usr/lib/libSystem.B.dylib dyld: Symbol not found: _pwritev Referenced from: /Users/shinderuk/src/pgwork/devel/install/bin/postgres Expected in: /usr/lib/libSystem.B.dylib Regards. -- Sergey Shinderuk Postgres Professional: http://www.postgrespro.com The Russian Postgres Company |
On Wed, Jan 13, 2021 at 10:40 PM Sergey Shinderuk
<[hidden email]> wrote: > /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.1.sdk/usr/include/sys/uio.h:104:9: > note: 'pwritev' has been marked as being introduced in macOS 11.0 here, > but the deployment target is macOS > 10.15.0 > ssize_t pwritev(int, const struct iovec *, int, off_t) > __DARWIN_NOCANCEL(pwritev) __API_AVAILABLE(macos(11.0), ios(14.0), > watchos(7.0), tvos(14.0)); > ^ Hrm... So why did "configure" think you have pwritev, then? It seems like you must have been using different compilers or options at configure time and compile time, no? |
On 13.01.2021 12:56, Thomas Munro wrote:
> On Wed, Jan 13, 2021 at 10:40 PM Sergey Shinderuk > <[hidden email]> wrote: >> /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.1.sdk/usr/include/sys/uio.h:104:9: >> note: 'pwritev' has been marked as being introduced in macOS 11.0 here, >> but the deployment target is macOS >> 10.15.0 >> ssize_t pwritev(int, const struct iovec *, int, off_t) >> __DARWIN_NOCANCEL(pwritev) __API_AVAILABLE(macos(11.0), ios(14.0), >> watchos(7.0), tvos(14.0)); >> ^ > > Hrm... So why did "configure" think you have pwritev, then? It seems > like you must have been using different compilers or options at > configure time and compile time, no? > No, i've just rerun configure from clean checkout without any options. It does think that pwritev is available. I'll try to figure this out later and come back to you. Thanks. |
Sergey Shinderuk <[hidden email]> writes:
> On 13.01.2021 12:56, Thomas Munro wrote: >> On Wed, Jan 13, 2021 at 10:40 PM Sergey Shinderuk >> <[hidden email]> wrote: >>> note: 'pwritev' has been marked as being introduced in macOS 11.0 here, >>> but the deployment target is macOS 10.15.0 >> Hrm... So why did "configure" think you have pwritev, then? It seems >> like you must have been using different compilers or options at >> configure time and compile time, no? > No, i've just rerun configure from clean checkout without any options. > It does think that pwritev is available. I'll try to figure this out > later and come back to you. Thanks. The symptoms sound consistent with using bleeding-edge Xcode on a Catalina machine ... please report exact OS and Xcode versions. I have a different complaint, using Big Sur and Xcode 12.3: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ranlib: file: libpgport.a(pread.o) has no symbols /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ranlib: file: libpgport_shlib.a(pread_shlib.o) has no symbols /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ranlib: file: libpgport_srv.a(pread_srv.o) has no symbols Looks like we need to be more careful about not including pread.c in the build unless it actually has code to contribute. regards, tom lane |
I wrote:
> Sergey Shinderuk <[hidden email]> writes: >>>> note: 'pwritev' has been marked as being introduced in macOS 11.0 here, >>>> but the deployment target is macOS 10.15.0 > The symptoms sound consistent with using bleeding-edge Xcode on a > Catalina machine ... please report exact OS and Xcode versions. I can reproduce these warnings on Big Sur + Xcode 12.3 by doing export MACOSX_DEPLOYMENT_TARGET=10.15 before building; however the executable runs anyway, which I guess is unsurprising. AFAICS from config.log, configure has no idea that anything is wrong. (BTW, at least the rather-old version of ccache I'm using does not seem to realize that that environment variable is significant; I had to clear ~/.ccache to get consistent results.) We've had issues before with weird results from Xcode versions newer than the underlying OS. In the past we've been able to work around that, but I'm not sure that I see a way here. regards, tom lane |
Hmmm ... I can further report that on Catalina + Xcode 12.0,
everything seems fine. configure correctly detects that preadv and pwritev aren't there: configure:15161: checking for preadv configure:15161: ccache gcc -o conftest -Wall -Wmissing-prototypes -Wpointer-ar\ ith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-a\ ttribute -Wformat-security -fno-strict-aliasing -fwrapv -Wno-unused-command-lin\ e-argument -g -O2 -isysroot /Applications/Xcode.app/Contents/Developer/Platform\ s/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk -isysroot /Applications/Xcod\ e.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.s\ dk conftest.c -lz -lm >&5 Undefined symbols for architecture x86_64: "_preadv", referenced from: _main in conftest-fca7e9.o ld: symbol(s) not found for architecture x86_64 clang: error: linker command failed with exit code 1 (use -v to see invocation) configure:15161: $? = 1 So I'm a little confused as to why this test is failing to fail with (I assume) newer Xcode. Can we see the relevant part of config.log on your machine? regards, tom lane |
In reply to this post by Tom Lane-2
On Thu, Jan 14, 2021 at 5:13 AM Tom Lane <[hidden email]> wrote:
> I have a different complaint, using Big Sur and Xcode 12.3: > > /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ranlib: file: libpgport.a(pread.o) has no symbols > /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ranlib: file: libpgport_shlib.a(pread_shlib.o) has no symbols > /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ranlib: file: libpgport_srv.a(pread_srv.o) has no symbols > > Looks like we need to be more careful about not including pread.c > in the build unless it actually has code to contribute. I did it that way because it made it easy to test different combinations of the replacements on computers that do actually have pwrite and pwritev, just by tweaking pg_config.h. Here's an attempt to do it with AC_REPLACE_FUNCS, which avoids creating empty .o files. It means that to test the replacements on modern systems you have to tweak pg_config.h and also add the relevant .o files to LIBOBJS in src/Makefile.global, but that seems OK. |
Thomas Munro <[hidden email]> writes:
> On Thu, Jan 14, 2021 at 5:13 AM Tom Lane <[hidden email]> wrote: >> Looks like we need to be more careful about not including pread.c >> in the build unless it actually has code to contribute. > I did it that way because it made it easy to test different > combinations of the replacements on computers that do actually have > pwrite and pwritev, just by tweaking pg_config.h. Here's an attempt > to do it with AC_REPLACE_FUNCS, which avoids creating empty .o files. > It means that to test the replacements on modern systems you have to > tweak pg_config.h and also add the relevant .o files to LIBOBJS in > src/Makefile.global, but that seems OK. Yeah, this looks better. Two gripes, one major and one minor: * You need to remove pread.o and pwrite.o from the hard-wired part of the list in src/port/Makefile, else they get built whether needed or not. * I don't much like this in fd.h: @@ -46,6 +46,7 @@ #include <dirent.h> +struct iovec; typedef int File; because it makes it look like iovec and File are of similar status, which they hardly are. Perhaps more like #include <dirent.h> + +struct iovec; /* avoid including sys/uio.h here */ typedef int File; I confirm clean builds on Big Sur and Catalina with this. regards, tom lane |
Free forum by Nabble | Edit this page |