Hi,
over the last ~year I spent a lot of time trying to figure out how we could add AIO (asynchronous IO) and DIO (direct IO) support to postgres. While there's still a *lot* of open questions, I think I now have a decent handle on most of the bigger architectural questions. Thus this long email. Just to be clear: I don't expect the current to design to survive as-is. If there's a few sentences below that sound a bit like describing the new world, that's because they're from the README.md in the patch series... ## Why Direct / unbuffered IO? The main reason to want to use Direct IO are: - Lower CPU usage / higher throughput. Particularly on modern storage buffered writes are bottlenecked by the operating system having to copy data from the kernel's page cache to postgres buffer pool using the CPU. Whereas direct IO can often move the data directly between the storage devices and postgres' buffer cache, using DMA. While that transfer is ongoing, the CPU is free to perform other work, with little impact. - Avoiding double buffering between operating system cache and postgres' shared_buffers. - Better control over the timing and pace of dirty data writeback. - Potential for concurrent WAL writes (via O_DIRECT | O_DSYNC writes) The main reason *not* to use Direct IO are: - Without AIO, Direct IO is unusably slow for most purposes. - Even with AIO, many parts of postgres need to be modified to perform explicit prefetching. - In situations where shared_buffers cannot be set appropriately large, e.g. because there are many different postgres instances hosted on shared hardware, performance will often be worse then when using buffered IO. ## Why Asynchronous IO - Without AIO we cannot use DIO - Without asynchronous IO (AIO) PG has to rely on the operating system to hide the cost of synchronous IO from Postgres. While this works surprisingly well in a lot of workloads, it does not do as good a job on prefetching and controlled writeback as we would like. - There are important expensive operations like fdatasync() where the operating system cannot hide the storage latency. This is particularly important for WAL writes, where the ability to asynchronously issue fdatasync() or O_DSYNC writes can yield significantly higher throughput. - Fetching data into shared buffers asynchronously and concurrently with query execution means there is more CPU time for query execution. ## High level difficulties adding AIO/DIO support - Optionally using AIO leads to convoluted and / or duplicated code. - Platform dependency: The common AIO APIs are typically specific to one platform (linux AIO, linux io_uring, windows IOCP, windows overlapped IO) or a few platforms (posix AIO, but there's many differences). - There are a lot of separate places doing IO in PG. Moving all of these to use efficiently use AIO is an, um, large undertaking. - Nothing in the buffer management APIs expects there to be more than one IO to be in progress at the same time - which is required to do AIO. ## Portability & Duplication To avoid the issue of needing non-AIO codepaths to support platforms without native AIO support a worker process based AIO implementation exists (and is currently the default). This also is convenient to check if a problem is related to the native IO implementation or not. Thanks to Thomas Munro for helping a *lot* around this area. He wrote the worker mode, the posix aio mode, added CI, did a lot of other testing, listened to me... ## Deadlock and Starvation Dangers due to AIO Using AIO in a naive way can easily lead to deadlocks in an environment where the source/target of AIO are shared resources, like pages in postgres' shared_buffers. Consider one backend performing readahead on a table, initiating IO for a number of buffers ahead of the current "scan position". If that backend then performs some operation that blocks, or even just is slow, the IO completion for the asynchronously initiated read may not be processed. This AIO implementation solves this problem by requiring that AIO methods either allow AIO completions to be processed by any backend in the system (e.g. io_uring, and indirectly posix, via signal handlers), or to guarantee that AIO processing will happen even when the issuing backend is blocked (e.g. worker mode, which offloads completion processing to the AIO workers). ## AIO API overview The main steps to use AIO (without higher level helpers) are: 1) acquire an "unused" AIO: pgaio_io_get() 2) start some IO, this is done by functions like pgaio_io_start_(read|write|fsync|flush_range)_(smgr|sb|raw|wal) The (read|write|fsync|flush_range) indicates the operation, whereas (smgr|sb|raw|wal) determines how IO completions, errors, ... are handled. (see below for more details about this design choice - it might or not be right) 3) optionally: assign a backend-local completion callback to the IO (pgaio_io_on_completion_local()) 4) 2) alone does *not* cause the IO to be submitted to the kernel, but to be put on a per-backend list of pending IOs. The pending IOs can be explicitly be flushed pgaio_submit_pending(), but will also be submitted if the pending list gets to be too large, or if the current backend waits for the IO. The are two main reasons not to submit the IO immediately: - If adjacent, we can merge several IOs into one "kernel level" IO during submission. Larger IOs are considerably more efficient. - Several AIO APIs allow to submit a batch of IOs in one system call. 5) wait for the IO: pgaio_io_wait() waits for an IO "owned" by the current backend. When other backends may need to wait for an IO to finish, pgaio_io_ref() can put a reference to that AIO in shared memory (e.g. a BufferDesc), which can be waited for using pgaio_io_wait_ref(). 6) Process the results of the request. If a callback was registered in 3), this isn't always necessary. The results of AIO can be accessed using pgaio_io_result() which returns an integer where negative numbers are -errno, and positive numbers are the [partial] success conditions (e.g. potentially indicating a short read). 7) release ownership of the io (pgaio_io_release()) or reuse the IO for another operation (pgaio_io_recycle()) Most places that want to use AIO shouldn't themselves need to care about managing the number of writes in flight, or the readahead distance. To help with that there are two helper utilities, a "streaming read" and a "streaming write". The "streaming read" helper uses a callback to determine which blocks to prefetch - that allows to do readahead in a sequential fashion but importantly also allows to asynchronously "read ahead" non-sequential blocks. E.g. for vacuum, lazy_scan_heap() has a callback that uses the visibility map to figure out which block needs to be read next. Similarly lazy_vacuum_heap() uses the tids in LVDeadTuples to figure out which blocks are going to be needed. Here's the latter as an example: https://github.com/anarazel/postgres/commit/a244baa36bfb252d451a017a273a6da1c09f15a3#diff-3198152613d9a28963266427b380e3d4fbbfabe96a221039c6b1f37bc575b965R1906 ## IO initialization layering One difficulty I had in this process was how to initialize IOs in light of the layering (from bufmgr.c over smgr.c and md.c to fd.c and back, but also e.g. xlog.c). Sometimes AIO needs to be initialized on the bufmgr.c level, sometimes on the md.c level, sometimes on the level of fd.c. But to be able to react to the completion of any such IO metadata about the operation is needed. Early on fd.c initialized IOs, and the context information was just passed through to fd.c. But that seems quite wrong - fd.c shouldn't have to know about which Buffer an IO is about. But higher levels shouldn't know about which files an operation resides in either, so they can't do all the work either... To avoid that, I ended up splitting the "start an AIO" operation into a higher level part, e.g. pgaio_io_start_read_smgr() - which doesn't know about which smgr implementation is in use and thus also not what file/offset we're dealing with, which calls into smgr->md->fd to actually "prepare" the IO (i.e. figure out file / offset). This currently looks like: void pgaio_io_start_read_smgr(PgAioInProgress *io, struct SMgrRelationData* smgr, ForkNumber forknum, BlockNumber blocknum, char *bufdata) { pgaio_io_prepare(io, PGAIO_OP_READ); smgrstartread(io, smgr, forknum, blocknum, bufdata); io->scb_data.read_smgr.tag = (AioBufferTag){ .rnode = smgr->smgr_rnode, .forkNum = forknum, .blockNum = blocknum }; pgaio_io_stage(io, PGAIO_SCB_READ_SMGR); } Once this reaches the fd.c layer the new FileStartRead() function calls pgaio_io_prep_read() on the IO - but doesn't need to know anything about weird higher level stuff like relfilenodes. The _sb (_sb for shared_buffers) variant stores the Buffer, backend and mode (as in ReadBufferMode). I'm not sure this is the right design - but it seems a lot better than what I had earlier... ## Callbacks In the core AIO pieces there are two different types of callbacks at the moment: Shared callbacks, which can be invoked by any backend (normally the issuing backend / the AIO workers, but can be other backends if they are waiting for the IO to complete). For operations on shared resources (e.g. shared buffer reads/writes, or WAL writes) these shared callback needs to transition the state of the object the IO is being done for to completion. E.g. for a shared buffer read that means setting BM_VALID / unsetting BM_IO_IN_PROGRESS. The main reason these callbacks exist is that they make it safe for a backend to issue non-blocking IO on buffers (see the deadlock section above). As any blocked backend can cause the IO to complete, the deadlock danger is gone. Local callbacks, one of which the issuer of an IO can associate with the IO. These can be used to issue further readahead. I initially did not have these, but I found it hard to have a controllable numbers of IO in flight. They are currently mainly used for error handling (e.g. erroring out when XLogFileInit() cannot create the file due to ENOSPC), and to issue more IO (e.g. readahead for heapam). The local callback system isn't quite right, and there's ## AIO conversions Currently the patch series converts a number of subsystems to AIO. They are of very varying quality. I mainly did the conversions that I considered either be of interest architecturally, or that caused a fair bit of pain due to slowness (e.g. VACUUMing without AIO is no fun at all when using DIO). Some also for fun ;) Most conversions are fairly simple. E.g. heap scans, checkpointer, bgwriter, VACUUM are all not too complicated. There are two conversions that are good bit more complicated/experimental: 1) Asynchronous, concurrent, WAL writes. This is important because we right now are very bottlenecked by IO latency, because there effectively only ever is one WAL IO in flight at the same time. Even though in many situations it is possible to issue a WAL write, have one [set of] backends wait for that write as its completions satisfies their XLogFlush() needs, but concurrently already issue the next WAL write(s) that other backends need. The code here is very crufty, but I think the concept is mostly right. 2) Asynchronous buffer replacement. Even with buffered IO we experience a lot of pain when ringbuffers need to write out data (VACUUM!). But with DIO the issue gets a lot worse - the kernel can't hide the write latency from us anymore. This change makes each backend asynchronously clean out buffers that it will need soon. When a ringbuffer is is use this means cleaning out buffers in the ringbuffer, when not, performing the clock sweep and cleaning out victim buffers. Due to 1) the XLogFlush() can also be done asynchronously. There are a *lot* of places that haven't been converted to use AIO. ## Stats There are two new views: pg_stat_aios showing AIOs that are currently in-progress, pg_stat_aio_backends showing per-backend statistics about AIO. ## Code: https://github.com/anarazel/postgres/tree/aio I was not planning to attach all the patches on the way to AIO - it's too many right now... I hope I can reduce the size of the series iteratively into easier to look at chunks. ## TL;DR: Performance numbers This is worth an email on its own, and it's pretty late here already and I want to rerun benchmarks before posting more numbers. So here are just a few that I could run before falling asleep. 1) 60s of parallel COPY BINARY of a 89MB into separate tables (s_b = 96GB): slow NVMe SSD branch dio clients tps/stddev checkpoint write time master n 8 3.0/2296 ms 4.1s / 379647 buffers = 723MiB/s aio n 8 3.8/1985 ms 11.5s / 1028669 buffers = 698MiB/ aio y 8 4.7/204 ms 10.0s / 1164933 buffers = 910MiB/s raid of 2 fast NVMe SSDs (on pcie3): branch dio clients tps/stddev checkpoint write time master n 8 9.7/62 ms 7.6s / 1206376 buffers = 1240MiB/s aio n 8 11.4/82 ms 14.3s / 2838129 buffers = 1550MiB/s aio y 8 18.1/56 ms 8.9s / 4486170 buffers = 3938MiB/s 2) pg prewarm speed raid of 2 fast NVMe SSDs (on pcie3): pg_prewarm(62GB, read) branch dio time bw master n 17.4s 3626MiB/s aio n 10.3s 6126MiB/s (higher cpu usage) aio y 9.8s 6438MiB/s pg_prewarm(62GB, buffer) branch dio time bw master n 38.3s 1647MiB/s aio n 13.6s 4639MiB/s (higher cpu usage) aio y 10.7s 5897MiB/s 3) parallel sequential scan speed parallel sequential scan + count(*) of 59GB table: branch dio max_parallel time master n 0 40.5s master n 1 22.6s master n 2 16.4s master n 4 10.9s master n 8 9.3s aio y 0 33.1s aio y 1 17.2s aio y 2 11.8s aio y 4 9.0s aio y 8 9.2s On local SSDs there's some, but not a huge performance advantage in most transactional r/w workloads. But on cloud storage - which has a lot higher latency - AIO can yield huge advantages. I've seen over 4x. There's definitely also cases where AIO currently hurts - most of those I just didn't get aroung to address. There's a lot more cases in which DIO currently hurts - mostly because the necessary smarts haven't yet been added. Comments? Questions? I plan to send separate emails about smaller chunks of this seperately - the whole topic is just too big. In particular I plan to send something around buffer locking / state management - it's a one of the core issues around this imo. Regards, Andres |
On Tue, 23 Feb 2021 at 05:04, Andres Freund <[hidden email]> wrote:
> > ## Callbacks > > In the core AIO pieces there are two different types of callbacks at the > moment: > > Shared callbacks, which can be invoked by any backend (normally the issuing > backend / the AIO workers, but can be other backends if they are waiting for > the IO to complete). For operations on shared resources (e.g. shared buffer > reads/writes, or WAL writes) these shared callback needs to transition the > state of the object the IO is being done for to completion. E.g. for a shared > buffer read that means setting BM_VALID / unsetting BM_IO_IN_PROGRESS. > > The main reason these callbacks exist is that they make it safe for a backend > to issue non-blocking IO on buffers (see the deadlock section above). As any > blocked backend can cause the IO to complete, the deadlock danger is gone. So firstly this is all just awesome work and I have questions but I don't want them to come across in any way as criticism or as a demand for more work. This is really great stuff, thank you so much! The callbacks make me curious about two questions: 1) Is there a chance that a backend issues i/o, the i/o completes in some other backend and by the time this backend gets around to looking at the buffer it's already been overwritten again? Do we have to initiate I/O again or have you found a way to arrange that this backend has the buffer pinned from the time the i/o starts even though it doesn't handle the comletion? 2) Have you made (or considered making) things like sequential scans (or more likely bitmap index scans) asynchronous at a higher level. That is, issue a bunch of asynchronous i/o and then handle the pages and return the tuples as the pages arrive. Since sequential scans and bitmap scans don't guarantee to read the pages in order they're generally free to return tuples from any page in any order. I'm not sure how much of a win that would actually be since all the same i/o would be getting executed and the savings in shared buffers would be small but if there are mostly hot pages you could imagine interleaving a lot of in-memory pages with the few i/os instead of sitting idle waiting for the async i/o to return. > ## Stats > > There are two new views: pg_stat_aios showing AIOs that are currently > in-progress, pg_stat_aio_backends showing per-backend statistics about AIO. This is impressive. How easy is it to correlate with system aio stats? -- greg |
Hi,
On 2021-02-23 14:58:32 -0500, Greg Stark wrote: > So firstly this is all just awesome work and I have questions but I > don't want them to come across in any way as criticism or as a demand > for more work. I posted it to get argued with ;). > The callbacks make me curious about two questions: > > 1) Is there a chance that a backend issues i/o, the i/o completes in > some other backend and by the time this backend gets around to looking > at the buffer it's already been overwritten again? Do we have to > initiate I/O again or have you found a way to arrange that this > backend has the buffer pinned from the time the i/o starts even though > it doesn't handle the comletion? The initiator of the IO can just keep a pin for the buffer to prevent that. There's a lot of complexity around how to handle pinning and locking around asynchronous buffer IO. I plan to send a separate email with a more details. In short: I made it so that for shared buffer IO holds a separate refcount for the duration of the IO - that way the issuer can release its own pin without causing a problem (consider e.g. an error while an IO is in flight). The pin held by the IO gets released in the completion callback. There's similar trickery with locks - but more about that later. > 2) Have you made (or considered making) things like sequential scans > (or more likely bitmap index scans) asynchronous at a higher level. > That is, issue a bunch of asynchronous i/o and then handle the pages > and return the tuples as the pages arrive. Since sequential scans and > bitmap scans don't guarantee to read the pages in order they're > generally free to return tuples from any page in any order. I'm not > sure how much of a win that would actually be since all the same i/o > would be getting executed and the savings in shared buffers would be > small but if there are mostly hot pages you could imagine interleaving > a lot of in-memory pages with the few i/os instead of sitting idle > waiting for the async i/o to return. I have not. Mostly because it'd basically break the entire regression test suite. And probably a lot of user expectations (yes, synchronize_seqscan exists, but it pretty rarely triggers). I'm not sure how big the win would be - the readahead for heapam.c that is in the patch set tries to keep ahead of the "current scan position" by a certain amount - so it'll issue the reads for the "missing" pages before they're needed. Hopefully early enough to avoid unnecessary stalls. But I'm pretty sure there'll be cases where that'd require a prohibitively long "readahead distance". I think there's a lot of interesting things along these lines that we could tackle, but since they involve changing results and/or larger changes to avoid those (e.g. detecting when sequential scan order isn't visible to the user) I think it'd make sense to separate them from the aio patchset itself. If we had the infrastructure to detect whether seqscan order matters, we could also switch to the tuple-iteration order to "backwards" in heapgetpage() - right now iterating forward in "itemid order" causes a lot of cache misses because the hardware prefetcher doesn't predict that the tuples are layed out "decreasing pointer order". > > ## Stats > > > > There are two new views: pg_stat_aios showing AIOs that are currently > > in-progress, pg_stat_aio_backends showing per-backend statistics about AIO. > > This is impressive. How easy is it to correlate with system aio stats? Could you say a bit more about what you are trying to correlate? Here's some example IOs from pg_stat_aios. ┌─[ RECORD 1 ]─┬───────────────────────────────────────────────────────────────────────────────────────────────────┐ │ backend_type │ client backend │ │ id │ 98 │ │ gen │ 13736 │ │ op │ write │ │ scb │ sb │ │ flags │ PGAIOIP_IN_PROGRESS | PGAIOIP_INFLIGHT │ │ ring │ 5 │ │ owner_pid │ 1866051 │ │ merge_with │ (null) │ │ result │ 0 │ │ desc │ fd: 38, offset: 329588736, nbytes: 8192, already_done: 0, release_lock: 1, buffid: 238576 │ ├─[ RECORD 24 ]┼───────────────────────────────────────────────────────────────────────────────────────────────┤ │ backend_type │ checkpointer │ │ id │ 1501 │ │ gen │ 15029 │ │ op │ write │ │ scb │ sb │ │ flags │ PGAIOIP_IN_PROGRESS | PGAIOIP_INFLIGHT │ │ ring │ 3 │ │ owner_pid │ 1865275 │ │ merge_with │ 1288 │ │ result │ 0 │ │ desc │ fd: 24, offset: 105136128, nbytes: 8192, already_done: 0, release_lock: 1, buffid: 202705 │ ├─[ RECORD 31 ]┼───────────────────────────────────────────────────────────────────────────────────────────────────┤ │ backend_type │ walwriter │ │ id │ 90 │ │ gen │ 26498 │ │ op │ write │ │ scb │ wal │ │ flags │ PGAIOIP_IN_PROGRESS | PGAIOIP_INFLIGHT │ │ ring │ 5 │ │ owner_pid │ 1865281 │ │ merge_with │ 181 │ │ result │ 0 │ │ desc │ write_no: 17, fd: 12, offset: 6307840, nbytes: 1048576, already_done: 0, bufdata: 0x7f94dd670000 │ └──────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────┘ And the per-backend AIO view shows information like this (too wide to display all cols): SELECT pid, last_context, backend_type, executed_total_count,issued_total_count,submissions_total_count, inflight_count FROM pg_stat_aio_backends sab JOIN pg_stat_activity USING (pid); ┌─────────┬──────────────┬──────────────────────────────┬──────────────────────┬────────────────────┬─────────────────────────┬────────────────┐ │ pid │ last_context │ backend_type │ executed_total_count │ issued_total_count │ submissions_total_count │ inflight_count │ ├─────────┼──────────────┼──────────────────────────────┼──────────────────────┼────────────────────┼─────────────────────────┼────────────────┤ │ 1865291 │ 3 │ logical replication launcher │ 0 │ 0 │ 0 │ 0 │ │ 1865296 │ 0 │ client backend │ 85 │ 85 │ 85 │ 0 │ │ 1865341 │ 7 │ client backend │ 9574416 │ 2905321 │ 345642 │ 0 │ ... │ 1873501 │ 3 │ client backend │ 3565 │ 3565 │ 467 │ 10 │ ... │ 1865277 │ 0 │ background writer │ 695402 │ 575906 │ 13513 │ 0 │ │ 1865275 │ 3 │ checkpointer │ 4664110 │ 3488530 │ 1399896 │ 0 │ │ 1865281 │ 3 │ walwriter │ 77203 │ 7759 │ 7747 │ 3 │ └─────────┴──────────────┴──────────────────────────────┴──────────────────────┴────────────────────┴─────────────────────────┴────────────────┘ It's not super obvious at this point but executed_total_count / issued_total_count shows the rate at which IOs have been merged. issued_total_count / submissions_total_count shows how many (already merged) IOs were submitted together in one "submission batch" (for io_method=io_uring, io_uring_enter()). So pg_stat_aios should allow to enrich some system stats - e.g. by being able to split out WAL writes from data file writes. And - except that the current callbacks for that aren't great - it should even allow to split the IO by different relfilenodes etc. I assume pg_stat_aio_backends also can be helpful, e.g. by seeing which backends currently the deep IO queues that cause latency delays in other backends, and which backends do a lot of sequential IO (high executed_total_count / issued_total_count) and which only random... Greetings, Andres Freund |
In reply to this post by Andres Freund
On Tue, Feb 23, 2021 at 11:03 PM Andres Freund <[hidden email]> wrote:
> over the last ~year I spent a lot of time trying to figure out how we could > add AIO (asynchronous IO) and DIO (direct IO) support to postgres. While > there's still a *lot* of open questions, I think I now have a decent handle on > most of the bigger architectural questions. Thus this long email. Hello, Very cool to see this project escaping onto -hackers! I have done some work on a couple of low level parts of it, and I wanted to show a quick "hey, where'd my system calls go?" demo, which might help illustrate some very simple things about this stuff. Even though io_uring is the new hotness in systems programming, I'm going to use io_mode=worker here. It's the default in the current patch set, it works on all our supported OSes and is easier to understand without knowledge of shiny new or obscure old AIO system interfaces. I'll also use io_workers=1, an artificially low setting to make it easy to spy on (pseudo) async I/O with strace/truss/dtruss on a single process, and max_parallel_workers_per_gather=0 to keep executor parallelism from confusing matters. The first thing to notice is that there's an "io worker" process, and while filling up a big table with "insert into t select generate_series(1, 100000000)", it's doing a constant stream of 128KB pwritev() calls. These are writing out 16 blocks from shared buffers at a time: pwritev(44, [{iov_base=..., iov_len=73728}, {iov_base=..., iov_len=24576}, {iov_base=..., iov_len=32768}], 3, 228032512) = 131072 The reason there are 3 vectors there rather than 16 is just that some of the buffers happened to be adjacent in memory and we might as well use the smallest number of vectors. Just after we've started up and the buffer pool is empty, it's easy to find big single vector I/Os, but things soon get more fragmented (blocks adjacent on disk become less likely to be adjacent in shared buffers) and that number goes up, but that shouldn't make much difference to the OS or hardware assuming decent scatter/gather support through the stack. If io_data_direct=on (not the default) and the blocks are in one physical extent on the file system, that might even go all the way down to the disk as a single multi-segment write command for the storage hardware DMA engine to beam directly in/out of our buffer pool without CPU involvement. Mixed into that stream of I/O worker system calls, you'll also see WAL going out to disk: pwritev(15, [{iov_base=..., iov_len=1048576}], 1, 4194304) = 1048576 Meanwhile, the user session process running the big INSERT can be seen signalling the I/O worker to wake it up. The same thing happens for bgwriter, checkpointer, autovacuum and walwriter: you can see them all handing off most of their I/O work to the pool of I/O workers, with a bit of new signalling going on (which we try to minimise, and can probably minimise much more). (You might be able to see some evidence of Andres's new buffer cleaning scheme too, which avoids some bad patterns of interleaving small reads and writes, but I'm skipping right over here...) Working through a very simple example of how the I/O comes to be consolidated and parallelised, let's look at a simple non-parallel SELECT COUNT(*) query on a large table. The I/O worker does a stream of scattered reads into our buffer pool: preadv(51, [{iov_base=..., iov_len=24576}, {iov_base=..., iov_len=8192}, {iov_base=..., iov_len=16384}, {iov_base=..., iov_len=16384}, {iov_base=..., iov_len=16384}, {iov_base=..., iov_len=49152}], 6, 190808064) = 131072 Meanwhile our user session backend can be seen waking it up whenever it's trying to start I/O and finds it snoozing: kill(1803835, SIGURG) = 0 kill(1803835, SIGURG) = 0 kill(1803835, SIGURG) = 0 kill(1803835, SIGURG) = 0 kill(1803835, SIGURG) = 0 Notice that there are no sleeping system calls in the query backend, meaning the I/O in this example is always finished by the time the executor gets around to accessing the page it requested, so we're staying far enough ahead and we can be 100% CPU bound. In unpatched PostgreSQL we'd hope to have no actual sleeping in such a simple case anyway, thanks to the OS's readahead heuristics; but (1) we'd still do individual pread(8KB) calls, meaning that the user's query is at least having to pay the CPU cost of a return trip into the kernel and a copyout of 8KB from kernel space to user space, here avoided, (2) in io_data_direct=on mode, there's no page cache and thus no kernel read ahead, so we need to replace that mechanism with something anyway, (3) it's needed for non-sequential access like btree scans. Sometimes I/Os are still run in user backends, for example because (1) existing non-AIO code paths are still reached, (2) in worker mode, some kinds of I/Os can't be handed off to another process due to lack of a way to open some fds or because we're in single process mode, (3) because a heuristic kicks in when we know there's only one I/O to run and we know we'll immediately wait for it and we can skip a lot of communication with a traditional synchronous syscall (worker mode only for no, needs to be done for others). In order to be able to generate a stream of big vectored reads/writes, and start them far enough ahead of time that they're finished before we need the data, there are several layers of new instructure that Andres already mentioned and can explain far better than I, but super briefly: heapam.c uses a "pg_streaming_read" object (aio_util.c) to get buffers to scan, instead of directly calling ReadBuffer(). It gives the pg_streaming_read a callback of its own, so that heapam.c remains in control of what is read, but the pg_streaming_read is in control of readahead distance and also "submission". heapam.c's callback calls ReadBufferAsync() to initiate reads of pages that it will need soon, which it does with pgaio_io_start_read_sb() if there's a cache miss. This results in 8KB reads queued up in the process's pending I/O list, with pgaio_read_sb_complete as the completion function to run when each read has eventually completed. When the pending list grows to a certain length, it is submitted by pg_streaming_read code. That involves first "combining" pending I/Os: this is where read/write of adjacent ranges of files are merged into larger I/Os up to a limit. Then the I/Os are submitted to the OS, and we'll eventually learn about their completion, via io_method-specific means (see aio_worker.c, aio_uring.c, aio_posix.c and one day probably also aio_win32.c). At that point, merged I/Os will be uncombined. Skipping over some complication about retrying on some kinds of failure/partial I/O, that leads to ReadBufferCompleteWrite() being called for each buffer. (Far be it from me to try to explain the rather complex interlocking required to deal with pins and locks between ReadBufferAsync() and ReadBufferCompleteWrite() in (potentially) another process while the I/O is in progress, at this stage.) Places in the tree that want to do carefully controlled I/O depth management can consider using pg_streaming_{read,write}, providing their own callback to do the real work (though it's not necessary, and not all AIO uses suit the "streaming" model). There's also the traditional PrefetchBuffer() mechanism, which can still be used to initiate buffer reads as before. It's comparatively primitive; since you don't know when the I/O completes, you have to use conservative models as I do in my proposed WAL prefetching patch. That patch (like probably many others like CF #2799) works just fine on top of the AIO branch, with some small tweaks: it happily shifts all I/O system calls out of the recovery process, so that instead of calling posix_fadvise() and then a bit later pread() for each cold page accessed, it makes one submission system call for every N cold pages (or, in some cases, no system calls at all). A future better integration would probably use pg_streaming_read for precise control of the I/O depth instead of the little LSN queue it currently uses, but I haven't tried to write that yet. If you do simple large INSERTs and SELECTs with one of the native io_method settings instead of worker mode, it'd be much the same, in terms of most of the architecture. The information in the pg_stat_XXX views is almost exactly the same. There are two major differences: (1) the other methods have no I/O worker processes, because the kernel manages the I/O (or in some unfortunate cases runtime libraries fake it with threads), (2) the "shared completion callbacks" (see aio_scb.c) are run by I/O workers in worker mode, but are run by whichever process "drains" the I/O in the other modes. That is, initiating processes never hear about I/Os completing from the operating system, they just eventually wait on them and find that they're already completed (though they do run the "local callback" if there there is one, which is for example the point at which eg pg_streaming_read might initiate more I/O), or alternatively see that they aren't, and wait on a condition variable for an I/O worker to signal completion. So far this seems like a good choice... Hope that helped show off a couple of features of this scheme. |
In reply to this post by Andres Freund
Hi,
Thank you for the amazing and great work. On 23.02.2021 15:03, Andres Freund wrote: > ## Stats > > There are two new views: pg_stat_aios showing AIOs that are currently > in-progress, pg_stat_aio_backends showing per-backend statistics about AIO. As a DBA I would like to propose a few amendments that might help with practical usage of stats when feature will be finally implemented. My suggestions aren’t related to the central idea of the proposed changes, but rather to the stats part. A quick side note, there are two terms in Prometheus (https://prometheus.io/docs/concepts/metric_types/): 1. Counter. A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. 2. Gauge. A gauge is a metric that represents a single numerical value that can arbitrarily go up and down. For the purposes of long-term stats collection, COUNTERs are preferred over GAUGEs, because COUNTERs allow us to understand how metrics are changed overtime without missing out potential spikes in activity. As a result, we have a much better historic perspective. Measuring and collecting GAUGEs is limited to the moments in time when the stats are taken (snapshots) so the changes that took place between the snapshots remain unmeasured. In systems with a high rate of transactions per second (even 1 second interval between the snapshots) GAUGEs measuring won’t provide the full picture. In addition, most of the monitoring systems like Prometheus, Zabbix, etc. use longer intervals (from 10-15 to 60 seconds). The main idea is to try to expose almost all numeric stats as COUNTERs - this increases overall observabilty of implemented feature. pg_stat_aios. In general, this stat is a set of text values, and at the same time it looks GAUGE-like (similar to pg_stat_activity or pg_locks), and is only relevant for the moment when the user is looking at it. I think it would be better to rename this view to pg_stat_progress_aios. And keep pg_stat_aios for other AIO stats with global COUNTERs (like stuff in pg_stat_user_tables or pg_stat_statements, or system-wide /proc/stat, /proc/diskstats). pg_stat_aio_backends. This stat is based on COUNTERs, which is great, but the issue here is that its lifespan is limited by the lifespan of the backend processes - once the backend exits the stat will no longer be available - which could be inappropriate in workloads with short-lived backends. I think there might be few existing examples in the current code that could be repurposed to implement the suggestions above (such as pg_stat_user_tables, pg_stat_database, etc). With this in mind, I think having these changes incorporated shouldn’t take significant effort considering the benefit it will bring to the final user. Once again huge respect to your work on this changes and good look. Regards, Alexey |
I guess what I would be looking for in stats would be a way to tell
what the bandwidth, latency, and queue depth is. Ideally one day broken down by relation/index and pg_stat_statement record. I think seeing the actual in flight async requests in a connection is probably not going to be very useful in production. It's very low level and in production the user is just going to find that overwhelming detail. It is kind of cool to see the progress in sequential operations but I think that should be solved in a higher level way than this anyways. What we need to calculate these values would be the kinds of per-op stats nfsiostat uses from /proc/self/mountstats: https://utcc.utoronto.ca/~cks/space/blog/linux/NFSMountstatsNFSOps So number of async reads we've initiated, how many callbacks have been called, total cumulative elapsed time between i/o issued and i/o completed, total bytes of i/o initiated, total bytes of i/o completed. As well a counter of requests which returned errors (eof? i/o error?) If there are other locks or queues internally to postgres total time spent in those states. I have some vague idea that we should have a generic infrastructure for stats that automatically counts things associated with plan nodes and automatically bubbles that data up to the per-transaction, per-backend, per-relation, and pg_stat_statements stats. But that's a whole other ball of wax :) |
In reply to this post by Andres Freund
> On Tue, Feb 23, 2021 at 02:03:44AM -0800, Andres Freund wrote:
> > over the last ~year I spent a lot of time trying to figure out how we could > add AIO (asynchronous IO) and DIO (direct IO) support to postgres. While > there's still a *lot* of open questions, I think I now have a decent handle on > most of the bigger architectural questions. Thus this long email. > > Just to be clear: I don't expect the current to design to survive as-is. If > there's a few sentences below that sound a bit like describing the new world, > that's because they're from the README.md in the patch series... Thanks! > Comments? Questions? > > I plan to send separate emails about smaller chunks of this seperately - > the whole topic is just too big. In particular I plan to send something > around buffer locking / state management - it's a one of the core issues > around this imo. I'm curious about control knobs for this feature, it's somewhat related to the stats questions also discussed in this thread. I guess most important of those are max_aio_in_flight, io_max_concurrency etc, and they're going to be a hard limits, right? I'm curious if it makes sense to explore possibility to have these sort of "backpressure", e.g. if number of inflight requests is too large calculate inflight_limit a bit lower than possible (to avoid hard performance deterioration when the db is trying to do too much IO, and rather do it smooth). From what I remember io_uring does have something similar only for SQPOLL. Another similar question if this could be used for throttling of some overloaded workers in case of misconfigured clients or such? |
In reply to this post by dataegret
Hi,
On 2021-02-24 21:15:14 +0500, Alexey Lesovsky wrote: > pg_stat_aios. > In general, this stat is a set of text values, and at the same time it looks > GAUGE-like (similar to pg_stat_activity or pg_locks), and is only relevant > for the moment when the user is looking at it. I think it would be better to > rename this view to pg_stat_progress_aios. And keep pg_stat_aios for other > AIO stats with global COUNTERs (like stuff in pg_stat_user_tables or > pg_stat_statements, or system-wide /proc/stat, /proc/diskstats). Right - arguably it really shouldn't even have _stat_ in the name... I don't particularly like the idea of adding _progress_ as that seems it'd lead to confusing it with pg_stat_progress_vacuum etc - and it's quite a different beast. > pg_stat_aio_backends. > This stat is based on COUNTERs, which is great, but the issue here is that > its lifespan is limited by the lifespan of the backend processes - once the > backend exits the stat will no longer be available - which could be > inappropriate in workloads with short-lived backends. There's a todo somewhere to roll over the per-connection stats into a global stats piece on disconnect. In addition I was thinking of adding a view that sums up the value of "already disconnected backends" and the currently connected ones. Would that mostly address your concerns? > I think there might be few existing examples in the current code that could > be repurposed to implement the suggestions above (such as > pg_stat_user_tables, pg_stat_database, etc). With this in mind, I think > having these changes incorporated shouldn’t take significant effort > considering the benefit it will bring to the final user. Yea - I kind of was planning to go somewhere roughly in the direction you suggest, but took a few shortcuts due to the size of the project. Having the views made it a lot easier to debug / develop, but supporting longer lived stuff wasn't yet crucial. But I agree, we really should have it... Greetings, Andres Freund |
In reply to this post by Greg Stark
Hi,
On 2021-02-24 14:59:19 -0500, Greg Stark wrote: > I guess what I would be looking for in stats would be a way to tell > what the bandwidth, latency, and queue depth is. Ideally one day > broken down by relation/index and pg_stat_statement record. I think doing it at that granularity will likely be too expensive... > I think seeing the actual in flight async requests in a connection is > probably not going to be very useful in production. I think it's good for analyzing concrete performance issues, but probably not that much more. Although, it's not too hard to build sampling based on top of it with a tiny bit of work (should display the relfilenode etc). > So number of async reads we've initiated, how many callbacks have been > called, total cumulative elapsed time between i/o issued and i/o > completed, total bytes of i/o initiated, total bytes of i/o completed. Much of that is already in pg_stat_aio_backends - but is lost after disconnect (easy to solve). We don't track bytes of IO currently, but that'd not be hard. However, it's surprisingly hard to do the measurement between "issued" and "completed" in a meaningful way. It's obviously not hard to measure the time at which the request was issued, but there's no real way to determine the time at which it was completed. If a backend is busy doing other things (e.g. invoke aggregate transition functions), we'll not see the completion immediately, and therefore not have an accurate timestamp. With several methods of doing AIO we can set up signals that fire on completion, but that's pretty darn expensive. And it's painful to write such signal handlers in a safe way. > I have some vague idea that we should have a generic infrastructure > for stats that automatically counts things associated with plan nodes > and automatically bubbles that data up to the per-transaction, > per-backend, per-relation, and pg_stat_statements stats. But that's a > whole other ball of wax :) Heh, yea, let's tackle that separately ;) Greetings, Andres Freund |
In reply to this post by Dmitry Dolgov
Hi,
On 2021-02-24 21:41:16 +0100, Dmitry Dolgov wrote: > I'm curious about control knobs for this feature, it's somewhat related > to the stats questions also discussed in this thread. I guess most > important of those are max_aio_in_flight, io_max_concurrency etc, and > they're going to be a hard limits, right? Yea - there's a lot more work needed in that area. io_max_concurrency especially really should be a GUC, I was just too lazy for that so far. > I'm curious if it makes sense > to explore possibility to have these sort of "backpressure", e.g. if > number of inflight requests is too large calculate inflight_limit a bit > lower than possible (to avoid hard performance deterioration when the db > is trying to do too much IO, and rather do it smooth). It's decidedly nontrivial to compute "too large" - and pretty workload dependant (e.g. lower QDs are better latency sensitive OLTP, higher QD is better for bulk r/w heavy analytics). So I don't really want to go there for now - the project is already very large. What I do think is needed and feasible (there's a bunch of TODOs in the code about it already) is to be better at only utilizing deeper queues when lower queues don't suffice. So we e.g. don't read ahead more than a few blocks for a scan where the query is spending most of the time "elsewhere. There's definitely also some need for a bit better global, instead of per-backend, control over the number of IOs in flight. That's not too hard to implement - the hardest probably is to avoid it becoming a scalability issue. I think the area with the most need for improvement is figuring out how we determine the queue depths for different things using IO. Don't really want to end up with 30 parameters influencing what queue depth to use for (vacuum, index builds, sequential scans, index scans, bitmap heap scans, ...) - but they benefit from a deeper queue will differ between places. > From what I remember io_uring does have something similar only for > SQPOLL. Another similar question if this could be used for throttling > of some overloaded workers in case of misconfigured clients or such? You mean dynamically? Or just by setting the concurrency lower for certain users? I think doing so dynamically is way too complicated for now. But I'd expect configuring it on a per-user basis or such to be a reasonable thing. That might require splitting it into two GUCs - one SUSET one and a second one that's settable by any user, but can only lower the depth. I think it'll be pretty useful to e.g. configure autovacuum to have a low queue depth instead of using the current cost limiting. That way the impact on the overall system is limitted, but it's not slowed down unnecessarily as much. Greetings, Andres Freund |
> On Wed, Feb 24, 2021 at 01:45:10PM -0800, Andres Freund wrote:
> > > I'm curious if it makes sense > > to explore possibility to have these sort of "backpressure", e.g. if > > number of inflight requests is too large calculate inflight_limit a bit > > lower than possible (to avoid hard performance deterioration when the db > > is trying to do too much IO, and rather do it smooth). > > What I do think is needed and feasible (there's a bunch of TODOs in the > code about it already) is to be better at only utilizing deeper queues > when lower queues don't suffice. So we e.g. don't read ahead more than a > few blocks for a scan where the query is spending most of the time > "elsewhere. > > There's definitely also some need for a bit better global, instead of > per-backend, control over the number of IOs in flight. That's not too > hard to implement - the hardest probably is to avoid it becoming a > scalability issue. > > I think the area with the most need for improvement is figuring out how > we determine the queue depths for different things using IO. Don't > really want to end up with 30 parameters influencing what queue depth to > use for (vacuum, index builds, sequential scans, index scans, bitmap > heap scans, ...) - but they benefit from a deeper queue will differ > between places. Yeah, sounds like an interesting opportunity for improvements. I'm preparing few benchmarks to understand better how this all works, so will keep this in mind. > > From what I remember io_uring does have something similar only for > > SQPOLL. Another similar question if this could be used for throttling > > of some overloaded workers in case of misconfigured clients or such? > > You mean dynamically? Or just by setting the concurrency lower for > certain users? I think doing so dynamically is way too complicated for > now. But I'd expect configuring it on a per-user basis or such to be a > reasonable thing. That might require splitting it into two GUCs - one > SUSET one and a second one that's settable by any user, but can only > lower the depth. > > I think it'll be pretty useful to e.g. configure autovacuum to have a > low queue depth instead of using the current cost limiting. That way the > impact on the overall system is limitted, but it's not slowed down > unnecessarily as much. Yes, you got it right, not dynamically, but rather expose this to be configured on e.g. per-user basis. |
In reply to this post by Andres Freund
Hi,
On 25.02.2021 02:03, Andres Freund wrote: >> pg_stat_aio_backends. >> This stat is based on COUNTERs, which is great, but the issue here is that >> its lifespan is limited by the lifespan of the backend processes - once the >> backend exits the stat will no longer be available - which could be >> inappropriate in workloads with short-lived backends. > There's a todo somewhere to roll over the per-connection stats into a > global stats piece on disconnect. In addition I was thinking of adding a > view that sums up the value of "already disconnected backends" and the > currently connected ones. Would that mostly address your concerns? Yes, approach with separated stats for live and disconnected backends looks good and solves problem with "stats loss". Or it can be done like a stats for shared objects in pg_stat_databases, where there is a special NULL database is used. Regards, Alexey |
In reply to this post by Dmitry Dolgov
Sorry for another late reply, finally found some time to formulate couple of
thoughts. > On Thu, Feb 25, 2021 at 09:22:43AM +0100, Dmitry Dolgov wrote: > > On Wed, Feb 24, 2021 at 01:45:10PM -0800, Andres Freund wrote: > > > > > I'm curious if it makes sense > > > to explore possibility to have these sort of "backpressure", e.g. if > > > number of inflight requests is too large calculate inflight_limit a bit > > > lower than possible (to avoid hard performance deterioration when the db > > > is trying to do too much IO, and rather do it smooth). > > > > What I do think is needed and feasible (there's a bunch of TODOs in the > > code about it already) is to be better at only utilizing deeper queues > > when lower queues don't suffice. So we e.g. don't read ahead more than a > > few blocks for a scan where the query is spending most of the time > > "elsewhere. > > > > There's definitely also some need for a bit better global, instead of > > per-backend, control over the number of IOs in flight. That's not too > > hard to implement - the hardest probably is to avoid it becoming a > > scalability issue. > > > > I think the area with the most need for improvement is figuring out how > > we determine the queue depths for different things using IO. Don't > > really want to end up with 30 parameters influencing what queue depth to > > use for (vacuum, index builds, sequential scans, index scans, bitmap > > heap scans, ...) - but they benefit from a deeper queue will differ > > between places. Talking about parameters, from what I understand the actual number of queues (e.g. io_uring) created is specified by PGAIO_NUM_CONTEXTS, shouldn't it be configurable? Maybe in fact there should be not that many knobs after all - if the model assumes the storage has: * Some number of hardware queues, then the number of queues AIO implementation needs to use depends on it. For example, lowering number of contexts between different benchmark runs I could see that some of the hardware queues were significantly underutilized. Potentially there could be also such thing as too many contexts. * Certain bandwidth, then the submit batch size (io_max_concurrency or PGAIO_SUBMIT_BATCH_SIZE) depends on it. This will allow to distinguish attached storage with high bandwidth and high latency vs local storages. From what I see max_aio_in_flight is used as a queue depth for contexts, which is workload dependent and not easy to figure out as you mentioned. To avoid having 30 different parameters maybe it's more feasible to introduce "shallow" and "deep" queues, where particular depth for those could be derived from depth of hardware queues. The question which activity should use which queue is not easy, but if I get it right from queuing theory (assuming IO producers are stationary processes and fixed IO latency from the storage) it depends on IO arrivals distribution in every particular case and this in turn could be roughly estimated for each type of activity. One can expect different IO arrivals distributions for e.g. a normal point-query backend and a checkpoint or vacuum process, no matter what are the other conditions (collecting those for few benchmark runs gives indeed pretty distinct distributions). If I understand correctly, those contexts defined by PGAIO_NUM_CONTEXTS are the main working horse, right? I'm asking because there is also something called local_ring, but it seems there are no IOs submitted into those. Assuming that contexts are a main way of submitting IO, it would be also interesting to explore isolated for different purposes contexts. I haven't finished yet my changes here to give any results, but at least doing some tests with fio show different latencies, when two io_urings are processing mixed read/writes vs isolated read or writes. On the side note, at the end of the day there are so many queues - application queue, io_uring, mq software queue, hardware queue - I'm really curious if it would amplify tail latencies. Another thing I've noticed is AIO implementation is much more significantly affected by side IO activity than synchronous one. E.g. AIO version tps drops from tens of thousands to a couple of hundreds just because of some kworker started to flush dirty buffers (especially with disabled writeback throttling), while synchronous version doesn't suffer that much. Not sure what to make of it. Btw, overall I've managed to get better numbers from AIO implementation on IO bounded test cases with local NVME device, but non IO bounded were mostly a bit slower - is it expected, or am I missing something? Interesting thing to note is that io_uring implementation apparently relaxed requirements for polling operations, now one needs to have only CAP_SYS_NICE capability, not CAP_SYS_ADMIN. I guess theoretically there are no issues using it within the current design? |
Free forum by Nabble | Edit this page |