Asynchronous and "direct" IO support for PostgreSQL.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Asynchronous and "direct" IO support for PostgreSQL.

Andres Freund
Hi,

over the last ~year I spent a lot of time trying to figure out how we could
add AIO (asynchronous IO) and DIO (direct IO) support to postgres. While
there's still a *lot* of open questions, I think I now have a decent handle on
most of the bigger architectural questions.  Thus this long email.


Just to be clear: I don't expect the current to design to survive as-is. If
there's a few sentences below that sound a bit like describing the new world,
that's because they're from the README.md in the patch series...


## Why Direct / unbuffered IO?

The main reason to want to use Direct IO are:

- Lower CPU usage / higher throughput. Particularly on modern storage
  buffered writes are bottlenecked by the operating system having to
  copy data from the kernel's page cache to postgres buffer pool using
  the CPU. Whereas direct IO can often move the data directly between
  the storage devices and postgres' buffer cache, using DMA. While
  that transfer is ongoing, the CPU is free to perform other work,
  with little impact.
- Avoiding double buffering between operating system cache and
  postgres' shared_buffers.
- Better control over the timing and pace of dirty data writeback.
- Potential for concurrent WAL writes (via O_DIRECT | O_DSYNC writes)


The main reason *not* to use Direct IO are:

- Without AIO, Direct IO is unusably slow for most purposes.
- Even with AIO, many parts of postgres need to be modified to perform
  explicit prefetching.
- In situations where shared_buffers cannot be set appropriately
  large, e.g. because there are many different postgres instances
  hosted on shared hardware, performance will often be worse then when
  using buffered IO.


## Why Asynchronous IO

- Without AIO we cannot use DIO

- Without asynchronous IO (AIO) PG has to rely on the operating system
  to hide the cost of synchronous IO from Postgres. While this works
  surprisingly well in a lot of workloads, it does not do as good a job
  on prefetching and controlled writeback as we would like.
- There are important expensive operations like fdatasync() where the
  operating system cannot hide the storage latency. This is particularly
  important for WAL writes, where the ability to asynchronously issue
  fdatasync() or O_DSYNC writes can yield significantly higher
  throughput.
- Fetching data into shared buffers asynchronously and concurrently with query
  execution means there is more CPU time for query execution.


## High level difficulties adding AIO/DIO support

- Optionally using AIO leads to convoluted and / or duplicated code.

- Platform dependency: The common AIO APIs are typically specific to one
  platform (linux AIO, linux io_uring, windows IOCP, windows overlapped IO) or
  a few platforms (posix AIO, but there's many differences).

- There are a lot of separate places doing IO in PG. Moving all of these to
  use efficiently use AIO is an, um, large undertaking.

- Nothing in the buffer management APIs expects there to be more than one IO
  to be in progress at the same time - which is required to do AIO.


## Portability & Duplication

To avoid the issue of needing non-AIO codepaths to support platforms without
native AIO support a worker process based AIO implementation exists (and is
currently the default). This also is convenient to check if a problem is
related to the native IO implementation or not.

Thanks to Thomas Munro for helping a *lot* around this area. He wrote
the worker mode, the posix aio mode, added CI, did a lot of other
testing, listened to me...


## Deadlock and Starvation Dangers due to AIO

Using AIO in a naive way can easily lead to deadlocks in an environment where
the source/target of AIO are shared resources, like pages in postgres'
shared_buffers.

Consider one backend performing readahead on a table, initiating IO for a
number of buffers ahead of the current "scan position". If that backend then
performs some operation that blocks, or even just is slow, the IO completion
for the asynchronously initiated read may not be processed.

This AIO implementation solves this problem by requiring that AIO methods
either allow AIO completions to be processed by any backend in the system
(e.g. io_uring, and indirectly posix, via signal handlers), or to guarantee
that AIO processing will happen even when the issuing backend is blocked
(e.g. worker mode, which offloads completion processing to the AIO workers).


## AIO API overview

The main steps to use AIO (without higher level helpers) are:

1) acquire an "unused" AIO: pgaio_io_get()

2) start some IO, this is done by functions like
   pgaio_io_start_(read|write|fsync|flush_range)_(smgr|sb|raw|wal)

   The (read|write|fsync|flush_range) indicates the operation, whereas
   (smgr|sb|raw|wal) determines how IO completions, errors, ... are handled.

   (see below for more details about this design choice - it might or not be
   right)

3) optionally: assign a backend-local completion callback to the IO
   (pgaio_io_on_completion_local())

4) 2) alone does *not* cause the IO to be submitted to the kernel, but to be
   put on a per-backend list of pending IOs. The pending IOs can be explicitly
   be flushed pgaio_submit_pending(), but will also be submitted if the
   pending list gets to be too large, or if the current backend waits for the
   IO.

   The are two main reasons not to submit the IO immediately:
   - If adjacent, we can merge several IOs into one "kernel level" IO during
     submission. Larger IOs are considerably more efficient.
   - Several AIO APIs allow to submit a batch of IOs in one system call.

5) wait for the IO: pgaio_io_wait() waits for an IO "owned" by the current
   backend. When other backends may need to wait for an IO to finish,
   pgaio_io_ref() can put a reference to that AIO in shared memory (e.g. a
   BufferDesc), which can be waited for using pgaio_io_wait_ref().

6) Process the results of the request. If a callback was registered in 3),
   this isn't always necessary. The results of AIO can be accessed using
   pgaio_io_result() which returns an integer where negative numbers are
   -errno, and positive numbers are the [partial] success conditions
   (e.g. potentially indicating a short read).

7) release ownership of the io (pgaio_io_release()) or reuse the IO for
   another operation (pgaio_io_recycle())


Most places that want to use AIO shouldn't themselves need to care about
managing the number of writes in flight, or the readahead distance. To help
with that there are two helper utilities, a "streaming read" and a "streaming
write".

The "streaming read" helper uses a callback to determine which blocks to
prefetch - that allows to do readahead in a sequential fashion but importantly
also allows to asynchronously "read ahead" non-sequential blocks.

E.g. for vacuum, lazy_scan_heap() has a callback that uses the visibility map
to figure out which block needs to be read next. Similarly lazy_vacuum_heap()
uses the tids in LVDeadTuples to figure out which blocks are going to be
needed. Here's the latter as an example:
https://github.com/anarazel/postgres/commit/a244baa36bfb252d451a017a273a6da1c09f15a3#diff-3198152613d9a28963266427b380e3d4fbbfabe96a221039c6b1f37bc575b965R1906


## IO initialization layering

One difficulty I had in this process was how to initialize IOs in light of the
layering (from bufmgr.c over smgr.c and md.c to fd.c and back, but also
e.g. xlog.c). Sometimes AIO needs to be initialized on the bufmgr.c level,
sometimes on the md.c level, sometimes on the level of fd.c. But to be able to
react to the completion of any such IO metadata about the operation is needed.

Early on fd.c initialized IOs, and the context information was just passed
through to fd.c. But that seems quite wrong - fd.c shouldn't have to know
about which Buffer an IO is about. But higher levels shouldn't know about
which files an operation resides in either, so they can't do all the work
either...

To avoid that, I ended up splitting the "start an AIO" operation into a higher
level part, e.g. pgaio_io_start_read_smgr() - which doesn't know about which
smgr implementation is in use and thus also not what file/offset we're dealing
with, which calls into smgr->md->fd to actually "prepare" the IO (i.e. figure
out file / offset).  This currently looks like:

void
pgaio_io_start_read_smgr(PgAioInProgress *io, struct SMgrRelationData* smgr, ForkNumber forknum,
                                                 BlockNumber blocknum, char *bufdata)
{
        pgaio_io_prepare(io, PGAIO_OP_READ);

        smgrstartread(io, smgr, forknum, blocknum, bufdata);

        io->scb_data.read_smgr.tag = (AioBufferTag){
                .rnode = smgr->smgr_rnode,
                .forkNum = forknum,
                .blockNum = blocknum
        };

        pgaio_io_stage(io, PGAIO_SCB_READ_SMGR);
}

Once this reaches the fd.c layer the new FileStartRead() function calls
pgaio_io_prep_read() on the IO - but doesn't need to know anything about weird
higher level stuff like relfilenodes.

The _sb (_sb for shared_buffers) variant stores the Buffer, backend and mode
(as in ReadBufferMode).


I'm not sure this is the right design - but it seems a lot better than what I
had earlier...


## Callbacks

In the core AIO pieces there are two different types of callbacks at the
moment:

Shared callbacks, which can be invoked by any backend (normally the issuing
backend / the AIO workers, but can be other backends if they are waiting for
the IO to complete). For operations on shared resources (e.g. shared buffer
reads/writes, or WAL writes) these shared callback needs to transition the
state of the object the IO is being done for to completion. E.g. for a shared
buffer read that means setting BM_VALID / unsetting BM_IO_IN_PROGRESS.

The main reason these callbacks exist is that they make it safe for a backend
to issue non-blocking IO on buffers (see the deadlock section above). As any
blocked backend can cause the IO to complete, the deadlock danger is gone.


Local callbacks, one of which the issuer of an IO can associate with the
IO. These can be used to issue further readahead. I initially did not have
these, but I found it hard to have a controllable numbers of IO in
flight. They are currently mainly used for error handling (e.g. erroring out
when XLogFileInit() cannot create the file due to ENOSPC), and to issue more
IO (e.g. readahead for heapam).

The local callback system isn't quite right, and there's


## AIO conversions

Currently the patch series converts a number of subsystems to AIO. They are of
very varying quality. I mainly did the conversions that I considered either be
of interest architecturally, or that caused a fair bit of pain due to slowness
(e.g. VACUUMing without AIO is no fun at all when using DIO). Some also for
fun ;)

Most conversions are fairly simple. E.g. heap scans, checkpointer, bgwriter,
VACUUM are all not too complicated.

There are two conversions that are good bit more complicated/experimental:

1) Asynchronous, concurrent, WAL writes. This is important because we right
   now are very bottlenecked by IO latency, because there effectively only
   ever is one WAL IO in flight at the same time. Even though in many
   situations it is possible to issue a WAL write, have one [set of] backends
   wait for that write as its completions satisfies their XLogFlush() needs,
   but concurrently already issue the next WAL write(s) that other backends
   need.

   The code here is very crufty, but I think the concept is mostly right.

2) Asynchronous buffer replacement. Even with buffered IO we experience a lot
   of pain when ringbuffers need to write out data (VACUUM!). But with DIO the
   issue gets a lot worse - the kernel can't hide the write latency from us
   anymore.  This change makes each backend asynchronously clean out buffers
   that it will need soon. When a ringbuffer is is use this means cleaning out
   buffers in the ringbuffer, when not, performing the clock sweep and cleaning
   out victim buffers.  Due to 1) the XLogFlush() can also be done
   asynchronously.


There are a *lot* of places that haven't been converted to use AIO.


## Stats

There are two new views: pg_stat_aios showing AIOs that are currently
in-progress, pg_stat_aio_backends showing per-backend statistics about AIO.


## Code:

https://github.com/anarazel/postgres/tree/aio

I was not planning to attach all the patches on the way to AIO - it's too many
right now... I hope I can reduce the size of the series iteratively into
easier to look at chunks.


## TL;DR: Performance numbers

This is worth an email on its own, and it's pretty late here already and I
want to rerun benchmarks before posting more numbers. So here are just a few
that I could run before falling asleep.


1) 60s of parallel COPY BINARY of a 89MB into separate tables (s_b = 96GB):

slow NVMe SSD
branch dio  clients   tps/stddev checkpoint write time
master n    8 3.0/2296 ms 4.1s / 379647 buffers = 723MiB/s
aio n    8 3.8/1985 ms 11.5s / 1028669 buffers = 698MiB/
aio y    8 4.7/204 ms 10.0s / 1164933 buffers = 910MiB/s

raid of 2 fast NVMe SSDs (on pcie3):
branch dio  clients   tps/stddev checkpoint write time
master n    8 9.7/62 ms 7.6s / 1206376 buffers = 1240MiB/s
aio n    8 11.4/82 ms 14.3s / 2838129 buffers = 1550MiB/s
aio y    8 18.1/56 ms 8.9s / 4486170 buffers = 3938MiB/s


2) pg prewarm speed

raid of 2 fast NVMe SSDs (on pcie3):

pg_prewarm(62GB, read)
branch dio time bw
master n 17.4s 3626MiB/s
aio n 10.3s 6126MiB/s (higher cpu usage)
aio y 9.8s 6438MiB/s

pg_prewarm(62GB, buffer)
branch dio time bw
master n 38.3s 1647MiB/s
aio n 13.6s 4639MiB/s (higher cpu usage)
aio y 10.7s 5897MiB/s



3) parallel sequential scan speed

parallel sequential scan + count(*) of 59GB table:

branch dio    max_parallel time
master n    0 40.5s
master n    1 22.6s
master n    2 16.4s
master n    4 10.9s
master n    8 9.3s

aio y    0 33.1s
aio y    1 17.2s
aio y    2 11.8s
aio y    4 9.0s
aio y    8 9.2s



On local SSDs there's some, but not a huge performance advantage in most
transactional r/w workloads. But on cloud storage - which has a lot higher
latency - AIO can yield huge advantages. I've seen over 4x.


There's definitely also cases where AIO currently hurts - most of those I just
didn't get aroung to address.

There's a lot more cases in which DIO currently hurts - mostly because the
necessary smarts haven't yet been added.


Comments? Questions?

I plan to send separate emails about smaller chunks of this seperately -
the whole topic is just too big. In particular I plan to send something
around buffer locking / state management - it's a one of the core issues
around this imo.


Regards,

Andres


Reply | Threaded
Open this post in threaded view
|

Re: Asynchronous and "direct" IO support for PostgreSQL.

Greg Stark
On Tue, 23 Feb 2021 at 05:04, Andres Freund <[hidden email]> wrote:

>
> ## Callbacks
>
> In the core AIO pieces there are two different types of callbacks at the
> moment:
>
> Shared callbacks, which can be invoked by any backend (normally the issuing
> backend / the AIO workers, but can be other backends if they are waiting for
> the IO to complete). For operations on shared resources (e.g. shared buffer
> reads/writes, or WAL writes) these shared callback needs to transition the
> state of the object the IO is being done for to completion. E.g. for a shared
> buffer read that means setting BM_VALID / unsetting BM_IO_IN_PROGRESS.
>
> The main reason these callbacks exist is that they make it safe for a backend
> to issue non-blocking IO on buffers (see the deadlock section above). As any
> blocked backend can cause the IO to complete, the deadlock danger is gone.

So firstly this is all just awesome work and I have questions but I
don't want them to come across in any way as criticism or as a demand
for more work. This is really great stuff, thank you so much!

The callbacks make me curious about two questions:

1) Is there a chance that a backend issues i/o, the i/o completes in
some other backend and by the time this backend gets around to looking
at the buffer it's already been overwritten again? Do we have to
initiate I/O again or have you found a way to arrange that this
backend has the buffer pinned from the time the i/o starts even though
it doesn't handle the comletion?

2) Have you made (or considered making) things like sequential scans
(or more likely bitmap index scans) asynchronous at a higher level.
That is, issue a bunch of asynchronous i/o and then handle the pages
and return the tuples as the pages arrive. Since sequential scans and
bitmap scans don't guarantee to read the pages in order they're
generally free to return tuples from any page in any order. I'm not
sure how much of a win that would actually be since all the same i/o
would be getting executed and the savings in shared buffers would be
small but if there are mostly hot pages you could imagine interleaving
a lot of in-memory pages with the few i/os instead of sitting idle
waiting for the async i/o to return.



> ## Stats
>
> There are two new views: pg_stat_aios showing AIOs that are currently
> in-progress, pg_stat_aio_backends showing per-backend statistics about AIO.

This is impressive. How easy is it to correlate with system aio stats?


--
greg


Reply | Threaded
Open this post in threaded view
|

Re: Asynchronous and "direct" IO support for PostgreSQL.

Andres Freund
Hi,

On 2021-02-23 14:58:32 -0500, Greg Stark wrote:
> So firstly this is all just awesome work and I have questions but I
> don't want them to come across in any way as criticism or as a demand
> for more work.

I posted it to get argued with ;).


> The callbacks make me curious about two questions:
>
> 1) Is there a chance that a backend issues i/o, the i/o completes in
> some other backend and by the time this backend gets around to looking
> at the buffer it's already been overwritten again? Do we have to
> initiate I/O again or have you found a way to arrange that this
> backend has the buffer pinned from the time the i/o starts even though
> it doesn't handle the comletion?

The initiator of the IO can just keep a pin for the buffer to prevent
that.

There's a lot of complexity around how to handle pinning and locking
around asynchronous buffer IO. I plan to send a separate email with a
more details.

In short: I made it so that for shared buffer IO holds a separate
refcount for the duration of the IO - that way the issuer can release
its own pin without causing a problem (consider e.g. an error while an
IO is in flight). The pin held by the IO gets released in the completion
callback.  There's similar trickery with locks - but more about that later.


> 2) Have you made (or considered making) things like sequential scans
> (or more likely bitmap index scans) asynchronous at a higher level.
> That is, issue a bunch of asynchronous i/o and then handle the pages
> and return the tuples as the pages arrive. Since sequential scans and
> bitmap scans don't guarantee to read the pages in order they're
> generally free to return tuples from any page in any order. I'm not
> sure how much of a win that would actually be since all the same i/o
> would be getting executed and the savings in shared buffers would be
> small but if there are mostly hot pages you could imagine interleaving
> a lot of in-memory pages with the few i/os instead of sitting idle
> waiting for the async i/o to return.

I have not. Mostly because it'd basically break the entire regression
test suite. And probably a lot of user expectations (yes,
synchronize_seqscan exists, but it pretty rarely triggers).

I'm not sure how big the win would be - the readahead for heapam.c that
is in the patch set tries to keep ahead of the "current scan position"
by a certain amount - so it'll issue the reads for the "missing" pages
before they're needed. Hopefully early enough to avoid unnecessary
stalls. But I'm pretty sure there'll be cases where that'd require a
prohibitively long "readahead distance".

I think there's a lot of interesting things along these lines that we
could tackle, but since they involve changing results and/or larger
changes to avoid those (e.g. detecting when sequential scan order isn't
visible to the user) I think it'd make sense to separate them from the
aio patchset itself.

If we had the infrastructure to detect whether seqscan order matters, we
could also switch to the tuple-iteration order to "backwards" in
heapgetpage() - right now iterating forward in "itemid order" causes a
lot of cache misses because the hardware prefetcher doesn't predict that
the tuples are layed out "decreasing pointer order".


> > ## Stats
> >
> > There are two new views: pg_stat_aios showing AIOs that are currently
> > in-progress, pg_stat_aio_backends showing per-backend statistics about AIO.
>
> This is impressive. How easy is it to correlate with system aio stats?

Could you say a bit more about what you are trying to correlate?

Here's some example IOs from pg_stat_aios.

┌─[ RECORD 1 ]─┬───────────────────────────────────────────────────────────────────────────────────────────────────┐
│ backend_type │ client backend                                                                                    │
│ id           │ 98                                                                                                │
│ gen          │ 13736                                                                                             │
│ op           │ write                                                                                             │
│ scb          │ sb                                                                                                │
│ flags        │ PGAIOIP_IN_PROGRESS | PGAIOIP_INFLIGHT                                                            │
│ ring         │ 5                                                                                                 │
│ owner_pid    │ 1866051                                                                                           │
│ merge_with   │ (null)                                                                                            │
│ result       │ 0                                                                                                 │
│ desc         │ fd: 38, offset: 329588736, nbytes: 8192, already_done: 0, release_lock: 1, buffid: 238576         │

├─[ RECORD 24 ]┼───────────────────────────────────────────────────────────────────────────────────────────────┤
│ backend_type │ checkpointer                                                                                  │
│ id           │ 1501                                                                                          │
│ gen          │ 15029                                                                                         │
│ op           │ write                                                                                         │
│ scb          │ sb                                                                                            │
│ flags        │ PGAIOIP_IN_PROGRESS | PGAIOIP_INFLIGHT                                                        │
│ ring         │ 3                                                                                             │
│ owner_pid    │ 1865275                                                                                       │
│ merge_with   │ 1288                                                                                          │
│ result       │ 0                                                                                             │
│ desc         │ fd: 24, offset: 105136128, nbytes: 8192, already_done: 0, release_lock: 1, buffid: 202705     │

├─[ RECORD 31 ]┼───────────────────────────────────────────────────────────────────────────────────────────────────┤
│ backend_type │ walwriter                                                                                         │
│ id           │ 90                                                                                                │
│ gen          │ 26498                                                                                             │
│ op           │ write                                                                                             │
│ scb          │ wal                                                                                               │
│ flags        │ PGAIOIP_IN_PROGRESS | PGAIOIP_INFLIGHT                                                            │
│ ring         │ 5                                                                                                 │
│ owner_pid    │ 1865281                                                                                           │
│ merge_with   │ 181                                                                                               │
│ result       │ 0                                                                                                 │
│ desc         │ write_no: 17, fd: 12, offset: 6307840, nbytes: 1048576, already_done: 0, bufdata: 0x7f94dd670000  │
└──────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────┘


And the per-backend AIO view shows information like this (too wide to
display all cols):

SELECT
    pid, last_context, backend_type,
    executed_total_count,issued_total_count,submissions_total_count, inflight_count
FROM pg_stat_aio_backends sab JOIN pg_stat_activity USING (pid);
┌─────────┬──────────────┬──────────────────────────────┬──────────────────────┬────────────────────┬─────────────────────────┬────────────────┐
│   pid   │ last_context │         backend_type         │ executed_total_count │ issued_total_count │ submissions_total_count │ inflight_count │
├─────────┼──────────────┼──────────────────────────────┼──────────────────────┼────────────────────┼─────────────────────────┼────────────────┤
│ 1865291 │            3 │ logical replication launcher │                    0 │                  0 │                       0 │              0 │
│ 1865296 │            0 │ client backend               │                   85 │                 85 │                      85 │              0 │
│ 1865341 │            7 │ client backend               │              9574416 │            2905321 │                  345642 │              0 │
...
│ 1873501 │            3 │ client backend               │                 3565 │               3565 │                     467 │             10 │
...
│ 1865277 │            0 │ background writer            │               695402 │             575906 │                   13513 │              0 │
│ 1865275 │            3 │ checkpointer                 │              4664110 │            3488530 │                 1399896 │              0 │
│ 1865281 │            3 │ walwriter                    │                77203 │               7759 │                    7747 │              3 │
└─────────┴──────────────┴──────────────────────────────┴──────────────────────┴────────────────────┴─────────────────────────┴────────────────┘

It's not super obvious at this point but executed_total_count /
issued_total_count shows the rate at which IOs have been
merged. issued_total_count / submissions_total_count shows how many
(already merged) IOs were submitted together in one "submission batch"
(for io_method=io_uring, io_uring_enter()).


So pg_stat_aios should allow to enrich some system stats - e.g. by being
able to split out WAL writes from data file writes. And - except that
the current callbacks for that aren't great - it should even allow to
split the IO by different relfilenodes etc.


I assume pg_stat_aio_backends also can be helpful, e.g. by seeing which
backends currently the deep IO queues that cause latency delays in other
backends, and which backends do a lot of sequential IO (high
executed_total_count / issued_total_count) and which only random...

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Asynchronous and "direct" IO support for PostgreSQL.

Thomas Munro-5
In reply to this post by Andres Freund
On Tue, Feb 23, 2021 at 11:03 PM Andres Freund <[hidden email]> wrote:
> over the last ~year I spent a lot of time trying to figure out how we could
> add AIO (asynchronous IO) and DIO (direct IO) support to postgres. While
> there's still a *lot* of open questions, I think I now have a decent handle on
> most of the bigger architectural questions.  Thus this long email.

Hello,

Very cool to see this project escaping onto -hackers!

I have done some work on a couple of low level parts of it, and I
wanted to show a quick "hey, where'd my system calls go?" demo, which
might help illustrate some very simple things about this stuff.  Even
though io_uring is the new hotness in systems programming, I'm going
to use io_mode=worker here.  It's the default in the current patch
set, it works on all our supported OSes and is easier to understand
without knowledge of shiny new or obscure old AIO system interfaces.
I'll also use io_workers=1, an artificially low setting to make it
easy to spy on (pseudo) async I/O with strace/truss/dtruss on a single
process, and max_parallel_workers_per_gather=0 to keep executor
parallelism from confusing matters.

The first thing to notice is that there's an "io worker" process, and
while filling up a big table with "insert into t select
generate_series(1, 100000000)", it's doing a constant stream of 128KB
pwritev() calls.  These are writing out 16 blocks from shared buffers
at a time:

  pwritev(44, [{iov_base=..., iov_len=73728},
               {iov_base=..., iov_len=24576},
               {iov_base=..., iov_len=32768}], 3, 228032512) = 131072

The reason there are 3 vectors there rather than 16 is just that some
of the buffers happened to be adjacent in memory and we might as well
use the smallest number of vectors.  Just after we've started up and
the buffer pool is empty, it's easy to find big single vector I/Os,
but things soon get more fragmented (blocks adjacent on disk become
less likely to be adjacent in shared buffers) and that number goes up,
but that shouldn't make much difference to the OS or hardware assuming
decent scatter/gather support through the stack.  If io_data_direct=on
(not the default) and the blocks are in one physical extent on the
file system, that might even go all the way down to the disk as a
single multi-segment write command for the storage hardware DMA engine
to beam directly in/out of our buffer pool without CPU involvement.

Mixed into that stream of I/O worker system calls, you'll also see WAL
going out to disk:

  pwritev(15, [{iov_base=..., iov_len=1048576}], 1, 4194304) = 1048576

Meanwhile, the user session process running the big INSERT can be seen
signalling the I/O worker to wake it up.  The same thing happens for
bgwriter, checkpointer, autovacuum and walwriter: you can see them all
handing off most of their I/O work to the pool of I/O workers, with a
bit of new signalling going on (which we try to minimise, and can
probably minimise much more).  (You might be able to see some evidence
of Andres's new buffer cleaning scheme too, which avoids some bad
patterns of interleaving small reads and writes, but I'm skipping
right over here...)

Working through a very simple example of how the I/O comes to be
consolidated and parallelised, let's look at a simple non-parallel
SELECT COUNT(*) query on a large table.  The I/O worker does a stream
of scattered reads into our buffer pool:

  preadv(51, [{iov_base=..., iov_len=24576},
              {iov_base=..., iov_len=8192},
              {iov_base=..., iov_len=16384},
              {iov_base=..., iov_len=16384},
              {iov_base=..., iov_len=16384},
              {iov_base=..., iov_len=49152}], 6, 190808064) = 131072

Meanwhile our user session backend can be seen waking it up whenever
it's trying to start I/O and finds it snoozing:

  kill(1803835, SIGURG)                   = 0
  kill(1803835, SIGURG)                   = 0
  kill(1803835, SIGURG)                   = 0
  kill(1803835, SIGURG)                   = 0
  kill(1803835, SIGURG)                   = 0

Notice that there are no sleeping system calls in the query backend,
meaning the I/O in this example is always finished by the time the
executor gets around to accessing the page it requested, so we're
staying far enough ahead and we can be 100% CPU bound.  In unpatched
PostgreSQL we'd hope to have no actual sleeping in such a simple case
anyway, thanks to the OS's readahead heuristics; but (1) we'd still do
individual pread(8KB) calls, meaning that the user's query is at least
having to pay the CPU cost of a return trip into the kernel and a
copyout of 8KB from kernel space to user space, here avoided, (2) in
io_data_direct=on mode, there's no page cache and thus no kernel read
ahead, so we need to replace that mechanism with something anyway, (3)
it's needed for non-sequential access like btree scans.

Sometimes I/Os are still run in user backends, for example because (1)
existing non-AIO code paths are still reached, (2) in worker mode,
some kinds of I/Os can't be handed off to another process due to lack
of a way to open some fds or because we're in single process mode, (3)
because a heuristic kicks in when we know there's only one I/O to run
and we know we'll immediately wait for it and we can skip a lot of
communication with a traditional synchronous syscall (worker mode only
for no, needs to be done for others).

In order to be able to generate a stream of big vectored reads/writes,
and start them far enough ahead of time that they're finished before
we need the data, there are several layers of new instructure that
Andres already mentioned and can explain far better than I, but super
briefly:

heapam.c uses a "pg_streaming_read" object (aio_util.c) to get buffers
to scan, instead of directly calling ReadBuffer().  It gives the
pg_streaming_read a callback of its own, so that heapam.c remains in
control of what is read, but the pg_streaming_read is in control of
readahead distance and also "submission".  heapam.c's callback calls
ReadBufferAsync() to initiate reads of pages that it will need soon,
which it does with pgaio_io_start_read_sb() if there's a cache miss.
This results in 8KB reads queued up in the process's pending I/O list,
with pgaio_read_sb_complete as the completion function to run when
each read has eventually completed.  When the pending list grows to a
certain length, it is submitted by pg_streaming_read code.  That
involves first "combining" pending I/Os: this is where read/write of
adjacent ranges of files are merged into larger I/Os up to a limit.
Then the I/Os are submitted to the OS, and we'll eventually learn
about their completion, via io_method-specific means (see
aio_worker.c, aio_uring.c, aio_posix.c and one day probably also
aio_win32.c).  At that point, merged I/Os will be uncombined.
Skipping over some complication about retrying on some kinds of
failure/partial I/O, that leads to ReadBufferCompleteWrite() being
called for each buffer.  (Far be it from me to try to explain the
rather complex interlocking required to deal with pins and locks
between ReadBufferAsync() and ReadBufferCompleteWrite() in
(potentially) another process while the I/O is in progress, at this
stage.)

Places in the tree that want to do carefully controlled I/O depth
management can consider using pg_streaming_{read,write}, providing
their own callback to do the real work (though it's not necessary, and
not all AIO uses suit the "streaming" model).  There's also the
traditional PrefetchBuffer() mechanism, which can still be used to
initiate buffer reads as before.  It's comparatively primitive; since
you don't know when the I/O completes, you have to use conservative
models as I do in my proposed WAL prefetching patch.  That patch (like
probably many others like CF #2799) works just fine on top of the AIO
branch, with some small tweaks: it happily shifts all I/O system calls
out of the recovery process, so that instead of calling
posix_fadvise() and then a bit later pread() for each cold page
accessed, it makes one submission system call for every N cold pages
(or, in some cases, no system calls at all).  A future better
integration would probably use pg_streaming_read for precise control
of the I/O depth instead of the little LSN queue it currently uses,
but I haven't tried to write that yet.

If you do simple large INSERTs and SELECTs with one of the native
io_method settings instead of worker mode, it'd be much the same, in
terms of most of the architecture.  The information in the pg_stat_XXX
views is almost exactly the same.  There are two major differences:
(1) the other methods have no I/O worker processes, because the kernel
manages the I/O (or in some unfortunate cases runtime libraries fake
it with threads), (2) the "shared completion callbacks" (see
aio_scb.c) are run by I/O workers in worker mode, but are run by
whichever process "drains" the I/O in the other modes.  That is,
initiating processes never hear about I/Os completing from the
operating system, they just eventually wait on them and find that
they're already completed (though they do run the "local callback" if
there there is one, which is for example the point at which eg
pg_streaming_read might initiate more I/O), or alternatively see that
they aren't, and wait on a condition variable for an I/O worker to
signal completion.  So far this seems like a good choice...

Hope that helped show off a couple of features of this scheme.


Reply | Threaded
Open this post in threaded view
|

Re: Asynchronous and "direct" IO support for PostgreSQL.

dataegret
In reply to this post by Andres Freund
Hi,

Thank you for the amazing and great work.

On 23.02.2021 15:03, Andres Freund wrote:
> ## Stats
>
> There are two new views: pg_stat_aios showing AIOs that are currently
> in-progress, pg_stat_aio_backends showing per-backend statistics about AIO.

As a DBA I would like to propose a few amendments that might help with
practical usage of stats when feature will be finally implemented. My
suggestions aren’t related to the central idea of the proposed changes,
but rather to the stats part.

A quick side note, there are two terms in Prometheus
(https://prometheus.io/docs/concepts/metric_types/):
1. Counter. A counter is a cumulative metric that represents a single
monotonically increasing counter whose value can only increase or be
reset to zero on restart.
2. Gauge. A gauge is a metric that represents a single numerical value
that can arbitrarily go up and down.

For the purposes of long-term stats collection, COUNTERs are preferred
over GAUGEs, because COUNTERs allow us to understand how metrics are
changed overtime without missing out potential spikes in activity. As a
result, we have a much better historic perspective.

Measuring and collecting GAUGEs is limited to the moments in time when
the stats are taken (snapshots) so the changes that took place between
the snapshots remain unmeasured. In systems with a high rate of
transactions per second (even 1 second interval between the snapshots)
GAUGEs measuring won’t provide the full picture.  In addition, most of
the monitoring systems like Prometheus, Zabbix, etc. use longer
intervals (from 10-15 to 60 seconds).

The main idea is to try to expose almost all numeric stats as COUNTERs -
this increases overall observabilty of implemented feature.

pg_stat_aios.
In general, this stat is a set of text values, and at the same time it
looks GAUGE-like (similar to pg_stat_activity or pg_locks), and is only
relevant for the moment when the user is looking at it. I think it would
be better to rename this view to pg_stat_progress_aios. And keep
pg_stat_aios for other AIO stats with global COUNTERs (like stuff in
pg_stat_user_tables or pg_stat_statements, or system-wide /proc/stat,
/proc/diskstats).

pg_stat_aio_backends.
This stat is based on COUNTERs, which is great, but the issue here is
that its lifespan is limited by the lifespan of the backend processes -
once the backend exits the stat will no longer be available - which
could be inappropriate in workloads with short-lived backends.

I think there might be few existing examples in the current code that
could be repurposed to implement the suggestions above (such as
pg_stat_user_tables, pg_stat_database, etc). With this in mind, I think
having these changes incorporated shouldn’t take significant effort
considering the benefit it will bring to the final user.

Once again huge respect to your work on this changes and good look.

Regards, Alexey



Reply | Threaded
Open this post in threaded view
|

Re: Asynchronous and "direct" IO support for PostgreSQL.

Greg Stark
I guess what I would be looking for in stats would be a way to tell
what the bandwidth, latency, and queue depth is. Ideally one day
broken down by relation/index and pg_stat_statement record.

I think seeing the actual in flight async requests in a connection is
probably not going to be very useful in production. It's very low
level and in production the user is just going to find that
overwhelming detail. It is kind of cool to see the progress in
sequential operations but I think that should be solved in a higher
level way than this anyways.

What we need to calculate these values would be the kinds of per-op
stats nfsiostat uses from /proc/self/mountstats:
https://utcc.utoronto.ca/~cks/space/blog/linux/NFSMountstatsNFSOps

So number of async reads we've initiated, how many callbacks have been
called, total cumulative elapsed time between i/o issued and i/o
completed, total bytes of i/o initiated, total bytes of i/o completed.
As well a counter of requests which returned errors (eof? i/o error?)
If there are other locks or queues internally to postgres total time
spent in those states.

I have some vague idea that we should have a generic infrastructure
for stats that automatically counts things associated with plan nodes
and automatically bubbles that data up to the per-transaction,
per-backend, per-relation, and pg_stat_statements stats. But that's a
whole other ball of wax :)


Reply | Threaded
Open this post in threaded view
|

Re: Asynchronous and "direct" IO support for PostgreSQL.

Dmitry Dolgov
In reply to this post by Andres Freund
> On Tue, Feb 23, 2021 at 02:03:44AM -0800, Andres Freund wrote:
>
> over the last ~year I spent a lot of time trying to figure out how we could
> add AIO (asynchronous IO) and DIO (direct IO) support to postgres. While
> there's still a *lot* of open questions, I think I now have a decent handle on
> most of the bigger architectural questions.  Thus this long email.
>
> Just to be clear: I don't expect the current to design to survive as-is. If
> there's a few sentences below that sound a bit like describing the new world,
> that's because they're from the README.md in the patch series...

Thanks!

> Comments? Questions?
>
> I plan to send separate emails about smaller chunks of this seperately -
> the whole topic is just too big. In particular I plan to send something
> around buffer locking / state management - it's a one of the core issues
> around this imo.

I'm curious about control knobs for this feature, it's somewhat related
to the stats questions also discussed in this thread. I guess most
important of those are max_aio_in_flight, io_max_concurrency etc, and
they're going to be a hard limits, right? I'm curious if it makes sense
to explore possibility to have these sort of "backpressure", e.g. if
number of inflight requests is too large calculate inflight_limit a bit
lower than possible (to avoid hard performance deterioration when the db
is trying to do too much IO, and rather do it smooth). From what I
remember io_uring does have something similar only for SQPOLL. Another
similar question if this could be used for throttling of some overloaded
workers in case of misconfigured clients or such?


Reply | Threaded
Open this post in threaded view
|

Re: Asynchronous and "direct" IO support for PostgreSQL.

Andres Freund
In reply to this post by dataegret
Hi,

On 2021-02-24 21:15:14 +0500, Alexey Lesovsky wrote:
> pg_stat_aios.
> In general, this stat is a set of text values, and at the same time it looks
> GAUGE-like (similar to pg_stat_activity or pg_locks), and is only relevant
> for the moment when the user is looking at it. I think it would be better to
> rename this view to pg_stat_progress_aios. And keep pg_stat_aios for other
> AIO stats with global COUNTERs (like stuff in pg_stat_user_tables or
> pg_stat_statements, or system-wide /proc/stat, /proc/diskstats).

Right - arguably it really shouldn't even have _stat_ in the name... I
don't particularly like the idea of adding _progress_ as that seems it'd
lead to confusing it with pg_stat_progress_vacuum etc - and it's quite a
different beast.


> pg_stat_aio_backends.
> This stat is based on COUNTERs, which is great, but the issue here is that
> its lifespan is limited by the lifespan of the backend processes - once the
> backend exits the stat will no longer be available - which could be
> inappropriate in workloads with short-lived backends.

There's a todo somewhere to roll over the per-connection stats into a
global stats piece on disconnect. In addition I was thinking of adding a
view that sums up the value of "already disconnected backends" and the
currently connected ones.  Would that mostly address your concerns?


> I think there might be few existing examples in the current code that could
> be repurposed to implement the suggestions above (such as
> pg_stat_user_tables, pg_stat_database, etc). With this in mind, I think
> having these changes incorporated shouldn’t take significant effort
> considering the benefit it will bring to the final user.

Yea - I kind of was planning to go somewhere roughly in the direction
you suggest, but took a few shortcuts due to the size of the
project. Having the views made it a lot easier to debug / develop, but
supporting longer lived stuff wasn't yet crucial. But I agree, we really
should have it...

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Asynchronous and "direct" IO support for PostgreSQL.

Andres Freund
In reply to this post by Greg Stark
Hi,

On 2021-02-24 14:59:19 -0500, Greg Stark wrote:
> I guess what I would be looking for in stats would be a way to tell
> what the bandwidth, latency, and queue depth is. Ideally one day
> broken down by relation/index and pg_stat_statement record.

I think doing it at that granularity will likely be too expensive...


> I think seeing the actual in flight async requests in a connection is
> probably not going to be very useful in production.

I think it's good for analyzing concrete performance issues, but
probably not that much more. Although, it's not too hard to build
sampling based on top of it with a tiny bit of work (should display the
relfilenode etc).


> So number of async reads we've initiated, how many callbacks have been
> called, total cumulative elapsed time between i/o issued and i/o
> completed, total bytes of i/o initiated, total bytes of i/o completed.

Much of that is already in pg_stat_aio_backends - but is lost after
disconnect (easy to solve). We don't track bytes of IO currently, but
that'd not be hard.

However, it's surprisingly hard to do the measurement between "issued"
and "completed" in a meaningful way. It's obviously not hard to measure
the time at which the request was issued, but there's no real way to
determine the time at which it was completed. If a backend is busy doing
other things (e.g. invoke aggregate transition functions), we'll not see
the completion immediately, and therefore not have an accurate
timestamp.

With several methods of doing AIO we can set up signals that fire on
completion, but that's pretty darn expensive. And it's painful to write
such signal handlers in a safe way.


> I have some vague idea that we should have a generic infrastructure
> for stats that automatically counts things associated with plan nodes
> and automatically bubbles that data up to the per-transaction,
> per-backend, per-relation, and pg_stat_statements stats. But that's a
> whole other ball of wax :)

Heh, yea, let's tackle that separately ;)

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Asynchronous and "direct" IO support for PostgreSQL.

Andres Freund
In reply to this post by Dmitry Dolgov
Hi,

On 2021-02-24 21:41:16 +0100, Dmitry Dolgov wrote:
> I'm curious about control knobs for this feature, it's somewhat related
> to the stats questions also discussed in this thread. I guess most
> important of those are max_aio_in_flight, io_max_concurrency etc, and
> they're going to be a hard limits, right?

Yea - there's a lot more work needed in that area.

io_max_concurrency especially really should be a GUC, I was just too
lazy for that so far.


> I'm curious if it makes sense
> to explore possibility to have these sort of "backpressure", e.g. if
> number of inflight requests is too large calculate inflight_limit a bit
> lower than possible (to avoid hard performance deterioration when the db
> is trying to do too much IO, and rather do it smooth).

It's decidedly nontrivial to compute "too large" - and pretty workload
dependant (e.g. lower QDs are better latency sensitive OLTP, higher QD
is better for bulk r/w heavy analytics). So I don't really want to go
there for now - the project is already very large.

What I do think is needed and feasible (there's a bunch of TODOs in the
code about it already) is to be better at only utilizing deeper queues
when lower queues don't suffice. So we e.g. don't read ahead more than a
few blocks for a scan where the query is spending most of the time
"elsewhere.

There's definitely also some need for a bit better global, instead of
per-backend, control over the number of IOs in flight. That's not too
hard to implement - the hardest probably is to avoid it becoming a
scalability issue.

I think the area with the most need for improvement is figuring out how
we determine the queue depths for different things using IO. Don't
really want to end up with 30 parameters influencing what queue depth to
use for (vacuum, index builds, sequential scans, index scans, bitmap
heap scans, ...) - but they benefit from a deeper queue will differ
between places.


> From what I remember io_uring does have something similar only for
> SQPOLL. Another similar question if this could be used for throttling
> of some overloaded workers in case of misconfigured clients or such?

You mean dynamically? Or just by setting the concurrency lower for
certain users? I think doing so dynamically is way too complicated for
now. But I'd expect configuring it on a per-user basis or such to be a
reasonable thing. That might require splitting it into two GUCs - one
SUSET one and a second one that's settable by any user, but can only
lower the depth.

I think it'll be pretty useful to e.g. configure autovacuum to have a
low queue depth instead of using the current cost limiting. That way the
impact on the overall system is limitted, but it's not slowed down
unnecessarily as much.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Asynchronous and "direct" IO support for PostgreSQL.

Dmitry Dolgov
> On Wed, Feb 24, 2021 at 01:45:10PM -0800, Andres Freund wrote:
>
> > I'm curious if it makes sense
> > to explore possibility to have these sort of "backpressure", e.g. if
> > number of inflight requests is too large calculate inflight_limit a bit
> > lower than possible (to avoid hard performance deterioration when the db
> > is trying to do too much IO, and rather do it smooth).
>
> What I do think is needed and feasible (there's a bunch of TODOs in the
> code about it already) is to be better at only utilizing deeper queues
> when lower queues don't suffice. So we e.g. don't read ahead more than a
> few blocks for a scan where the query is spending most of the time
> "elsewhere.
>
> There's definitely also some need for a bit better global, instead of
> per-backend, control over the number of IOs in flight. That's not too
> hard to implement - the hardest probably is to avoid it becoming a
> scalability issue.
>
> I think the area with the most need for improvement is figuring out how
> we determine the queue depths for different things using IO. Don't
> really want to end up with 30 parameters influencing what queue depth to
> use for (vacuum, index builds, sequential scans, index scans, bitmap
> heap scans, ...) - but they benefit from a deeper queue will differ
> between places.

Yeah, sounds like an interesting opportunity for improvements. I'm
preparing few benchmarks to understand better how this all works, so
will keep this in mind.

> > From what I remember io_uring does have something similar only for
> > SQPOLL. Another similar question if this could be used for throttling
> > of some overloaded workers in case of misconfigured clients or such?
>
> You mean dynamically? Or just by setting the concurrency lower for
> certain users? I think doing so dynamically is way too complicated for
> now. But I'd expect configuring it on a per-user basis or such to be a
> reasonable thing. That might require splitting it into two GUCs - one
> SUSET one and a second one that's settable by any user, but can only
> lower the depth.
>
> I think it'll be pretty useful to e.g. configure autovacuum to have a
> low queue depth instead of using the current cost limiting. That way the
> impact on the overall system is limitted, but it's not slowed down
> unnecessarily as much.

Yes, you got it right, not dynamically, but rather expose this to be
configured on e.g. per-user basis.


Reply | Threaded
Open this post in threaded view
|

Re: Asynchronous and "direct" IO support for PostgreSQL.

dataegret
In reply to this post by Andres Freund
Hi,

On 25.02.2021 02:03, Andres Freund wrote:
>> pg_stat_aio_backends.
>> This stat is based on COUNTERs, which is great, but the issue here is that
>> its lifespan is limited by the lifespan of the backend processes - once the
>> backend exits the stat will no longer be available - which could be
>> inappropriate in workloads with short-lived backends.
> There's a todo somewhere to roll over the per-connection stats into a
> global stats piece on disconnect. In addition I was thinking of adding a
> view that sums up the value of "already disconnected backends" and the
> currently connected ones.  Would that mostly address your concerns?

Yes, approach with separated stats for live and disconnected backends
looks good and solves problem with "stats loss".

Or it can be done like a stats for shared objects in pg_stat_databases,
where there is a special NULL database is used.

Regards, Alexey