mirror of
https://github.com/postgres/postgres.git
synced 2025-04-18 13:44:19 +03:00
aio: Add README.md explaining higher level design
Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
This commit is contained in:
parent
5aec7e07fb
commit
fdd146a8ef
424
src/backend/storage/aio/README.md
Normal file
424
src/backend/storage/aio/README.md
Normal file
@ -0,0 +1,424 @@
|
||||
# Asynchronous & Direct IO
|
||||
|
||||
## Motivation
|
||||
|
||||
### Why Asynchronous IO
|
||||
|
||||
Until the introduction of asynchronous IO postgres relied on the operating
|
||||
system to hide the cost of synchronous IO from postgres. While this worked
|
||||
surprisingly well in a lot of workloads, it does not do as good a job on
|
||||
prefetching and controlled writeback as we would like.
|
||||
|
||||
There are important expensive operations like `fdatasync()` where the operating
|
||||
system cannot hide the storage latency. This is particularly important for WAL
|
||||
writes, where the ability to asynchronously issue `fdatasync()` or O_DSYNC
|
||||
writes can yield significantly higher throughput.
|
||||
|
||||
|
||||
### Why Direct / unbuffered IO
|
||||
|
||||
The main reasons to want to use Direct IO are:
|
||||
|
||||
- Lower CPU usage / higher throughput. Particularly on modern storage buffered
|
||||
writes are bottlenecked by the operating system having to copy data from the
|
||||
kernel's page cache to postgres buffer pool using the CPU. Whereas direct IO
|
||||
can often move the data directly between the storage devices and postgres'
|
||||
buffer cache, using DMA. While that transfer is ongoing, the CPU is free to
|
||||
perform other work.
|
||||
- Reduced latency - Direct IO can have substantially lower latency than
|
||||
buffered IO, which can be impactful for OLTP workloads bottlenecked by WAL
|
||||
write latency.
|
||||
- Avoiding double buffering between operating system cache and postgres'
|
||||
shared_buffers.
|
||||
- Better control over the timing and pace of dirty data writeback.
|
||||
|
||||
|
||||
The main reasons *not* to use Direct IO are:
|
||||
|
||||
- Without AIO, Direct IO is unusably slow for most purposes.
|
||||
- Even with AIO, many parts of postgres need to be modified to perform
|
||||
explicit prefetching.
|
||||
- In situations where shared_buffers cannot be set appropriately large,
|
||||
e.g. because there are many different postgres instances hosted on shared
|
||||
hardware, performance will often be worse than when using buffered IO.
|
||||
|
||||
|
||||
## AIO Usage Example
|
||||
|
||||
In many cases code that can benefit from AIO does not directly have to
|
||||
interact with the AIO interface, but can use AIO via higher-level
|
||||
abstractions. See [Helpers](#helpers).
|
||||
|
||||
In this example, a buffer will be read into shared buffers.
|
||||
|
||||
```C
|
||||
/*
|
||||
* Result of the operation, only to be accessed in this backend.
|
||||
*/
|
||||
PgAioReturn ioret;
|
||||
|
||||
/*
|
||||
* Acquire an AIO Handle, ioret will get result upon completion.
|
||||
*
|
||||
* Note that ioret needs to stay alive until the IO completes or
|
||||
* CurrentResourceOwner is released (i.e. an error is thrown).
|
||||
*/
|
||||
PgAioHandle *ioh = pgaio_io_acquire(CurrentResourceOwner, &ioret);
|
||||
|
||||
/*
|
||||
* Reference that can be used to wait for the IO we initiate below. This
|
||||
* reference can reside in local or shared memory and waited upon by any
|
||||
* process. An arbitrary number of references can be made for each IO.
|
||||
*/
|
||||
PgAioWaitRef iow;
|
||||
|
||||
pgaio_io_get_wref(ioh, &iow);
|
||||
|
||||
/*
|
||||
* Arrange for shared buffer completion callbacks to be called upon completion
|
||||
* of the IO. This callback will update the buffer descriptors associated with
|
||||
* the AioHandle, which e.g. allows other backends to access the buffer.
|
||||
*
|
||||
* A callback can be passed a small bit of data, e.g. to indicate whether to
|
||||
* zero a buffer if it is invalid.
|
||||
*
|
||||
* Multiple completion callbacks can be registered for each handle.
|
||||
*/
|
||||
pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV, 0);
|
||||
|
||||
/*
|
||||
* The completion callback needs to know which buffers to update when the IO
|
||||
* completes. As the AIO subsystem does not know about buffers, we have to
|
||||
* associate this information with the AioHandle, for use by the completion
|
||||
* callback registered above.
|
||||
*
|
||||
* In this example we're reading only a single buffer, hence the 1.
|
||||
*/
|
||||
pgaio_io_set_handle_data_32(ioh, (uint32 *) buffer, 1);
|
||||
|
||||
/*
|
||||
* Pass the AIO handle to lower-level function. When operating on the level of
|
||||
* buffers, we don't know how exactly the IO is performed, that is the
|
||||
* responsibility of the storage manager implementation.
|
||||
*
|
||||
* E.g. md.c needs to translate block numbers into offsets in segments.
|
||||
*
|
||||
* Once the IO handle has been handed off to smgstartreadv(), it may not
|
||||
* further be used, as the IO may immediately get executed below
|
||||
* smgrstartreadv() and the handle reused for another IO.
|
||||
*
|
||||
* To issue multiple IOs in an efficient way, a caller can call
|
||||
* pgaio_enter_batchmode() before starting multiple IOs, and end that batch
|
||||
* with pgaio_exit_batchmode(). Note that one needs to be careful while there
|
||||
* may be unsubmitted IOs, as another backend may need to wait for one of the
|
||||
* unsubmitted IOs. If this backend then had to wait for the other backend,
|
||||
* it'd end in an undetected deadlock. See pgaio_enter_batchmode() for more
|
||||
* details.
|
||||
*
|
||||
* Note that even while in batchmode an IO might get submitted immediately,
|
||||
* e.g. due to reaching a limit on the number of unsubmitted IOs, and even
|
||||
* complete before smgrstartreadv() returns.
|
||||
*/
|
||||
smgrstartreadv(ioh, operation->smgr, forknum, blkno,
|
||||
BufferGetBlock(buffer), 1);
|
||||
|
||||
/*
|
||||
* To benefit from AIO, it is beneficial to perform other work, including
|
||||
* submitting other IOs, before waiting for the IO to complete. Otherwise
|
||||
* we could just have used synchronous, blocking IO.
|
||||
*/
|
||||
perform_other_work();
|
||||
|
||||
/*
|
||||
* We did some other work and now need the IO operation to have completed to
|
||||
* continue.
|
||||
*/
|
||||
pgaio_wref_wait(&iow);
|
||||
|
||||
/*
|
||||
* At this point the IO has completed. We do not yet know whether it succeeded
|
||||
* or failed, however. The buffer's state has been updated, which allows other
|
||||
* backends to use the buffer (if the IO succeeded), or retry the IO (if it
|
||||
* failed).
|
||||
*
|
||||
* Note that in case the IO has failed, a LOG message may have been emitted,
|
||||
* but no ERROR has been raised. This is crucial, as another backend waiting
|
||||
* for this IO should not see an ERROR.
|
||||
*
|
||||
* To check whether the operation succeeded, and to raise an ERROR, or if more
|
||||
* appropriate LOG, the PgAioReturn we passed to pgaio_io_acquire() is used.
|
||||
*/
|
||||
if (ioret.result.status == PGAIO_RS_ERROR)
|
||||
pgaio_result_report(ioret.result, &ioret.target_data, ERROR);
|
||||
|
||||
/*
|
||||
* Besides having succeeded completely, the IO could also have a) partially
|
||||
* completed or b) succeeded with a warning (e.g. due to zero_damaged_pages).
|
||||
* If we e.g. tried to read many blocks at once, the read might have
|
||||
* only succeeded for the first few blocks.
|
||||
*
|
||||
* If the IO partially succeeded and this backend needs all blocks to have
|
||||
* completed, this backend needs to reissue the IO for the remaining buffers.
|
||||
* The AIO subsystem cannot handle this retry transparently.
|
||||
*
|
||||
* As this example is already long, and we only read a single block, we'll just
|
||||
* error out if there's a partial read or a warning.
|
||||
*/
|
||||
if (ioret.result.status != PGAIO_RS_OK)
|
||||
pgaio_result_report(ioret.result, &ioret.target_data, ERROR);
|
||||
|
||||
/*
|
||||
* The IO succeeded, so we can use the buffer now.
|
||||
*/
|
||||
```
|
||||
|
||||
|
||||
## Design Criteria & Motivation
|
||||
|
||||
### Deadlock and Starvation Dangers due to AIO
|
||||
|
||||
Using AIO in a naive way can easily lead to deadlocks in an environment where
|
||||
the source/target of AIO are shared resources, like pages in postgres'
|
||||
shared_buffers.
|
||||
|
||||
Consider one backend performing readahead on a table, initiating IO for a
|
||||
number of buffers ahead of the current "scan position". If that backend then
|
||||
performs some operation that blocks, or even just is slow, the IO completion
|
||||
for the asynchronously initiated read may not be processed.
|
||||
|
||||
This AIO implementation solves this problem by requiring that AIO methods
|
||||
either allow AIO completions to be processed by any backend in the system
|
||||
(e.g. io_uring), or to guarantee that AIO processing will happen even when the
|
||||
issuing backend is blocked (e.g. worker mode, which offloads completion
|
||||
processing to the AIO workers).
|
||||
|
||||
|
||||
### IO can be started in critical sections
|
||||
|
||||
Using AIO for WAL writes can reduce the overhead of WAL logging substantially:
|
||||
|
||||
- AIO allows to start WAL writes eagerly, so they complete before needing to
|
||||
wait
|
||||
- AIO allows to have multiple WAL flushes in progress at the same time
|
||||
- AIO makes it more realistic to use O\_DIRECT + O\_DSYNC, which can reduce
|
||||
the number of roundtrips to storage on some OSs and storage HW (buffered IO
|
||||
and direct IO without O_DSYNC needs to issue a write and after the write's
|
||||
completion a cache flush, whereas O\_DIRECT + O\_DSYNC can use a single
|
||||
Force Unit Access (FUA) write).
|
||||
|
||||
The need to be able to execute IO in critical sections has substantial design
|
||||
implication on the AIO subsystem. Mainly because completing IOs (see prior
|
||||
section) needs to be possible within a critical section, even if the
|
||||
to-be-completed IO itself was not issued in a critical section. Consider
|
||||
e.g. the case of a backend first starting a number of writes from shared
|
||||
buffers and then starting to flush the WAL. Because only a limited amount of
|
||||
IO can be in-progress at the same time, initiating IO for flushing the WAL may
|
||||
require to first complete IO that was started earlier.
|
||||
|
||||
|
||||
### State for AIO needs to live in shared memory
|
||||
|
||||
Because postgres uses a process model and because AIOs need to be
|
||||
complete-able by any backend much of the state of the AIO subsystem needs to
|
||||
live in shared memory.
|
||||
|
||||
In an `EXEC_BACKEND` build, a backend's executable code and other process
|
||||
local state is not necessarily mapped to the same addresses in each process
|
||||
due to ASLR. This means that the shared memory cannot contain pointers to
|
||||
callbacks.
|
||||
|
||||
|
||||
## Design of the AIO Subsystem
|
||||
|
||||
|
||||
### AIO Methods
|
||||
|
||||
To achieve portability and performance, multiple methods of performing AIO are
|
||||
implemented and others are likely worth adding in the future.
|
||||
|
||||
|
||||
#### Synchronous Mode
|
||||
|
||||
`io_method=sync` does not actually perform AIO but allows to use the AIO API
|
||||
while performing synchronous IO. This can be useful for debugging. The code
|
||||
for the synchronous mode is also used as a fallback by e.g. the [worker
|
||||
mode](#worker) uses it to execute IO that cannot be executed by workers.
|
||||
|
||||
|
||||
#### Worker
|
||||
|
||||
`io_method=worker` is available on every platform postgres runs on, and
|
||||
implements asynchronous IO - from the view of the issuing process - by
|
||||
dispatching the IO to one of several worker processes performing the IO in a
|
||||
synchronous manner.
|
||||
|
||||
|
||||
#### io_uring
|
||||
|
||||
`io_method=io_uring` is available on Linux 5.1+. In contrast to worker mode it
|
||||
dispatches all IO from within the process, lowering context switch rate /
|
||||
latency.
|
||||
|
||||
|
||||
### AIO Handles
|
||||
|
||||
The central API piece for postgres' AIO abstraction are AIO handles. To
|
||||
execute an IO one first has to acquire an IO handle (`pgaio_io_acquire()`) and
|
||||
then "define" it, i.e. associate an IO operation with the handle.
|
||||
|
||||
Often AIO handles are acquired on a higher level and then passed to a lower
|
||||
level to be fully defined. E.g., for IO to/from shared buffers, bufmgr.c
|
||||
routines acquire the handle, which is then passed through smgr.c, md.c to be
|
||||
finally fully defined in fd.c.
|
||||
|
||||
The functions used at the lowest level to define the operation are
|
||||
`pgaio_io_start_*()`.
|
||||
|
||||
Because acquisition of an IO handle
|
||||
[must always succeed](#io-can-be-started-in-critical-sections)
|
||||
and the number of AIO Handles
|
||||
[has to be limited](#state-for-aio-needs-to-live-in-shared-memory)
|
||||
AIO handles can be reused as soon as they have completed. Obviously code needs
|
||||
to be able to react to IO completion. State can be updated using
|
||||
[AIO Completion callbacks](#aio-callbacks)
|
||||
and the issuing backend can provide a backend local variable to receive the
|
||||
result of the IO, as described in
|
||||
[AIO Result](#aio-results).
|
||||
An IO can be waited for, by both the issuing and any other backend, using
|
||||
[AIO References](#aio-wait-references).
|
||||
|
||||
|
||||
Because an AIO Handle is not executable just after calling
|
||||
`pgaio_io_acquire()` and because `pgaio_io_acquire()` needs to always succeed
|
||||
(absent a PANIC), only a single AIO Handle may be acquired (i.e. returned by
|
||||
`pgaio_io_acquire()`) without causing the IO to have been defined (by,
|
||||
potentially indirectly, causing `pgaio_io_start_*()` to have been
|
||||
called). Otherwise a backend could trivially self-deadlock by using up all AIO
|
||||
Handles without the ability to wait for some of the IOs to complete.
|
||||
|
||||
If it turns out that an AIO Handle is not needed, e.g., because the handle was
|
||||
acquired before holding a contended lock, it can be released without being
|
||||
defined using `pgaio_io_release()`.
|
||||
|
||||
|
||||
### AIO Callbacks
|
||||
|
||||
Commonly several layers need to react to completion of an IO. E.g. for a read
|
||||
md.c needs to check if the IO outright failed or was shorter than needed,
|
||||
bufmgr.c needs to verify the page looks valid and bufmgr.c needs to update the
|
||||
BufferDesc to update the buffer's state.
|
||||
|
||||
The fact that several layers / subsystems need to react to IO completion poses
|
||||
a few challenges:
|
||||
|
||||
- Upper layers should not need to know details of lower layers. E.g. bufmgr.c
|
||||
should not assume the IO will pass through md.c. Therefore upper levels
|
||||
cannot know what lower layers would consider an error.
|
||||
|
||||
- Lower layers should not need to know about upper layers. E.g. smgr APIs are
|
||||
used going through shared buffers but are also used bypassing shared
|
||||
buffers. This means that e.g. md.c is not in a position to validate
|
||||
checksums.
|
||||
|
||||
- Having code in the AIO subsystem for every possible combination of layers
|
||||
would lead to a lot of duplication.
|
||||
|
||||
The "solution" to this is the ability to associate multiple completion
|
||||
callbacks with a handle. E.g. bufmgr.c can have a callback to update the
|
||||
BufferDesc state and to verify the page and md.c can have another callback to
|
||||
check if the IO operation was successful.
|
||||
|
||||
As [mentioned](#state-for-aio-needs-to-live-in-shared-memory), shared memory
|
||||
currently cannot contain function pointers. Because of that completion
|
||||
callbacks are not directly identified by function pointers but by IDs
|
||||
(`PgAioHandleCallbackID`). A substantial added benefit is that that
|
||||
allows callbacks to be identified by much smaller amount of memory (a single
|
||||
byte currently).
|
||||
|
||||
In addition to completion, AIO callbacks also are called to "stage" an
|
||||
IO. This is, e.g., used to increase buffer reference counts to account for the
|
||||
AIO subsystem referencing the buffer, which is required to handle the case
|
||||
where the issuing backend errors out and releases its own pins while the IO is
|
||||
still ongoing.
|
||||
|
||||
As [explained earlier](#io-can-be-started-in-critical-sections) IO completions
|
||||
need to be safe to execute in critical sections. To allow the backend that
|
||||
issued the IO to error out in case of failure [AIO Result](#aio-results) can
|
||||
be used.
|
||||
|
||||
|
||||
### AIO Targets
|
||||
|
||||
In addition to the completion callbacks describe above, each AIO Handle has
|
||||
exactly one "target". Each target has some space inside an AIO Handle with
|
||||
information specific to the target and can provide callbacks to allow to
|
||||
reopen the underlying file (required for worker mode) and to describe the IO
|
||||
operation (used for debug logging and error messages).
|
||||
|
||||
I.e., if two different uses of AIO can describe the identity of the file being
|
||||
operated on the same way, it likely makes sense to use the same
|
||||
target. E.g. different smgr implementations can describe IO with
|
||||
RelFileLocator, ForkNumber and BlockNumber and can thus share a target. In
|
||||
contrast, IO for a WAL file would be described with TimeLineID and XLogRecPtr
|
||||
and it would not make sense to use the same target for smgr and WAL.
|
||||
|
||||
|
||||
### AIO Wait References
|
||||
|
||||
As [described above](#aio-handles), AIO Handles can be reused immediately
|
||||
after completion and therefore cannot be used to wait for completion of the
|
||||
IO. Waiting is enabled using AIO wait references, which do not just identify
|
||||
an AIO Handle but also include the handles "generation".
|
||||
|
||||
A reference to an AIO Handle can be acquired using `pgaio_io_get_wref()` and
|
||||
then waited upon using `pgaio_wref_wait()`.
|
||||
|
||||
|
||||
### AIO Results
|
||||
|
||||
As AIO completion callbacks
|
||||
[are executed in critical sections](#io-can-be-started-in-critical-sections)
|
||||
and [may be executed by any backend](#deadlock-and-starvation-dangers-due-to-aio)
|
||||
completion callbacks cannot be used to, e.g., make the query that triggered an
|
||||
IO ERROR out.
|
||||
|
||||
To allow to react to failing IOs the issuing backend can pass a pointer to a
|
||||
`PgAioReturn` in backend local memory. Before an AIO Handle is reused the
|
||||
`PgAioReturn` is filled with information about the IO. This includes
|
||||
information about whether the IO was successful (as a value of
|
||||
`PgAioResultStatus`) and enough information to raise an error in case of a
|
||||
failure (via `pgaio_result_report()`, with the error details encoded in
|
||||
`PgAioResult`).
|
||||
|
||||
|
||||
### AIO Errors
|
||||
|
||||
It would be very convenient to have shared completion callbacks encode the
|
||||
details of errors as an `ErrorData` that could be raised at a later
|
||||
time. Unfortunately doing so would require allocating memory. While elog.c can
|
||||
guarantee (well, kinda) that logging a message will not run out of memory,
|
||||
that only works because a very limited number of messages are in the process
|
||||
of being logged. With AIO a large number of concurrently issued AIOs might
|
||||
fail.
|
||||
|
||||
To avoid the need for preallocating a potentially large amount of memory (in
|
||||
shared memory no less!), completion callbacks instead have to encode errors in
|
||||
a more compact format that can be converted into an error message.
|
||||
|
||||
|
||||
## Helpers
|
||||
|
||||
Using the low-level AIO API introduces too much complexity to do so all over
|
||||
the tree. Most uses of AIO should be done via reusable, higher-level,
|
||||
helpers.
|
||||
|
||||
|
||||
### Read Stream
|
||||
|
||||
A common and very beneficial use of AIO are reads where a substantial number
|
||||
of to-be-read locations are known ahead of time. E.g., for a sequential scan
|
||||
the set of blocks that need to be read can be determined solely by knowing the
|
||||
current position and checking the buffer mapping table.
|
||||
|
||||
The [Read Stream](../../../include/storage/read_stream.h) interface makes it
|
||||
comparatively easy to use AIO for such use cases.
|
@ -24,6 +24,8 @@
|
||||
*
|
||||
* - read_stream.c - helper for reading buffered relation data
|
||||
*
|
||||
* - README.md - higher-level overview over AIO
|
||||
*
|
||||
*
|
||||
* Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
|
||||
* Portions Copyright (c) 1994, Regents of the University of California
|
||||
|
Loading…
x
Reference in New Issue
Block a user