1
0
mirror of https://github.com/postgres/postgres.git synced 2025-12-21 05:21:08 +03:00
Commit Graph

43 Commits

Author SHA1 Message Date
Andres Freund
fdd146a8ef aio: Add README.md explaining higher level design
Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
2025-04-01 16:06:48 -04:00
Andres Freund
60f566b4f2 aio: Add pg_aios view
The new view lists all IO handles that are currently in use and is mainly
useful for PG developers, but may also be useful when tuning PG.

Bumps catversion.

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
2025-04-01 13:30:33 -04:00
Andres Freund
ae3df4b341 read_stream: Introduce and use optional batchmode support
Submitting IO in larger batches can be more efficient than doing so
one-by-one, particularly for many small reads. It does, however, require
the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO
batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not:
a) block without first calling pgaio_submit_staged(), unless a
   to-be-waited-on lock cannot be part of a deadlock, e.g. because it is
   never held while waiting for IO.

b) directly or indirectly start another batch pgaio_enter_batchmode()

As this requires care and is nontrivial in some cases, batching is only
used with explicit opt-in.

This patch adds an explicit flag (READ_STREAM_USE_BATCHING) to read_stream and
uses it where appropriate.

There are two cases where batching would likely be beneficial, but where we
aren't using it yet:

1) bitmap heap scans, because the callback reads the VM

   This should soon be solved, because we are planning to remove the use of
   the VM, due to that not being sound.

2) The first phase of heap vacuum

   This could be made to support batchmode, but would require some care.

Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
2025-03-30 18:36:41 -04:00
Andres Freund
f4d0730bbc aio: Basic read_stream adjustments for real AIO
Adapt the read stream logic for real AIO:
- If AIO is enabled, we shouldn't issue advice, but if it isn't, we should
  continue issuing advice
- AIO benefits from reading ahead with direct IO
- If effective_io_concurrency=0, pass READ_BUFFERS_SYNCHRONOUSLY to
  StartReadBuffers() to ensure synchronous IO execution

There are further improvements we should consider:

- While in read_stream_look_ahead(), we can use AIO batch submission mode for
  increased efficiency. That however requires care to avoid deadlocks and thus
  done separately.
- It can be beneficial to defer starting new IOs until we can issue multiple
  IOs at once. That however requires non-trivial heuristics to decide when to
  do so.

Reviewed-by: Noah Misch <noah@leadboat.com>
Co-authored-by: Andres Freund <andres@anarazel.de>
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
2025-03-30 18:26:44 -04:00
Andres Freund
047cba7fa0 bufmgr: Implement AIO read support
This commit implements the infrastructure to perform asynchronous reads into
the buffer pool.

To do so, it:

- Adds readv AIO callbacks for shared and local buffers

  It may be worth calling out that shared buffer completions may be run in a
  different backend than where the IO started.

- Adds an AIO wait reference to BufferDesc, to allow backends to wait for
  in-progress asynchronous IOs

- Adapts StartBufferIO(), WaitIO(), TerminateBufferIO(), and their localbuf.c
  equivalents, to be able to deal with AIO

- Moves the code to handle BM_PIN_COUNT_WAITER into a helper function, as it
  now also needs to be called on IO completion

As of this commit, nothing issues AIO on shared/local buffers. A future commit
will update StartReadBuffers() to do so.

Buffer reads executed through this infrastructure will report invalid page /
checksum errors / warnings differently than before:

In the error case the error message will cover all the blocks that were
included in the read, rather than just the reporting the first invalid
block. If more than one block is invalid, the error will include information
about the range of the read, the first invalid block and the number of invalid
pages, with a HINT towards the server log for per-block details.

For the warning case (i.e. zero_damaged_buffers) we would previously emit one
warning message for each buffer in a multi-block read. Now there is only a
single warning message for the entire read, again referring to the server log
for more details in case of multiple checksum failures within a single larger
read.

Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
2025-03-30 17:28:03 -04:00
Andres Freund
ef64fe26ba aio: Add WARNING result status
If an IO succeeds, but issues a warning, e.g. due to a page verification
failure with zero_damaged_pages, we want to issue that warning in the context
of the issuer of the IO, not the process that executes the completion (always
the case for worker).

It's already possible for a completion callback to report a custom error
message, we just didn't have a result status that allowed a user of AIO to
know that a warning should be emitted even though the IO request succeeded.

All that's needed for that is a dedicated PGAIO_RS_ value.

Previously there were not enough bits in PgAioResult.id for the new
value. Increase. While at that, add defines for the amount of bits and static
asserts to check that the widths are appropriate.

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20250329212929.a6.nmisch@google.com
2025-03-30 16:27:10 -04:00
Andres Freund
50cb7505b3 aio: Implement support for reads in smgr/md/fd
This implements the following:

1) An smgr AIO target, for AIO on smgr files. This should be usable not just
   for md.c but also other SMGR implementation if we ever get them.
2) readv support in fd.c, which requires a small bit of infrastructure work in
   fd.c
3) smgr.c and md.c support for readv

There still is nothing performing AIO, but as of this commit it would be
possible.

As part of this change FileGetRawDesc() actually ensures that the file is
opened - previously it was basically not usable. It's used to reopen a file in
IO workers.

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
2025-03-29 13:38:35 -04:00
Andres Freund
c325a7633f aio: Add io_method=io_uring
Performing AIO using io_uring can be considerably faster than
io_method=worker, particularly when lots of small IOs are issued, as
a) the context-switch overhead for worker based AIO becomes more significant
b) the number of IO workers can become limiting

io_uring, however, is linux specific and requires an additional compile-time
dependency (liburing).

This implementation is fairly simple and there are substantial optimization
opportunities.

The description of the existing AIO_IO_COMPLETION wait event is updated to
make the difference between it and the new AIO_IO_URING_EXECUTION clearer.

Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
2025-03-26 19:49:13 -04:00
Andres Freund
9469d7fdd2 aio: Rename pgaio_io_prep_* to pgaio_io_start_*
The old naming pattern (mirroring liburing's naming) was inconsistent with
the (not yet introduced) callers. It seems better to get rid of the
inconsistency now than to grow more users of the odd naming.

Reported-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20250326001915.bc.nmisch@google.com
2025-03-26 16:10:29 -04:00
Andres Freund
f321ec237a aio: Pass result of local callbacks to ->report_return
Otherwise the results of e.g. temp table buffer verification errors will not
reach bufmgr.c. Obviously that's not right. Found while expanding the tests
for invalid buffer contents.

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20250326001915.bc.nmisch@google.com
2025-03-26 16:06:54 -04:00
Andres Freund
96da9050a5 aio: Be more paranoid about interrupts
As reported by Noah, it's possible, although practically very unlikely, that
interrupts could be processed in between pgaio_io_reopen() and
pgaio_io_perform_synchronously(). Prevent that by explicitly holding
interrupts.

It also seems good to add an assertion to pgaio_io_before_prep() to ensure
that interrupts are held, as otherwise FDs referenced by the IO could be
closed during interrupt processing. All code in the aio series currently runs
the code with interrupts held, but it seems better to be paranoid.

Reviewed-by: Noah Misch <noah@leadboat.com>
Reported-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20250324002939.5c.nmisch@google.com
2025-03-26 16:06:54 -04:00
Andres Freund
ca3067cc57 aio: Change prefix of PgAioResultStatus values to PGAIO_RS_
The previous prefix wasn't consistent with the naming of other AIO related
enum values. It seems best to rename it before the users are introduced.

Reported-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_Yb+JzQpNsgUxCB0gBi+sE-mi_HmcJF6ALnmO4W+UgwpA@mail.gmail.com
2025-03-22 17:30:44 -04:00
Thomas Munro
e51ca405ed Fix ps display for IO workers.
This code must have missed a memo about the backend type description
being supplied automatically these days, and was duplicating that
information.

Before: "io worker io worker: N"
After:  "io worker N"
2025-03-22 10:13:23 +13:00
Thomas Munro
ed0b87caac Support buffer forwarding in read_stream.c.
In preparation for a follow-up change to the buffer manager, teach
read_stream.c to manage buffers "forwarded" from one StartReadBuffers()
call to the next after a short read.  This involves a small amount of
extra book-keeping, and opens the way for lower levels to split I/O
operations without having to drop pins, as required for efficient
handling of various edge cases.

Concretely, the "buffers" argument will change from an out parameter to
an in/out parameter.  Buffer queue elements must be initialized on first
use and cleared after they're consumed, but forwarded buffers are left
where they fall ahead of the current pending read in the queue, ready
for use by the operation that continues where a short read left off.
The stream also needs to count them for pin limit management and release
them on reset/early end.

Tested-by: Andres Freund <andres@anarazel.de> (earlier versions)
Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com
2025-03-21 18:44:47 +13:00
Thomas Munro
06fb5612c9 Increase io_combine_limit range to 1MB.
The default of 128kB is unchanged, but the upper limit is changed from
32 blocks to 128 blocks, unless the operating system's IOV_MAX is too
low.  Some other RDBMSes seem to cap their multi-block buffer pool I/O
around this number, and it seems useful to allow experimentation.

The concrete change is to our definition of PG_IOV_MAX, which provides
the maximum for io_combine_limit and io_max_combine_limit.  It also
affects a couple of other places that work with arrays of struct iovec
or smaller objects on the stack, so we still don't want to use the
system IOV_MAX directly without a clamp: it is not under our control and
likely to be 1024.  128 seems acceptable for our current usage.

For Windows, we can't use real scatter/gather yet, so we continue to
define our own IOV_MAX value of 16 and emulate preadv()/pwritev() with
loops.  Someone would need to research the trade-offs of raising that
number.

NB if trying to see this working: you might temporarily need to hack
BAS_BULKREAD to be bigger, since otherwise the obvious way of "a very
big SELECT" is limited by that for now.

Suggested-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CA%2BhUKG%2B2T9p-%2BzM6Eeou-RAJjTML6eit1qn26f9twznX59qtCA%40mail.gmail.com
2025-03-19 15:40:35 +13:00
Thomas Munro
10f6646847 Introduce io_max_combine_limit.
The existing io_combine_limit can be changed by users.  The new
io_max_combine_limit is fixed at server startup time, and functions as a
silent clamp on the user setting.  That in itself is probably quite
useful, but the primary motivation is:

aio_init.c allocates shared memory for all asynchronous IOs including
some per-block data, and we didn't want to waste memory you'd never used
by assuming they could be up to PG_IOV_MAX.  This commit already halves
the size of 'AioHandleIov' and 'AioHandleData'.  A follow-up commit can
now expand PG_IOV_MAX without affecting that.

Since our GUC system doesn't support dependencies or cross-checks
between GUCs, the user-settable one now assigns a "raw" value to
io_combine_limit_guc, and the lower of io_combine_limit_guc and
io_max_combine_limit is maintained in io_combine_limit.

Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version)
Discussion: https://postgr.es/m/CA%2BhUKG%2B2T9p-%2BzM6Eeou-RAJjTML6eit1qn26f9twznX59qtCA%40mail.gmail.com
2025-03-19 15:23:54 +13:00
Andres Freund
247ce06b88 aio: Add io_method=worker
The previous commit introduced the infrastructure to start io_workers. This
commit actually makes the workers execute IOs.

IO workers consume IOs from a shared memory submission queue, run traditional
synchronous system calls, and perform the shared completion handling
immediately.  Client code submits most requests by pushing IOs into the
submission queue, and waits (if necessary) using condition variables.  Some
IOs cannot be performed in another process due to lack of infrastructure for
reopening the file, and must processed synchronously by the client code when
submitted.

For now the default io_method is changed to "worker". We should re-evaluate
that around beta1, we might want to be careful and set the default to "sync"
for 18.

Reviewed-by: Noah Misch <noah@leadboat.com>
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
Co-authored-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
2025-03-18 11:54:01 -04:00
Andres Freund
55b454d0e1 aio: Infrastructure for io_method=worker
This commit contains the basic, system-wide, infrastructure for
io_method=worker. It does not yet actually execute IO, this commit just
provides the infrastructure for running IO workers, kept separate for easier
review.

The number of IO workers can be adjusted with a PGC_SIGHUP GUC. Eventually
we'd like to make the number of workers dynamically scale up/down based on the
current "IO load".

To allow the number of IO workers to be increased without a restart, we need
to reserve PGPROC entries for the workers unconditionally. This has been
judged to be worth the cost. If it turns out to be problematic, we can
introduce a PGC_POSTMASTER GUC to control the maximum number.

As io workers might be needed during shutdown, e.g. for AIO during the
shutdown checkpoint, a new PMState phase is added. IO workers are shut down
after the shutdown checkpoint has been performed and walsender/archiver have
shut down, but before the checkpointer itself shuts down. See also
87a6690cc6.

Updates PGSTAT_FILE_FORMAT_ID due to the addition of a new BackendType.

Reviewed-by: Noah Misch <noah@leadboat.com>
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
Co-authored-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
2025-03-18 11:54:01 -04:00
Andres Freund
da7226993f aio: Add core asynchronous I/O infrastructure
The main motivations to use AIO in PostgreSQL are:

a) Reduce the time spent waiting for IO by issuing IO sufficiently early.

   In a few places we have approximated this using posix_fadvise() based
   prefetching, but that is fairly limited (no completion feedback, double the
   syscalls, only works with buffered IO, only works on some OSs).

b) Allow to use Direct-I/O (DIO).

   DIO can offload most of the work for IO to hardware and thus increase
   throughput / decrease CPU utilization, as well as reduce latency.  While we
   have gained the ability to configure DIO in d4e71df6, it is not yet usable
   for real world workloads, as every IO is executed synchronously.

For portability, the new AIO infrastructure allows to implement AIO using
different methods. The choice of the AIO method is controlled by the new
io_method GUC. As of this commit, the only implemented method is "sync",
i.e. AIO is not actually executed asynchronously. The "sync" method exists to
allow to bypass most of the new code initially.

Subsequent commits will introduce additional IO methods, including a
cross-platform method implemented using worker processes and a linux specific
method using io_uring.

To allow different parts of postgres to use AIO, the core AIO infrastructure
does not need to know what kind of files it is operating on. The necessary
behavioral differences for different files are abstracted as "AIO
Targets". One example target would be smgr. For boring portability reasons,
all targets currently need to be added to an array in aio_target.c.  This
commit does not implement any AIO targets, just the infrastructure for
them. The smgr target will be added in a later commit.

Completion (and other events) of IOs for one type of file (i.e. one AIO
target) need to be reacted to differently, based on the IO operation and the
callsite. This is made possible by callbacks that can be registered on
IOs. E.g. an smgr read into a local buffer does not need to update the
corresponding BufferDesc (as there is none), but a read into shared buffers
does.  This commit does not contain any callbacks, they will be added in
subsequent commits.

For now the AIO infrastructure only understands READV and WRITEV operations,
but it is expected that more operations will be added. E.g. fsync/fdatasync,
flush_range and network operations like send/recv.

As of this commit, nothing uses the AIO infrastructure. Later commits will add
an smgr target, md.c and bufmgr.c callbacks and then finally use AIO for
read_stream.c IO, which, in one fell swoop, will convert all read stream users
to AIO.

The goal is to use AIO in many more places. There are patches to use AIO for
checkpointer and bgwriter that are reasonably close to being ready. There also
are prototypes to use it for WAL, relation extension, backend writes and many
more. Those prototypes were important to ensure the design of the AIO
subsystem is not too limiting (e.g. WAL writes need to happen in critical
sections, which influenced a lot of the design).

A future commit will add an AIO README explaining the AIO architecture and how
to use the AIO subsystem. The README is added later, as it references details
only added in later commits.

Many many more people than the folks named below have contributed with
feedback, work on semi-independent patches etc. E.g. various folks have
contributed patches to use the read stream infrastructure (added by Thomas in
b5a9b18cd0) in more places. Similarly, a *lot* of folks have contributed to
the CI infrastructure, which I had started to work on to make adding AIO
feasible.

Some of the work by contributors has gone into the "v1" prototype of AIO,
which heavily influenced the current design of the AIO subsystem. None of the
code from that directly survives, but without the prototype, the current
version of the AIO infrastructure would not exist.

Similarly, the reviewers below have not necessarily looked at the current
design or the whole infrastructure, but have provided very valuable input. I
am to blame for problems, not they.

Author: Andres Freund <andres@anarazel.de>
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
Co-authored-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Co-authored-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Dmitry Dolgov <9erthalion6@gmail.com>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
2025-03-17 18:51:33 -04:00
Andres Freund
02844012b3 aio: Basic subsystem initialization
This commit just does the minimal wiring up of the AIO subsystem, added in the
next commit, to the rest of the system. The next commit contains more details
about motivation and architecture.

This commit is kept separate to make it easier to review, separating the
changes across the tree, from the implementation of the new subsystem.

We discussed squashing this commit with the main commit before merging AIO,
but there has been a mild preference for keeping it separate.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
2025-03-17 18:51:33 -04:00
Thomas Munro
799959dc7c Simplify distance heuristics in read_stream.c.
Make the distance control heuristics simpler and more aggressive in
preparation for asynchronous I/O.

The v17 version of read_stream.c made a conservative choice to limit the
look-ahead distance when streaming sequential blocks, because it
couldn't benefit very much from looking ahead further yet.  It had a
three-behavior model where only random I/O would rapidly increase the
look-ahead distance, to support read-ahead advice.  Sequential I/O would
move it towards the io_combine_limit setting, just enough to build one
full-sized synchronous I/O at a time, and then expect kernel read-ahead
to avoid I/O stalls.

That already left I/O performance on the table with advice-based I/O
concurrency, since sequential blocks could be followed by random jumps,
eg with the proposed streaming Bitmap Heap Scan patch.

It is time to delete the cautious middle option and adjust the distance
based on recent I/O needs only, since asynchronous reads will need to be
started ahead of time whether random or sequential.  It is still limited
by io_combine_limit, *_io_concurrency, buffer availability and
strategy ring size, as before.

Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version)
Tested-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com
2025-03-16 03:05:07 +13:00
Thomas Munro
7ea8cd1566 Improve read_stream.c advice for dense streams.
read_stream.c tries not to issue read-ahead advice when it thinks the
kernel's own read-ahead should be active, ie when using buffered I/O and
reading sequential blocks.  It previously gave up too easily, and issued
advice only for the first read of up to io_combine_limit blocks in a
larger range of sequential blocks after random jump.  The following read
could suffer an avoidable I/O stall.

Fix, by continuing to issue advice until the corresponding preadv()
calls catch up with the start of the region we're currently issuing
advice for, if ever.  That's when the kernel actually sees the
sequential pattern.  Advice is now disabled only when the stream is
entirely sequential as far as we can see in the look-ahead window, or
in other words, when a sequential region is larger than we can cover
with the current io_concurrency and io_combine_limit settings.

While refactoring the advice control logic, also get rid of the
"suppress_advice" argument that was passed around between functions to
skip useless posix_fadvise() calls immediately followed by preadv().
read_stream_start_pending_read() can figure that out, so let's
concentrate knowledge of advice heuristics in fewer places (our goal
being to make advice-based I/O concurrency a legacy mode soon).

The problem cases were revealed by Tomas Vondra's extensive regression
testing with many different disk access patterns using Melanie
Plageman's streaming Bitmap Heap Scan patch, in a battle against the
venerable always-issue-advice-and-always-one-block-at-a-time code.

Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version)
Reported-by: Melanie Plageman <melanieplageman@gmail.com>
Reported-by: Tomas Vondra <tomas@vondra.me>
Reported-by: Andres Freund <andres@anarazel.de>
Tested-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com
Discussion: https://postgr.es/m/CA%2BhUKGJ3HSWciQCz8ekP1Zn7N213RfA4nbuotQawfpq23%2Bw-5Q%40mail.gmail.com
2025-03-15 19:04:54 +13:00
Thomas Munro
92fc6856cb Respect changing pin limits in read_stream.c.
To avoid pinning too much of the buffer pool at once, read_stream.c
previously used LimitAdditionalPins().  The coding was naive, and only
considered the available buffers at stream construction time.

This commit checks before each StartReadBuffers() call with
GetAdditionalPinLimit().  The result might change over time due to pins
acquired outside this stream by the same backend.  No extra CPU cycles
are added to the all-buffered fast-path code, but the I/O-starting path
now considers the up-to-date remaining buffer limit.

In practice it was quite difficult to exceed limits and cause any real
problems in v17, so no back-patch for now, but proposed changes will
make it easier.

Per code review from Andres, in the course of testing his AIO patches.

Reviewed-by: Andres Freund <andres@anarazel.de> (earlier versions)
Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com
2025-03-14 21:21:09 +13:00
Thomas Munro
75da2bece6 Fix read_stream.c for changing io_combine_limit.
In a couple of places, read_stream.c assumed that io_combine_limit would
be stable during the lifetime of a stream.  That is not true in at least
one unusual case: streams held by CURSORs where you could change the GUC
between FETCH commands, with unpredictable results.

Fix, by storing stream->io_combine_limit and referring only to that
after construction.  This mirrors the treatment of the other important
setting {effective,maintenance}_io_concurrency, which is stored in
stream->max_ios.

One of the cases was the queue overflow space, which was sized for
io_combine_limit and could be overrun if the GUC was increased.  Since
that coding was a little hard to follow, also introduce a variable for
better readability instead of open-coding the arithmetic.  Doing so
revealed an off-by-one thinko while clamping max_pinned_buffers to
INT16_MAX, though that wasn't a live bug due to the current limits on
GUC values.

Back-patch to 17.

Discussion: https://postgr.es/m/CA%2BhUKG%2B2T9p-%2BzM6Eeou-RAJjTML6eit1qn26f9twznX59qtCA%40mail.gmail.com
2025-03-13 15:43:34 +13:00
Thomas Munro
55918f798b Remove arbitrary cap on read_stream.c buffer queue.
Previously the internal queue of buffers was capped at max_ios * 4,
though not less than io_combine_limit, at allocation time.  That was
done in the first version based on conservative theories about resource
usage and heuristics pending later work.  The configured I/O depth could
not always be reached with dense random streams generated by ANALYZE,
VACUUM, the proposed Bitmap Heap Scan patch, and also sequential streams
with the proposed AIO subsystem to name some examples.

The new formula is (max_ios + 1) * io_combine_limit, enough buffers for
the full configured I/O concurrency level using the full configured I/O
combine size, plus the buffers from one finished but not yet consumed
full-sized I/O.  Significantly more memory would be needed for high GUC
values if the client code requests a large per-buffer data size, but
that is discouraged (existing and proposed stream users try to keep it
under a few words, if not zero).

With this new formula, an intermediate variable could have overflowed
under maximum GUC values, so its data type is adjusted to cope.

Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com
2025-02-27 20:49:48 +13:00
Thomas Munro
2509b857cc Fix typo in 2a8a0067.
Builds configured with Valgrind but without assertions would fail due to
a typo in the recent change.  This should be included when back-patching
2a8a0067 into v17.
2025-02-18 14:44:59 +13:00
Thomas Munro
2a8a00674e Fix explicit valgrind interaction in read_stream.c.
By calling wipe_mem() on per-buffer data memory that has been released,
we are also telling Valgrind that the memory is "noaccess".  We need to
set it to "undefined" before giving it to the registered callback to
fill in, when a slot is reused.

As discovered by build farm animal skink when the VACUUM streamification
patches landed (the first users of per-buffer data).

Pushing to master only for now, to clear the error on skink.  It's also
possible that external code might discover the per-buffer data feature
in v17, and reasonable to expect Valgrind not to produce spurious
memcheck reports, but the back-patch is deferred until after the
imminent minor release is out of the way.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Tested-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKG%2Bg6aXpi2FEHqeLOzE%2BxYw%3DOV%2B-N5jhOEnnV%2BF0USM9xA%40mail.gmail.com
2025-02-15 13:14:03 +13:00
Bruce Momjian
50e6eb731d Update copyright for 2025
Backpatch-through: 13
2025-01-01 11:21:55 -05:00
Peter Eisentraut
e18512c000 Remove unused #include's from backend .c files
as determined by IWYU

These are mostly issues that are new since commit dbbca2cf29.

Discussion: https://www.postgresql.org/message-id/flat/0df1d5b1-8ca8-4f84-93be-121081bde049%40eisentraut.org
2024-10-27 08:26:50 +01:00
Thomas Munro
70d38e3d8a Allow ReadStream to be consumed as raw block numbers.
Commits 041b9680 and 6377e12a changed the interface of
scan_analyze_next_block() to take a ReadStream instead of a BlockNumber
and a BufferAccessStrategy, and to return a value to indicate when the
stream has run out of blocks.

This caused integration problems for at least one known extension that
uses specially encoded BlockNumber values that map to different
underlying storage, because acquire_sample_rows() sets up the stream so
that read_stream_next_buffer() reads blocks from the main fork of the
relation's SMgrRelation.

Provide read_stream_next_block(), as a way for such an extension to
access the stream of raw BlockNumbers directly and forward them to its
own ReadBuffer() calls after decoding, as it could in earlier releases.
The new function returns the BlockNumber and BufferAccessStrategy that
were previously passed directly to scan_analyze_next_block().
Alternatively, an extension could wrap the stream of BlockNumbers in
another ReadStream with a callback that performs any decoding required
to arrive at real storage manager BlockNumber values, so that it could
benefit from the I/O combining and concurrency provided by
read_stream.c.

Another class of table access method that does nothing in
scan_analyze_next_block() because it is not block-oriented could use
this function to control the number of block sampling loops.  It could
match the previous behavior with "return read_stream_next_block(stream,
&bas) != InvalidBlockNumber".

Ongoing work is expected to provide better ANALYZE support for table
access methods that don't behave like heapam with respect to storage
blocks, but that will be for future releases.

Back-patch to 17.

Reported-by: Mats Kindahl <mats@timescale.com>
Reviewed-by: Mats Kindahl <mats@timescale.com>
Discussion: https://postgr.es/m/CA%2B14425%2BCcm07ocG97Fp%2BFrD9xUXqmBKFvecp0p%2BgV2YYR258Q%40mail.gmail.com
2024-09-18 11:34:28 +12:00
Thomas Munro
813fde73d4 Standardize "read-ahead advice" terminology.
Commit 6654bb920 added macOS's equivalent of POSIX_FADV_WILLNEED, and
changed some explicit references to posix_fadvise to use this more
general name for the concept.  Update some remaining references.

Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/0827edec-1317-4917-a186-035eb1e3241d%40eisentraut.org
2024-09-04 10:28:53 +12:00
Noah Misch
c582b75851 Add block_range_read_stream_cb(), to deduplicate code.
This replaces two functions for iterating over all blocks in a range.  A
pending patch will use this instead of adding a third.

Nazir Bilal Yavuz

Discussion: https://postgr.es/m/20240820184742.f2.nmisch@google.com
2024-09-03 10:46:20 -07:00
Michael Paquier
4236825197 Fix typos and grammar in code comments and docs
Author: Alexander Lakhin
Discussion: https://postgr.es/m/f7e514cf-2446-21f1-a5d2-8c089a6e2168@gmail.com
2024-09-03 14:49:04 +09:00
Thomas Munro
4effd0844d Fix unfairness in all-cached parallel seq scan.
Commit b5a9b18c introduced block streaming infrastructure with a special
fast path for all-cached scans, and commit b7b0f3f2 connected the
infrastructure up to sequential scans.  One of the fast path
micro-optimizations had an unintended consequence: it interfered with
parallel sequential scan's block range allocator (from commit 56788d21),
which has its own ramp-up and ramp-down algorithm when handing out
groups of pages to workers.  A scan of an all-cached table could give
extra blocks to one worker, when others had finished.  In some plans
(probably already very bad plans, such as the one reported by
Alexander), the unfairness could be magnified.

An internal buffer of 16 block numbers is removed, keeping just a single
block buffer for technical reasons.

Back-patch to 17.

Reported-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/63a63690-dd92-c809-0b47-af05459e95d1%40gmail.com
2024-08-31 17:28:02 +12:00
Bruce Momjian
151da217a3 C comment: fix for commit b5a9b18cd0
The commit was "Provide API for streaming relation data.".

Reported-by: Nazir Bilal Yavuz

Discussion: https://postgr.es/m/CAN55FZ3KsZ2faZs1sK5J0W+_8B3myB232CfLYGie4u4BBMwP3g@mail.gmail.com

Backpatch-through: master
2024-08-16 21:12:18 -04:00
Noah Misch
a858be17c3 Add a way to create read stream object by using SMgrRelation.
Currently read stream object can be created only by using Relation.

Nazir Bilal Yavuz

Discussion: https://postgr.es/m/CAN55FZ0JKL6vk1xQp6rfOXiNFV1u1H0tJDPPGHWoiO3ea2Wc=A@mail.gmail.com
2024-07-20 04:22:12 -07:00
Noah Misch
af07a827b9 Refactor PinBufferForBlock() to remove checks about persistence.
There are checks in PinBufferForBlock() function to set persistence of
the relation.  This function is called for each block in the relation.
Instead, set persistence of the relation before PinBufferForBlock().

Nazir Bilal Yavuz

Discussion: https://postgr.es/m/CAN55FZ0JKL6vk1xQp6rfOXiNFV1u1H0tJDPPGHWoiO3ea2Wc=A@mail.gmail.com
2024-07-20 04:22:12 -07:00
David Rowley
2ea4b29277 Fix typos and incorrect type in read_stream.c
max_ios should be int rather than int16, otherwise there's not much
point in doing:

max_ios = Min(max_ios, PG_INT16_MAX);

Discussion: https://postgr.es/m/CAApHDvr9Un-XpDr_+AFdOGM38O2K8SpfoHimqZ838gguTGYBiQ@mail.gmail.com
2024-05-01 17:04:52 +12:00
Daniel Gustafsson
950d4a2cb1 Fix typos and duplicate words
This fixes various typos, duplicated words, and tiny bits of whitespace
mainly in code comments but also in docs.

Author: Daniel Gustafsson <daniel@yesql.se>
Author: Heikki Linnakangas <hlinnaka@iki.fi>
Author: Alexander Lakhin <exclusion@gmail.com>
Author: David Rowley <dgrowleyml@gmail.com>
Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/3F577953-A29E-4722-98AD-2DA9EFF2CBB8@yesql.se
2024-04-18 21:28:07 +02:00
Thomas Munro
158f581923 Fix if/while thinko in read_stream.c edge case.
When we determine that a wanted block can't be combined with the current
pending read, it's time to start that read to get it out of the way.  An
"if" in that code path should have been a "while", because it might take
more than one go in case of partial reads.  This was only broken for
smaller ranges, as the more common case of io_combine_limit-sized ranges
is handled earlier in the code and knows how to loop, hiding the bug for
a while.

Discovered while testing large parallel sequential scans of partially
cached tables.  The ramp-up-and-down block allocator for parallel scans
could hit the problem case and skip some blocks near the end that should
have been streamed.

Defect in commit b5a9b18c.

Discussion: https://postgr.es/m/CA%2BhUKG%2Bh8Whpv0YsJqjMVkjYX%2B80fTVc6oi-V%2BzxJvykLpLHYQ%40mail.gmail.com
2024-04-07 14:51:33 +12:00
Thomas Munro
3bd8439ed6 Allow BufferAccessStrategy to limit pin count.
While pinning extra buffers to look ahead, users of strategies are in
danger of using too many buffers.  For some strategies, that means
"escaping" from the ring, and in others it means forcing dirty data to
disk very frequently with associated WAL flushing.  Since external code
has no insight into any of that, allow individual strategy types to
expose a clamp that should be applied when deciding how many buffers to
pin at once.

Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_aJXnqsyZt6HwFLnxYEBgE17oypkxbKbT1t1geE_wvH2Q%40mail.gmail.com
2024-04-06 23:11:45 +13:00
Thomas Munro
aa1e8c2064 Improve read_stream.c's fast path.
The "fast path" for well cached scans that don't do any I/O was
accidentally coded in a way that could only be triggered by pg_prewarm's
usage pattern, which starts out with a higher distance because of the
flags it passes in.  We want it to work for streaming sequential scans
too, once that patch is committed.  Adjust.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGKXZALJ%3D6aArUsXRJzBm%3Dqvc4AWp7%3DiJNXJQqpbRLnD_w%40mail.gmail.com
2024-04-06 18:00:35 +13:00
Thomas Munro
b5a9b18cd0 Provide API for streaming relation data.
Introduce an abstraction allowing relation data to be accessed as a
stream of buffers, with an implementation that is more efficient than
the equivalent sequence of ReadBuffer() calls.

Client code supplies a callback that can say which block number it wants
next, and then consumes individual buffers one at a time from the
stream.  This division puts read_stream.c in control of how far ahead it
can see and allows it to read clusters of neighboring blocks with
StartReadBuffers().  It also issues POSIX_FADV_WILLNEED advice ahead of
time when random access is detected.

Other variants of I/O stream will be proposed in future work (for
example to support recovery, whose LsnReadQueue device in
xlogprefetcher.c is a distant cousin of this code and should eventually
be replaced by this), but this basic API is sufficient for many common
executor usage patterns involving predictable access to a single fork of
a single relation.

Several patches using this API are proposed separately.

This stream concept is loosely based on ideas from Andres Freund on how
we should pave the way for later work on asynchronous I/O.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Heikki Linnakangas <hlinnaka@iki.fi> (contributions)
Author: Melanie Plageman <melanieplageman@gmail.com> (contributions)
Suggested-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Tested-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
2024-04-03 00:49:46 +13:00