postgres

mirror of https://github.com/postgres/postgres.git synced 2025-12-21 05:21:08 +03:00

Author	SHA1	Message	Date
Andres Freund	fdd146a8ef	aio: Add README.md explaining higher level design Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m	2025-04-01 16:06:48 -04:00
Andres Freund	60f566b4f2	aio: Add pg_aios view The new view lists all IO handles that are currently in use and is mainly useful for PG developers, but may also be useful when tuning PG. Bumps catversion. Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt	2025-04-01 13:30:33 -04:00
Andres Freund	ae3df4b341	read_stream: Introduce and use optional batchmode support Submitting IO in larger batches can be more efficient than doing so one-by-one, particularly for many small reads. It does, however, require the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not: a) block without first calling pgaio_submit_staged(), unless a to-be-waited-on lock cannot be part of a deadlock, e.g. because it is never held while waiting for IO. b) directly or indirectly start another batch pgaio_enter_batchmode() As this requires care and is nontrivial in some cases, batching is only used with explicit opt-in. This patch adds an explicit flag (READ_STREAM_USE_BATCHING) to read_stream and uses it where appropriate. There are two cases where batching would likely be beneficial, but where we aren't using it yet: 1) bitmap heap scans, because the callback reads the VM This should soon be solved, because we are planning to remove the use of the VM, due to that not being sound. 2) The first phase of heap vacuum This could be made to support batchmode, but would require some care. Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt	2025-03-30 18:36:41 -04:00
Andres Freund	f4d0730bbc	aio: Basic read_stream adjustments for real AIO Adapt the read stream logic for real AIO: - If AIO is enabled, we shouldn't issue advice, but if it isn't, we should continue issuing advice - AIO benefits from reading ahead with direct IO - If effective_io_concurrency=0, pass READ_BUFFERS_SYNCHRONOUSLY to StartReadBuffers() to ensure synchronous IO execution There are further improvements we should consider: - While in read_stream_look_ahead(), we can use AIO batch submission mode for increased efficiency. That however requires care to avoid deadlocks and thus done separately. - It can be beneficial to defer starting new IOs until we can issue multiple IOs at once. That however requires non-trivial heuristics to decide when to do so. Reviewed-by: Noah Misch <noah@leadboat.com> Co-authored-by: Andres Freund <andres@anarazel.de> Co-authored-by: Thomas Munro <thomas.munro@gmail.com>	2025-03-30 18:26:44 -04:00
Andres Freund	047cba7fa0	bufmgr: Implement AIO read support This commit implements the infrastructure to perform asynchronous reads into the buffer pool. To do so, it: - Adds readv AIO callbacks for shared and local buffers It may be worth calling out that shared buffer completions may be run in a different backend than where the IO started. - Adds an AIO wait reference to BufferDesc, to allow backends to wait for in-progress asynchronous IOs - Adapts StartBufferIO(), WaitIO(), TerminateBufferIO(), and their localbuf.c equivalents, to be able to deal with AIO - Moves the code to handle BM_PIN_COUNT_WAITER into a helper function, as it now also needs to be called on IO completion As of this commit, nothing issues AIO on shared/local buffers. A future commit will update StartReadBuffers() to do so. Buffer reads executed through this infrastructure will report invalid page / checksum errors / warnings differently than before: In the error case the error message will cover all the blocks that were included in the read, rather than just the reporting the first invalid block. If more than one block is invalid, the error will include information about the range of the read, the first invalid block and the number of invalid pages, with a HINT towards the server log for per-block details. For the warning case (i.e. zero_damaged_buffers) we would previously emit one warning message for each buffer in a multi-block read. Now there is only a single warning message for the entire read, again referring to the server log for more details in case of multiple checksum failures within a single larger read. Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m	2025-03-30 17:28:03 -04:00
Andres Freund	ef64fe26ba	aio: Add WARNING result status If an IO succeeds, but issues a warning, e.g. due to a page verification failure with zero_damaged_pages, we want to issue that warning in the context of the issuer of the IO, not the process that executes the completion (always the case for worker). It's already possible for a completion callback to report a custom error message, we just didn't have a result status that allowed a user of AIO to know that a warning should be emitted even though the IO request succeeded. All that's needed for that is a dedicated PGAIO_RS_ value. Previously there were not enough bits in PgAioResult.id for the new value. Increase. While at that, add defines for the amount of bits and static asserts to check that the widths are appropriate. Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/20250329212929.a6.nmisch@google.com	2025-03-30 16:27:10 -04:00
Andres Freund	50cb7505b3	aio: Implement support for reads in smgr/md/fd This implements the following: 1) An smgr AIO target, for AIO on smgr files. This should be usable not just for md.c but also other SMGR implementation if we ever get them. 2) readv support in fd.c, which requires a small bit of infrastructure work in fd.c 3) smgr.c and md.c support for readv There still is nothing performing AIO, but as of this commit it would be possible. As part of this change FileGetRawDesc() actually ensures that the file is opened - previously it was basically not usable. It's used to reopen a file in IO workers. Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m	2025-03-29 13:38:35 -04:00
Andres Freund	c325a7633f	aio: Add io_method=io_uring Performing AIO using io_uring can be considerably faster than io_method=worker, particularly when lots of small IOs are issued, as a) the context-switch overhead for worker based AIO becomes more significant b) the number of IO workers can become limiting io_uring, however, is linux specific and requires an additional compile-time dependency (liburing). This implementation is fairly simple and there are substantial optimization opportunities. The description of the existing AIO_IO_COMPLETION wait event is updated to make the difference between it and the new AIO_IO_URING_EXECUTION clearer. Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m	2025-03-26 19:49:13 -04:00
Andres Freund	9469d7fdd2	aio: Rename pgaio_io_prep_* to pgaio_io_start_* The old naming pattern (mirroring liburing's naming) was inconsistent with the (not yet introduced) callers. It seems better to get rid of the inconsistency now than to grow more users of the odd naming. Reported-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/20250326001915.bc.nmisch@google.com	2025-03-26 16:10:29 -04:00
Andres Freund	f321ec237a	aio: Pass result of local callbacks to ->report_return Otherwise the results of e.g. temp table buffer verification errors will not reach bufmgr.c. Obviously that's not right. Found while expanding the tests for invalid buffer contents. Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/20250326001915.bc.nmisch@google.com	2025-03-26 16:06:54 -04:00
Andres Freund	96da9050a5	aio: Be more paranoid about interrupts As reported by Noah, it's possible, although practically very unlikely, that interrupts could be processed in between pgaio_io_reopen() and pgaio_io_perform_synchronously(). Prevent that by explicitly holding interrupts. It also seems good to add an assertion to pgaio_io_before_prep() to ensure that interrupts are held, as otherwise FDs referenced by the IO could be closed during interrupt processing. All code in the aio series currently runs the code with interrupts held, but it seems better to be paranoid. Reviewed-by: Noah Misch <noah@leadboat.com> Reported-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/20250324002939.5c.nmisch@google.com	2025-03-26 16:06:54 -04:00
Andres Freund	ca3067cc57	aio: Change prefix of PgAioResultStatus values to PGAIO_RS_ The previous prefix wasn't consistent with the naming of other AIO related enum values. It seems best to rename it before the users are introduced. Reported-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CAAKRu_Yb+JzQpNsgUxCB0gBi+sE-mi_HmcJF6ALnmO4W+UgwpA@mail.gmail.com	2025-03-22 17:30:44 -04:00
Thomas Munro	e51ca405ed	Fix ps display for IO workers. This code must have missed a memo about the backend type description being supplied automatically these days, and was duplicating that information. Before: "io worker io worker: N" After: "io worker N"	2025-03-22 10:13:23 +13:00
Thomas Munro	ed0b87caac	Support buffer forwarding in read_stream.c. In preparation for a follow-up change to the buffer manager, teach read_stream.c to manage buffers "forwarded" from one StartReadBuffers() call to the next after a short read. This involves a small amount of extra book-keeping, and opens the way for lower levels to split I/O operations without having to drop pins, as required for efficient handling of various edge cases. Concretely, the "buffers" argument will change from an out parameter to an in/out parameter. Buffer queue elements must be initialized on first use and cleared after they're consumed, but forwarded buffers are left where they fall ahead of the current pending read in the queue, ready for use by the operation that continues where a short read left off. The stream also needs to count them for pin limit management and release them on reset/early end. Tested-by: Andres Freund <andres@anarazel.de> (earlier versions) Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com	2025-03-21 18:44:47 +13:00
Thomas Munro	06fb5612c9	Increase io_combine_limit range to 1MB. The default of 128kB is unchanged, but the upper limit is changed from 32 blocks to 128 blocks, unless the operating system's IOV_MAX is too low. Some other RDBMSes seem to cap their multi-block buffer pool I/O around this number, and it seems useful to allow experimentation. The concrete change is to our definition of PG_IOV_MAX, which provides the maximum for io_combine_limit and io_max_combine_limit. It also affects a couple of other places that work with arrays of struct iovec or smaller objects on the stack, so we still don't want to use the system IOV_MAX directly without a clamp: it is not under our control and likely to be 1024. 128 seems acceptable for our current usage. For Windows, we can't use real scatter/gather yet, so we continue to define our own IOV_MAX value of 16 and emulate preadv()/pwritev() with loops. Someone would need to research the trade-offs of raising that number. NB if trying to see this working: you might temporarily need to hack BAS_BULKREAD to be bigger, since otherwise the obvious way of "a very big SELECT" is limited by that for now. Suggested-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CA%2BhUKG%2B2T9p-%2BzM6Eeou-RAJjTML6eit1qn26f9twznX59qtCA%40mail.gmail.com	2025-03-19 15:40:35 +13:00
Thomas Munro	10f6646847	Introduce io_max_combine_limit. The existing io_combine_limit can be changed by users. The new io_max_combine_limit is fixed at server startup time, and functions as a silent clamp on the user setting. That in itself is probably quite useful, but the primary motivation is: aio_init.c allocates shared memory for all asynchronous IOs including some per-block data, and we didn't want to waste memory you'd never used by assuming they could be up to PG_IOV_MAX. This commit already halves the size of 'AioHandleIov' and 'AioHandleData'. A follow-up commit can now expand PG_IOV_MAX without affecting that. Since our GUC system doesn't support dependencies or cross-checks between GUCs, the user-settable one now assigns a "raw" value to io_combine_limit_guc, and the lower of io_combine_limit_guc and io_max_combine_limit is maintained in io_combine_limit. Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version) Discussion: https://postgr.es/m/CA%2BhUKG%2B2T9p-%2BzM6Eeou-RAJjTML6eit1qn26f9twznX59qtCA%40mail.gmail.com	2025-03-19 15:23:54 +13:00
Andres Freund	247ce06b88	aio: Add io_method=worker The previous commit introduced the infrastructure to start io_workers. This commit actually makes the workers execute IOs. IO workers consume IOs from a shared memory submission queue, run traditional synchronous system calls, and perform the shared completion handling immediately. Client code submits most requests by pushing IOs into the submission queue, and waits (if necessary) using condition variables. Some IOs cannot be performed in another process due to lack of infrastructure for reopening the file, and must processed synchronously by the client code when submitted. For now the default io_method is changed to "worker". We should re-evaluate that around beta1, we might want to be careful and set the default to "sync" for 18. Reviewed-by: Noah Misch <noah@leadboat.com> Co-authored-by: Thomas Munro <thomas.munro@gmail.com> Co-authored-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m	2025-03-18 11:54:01 -04:00
Andres Freund	55b454d0e1	aio: Infrastructure for io_method=worker This commit contains the basic, system-wide, infrastructure for io_method=worker. It does not yet actually execute IO, this commit just provides the infrastructure for running IO workers, kept separate for easier review. The number of IO workers can be adjusted with a PGC_SIGHUP GUC. Eventually we'd like to make the number of workers dynamically scale up/down based on the current "IO load". To allow the number of IO workers to be increased without a restart, we need to reserve PGPROC entries for the workers unconditionally. This has been judged to be worth the cost. If it turns out to be problematic, we can introduce a PGC_POSTMASTER GUC to control the maximum number. As io workers might be needed during shutdown, e.g. for AIO during the shutdown checkpoint, a new PMState phase is added. IO workers are shut down after the shutdown checkpoint has been performed and walsender/archiver have shut down, but before the checkpointer itself shuts down. See also `87a6690cc6`. Updates PGSTAT_FILE_FORMAT_ID due to the addition of a new BackendType. Reviewed-by: Noah Misch <noah@leadboat.com> Co-authored-by: Thomas Munro <thomas.munro@gmail.com> Co-authored-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m	2025-03-18 11:54:01 -04:00
Andres Freund	da7226993f	aio: Add core asynchronous I/O infrastructure The main motivations to use AIO in PostgreSQL are: a) Reduce the time spent waiting for IO by issuing IO sufficiently early. In a few places we have approximated this using posix_fadvise() based prefetching, but that is fairly limited (no completion feedback, double the syscalls, only works with buffered IO, only works on some OSs). b) Allow to use Direct-I/O (DIO). DIO can offload most of the work for IO to hardware and thus increase throughput / decrease CPU utilization, as well as reduce latency. While we have gained the ability to configure DIO in `d4e71df6`, it is not yet usable for real world workloads, as every IO is executed synchronously. For portability, the new AIO infrastructure allows to implement AIO using different methods. The choice of the AIO method is controlled by the new io_method GUC. As of this commit, the only implemented method is "sync", i.e. AIO is not actually executed asynchronously. The "sync" method exists to allow to bypass most of the new code initially. Subsequent commits will introduce additional IO methods, including a cross-platform method implemented using worker processes and a linux specific method using io_uring. To allow different parts of postgres to use AIO, the core AIO infrastructure does not need to know what kind of files it is operating on. The necessary behavioral differences for different files are abstracted as "AIO Targets". One example target would be smgr. For boring portability reasons, all targets currently need to be added to an array in aio_target.c. This commit does not implement any AIO targets, just the infrastructure for them. The smgr target will be added in a later commit. Completion (and other events) of IOs for one type of file (i.e. one AIO target) need to be reacted to differently, based on the IO operation and the callsite. This is made possible by callbacks that can be registered on IOs. E.g. an smgr read into a local buffer does not need to update the corresponding BufferDesc (as there is none), but a read into shared buffers does. This commit does not contain any callbacks, they will be added in subsequent commits. For now the AIO infrastructure only understands READV and WRITEV operations, but it is expected that more operations will be added. E.g. fsync/fdatasync, flush_range and network operations like send/recv. As of this commit, nothing uses the AIO infrastructure. Later commits will add an smgr target, md.c and bufmgr.c callbacks and then finally use AIO for read_stream.c IO, which, in one fell swoop, will convert all read stream users to AIO. The goal is to use AIO in many more places. There are patches to use AIO for checkpointer and bgwriter that are reasonably close to being ready. There also are prototypes to use it for WAL, relation extension, backend writes and many more. Those prototypes were important to ensure the design of the AIO subsystem is not too limiting (e.g. WAL writes need to happen in critical sections, which influenced a lot of the design). A future commit will add an AIO README explaining the AIO architecture and how to use the AIO subsystem. The README is added later, as it references details only added in later commits. Many many more people than the folks named below have contributed with feedback, work on semi-independent patches etc. E.g. various folks have contributed patches to use the read stream infrastructure (added by Thomas in `b5a9b18cd0`) in more places. Similarly, a lot of folks have contributed to the CI infrastructure, which I had started to work on to make adding AIO feasible. Some of the work by contributors has gone into the "v1" prototype of AIO, which heavily influenced the current design of the AIO subsystem. None of the code from that directly survives, but without the prototype, the current version of the AIO infrastructure would not exist. Similarly, the reviewers below have not necessarily looked at the current design or the whole infrastructure, but have provided very valuable input. I am to blame for problems, not they. Author: Andres Freund <andres@anarazel.de> Co-authored-by: Thomas Munro <thomas.munro@gmail.com> Co-authored-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Co-authored-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Dmitry Dolgov <9erthalion6@gmail.com> Reviewed-by: Antonin Houska <ah@cybertec.at> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m	2025-03-17 18:51:33 -04:00
Andres Freund	02844012b3	aio: Basic subsystem initialization This commit just does the minimal wiring up of the AIO subsystem, added in the next commit, to the rest of the system. The next commit contains more details about motivation and architecture. This commit is kept separate to make it easier to review, separating the changes across the tree, from the implementation of the new subsystem. We discussed squashing this commit with the main commit before merging AIO, but there has been a mild preference for keeping it separate. Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt	2025-03-17 18:51:33 -04:00
Thomas Munro	799959dc7c	Simplify distance heuristics in read_stream.c. Make the distance control heuristics simpler and more aggressive in preparation for asynchronous I/O. The v17 version of read_stream.c made a conservative choice to limit the look-ahead distance when streaming sequential blocks, because it couldn't benefit very much from looking ahead further yet. It had a three-behavior model where only random I/O would rapidly increase the look-ahead distance, to support read-ahead advice. Sequential I/O would move it towards the io_combine_limit setting, just enough to build one full-sized synchronous I/O at a time, and then expect kernel read-ahead to avoid I/O stalls. That already left I/O performance on the table with advice-based I/O concurrency, since sequential blocks could be followed by random jumps, eg with the proposed streaming Bitmap Heap Scan patch. It is time to delete the cautious middle option and adjust the distance based on recent I/O needs only, since asynchronous reads will need to be started ahead of time whether random or sequential. It is still limited by io_combine_limit, *_io_concurrency, buffer availability and strategy ring size, as before. Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version) Tested-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com	2025-03-16 03:05:07 +13:00
Thomas Munro	7ea8cd1566	Improve read_stream.c advice for dense streams. read_stream.c tries not to issue read-ahead advice when it thinks the kernel's own read-ahead should be active, ie when using buffered I/O and reading sequential blocks. It previously gave up too easily, and issued advice only for the first read of up to io_combine_limit blocks in a larger range of sequential blocks after random jump. The following read could suffer an avoidable I/O stall. Fix, by continuing to issue advice until the corresponding preadv() calls catch up with the start of the region we're currently issuing advice for, if ever. That's when the kernel actually sees the sequential pattern. Advice is now disabled only when the stream is entirely sequential as far as we can see in the look-ahead window, or in other words, when a sequential region is larger than we can cover with the current io_concurrency and io_combine_limit settings. While refactoring the advice control logic, also get rid of the "suppress_advice" argument that was passed around between functions to skip useless posix_fadvise() calls immediately followed by preadv(). read_stream_start_pending_read() can figure that out, so let's concentrate knowledge of advice heuristics in fewer places (our goal being to make advice-based I/O concurrency a legacy mode soon). The problem cases were revealed by Tomas Vondra's extensive regression testing with many different disk access patterns using Melanie Plageman's streaming Bitmap Heap Scan patch, in a battle against the venerable always-issue-advice-and-always-one-block-at-a-time code. Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version) Reported-by: Melanie Plageman <melanieplageman@gmail.com> Reported-by: Tomas Vondra <tomas@vondra.me> Reported-by: Andres Freund <andres@anarazel.de> Tested-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com Discussion: https://postgr.es/m/CA%2BhUKGJ3HSWciQCz8ekP1Zn7N213RfA4nbuotQawfpq23%2Bw-5Q%40mail.gmail.com	2025-03-15 19:04:54 +13:00
Thomas Munro	92fc6856cb	Respect changing pin limits in read_stream.c. To avoid pinning too much of the buffer pool at once, read_stream.c previously used LimitAdditionalPins(). The coding was naive, and only considered the available buffers at stream construction time. This commit checks before each StartReadBuffers() call with GetAdditionalPinLimit(). The result might change over time due to pins acquired outside this stream by the same backend. No extra CPU cycles are added to the all-buffered fast-path code, but the I/O-starting path now considers the up-to-date remaining buffer limit. In practice it was quite difficult to exceed limits and cause any real problems in v17, so no back-patch for now, but proposed changes will make it easier. Per code review from Andres, in the course of testing his AIO patches. Reviewed-by: Andres Freund <andres@anarazel.de> (earlier versions) Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com	2025-03-14 21:21:09 +13:00
Thomas Munro	75da2bece6	Fix read_stream.c for changing io_combine_limit. In a couple of places, read_stream.c assumed that io_combine_limit would be stable during the lifetime of a stream. That is not true in at least one unusual case: streams held by CURSORs where you could change the GUC between FETCH commands, with unpredictable results. Fix, by storing stream->io_combine_limit and referring only to that after construction. This mirrors the treatment of the other important setting {effective,maintenance}_io_concurrency, which is stored in stream->max_ios. One of the cases was the queue overflow space, which was sized for io_combine_limit and could be overrun if the GUC was increased. Since that coding was a little hard to follow, also introduce a variable for better readability instead of open-coding the arithmetic. Doing so revealed an off-by-one thinko while clamping max_pinned_buffers to INT16_MAX, though that wasn't a live bug due to the current limits on GUC values. Back-patch to 17. Discussion: https://postgr.es/m/CA%2BhUKG%2B2T9p-%2BzM6Eeou-RAJjTML6eit1qn26f9twznX59qtCA%40mail.gmail.com	2025-03-13 15:43:34 +13:00
Thomas Munro	55918f798b	Remove arbitrary cap on read_stream.c buffer queue. Previously the internal queue of buffers was capped at max_ios * 4, though not less than io_combine_limit, at allocation time. That was done in the first version based on conservative theories about resource usage and heuristics pending later work. The configured I/O depth could not always be reached with dense random streams generated by ANALYZE, VACUUM, the proposed Bitmap Heap Scan patch, and also sequential streams with the proposed AIO subsystem to name some examples. The new formula is (max_ios + 1) * io_combine_limit, enough buffers for the full configured I/O concurrency level using the full configured I/O combine size, plus the buffers from one finished but not yet consumed full-sized I/O. Significantly more memory would be needed for high GUC values if the client code requests a large per-buffer data size, but that is discouraged (existing and proposed stream users try to keep it under a few words, if not zero). With this new formula, an intermediate variable could have overflowed under maximum GUC values, so its data type is adjusted to cope. Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com	2025-02-27 20:49:48 +13:00
Thomas Munro	2509b857cc	Fix typo in `2a8a0067`. Builds configured with Valgrind but without assertions would fail due to a typo in the recent change. This should be included when back-patching `2a8a0067` into v17.	2025-02-18 14:44:59 +13:00
Thomas Munro	2a8a00674e	Fix explicit valgrind interaction in read_stream.c. By calling wipe_mem() on per-buffer data memory that has been released, we are also telling Valgrind that the memory is "noaccess". We need to set it to "undefined" before giving it to the registered callback to fill in, when a slot is reused. As discovered by build farm animal skink when the VACUUM streamification patches landed (the first users of per-buffer data). Pushing to master only for now, to clear the error on skink. It's also possible that external code might discover the per-buffer data feature in v17, and reasonable to expect Valgrind not to produce spurious memcheck reports, but the back-patch is deferred until after the imminent minor release is out of the way. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Tested-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKG%2Bg6aXpi2FEHqeLOzE%2BxYw%3DOV%2B-N5jhOEnnV%2BF0USM9xA%40mail.gmail.com	2025-02-15 13:14:03 +13:00
Bruce Momjian	50e6eb731d	Update copyright for 2025 Backpatch-through: 13	2025-01-01 11:21:55 -05:00
Peter Eisentraut	e18512c000	Remove unused #include's from backend .c files as determined by IWYU These are mostly issues that are new since commit `dbbca2cf29`. Discussion: https://www.postgresql.org/message-id/flat/0df1d5b1-8ca8-4f84-93be-121081bde049%40eisentraut.org	2024-10-27 08:26:50 +01:00
Thomas Munro	70d38e3d8a	Allow ReadStream to be consumed as raw block numbers. Commits `041b9680` and `6377e12a` changed the interface of scan_analyze_next_block() to take a ReadStream instead of a BlockNumber and a BufferAccessStrategy, and to return a value to indicate when the stream has run out of blocks. This caused integration problems for at least one known extension that uses specially encoded BlockNumber values that map to different underlying storage, because acquire_sample_rows() sets up the stream so that read_stream_next_buffer() reads blocks from the main fork of the relation's SMgrRelation. Provide read_stream_next_block(), as a way for such an extension to access the stream of raw BlockNumbers directly and forward them to its own ReadBuffer() calls after decoding, as it could in earlier releases. The new function returns the BlockNumber and BufferAccessStrategy that were previously passed directly to scan_analyze_next_block(). Alternatively, an extension could wrap the stream of BlockNumbers in another ReadStream with a callback that performs any decoding required to arrive at real storage manager BlockNumber values, so that it could benefit from the I/O combining and concurrency provided by read_stream.c. Another class of table access method that does nothing in scan_analyze_next_block() because it is not block-oriented could use this function to control the number of block sampling loops. It could match the previous behavior with "return read_stream_next_block(stream, &bas) != InvalidBlockNumber". Ongoing work is expected to provide better ANALYZE support for table access methods that don't behave like heapam with respect to storage blocks, but that will be for future releases. Back-patch to 17. Reported-by: Mats Kindahl <mats@timescale.com> Reviewed-by: Mats Kindahl <mats@timescale.com> Discussion: https://postgr.es/m/CA%2B14425%2BCcm07ocG97Fp%2BFrD9xUXqmBKFvecp0p%2BgV2YYR258Q%40mail.gmail.com	2024-09-18 11:34:28 +12:00
Thomas Munro	813fde73d4	Standardize "read-ahead advice" terminology. Commit `6654bb920` added macOS's equivalent of POSIX_FADV_WILLNEED, and changed some explicit references to posix_fadvise to use this more general name for the concept. Update some remaining references. Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/0827edec-1317-4917-a186-035eb1e3241d%40eisentraut.org	2024-09-04 10:28:53 +12:00
Noah Misch	c582b75851	Add block_range_read_stream_cb(), to deduplicate code. This replaces two functions for iterating over all blocks in a range. A pending patch will use this instead of adding a third. Nazir Bilal Yavuz Discussion: https://postgr.es/m/20240820184742.f2.nmisch@google.com	2024-09-03 10:46:20 -07:00
Michael Paquier	4236825197	Fix typos and grammar in code comments and docs Author: Alexander Lakhin Discussion: https://postgr.es/m/f7e514cf-2446-21f1-a5d2-8c089a6e2168@gmail.com	2024-09-03 14:49:04 +09:00
Thomas Munro	4effd0844d	Fix unfairness in all-cached parallel seq scan. Commit `b5a9b18c` introduced block streaming infrastructure with a special fast path for all-cached scans, and commit `b7b0f3f2` connected the infrastructure up to sequential scans. One of the fast path micro-optimizations had an unintended consequence: it interfered with parallel sequential scan's block range allocator (from commit `56788d21`), which has its own ramp-up and ramp-down algorithm when handing out groups of pages to workers. A scan of an all-cached table could give extra blocks to one worker, when others had finished. In some plans (probably already very bad plans, such as the one reported by Alexander), the unfairness could be magnified. An internal buffer of 16 block numbers is removed, keeping just a single block buffer for technical reasons. Back-patch to 17. Reported-by: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/63a63690-dd92-c809-0b47-af05459e95d1%40gmail.com	2024-08-31 17:28:02 +12:00
Bruce Momjian	151da217a3	C comment: fix for commit `b5a9b18cd0` The commit was "Provide API for streaming relation data.". Reported-by: Nazir Bilal Yavuz Discussion: https://postgr.es/m/CAN55FZ3KsZ2faZs1sK5J0W+_8B3myB232CfLYGie4u4BBMwP3g@mail.gmail.com Backpatch-through: master	2024-08-16 21:12:18 -04:00
Noah Misch	a858be17c3	Add a way to create read stream object by using SMgrRelation. Currently read stream object can be created only by using Relation. Nazir Bilal Yavuz Discussion: https://postgr.es/m/CAN55FZ0JKL6vk1xQp6rfOXiNFV1u1H0tJDPPGHWoiO3ea2Wc=A@mail.gmail.com	2024-07-20 04:22:12 -07:00
Noah Misch	af07a827b9	Refactor PinBufferForBlock() to remove checks about persistence. There are checks in PinBufferForBlock() function to set persistence of the relation. This function is called for each block in the relation. Instead, set persistence of the relation before PinBufferForBlock(). Nazir Bilal Yavuz Discussion: https://postgr.es/m/CAN55FZ0JKL6vk1xQp6rfOXiNFV1u1H0tJDPPGHWoiO3ea2Wc=A@mail.gmail.com	2024-07-20 04:22:12 -07:00
David Rowley	2ea4b29277	Fix typos and incorrect type in read_stream.c max_ios should be int rather than int16, otherwise there's not much point in doing: max_ios = Min(max_ios, PG_INT16_MAX); Discussion: https://postgr.es/m/CAApHDvr9Un-XpDr_+AFdOGM38O2K8SpfoHimqZ838gguTGYBiQ@mail.gmail.com	2024-05-01 17:04:52 +12:00
Daniel Gustafsson	950d4a2cb1	Fix typos and duplicate words This fixes various typos, duplicated words, and tiny bits of whitespace mainly in code comments but also in docs. Author: Daniel Gustafsson <daniel@yesql.se> Author: Heikki Linnakangas <hlinnaka@iki.fi> Author: Alexander Lakhin <exclusion@gmail.com> Author: David Rowley <dgrowleyml@gmail.com> Author: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/3F577953-A29E-4722-98AD-2DA9EFF2CBB8@yesql.se	2024-04-18 21:28:07 +02:00
Thomas Munro	158f581923	Fix if/while thinko in read_stream.c edge case. When we determine that a wanted block can't be combined with the current pending read, it's time to start that read to get it out of the way. An "if" in that code path should have been a "while", because it might take more than one go in case of partial reads. This was only broken for smaller ranges, as the more common case of io_combine_limit-sized ranges is handled earlier in the code and knows how to loop, hiding the bug for a while. Discovered while testing large parallel sequential scans of partially cached tables. The ramp-up-and-down block allocator for parallel scans could hit the problem case and skip some blocks near the end that should have been streamed. Defect in commit `b5a9b18c`. Discussion: https://postgr.es/m/CA%2BhUKG%2Bh8Whpv0YsJqjMVkjYX%2B80fTVc6oi-V%2BzxJvykLpLHYQ%40mail.gmail.com	2024-04-07 14:51:33 +12:00
Thomas Munro	3bd8439ed6	Allow BufferAccessStrategy to limit pin count. While pinning extra buffers to look ahead, users of strategies are in danger of using too many buffers. For some strategies, that means "escaping" from the ring, and in others it means forcing dirty data to disk very frequently with associated WAL flushing. Since external code has no insight into any of that, allow individual strategy types to expose a clamp that should be applied when deciding how many buffers to pin at once. Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CAAKRu_aJXnqsyZt6HwFLnxYEBgE17oypkxbKbT1t1geE_wvH2Q%40mail.gmail.com	2024-04-06 23:11:45 +13:00
Thomas Munro	aa1e8c2064	Improve read_stream.c's fast path. The "fast path" for well cached scans that don't do any I/O was accidentally coded in a way that could only be triggered by pg_prewarm's usage pattern, which starts out with a higher distance because of the flags it passes in. We want it to work for streaming sequential scans too, once that patch is committed. Adjust. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGKXZALJ%3D6aArUsXRJzBm%3Dqvc4AWp7%3DiJNXJQqpbRLnD_w%40mail.gmail.com	2024-04-06 18:00:35 +13:00
Thomas Munro	b5a9b18cd0	Provide API for streaming relation data. Introduce an abstraction allowing relation data to be accessed as a stream of buffers, with an implementation that is more efficient than the equivalent sequence of ReadBuffer() calls. Client code supplies a callback that can say which block number it wants next, and then consumes individual buffers one at a time from the stream. This division puts read_stream.c in control of how far ahead it can see and allows it to read clusters of neighboring blocks with StartReadBuffers(). It also issues POSIX_FADV_WILLNEED advice ahead of time when random access is detected. Other variants of I/O stream will be proposed in future work (for example to support recovery, whose LsnReadQueue device in xlogprefetcher.c is a distant cousin of this code and should eventually be replaced by this), but this basic API is sufficient for many common executor usage patterns involving predictable access to a single fork of a single relation. Several patches using this API are proposed separately. This stream concept is loosely based on ideas from Andres Freund on how we should pave the way for later work on asynchronous I/O. Author: Thomas Munro <thomas.munro@gmail.com> Author: Heikki Linnakangas <hlinnaka@iki.fi> (contributions) Author: Melanie Plageman <melanieplageman@gmail.com> (contributions) Suggested-by: Andres Freund <andres@anarazel.de> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Tested-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com	2024-04-03 00:49:46 +13:00

43 Commits