postgres

mirror of https://github.com/postgres/postgres.git synced 2025-12-21 05:21:08 +03:00

Author	SHA1	Message	Date
Amit Kapila	105b2cb336	Stabilize 035_standby_logical_decoding.pl. Some tests try to invalidate logical slots on the standby server by running VACUUM on the primary. The problem is that xl_running_xacts was getting generated and replayed before the VACUUM command, leading to the advancement of the active slot's catalog_xmin. Due to this, active slots were not getting invalidated, leading to test failures. We fix it by skipping the generation of xl_running_xacts for the required tests with the help of injection points. As the required interface for injection points was not present in back branches, we fixed the failing tests in them by disallowing the slot to become active for the required cases (where rows_removed conflict could be generated). Author: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Backpatch-through: 16, where it was introduced Discussion: https://postgr.es/m/Z6oQXc8LmiTLfwLA@ip-10-97-1-34.eu-west-3.compute.internal	2025-04-08 09:38:02 +05:30
Tomas Vondra	8cc139bec3	Introduce pg_shmem_allocations_numa view Introduce new pg_shmem_alloctions_numa view with information about how shared memory is distributed across NUMA nodes. For each shared memory segment, the view returns one row for each NUMA node backing it, with the total amount of memory allocated from that node. The view may be relatively expensive, especially when executed for the first time in a backend, as it has to touch all memory pages to get reliable information about the NUMA node. This may also force allocation of the shared memory. Unlike pg_shmem_allocations, the view does not show anonymous shared memory allocations. It also does not show memory allocated using the dynamic shared memory infrastructure. Author: Jakub Wartak <jakub.wartak@enterprisedb.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com	2025-04-07 23:08:17 +02:00
Andres Freund	8e293e689b	aio: Make AIO more compatible with valgrind In some edge cases valgrind flags issues with the memory referenced by IOs. All of the cases addressed in this change are false positives. Most of the false positives are caused by UnpinBuffer[NoOwner] marking buffer data as inaccessible. This happens even though the AIO subsystem still holds a pin. That's good, there shouldn't be accesses to the buffer outside of AIO related code until it is pinned by "user" code again. But it requires some explicit work - if the buffer is not pinned by the current backend, we need to explicitly mark the buffer data accessible/inaccessible while executing completion callbacks. That however causes a cascading issue in IO workers: After the completion callbacks for a buffer is executed, the page is marked as inaccessible. If subsequently the same worker is executing IO targeting the same buffer, we would get an error, as the memory is still marked inaccessible. To avoid that, we need to explicitly mark the memory as accessible in IO workers. Another issue is that IO executed in workers or via io_uring will not mark memory as DEFINED. In the case of workers that is because valgrind does not track memory definedness across processes. For io_uring that is because valgrind does not understand io_uring, and therefore its IOs never mark memory as defined, whether the completions are processed in the defining process or in another context. It's not entirely clear how to best solve that. The current user of AIO is not affected, as it explicitly marks buffers as DEFINED & NOACCESS anyway. Defer solving this issue until we have a user with different needs. Per buildfarm animal skink. Reviewed-by: Noah Misch <noah@leadboat.com> Co-authored-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/3pd4322mogfmdd5nln3zphdwhtmq3rzdldqjwb2sfqzcgs22lf@ok2gletdaoe6	2025-04-07 15:20:30 -04:00
Andres Freund	8ab4241b9f	localbuf: Add Valgrind buffer access instrumentation This mirrors `1e0dfd166b` (+ `46ef520b95`), for temporary table buffers. This is mainly interesting right now because the AIO work currently triggers spurious valgrind errors, and the fix for that is cleaner if temp buffers behave the same as shared buffers. This requires one change beyond the annotations themselves, namely to pin local buffers while writing them out in FlushRelationBuffers(). Reviewed-by: Noah Misch <noah@leadboat.com> Co-authored-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/3pd4322mogfmdd5nln3zphdwhtmq3rzdldqjwb2sfqzcgs22lf@ok2gletdaoe6	2025-04-07 15:20:30 -04:00
Andres Freund	8ce79483dc	read_stream: Fix overflow hazard with large shared buffers If the limit returned by GetAdditionalPinLimit() is large, the buffer_limit variable in read_stream_start_pending_read() can overflow. While the code is careful to limit buffer_limit PG_INT16_MAX, we subsequently add the number of forwarded buffers. The overflow can lead to assertion failures, crashes or wrong query results when using large shared buffers. It seems easier to avoid this if we make the buffer_limit variable an int, instead of an int16. Do so, and clamp buffer_limit after adding the number of forwarded buffers. It's possible we might want to address this and related issues more widely by changing to int instead of int16 more widely, but since the consequences of this bug can be confusing, it seems better to fix it now. This bug was introduced in `ed0b87caac`. Discussion: https://postgr.es/m/ewvz3cbtlhrwqk7h6ca6cctiqh7r64ol3pzb3iyjycn2r5nxk5@tnhw3a5zatlr	2025-04-07 09:45:00 -04:00
Andres Freund	57dec20fd4	aio: Avoid spurious coverity warning PgAioResult.result is never accessed in the relevant path, but coverity complains about an uninitialized access anyway. So just zero-initialize the whole thing. While at it, reduce the scope of the variable. Reported-by: Ranier Vilela <ranier.vf@gmail.com> Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/CAEudQApsKqd-s+fsUQ0OmxJAMHmBSXxrAz3dCs+uvqb3iRtjSw@mail.gmail.com	2025-04-06 12:07:02 -04:00
Tomas Vondra	1aff1dc8df	Revert "Improve accounting for memory used by shared hash tables" This reverts commit `f5930f9a98`. This broke the expansion of private hash tables, which reallocates the directory. But that's impossible when it's allocated together with the other fields, and dir_realloc() failed with BogusFree. Clearly, this needs rethinking. Discussion: https://postgr.es/m/CAApHDvriCiNkm=v521AP6PKPfyWkJ++jqZ9eqX4cXnhxLv8w-A@mail.gmail.com	2025-04-04 04:43:50 +02:00
Tomas Vondra	46df9487d9	Improve accounting for PredXactList, RWConflictPool and PGPROC Various places allocated shared memory by first allocating a small chunk using ShmemInitStruct(), followed by ShmemAlloc() calls to allocate more memory. Unfortunately, ShmemAlloc() does not update ShmemIndex, so this affected pg_shmem_allocations - it only shown the initial chunk. This commit modifies the following allocations, to allocate everything as a single chunk, and then split it internally. - PredXactList - RWConflictPool - PGPROC structures - Fast-Path Lock Array The fast-path lock array is allocated separately, not as a part of the PGPROC structures allocation. Author: Rahila Syed <rahilasyed90@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAH2L28vHzRankszhqz7deXURxKncxfirnuW68zD7+hVAqaS5GQ@mail.gmail.com	2025-04-02 17:14:28 +02:00
Tomas Vondra	f5930f9a98	Improve accounting for memory used by shared hash tables pg_shmem_allocations tracks the memory allocated by ShmemInitStruct(), but for shared hash tables that covered only the header and hash directory. The remaining parts (segments and buckets) were allocated later using ShmemAlloc(), which does not update the shmem accounting. Thus, these allocations were not shown in pg_shmem_allocations. This commit improves the situation by allocating all the hash table parts at once, using a single ShmemInitStruct() call. This way the ShmemIndex entries (and thus pg_shmem_allocations) better reflect the proper size of the hash table. This affects allocations for private (non-shared) hash tables too, as the hash_create() code is shared. For non-shared tables this however makes no practical difference. This changes the alignment a bit. ShmemAlloc() aligns the chunks using CACHELINEALIGN(), which means some parts (header, directory, segments) were aligned this way. Allocating all parts as a single chunk removes this (implicit) alignment. We've considered adding explicit alignment, but we've decided not to - it seems to be merely a coincidence due to using the ShmemAlloc() API, not due to necessity. Author: Rahila Syed <rahilasyed90@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAH2L28vHzRankszhqz7deXURxKncxfirnuW68zD7+hVAqaS5GQ@mail.gmail.com	2025-04-02 17:14:28 +02:00
Heikki Linnakangas	a460251f0a	Make cancel request keys longer Currently, the cancel request key is a 32-bit token, which isn't very much entropy. If you want to cancel another session's query, you can brute-force it. In most environments, an unauthorized cancellation of a query isn't very serious, but it nevertheless would be nice to have more protection from it. Hence make the key longer, to make it harder to guess. The longer cancellation keys are generated when using the new protocol version 3.2. For connections using version 3.0, short 4-bytes keys are still used. The new longer key length is not hardcoded in the protocol anymore, the client is expected to deal with variable length keys, up to 256 bytes. This flexibility allows e.g. a connection pooler to add more information to the cancel key, which might be useful for finding the connection. Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl> Reviewed-by: Robert Haas <robertmhaas@gmail.com> (earlier versions) Discussion: https://www.postgresql.org/message-id/508d0505-8b7a-4864-a681-e7e5edfe32aa@iki.fi	2025-04-02 16:41:48 +03:00
Melanie Plageman	b3219c69fc	aio: Add errcontext for processing I/Os for another backend Push an ErrorContextCallback adding additional detail about the process performing the I/O and the owner of the I/O when those are not the same. For io_method worker, this adds context specifying which process owns the I/O that the I/O worker is processing. For io_method io_uring, this adds context only when a backend is completing I/O for another backend. It specifies the pid of the owning process. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/rdml3fpukrqnas7qc5uimtl2fyytrnu6ymc2vjf2zuflbsjuul%40hyizyjsexwmm	2025-04-01 19:53:07 -04:00
Andres Freund	e19dc74491	aio: Minor comment improvements Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/usbwzckj7q3jhfx3ann3nrfnukmupbs35axvq5zfyeo6nvrzrm@onjhxs2du4st	2025-04-01 16:06:48 -04:00
Andres Freund	fdd146a8ef	aio: Add README.md explaining higher level design Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m	2025-04-01 16:06:48 -04:00
Andres Freund	00066aa173	md: Add comment & assert to buffer-zeroing path in md[start]readv() mdreadv() has a codepath to zero out buffers when a read returns zero bytes, guarded by a check for zero_damaged_pages \|\| InRecovery. The InRecovery codepath to zero out buffers in mdreadv() appears to be unreachable. The only known paths to reach mdreadv()/mdstartreadv() in recovery are XLogReadBufferExtended(), vm_readbuf(), and fsm_readbuf(), each of which takes care to extend the relation if necessary. This looks to either have been the case for a long time, or the code was never reachable. The zero_damaged_pages path is incomplete, as missing segments are not created. Putting blocks into the buffer-pool that do not exist on disk is rather problematic, as such blocks will, at least initially, not be found by scans that rely on smgrnblocks(), as they are beyond EOF. It also can cause weird problems with relation extension, as relation extension does not expect blocks beyond EOF to exist. Therefore we would like to remove that path. mdstartreadv(), which I added in e5fe570b51c, does not implement this zeroing logic. I had started a discussion about that a while ago (linked below), but forgot to act on the conclusion of the discussion, namely to disable the in-memory-zeroing behavior. We could certainly implement equivalent zeroing logic in mdstartreadv(), but it would have to be more complicated due to potential differences in the zero_damaged_pages setting between the definer and completor of IO. Given that we want to remove the logic, that does not seem worth implementing the necessary logic. For now, put an Assert(false) and comments documenting this choice into mdreadv() and comments documenting the deprecation of the path in mdreadv() and the non-implementation of it in mdstartreadv(). If we, during testing, discover that we do need the path, we can implement it at that time. Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/postgr.es/m/20250330024513.ac.nmisch@google.com Discussion: https://postgr.es/m/postgr.es/m/3qxxsnciyffyf3wyguiz4besdp5t5uxvv3utg75cbcszojlz7p@uibfzmnukkbd	2025-04-01 13:50:39 -04:00
Andres Freund	93bc3d75d8	aio: Add test_aio module To make the tests possible, a few functions from bufmgr.c/localbuf.c had to be exported, via buf_internals.h. Reviewed-by: Noah Misch <noah@leadboat.com> Co-authored-by: Andres Freund <andres@anarazel.de> Co-authored-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt	2025-04-01 13:47:46 -04:00
Andres Freund	60f566b4f2	aio: Add pg_aios view The new view lists all IO handles that are currently in use and is mainly useful for PG developers, but may also be useful when tuning PG. Bumps catversion. Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt	2025-04-01 13:30:33 -04:00
Andres Freund	ae3df4b341	read_stream: Introduce and use optional batchmode support Submitting IO in larger batches can be more efficient than doing so one-by-one, particularly for many small reads. It does, however, require the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not: a) block without first calling pgaio_submit_staged(), unless a to-be-waited-on lock cannot be part of a deadlock, e.g. because it is never held while waiting for IO. b) directly or indirectly start another batch pgaio_enter_batchmode() As this requires care and is nontrivial in some cases, batching is only used with explicit opt-in. This patch adds an explicit flag (READ_STREAM_USE_BATCHING) to read_stream and uses it where appropriate. There are two cases where batching would likely be beneficial, but where we aren't using it yet: 1) bitmap heap scans, because the callback reads the VM This should soon be solved, because we are planning to remove the use of the VM, due to that not being sound. 2) The first phase of heap vacuum This could be made to support batchmode, but would require some care. Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt	2025-03-30 18:36:41 -04:00
Andres Freund	f4d0730bbc	aio: Basic read_stream adjustments for real AIO Adapt the read stream logic for real AIO: - If AIO is enabled, we shouldn't issue advice, but if it isn't, we should continue issuing advice - AIO benefits from reading ahead with direct IO - If effective_io_concurrency=0, pass READ_BUFFERS_SYNCHRONOUSLY to StartReadBuffers() to ensure synchronous IO execution There are further improvements we should consider: - While in read_stream_look_ahead(), we can use AIO batch submission mode for increased efficiency. That however requires care to avoid deadlocks and thus done separately. - It can be beneficial to defer starting new IOs until we can issue multiple IOs at once. That however requires non-trivial heuristics to decide when to do so. Reviewed-by: Noah Misch <noah@leadboat.com> Co-authored-by: Andres Freund <andres@anarazel.de> Co-authored-by: Thomas Munro <thomas.munro@gmail.com>	2025-03-30 18:26:44 -04:00
Andres Freund	12ce89fd07	bufmgr: Use AIO in StartReadBuffers() This finally introduces the first actual use of AIO. StartReadBuffers() now uses the AIO routines to issue IO. As the implementation of StartReadBuffers() is also used by the functions for reading individual blocks (StartReadBuffer() and through that ReadBufferExtended()) this means all buffered read IO passes through the AIO paths. However, as those are synchronous reads, actually performing the IO asynchronously would be rarely beneficial. Instead such IOs are flagged to always be executed synchronously. This way we don't have to duplicate a fair bit of code. When io_method=sync is used, the IO patterns generated after this change are the same as before, i.e. actual reads are only issued in WaitReadBuffers() and StartReadBuffers() may issue prefetch requests. This allows to bypass most of the actual asynchronicity, which is important to make a change as big as this less risky. One thing worth calling out is that, if IO is actually executed asynchronously, the precise meaning of what track_io_timing is measuring has changed. Previously it tracked the time for each IO, but that does not make sense when multiple IOs are executed concurrently. Now it only measures the time actually spent waiting for IO. A subsequent commit will adjust the docs for this. While AIO is now actually used, the logic in read_stream.c will often prevent using sufficiently many concurrent IOs. That will be addressed in the next commit. Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Co-authored-by: Andres Freund <andres@anarazel.de> Co-authored-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m	2025-03-30 18:02:23 -04:00
Andres Freund	047cba7fa0	bufmgr: Implement AIO read support This commit implements the infrastructure to perform asynchronous reads into the buffer pool. To do so, it: - Adds readv AIO callbacks for shared and local buffers It may be worth calling out that shared buffer completions may be run in a different backend than where the IO started. - Adds an AIO wait reference to BufferDesc, to allow backends to wait for in-progress asynchronous IOs - Adapts StartBufferIO(), WaitIO(), TerminateBufferIO(), and their localbuf.c equivalents, to be able to deal with AIO - Moves the code to handle BM_PIN_COUNT_WAITER into a helper function, as it now also needs to be called on IO completion As of this commit, nothing issues AIO on shared/local buffers. A future commit will update StartReadBuffers() to do so. Buffer reads executed through this infrastructure will report invalid page / checksum errors / warnings differently than before: In the error case the error message will cover all the blocks that were included in the read, rather than just the reporting the first invalid block. If more than one block is invalid, the error will include information about the range of the read, the first invalid block and the number of invalid pages, with a HINT towards the server log for per-block details. For the warning case (i.e. zero_damaged_buffers) we would previously emit one warning message for each buffer in a multi-block read. Now there is only a single warning message for the entire read, again referring to the server log for more details in case of multiple checksum failures within a single larger read. Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m	2025-03-30 17:28:03 -04:00
Andres Freund	ef64fe26ba	aio: Add WARNING result status If an IO succeeds, but issues a warning, e.g. due to a page verification failure with zero_damaged_pages, we want to issue that warning in the context of the issuer of the IO, not the process that executes the completion (always the case for worker). It's already possible for a completion callback to report a custom error message, we just didn't have a result status that allowed a user of AIO to know that a warning should be emitted even though the IO request succeeded. All that's needed for that is a dedicated PGAIO_RS_ value. Previously there were not enough bits in PgAioResult.id for the new value. Increase. While at that, add defines for the amount of bits and static asserts to check that the widths are appropriate. Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/20250329212929.a6.nmisch@google.com	2025-03-30 16:27:10 -04:00
Andres Freund	d445990adc	Let caller of PageIsVerified() control ignore_checksum_failure For AIO the completion of a read into shared buffers (i.e. verifying the page including the checksum, updating the BufferDesc to reflect the IO) can happen in a different backend than the backend that started the IO. As ignore_checksum_failure can differ between backends, we need to allow the caller of PageIsVerified() control whether to ignore checksum failures. The commit leaves a gap in the PIV_* values, as an upcoming commit, which depends on this commit, will add PIV_LOG_LOG, which better fits just after PIV_LOG_WARNING. Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/20250329212929.a6.nmisch@google.com	2025-03-30 16:27:10 -04:00
Andres Freund	b96d3c3897	pgstat: Allow checksum errors to be reported in critical sections For AIO we execute completion callbacks in critical sections (to ensure that AIO can in the future be used for WAL, which in turn requires that we can call completion callbacks in critical sections, to get the resources for WAL io). To report checksum errors a backend now has to call pgstat_prepare_report_checksum_failure(), before entering a critical section, which guarantees the relevant pgstats entry is in shared memory, the relevant DSM segment is mapped into the backend's memory and the address is known via a PgStat_EntryRef. Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/wkjj4p2rmkevutkwc6tewoovdqznj6c6nvjmvii4oo5wmbh5sr@retq7d6uqs4j	2025-03-30 16:12:04 -04:00
Andres Freund	d6d8054dc7	localbuf: Track pincount in BufferDesc as well For AIO on temporary table buffers the AIO subsystem needs to be able to ensure a pin on a buffer while AIO is going on, even if the IO issuing query errors out. Tracking the buffer in LocalRefCount does not work, as it would cause CheckForLocalBufferLeaks() to assert out. Instead, also track the refcount in BufferDesc.state, not just LocalRefCount. This also makes local buffers behave a bit more akin to shared buffers. Note that we still don't need locking, AIO completion callbacks for local buffers are executed in the issuing session (i.e. nobody else has access to the BufferDesc). Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt	2025-03-29 16:36:51 -04:00
Andres Freund	08ccd56ac7	aio, bufmgr: Comment fixes/improvements Some of these comments have been wrong for a while (`12f3867f55`), some I recently introduced (`da7226993f`, `55b454d0e1`). This includes an update to a comment in FlushBuffer(), which will be copied in a future commit. These changes seem big enough to be worth doing in separate commits. Suggested-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/20250319212530.80.nmisch@google.com	2025-03-29 14:45:42 -04:00
Andres Freund	50cb7505b3	aio: Implement support for reads in smgr/md/fd This implements the following: 1) An smgr AIO target, for AIO on smgr files. This should be usable not just for md.c but also other SMGR implementation if we ever get them. 2) readv support in fd.c, which requires a small bit of infrastructure work in fd.c 3) smgr.c and md.c support for readv There still is nothing performing AIO, but as of this commit it would be possible. As part of this change FileGetRawDesc() actually ensures that the file is opened - previously it was basically not usable. It's used to reopen a file in IO workers. Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m	2025-03-29 13:38:35 -04:00
Andres Freund	dee8002468	Fix mis-attribution of checksum failure stats to the wrong database Checksum failure stats could be attributed to the wrong database in two cases: - when a read of a shared relation encountered a checksum error , it would be attributed to the current database, instead of the "database" representing shared relations - when using CREATE DATABASE ... STRATEGY WAL_LOG checksum errors in the source database would be attributed to the current database The checksum stats reporting via PageIsVerifiedExtended(PIV_REPORT_STAT) does not have access to the information about what database a page belongs to. This fixes the issue by removing PIV_REPORT_STAT and delegating the responsibility to report stats to the caller, which now can learn about the number of stats via a new optional argument. As this changes the signature of PageIsVerifiedExtended() and all callers should adapt to the new signature, use the occasion to rename the function to PageIsVerified() and remove the compatibility macro. We could instead have fixed this by adding information about the database to the args of PageIsVerified(), but there are soon-to-be-applied patches that need to separate the stats reporting from the PageIsVerified() call anyway. Those patches also include testing for the failure paths, something we inexplicably have not had. As there is no caller of pgstat_report_checksum_failure() left, remove it. It'd be possible, but awkward to fix this in the back branches. We considered doing the work not quite worth it, as mis-attributed stats should still elicit concern. The emitted error messages do allow to attribute the errors correctly. Discussion: https://postgr.es/m/5tyic6epvdlmd6eddgelv47syg2b5cpwffjam54axp25xyq2ga@ptwkinxqo3az Discussion: https://postgr.es/m/mglpvvbhighzuwudjxzu4br65qqcxsnyvio3nl4fbog3qknwhg@e4gt7npsohuz	2025-03-29 13:38:35 -04:00
Peter Eisentraut	a0ed19e0a9	Use PRI?64 instead of "ll?" in format strings (continued). Continuation of work started in commit `15a79c73`, after initial trial. Author: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/b936d2fb-590d-49c3-a615-92c3a88c6c19%40eisentraut.org	2025-03-29 10:43:57 +01:00
Heikki Linnakangas	51a0382e8d	Fix crash if LockErrorCleanup() is called twice The refactoring in commit `3c0fd64fec` removed the clearing of awaitedLock from LockErrorCleanup(). It's still needed, otherwise LockErrorCleanup() during abort processing will try to update the LOCALLOCK struct even after the lock has already been released. Put it back. Reported-by: Richard Guo <guofenglinux@gmail.com> Reported-by: Robins Tharakan <tharakan@gmail.com> Reported-by: Alexander Lakhin <exclusion@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CAMbWs4_dNX1SzBmvFdoY-LxJh_4W_BjtVd5i008ihfU-wFF=eg@mail.gmail.com Discussion: https://www.postgresql.org/message-id/18832-38e5575b1bbd7277@postgresql.org Discussion: https://www.postgresql.org/message-id/e11a30e5-c0d8-491d-8546-3a1b50c10ad4@gmail.com	2025-03-28 20:19:17 +02:00
Daniel Gustafsson	058b5152f0	Fix guc_malloc calls for consistency and OOM checks check_createrole_self_grant and check_synchronized_standby_slots were allocating memory on a LOG elevel without checking if the allocation succeeded or not, which would have led to a segfault on allocation failure. On top of that, a number of callsites were using the ERROR level, relying on erroring out rather than returning false to allow the GUC machinery handle it gracefully. Other callsites used WARNING instead of LOG. While neither being not wrong, this changes all check_ functions do it consistently with LOG. init_custom_variable gets a promoted elevel to FATAL to keep the guc_malloc error handling in line with the rest of the error handling in that function which already call FATAL. If we encounter an OOM in this callsite there is no graceful handling to be had, better to error out hard. Backpatch the fix to check_createrole_self_grant down to v16 and the fix to check_synchronized_standby_slots down to v17 where they were introduced. Author: Daniel Gustafsson <daniel@yesql.se> Reported-by: Nikita <pm91.arapov@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Bug: #18845 Discussion: https://postgr.es/m/18845-582c6e10247377ec@postgresql.org Backpatch-through: 16	2025-03-27 22:57:34 +01:00
Michael Paquier	44fe6ceb51	doc: Correct description of values used in FSM for indexes The implementation of FSM for indexes is simpler than heap, where 0 is used to track if a page is in-use and (BLCKSZ - 1) if a page is free. One comment in indexfsm.c and one description in the documentation of pg_freespacemap were incorrect about that. Author: Alex Friedman <alexf01@gmail.com> Discussion: https://postgr.es/m/71eef655-c192-453f-ac45-2772fec2cb04@gmail.com Backpatch-through: 13	2025-03-27 10:20:41 +09:00
Andres Freund	c325a7633f	aio: Add io_method=io_uring Performing AIO using io_uring can be considerably faster than io_method=worker, particularly when lots of small IOs are issued, as a) the context-switch overhead for worker based AIO becomes more significant b) the number of IO workers can become limiting io_uring, however, is linux specific and requires an additional compile-time dependency (liburing). This implementation is fairly simple and there are substantial optimization opportunities. The description of the existing AIO_IO_COMPLETION wait event is updated to make the difference between it and the new AIO_IO_URING_EXECUTION clearer. Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m	2025-03-26 19:49:13 -04:00
Andres Freund	9469d7fdd2	aio: Rename pgaio_io_prep_* to pgaio_io_start_* The old naming pattern (mirroring liburing's naming) was inconsistent with the (not yet introduced) callers. It seems better to get rid of the inconsistency now than to grow more users of the odd naming. Reported-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/20250326001915.bc.nmisch@google.com	2025-03-26 16:10:29 -04:00
Andres Freund	f321ec237a	aio: Pass result of local callbacks to ->report_return Otherwise the results of e.g. temp table buffer verification errors will not reach bufmgr.c. Obviously that's not right. Found while expanding the tests for invalid buffer contents. Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/20250326001915.bc.nmisch@google.com	2025-03-26 16:06:54 -04:00
Andres Freund	96da9050a5	aio: Be more paranoid about interrupts As reported by Noah, it's possible, although practically very unlikely, that interrupts could be processed in between pgaio_io_reopen() and pgaio_io_perform_synchronously(). Prevent that by explicitly holding interrupts. It also seems good to add an assertion to pgaio_io_before_prep() to ensure that interrupts are held, as otherwise FDs referenced by the IO could be closed during interrupt processing. All code in the aio series currently runs the code with interrupts held, but it seems better to be paranoid. Reviewed-by: Noah Misch <noah@leadboat.com> Reported-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/20250324002939.5c.nmisch@google.com	2025-03-26 16:06:54 -04:00
Andres Freund	adb5f85fa5	Redefine max_files_per_process to control additionally opened files Until now max_files_per_process=N limited each backend to open N files in total (minus a safety factor), even if there were already more files opened in postmaster and inherited by backends. Change max_files_per_process to control how many additional files each process is allowed to open. The main motivation for this is the patch to add io_method=io_uring, which needs to open one file for each backend. Without this patch, even if RLIMIT_NOFILE is high enough, postmaster will fail in set_max_safe_fds() if started with a high max_connections. The cause of the failure is that, until now, set_max_safe_fds() subtracted the already open files from max_files_per_process. Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/w6uiicyou7hzq47mbyejubtcyb2rngkkf45fk4q7inue5kfbeo@bbfad3qyubvs Discussion: https://postgr.es/m/CAGECzQQh6VSy3KG4pN1d=h9J=D1rStFCMR+t7yh_Kwj-g87aLQ@mail.gmail.com	2025-03-24 18:20:18 -04:00
Heikki Linnakangas	2817525f0d	Fix rare assertion failure in standby, if primary is restarted During hot standby, ExpireAllKnownAssignedTransactionIds() and ExpireOldKnownAssignedTransactionIds() functions mark old transactions as no-longer running, but they failed to update xactCompletionCount and latestCompletedXid. AFAICS it would not lead to incorrect query results, because those functions effectively turn in-progress transactions into aborted transactions and an MVCC snapshot considers both as "not visible". But it could surprise GetSnapshotDataReuse() and trigger the "TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin))" assertion in it, if the apparent xmin in a backend would move backwards. We saw this happen when GetCatalogSnapshot() would reuse an older catalog snapshot, when GetTransactionSnapshot() had already advanced TransactionXmin. The bug goes back all the way to commit `623a9ba79b` in v14 that introduced the snapshot reuse mechanism, but it started to happen more frequently with commit `952365cded` which removed a GetTransactionSnapshot() call from backend startup. That made it more likely for ExpireOldKnownAssignedTransactionIds() to be called between GetCatalogSnapshot() and the first GetTransactionSnapshot() in a backend. Andres Freund first spotted this assertion failure on buildfarm member 'skink'. Reproduction and analysis by Tomas Vondra. Backpatch-through: 14 Discussion: https://www.postgresql.org/message-id/oey246mcw43cy4qw2hqjmurbd62lfdpcuxyqiu7botx3typpax%40h7o7mfg5zmdj	2025-03-23 20:41:16 +02:00
Andres Freund	ca3067cc57	aio: Change prefix of PgAioResultStatus values to PGAIO_RS_ The previous prefix wasn't consistent with the naming of other AIO related enum values. It seems best to rename it before the users are introduced. Reported-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CAAKRu_Yb+JzQpNsgUxCB0gBi+sE-mi_HmcJF6ALnmO4W+UgwpA@mail.gmail.com	2025-03-22 17:30:44 -04:00
Thomas Munro	e51ca405ed	Fix ps display for IO workers. This code must have missed a memo about the backend type description being supplied automatically these days, and was duplicating that information. Before: "io worker io worker: N" After: "io worker N"	2025-03-22 10:13:23 +13:00
Thomas Munro	ce1a75c4fe	Support buffer forwarding in StartReadBuffers(). StartReadBuffers() reports a short read when it finds a cached block that ends a range needing I/O by updating the caller's nblocks. It doesn't want to have to unpin the trailing hit that it knows the caller wants, so the v17 version used sleight of hand in the name of simplicity: it included it in nblocks as if it were part of the I/O, but internally tracked the shorter real I/O size in io_buffers_len (now removed). This API change "forwards" the delimiting buffer to the next call. It's still pinned, and still stored in the caller's array, but *nblocks no longer includes stray buffers that are not really part of the operation. The expectation is that the caller still wants the rest of the blocks and will call again starting from that point, and now it can pass the already pinned buffer back in (or choose not to and release it). The change is needed for the coming asynchronous I/O version's larger version of the problem: by definition it must move BM_IO_IN_PROGRESS negotiation from WaitReadBuffers() to StartReadBuffers(), but it might already have many buffers pinned before it discovers a need to split an I/O. (The current synchronous I/O version hides that detail from callers by looping over smaller reads if required to make all covered buffers valid in WaitReadBuffers(), so it looks like one operation but it might occasionally be several under the covers.) Aside from avoiding unnecessary pin traffic, this will also be important for later work on out-of-order streams: you can't prioritize data that is already available right now if that fact is hidden from you. The new API is natural for read_stream.c (see `ed0b87ca`). After a short read it leaves forwarded buffers where they fell in its circular queue for the continuing call to pick up. Single-block StartReadBuffer() and traditional ReadBuffer() share code but are not affected by the change. They don't do multi-block I/O. Reviewed-by: Andres Freund <andres@anarazel.de> (earlier versions) Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com	2025-03-21 20:43:59 +13:00
Thomas Munro	ed0b87caac	Support buffer forwarding in read_stream.c. In preparation for a follow-up change to the buffer manager, teach read_stream.c to manage buffers "forwarded" from one StartReadBuffers() call to the next after a short read. This involves a small amount of extra book-keeping, and opens the way for lower levels to split I/O operations without having to drop pins, as required for efficient handling of various edge cases. Concretely, the "buffers" argument will change from an out parameter to an in/out parameter. Buffer queue elements must be initialized on first use and cleared after they're consumed, but forwarded buffers are left where they fall ahead of the current pending read in the queue, ready for use by the operation that continues where a short read left off. The stream also needs to count them for pin limit management and release them on reset/early end. Tested-by: Andres Freund <andres@anarazel.de> (earlier versions) Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com	2025-03-21 18:44:47 +13:00
Andres Freund	202b12774d	bufmgr: Improve stats when a buffer is read in concurrently Previously we would have the following inaccuracies when a backend tried to read in a buffer, but that buffer was read in concurrently by another backend: - the read IO was double-counted in the global buffer access stats (pgBufferUsage) - the buffer hit was not accounted for in: - global buffer access statistics - pg_stat_io - relation level IO stats - vacuum cost balancing While trying to read in a buffer that is concurrently read in by another backend is not a common occurrence, it's also not that rare, e.g. due to concurrent sequential scans on the same relation. This scenario has become more likely in PG 17, due to the introducing of read streams, which can pin multiple buffers before calling StartBufferIO() for all the buffers. This behaviour has historically grown, but there doesn't seem to be any reason to continue with the wrong accounting. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CAAKRu_Zk-B08AzPsO-6680LUHLOCGaNJYofaxTFseLa=OepV1g@mail.gmail.com	2025-03-20 19:58:22 -04:00
Andres Freund	fc51a60dd4	smgr: Hold interrupts in most smgr functions We need to hold interrupts across most of the smgr.c/md.c functions, as otherwise interrupt processing, e.g. due to a < ERROR elog/ereport, can trigger procsignal processing, which in turn can trigger smgrreleaseall(). As the relevant code is not reentrant, we quickly end up in a bad situation. The only reason we haven't noticed this before is that there is only one non-error ereport called in affected routines, in register_dirty_segments(), and that one is extremely rarely reached. If one enables fd.c's FDDEBUG it's easy to reproduce crashes. It seems better to put the HOLD_INTERRUPTS()/RESUME_INTERRUPTS() in smgr.c, instead of trying to push them down to md.c where possible: For one, every smgr implementation would be vulnerable, for another, a good bit of smgr.c code itself is affected too. Eventually we might want a more targeted solution, allowing e.g. a networked smgr implementation to be interrupted, but many other, more complicated, problems would need to be fixed for that to be viable (e.g. smgr.c is often called with interrupts already held). One could argue this should be backpatched, but the existing < ERROR elog/ereports that can be reached with unmodified sources are unlikely to be reached. On balance the risk of backpatching seems higher than the gain - at least for now. Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/3vae7l5ozvqtxmd7rr7zaeq3qkuipz365u3rtim5t5wdkr6f4g@vkgf2fogjirl	2025-03-20 17:33:57 -04:00
Peter Eisentraut	618c64ffd3	Revert workarounds for -Wmissing-braces false positives on old GCC We have collected several instances of a workaround for GCC bug 53119, which caused false-positive compiler warnings. This bug has long been fixed, but was still seen on the buildfarm, most recently on lapwing with gcc (Debian 4.7.2-5). (The GCC bug tracker mentions that a fix was backported to 4.7.4 and 4.8.3.) That compiler no longer runs warning-free since commit `6fdd5d9563`, so we don't need to keep these workarounds. And furthermore, the consensus appears to be that we don't want to keep supporting that era of platform anymore at all. This reverts the following commits: `d937904cce` `506428d091` `b449afb582` `6392f2a096` `bad0763a4d` `5e0c761d0a` and makes a few similar fixes to newer code. Discussion: https://www.postgresql.org/message-id/flat/e170d61f-01ab-4cf9-ab68-91cd1fac62c5%40eisentraut.org Discussion: https://www.postgresql.org/message-id/flat/CA%2BTgmoYEAm-KKZibAP3hSqbTFTjUd47XtVcf3xSFDpyecXX9uQ%40mail.gmail.com	2025-03-20 11:25:58 +01:00
Peter Eisentraut	47929324c5	Fix typo in comment	2025-03-20 10:44:12 +01:00
Thomas Munro	06fb5612c9	Increase io_combine_limit range to 1MB. The default of 128kB is unchanged, but the upper limit is changed from 32 blocks to 128 blocks, unless the operating system's IOV_MAX is too low. Some other RDBMSes seem to cap their multi-block buffer pool I/O around this number, and it seems useful to allow experimentation. The concrete change is to our definition of PG_IOV_MAX, which provides the maximum for io_combine_limit and io_max_combine_limit. It also affects a couple of other places that work with arrays of struct iovec or smaller objects on the stack, so we still don't want to use the system IOV_MAX directly without a clamp: it is not under our control and likely to be 1024. 128 seems acceptable for our current usage. For Windows, we can't use real scatter/gather yet, so we continue to define our own IOV_MAX value of 16 and emulate preadv()/pwritev() with loops. Someone would need to research the trade-offs of raising that number. NB if trying to see this working: you might temporarily need to hack BAS_BULKREAD to be bigger, since otherwise the obvious way of "a very big SELECT" is limited by that for now. Suggested-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CA%2BhUKG%2B2T9p-%2BzM6Eeou-RAJjTML6eit1qn26f9twznX59qtCA%40mail.gmail.com	2025-03-19 15:40:35 +13:00
Thomas Munro	10f6646847	Introduce io_max_combine_limit. The existing io_combine_limit can be changed by users. The new io_max_combine_limit is fixed at server startup time, and functions as a silent clamp on the user setting. That in itself is probably quite useful, but the primary motivation is: aio_init.c allocates shared memory for all asynchronous IOs including some per-block data, and we didn't want to waste memory you'd never used by assuming they could be up to PG_IOV_MAX. This commit already halves the size of 'AioHandleIov' and 'AioHandleData'. A follow-up commit can now expand PG_IOV_MAX without affecting that. Since our GUC system doesn't support dependencies or cross-checks between GUCs, the user-settable one now assigns a "raw" value to io_combine_limit_guc, and the lower of io_combine_limit_guc and io_max_combine_limit is maintained in io_combine_limit. Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version) Discussion: https://postgr.es/m/CA%2BhUKG%2B2T9p-%2BzM6Eeou-RAJjTML6eit1qn26f9twznX59qtCA%40mail.gmail.com	2025-03-19 15:23:54 +13:00
Andres Freund	499faf9063	smgr: Make SMgrRelation initialization safer against errors In case the smgr_open callback failed, the ->pincount field would not be initialized and the relation would not be put onto the unpinned_relns list. This buglet was introduced in `21d9c3ee4e`, in 17. Discussion: https://postgr.es/m/3vae7l5ozvqtxmd7rr7zaeq3qkuipz365u3rtim5t5wdkr6f4g@vkgf2fogjirl Backpatch-through: 17	2025-03-18 14:04:44 -04:00
Andres Freund	247ce06b88	aio: Add io_method=worker The previous commit introduced the infrastructure to start io_workers. This commit actually makes the workers execute IOs. IO workers consume IOs from a shared memory submission queue, run traditional synchronous system calls, and perform the shared completion handling immediately. Client code submits most requests by pushing IOs into the submission queue, and waits (if necessary) using condition variables. Some IOs cannot be performed in another process due to lack of infrastructure for reopening the file, and must processed synchronously by the client code when submitted. For now the default io_method is changed to "worker". We should re-evaluate that around beta1, we might want to be careful and set the default to "sync" for 18. Reviewed-by: Noah Misch <noah@leadboat.com> Co-authored-by: Thomas Munro <thomas.munro@gmail.com> Co-authored-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m	2025-03-18 11:54:01 -04:00
Andres Freund	55b454d0e1	aio: Infrastructure for io_method=worker This commit contains the basic, system-wide, infrastructure for io_method=worker. It does not yet actually execute IO, this commit just provides the infrastructure for running IO workers, kept separate for easier review. The number of IO workers can be adjusted with a PGC_SIGHUP GUC. Eventually we'd like to make the number of workers dynamically scale up/down based on the current "IO load". To allow the number of IO workers to be increased without a restart, we need to reserve PGPROC entries for the workers unconditionally. This has been judged to be worth the cost. If it turns out to be problematic, we can introduce a PGC_POSTMASTER GUC to control the maximum number. As io workers might be needed during shutdown, e.g. for AIO during the shutdown checkpoint, a new PMState phase is added. IO workers are shut down after the shutdown checkpoint has been performed and walsender/archiver have shut down, but before the checkpointer itself shuts down. See also `87a6690cc6`. Updates PGSTAT_FILE_FORMAT_ID due to the addition of a new BackendType. Reviewed-by: Noah Misch <noah@leadboat.com> Co-authored-by: Thomas Munro <thomas.munro@gmail.com> Co-authored-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m	2025-03-18 11:54:01 -04:00

1 2 3 4 5 ...

2898 Commits