This cleans up some loose ends left by commit e8ca9ed1d. I hadn't
looked closely enough at these places before, but now I have.
The use of double-quoted #includes for Perl headers in plperl_system.h
seems to be simply a mistake introduced in 6c944bf3c and faithfully
copied forward since then. (I had thought possibly it was required
by some weird Windows build setup, but there's no evidence of that in
our history.)
The occurrences in SectionMemoryManager.h and SectionMemoryManager.cpp
evidently stem from those files' origin as LLVM code. It's
understandable that LLVM would treat their own files as needing
double-quoted #includes; but they're still system headers to us.
I also applied the same check to *.c files, and found a few other
random incorrect usages in both directions.
Our ECPG headers and test files routinely use angle brackets to refer
to ECPG headers. I left those usages alone, since it seems reasonable
for an ECPG user to regard those headers as system headers.
c4d5cb71d2 adjusted the fast-path locking code to allow some
configuration of the number of fast-path locking slots via the
max_locks_per_transaction GUC. In that commit the FAST_PATH_REL_GROUP()
macro used integer division to determine the fast-path locking group slot
to use for the lock.
The divisor in this case is always a power-of-two value. Here we swap
out the divide by a bitwise-AND, which is a significantly faster
operation to perform.
In passing, adjust the code that's setting FastPathLockGroupsPerBackend
so that it's more clear that the value being set is a power-of-two.
Also, adjust some comments in the area which contained some magic
numbers. It seems better to justify the 1024 upper limit in the
location where the #define is made instead of where it is used.
Author: David Rowley <drowleyml@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAApHDvodr3bcnpxcs7+k-3cFwYR0tP-BYhyd2PpDhe-bCx9i=g@mail.gmail.com
10f6646847 intended to limit the value of io_combine_limit to the minimum of
io_combine_limit and io_max_combine_limit. To avoid issues with interdependent
GUCs, it introduced io_combine_limit_guc and set io_combine_limit in assign
hooks. That plan was thwarted by guc_tables.c accidentally still referencing
io_combine_limit, instead of io_combine_limit_guc. That lead to the GUC
machinery overriding the work done in the assign hooks, potentially leaving
io_combine_limit with a too high value.
The consequence of this bug was that when running with io_combine_limit >
io_combine_limit_guc the AIO machinery would not have reserved large enough
iovec and IO data arrays, with one IO's arrays overlapping with another IO's,
leading to total confusion.
To make such a problem easier to detect in the future, add assertions to
pgaio_io_set_handle_data_* checking the length is smaller than
io_max_combine_limit (not just PG_IOV_MAX).
It'd be nice to have a few tests for this, but it's not entirely obvious how
to do so portably.
As remarked upon by Tom, the GUC assignment hooks really shouldn't set the
underlying variable, that's the job of the GUC machinery. Change that as well.
Discussion: https://postgr.es/m/c5jyqnuwrpigd35qe7xdypxsisdjrdba5iw63mhcse4mzjogxo@qdjpv22z763f
pgaio_io_reclaim() reset the fields in PgAioHandle before updating the state
to IDLE or incrementing the generation. For most things that's OK, but for
pg_get_aios() it is not - if it copied the PgAioHandle while fields were being
reset, we wouldn't detect that and could call
pgaio_io_get_target_description() with ioh->target == PGAIO_TID_INVALID,
leading to a crash.
Fix this issue by incrementing the generation and state earlier, before
resetting.
Also add an assertion to pgaio_io_get_target_description() for the target to
be valid - that'd have made this case a bit easier to debug. While at it,
add/update a few related assertions.
Author: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/062daca9-dfad-4750-9da8-b13388301ad9@gmail.com
The format of the injection point names used by the AIO code does not
match the existing naming convention used everywhere else in the code,
so let's be consistent. These points are used in test_aio.
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Discussion: https://postgr.es/m/Z_yTB80bdu1sYDqJ@paquier.xyz
Commit 0bada39c83 fixed a bug of this kind,
which existed in all branches for six days before detection. While the
probability of reaching the trouble was low, the disruption was extreme. No
new backends could start, and service restoration needed an immediate
shutdown. Hence, add this to catch the next bug like it.
The new check in RelationIdGetRelation() suffices to make autovacuum detect
the bug in commit 243e9b40f1 that led to commit
0bada39. This also checks in a number of similar places. It replaces each
Assert(IsTransactionState()) that pertained to a conditional catalog read.
No back-patch for now, but a back-patch of commit 243e9b4 should back-patch
this, too. A back-patch could omit the src/test/regress changes, since back
branches won't gain new index columns.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/20250410191830.0e.nmisch@google.com
Discussion: https://postgr.es/m/10ec0bc3-5933-1189-6bb8-5dec4114558e@gmail.com
Make sure that function declarations use names that exactly match the
corresponding names from function definitions in a few places. These
inconsistencies were all introduced during Postgres 18 development.
This commit was written with help from clang-tidy, by mechanically
applying the same rules as similar clean-up commits (the earliest such
commit was commit 035ce1fe).
This moves/renames some of the functions defined in pg_numa.c:
* pg_numa_get_pagesize() is renamed to pg_get_shmem_pagesize(), and
moved to src/backend/storage/ipc/shmem.c. The new name better reflects
that the page size is not related to NUMA, and it's specifically about
the page size used for the main shared memory segment.
* move pg_numa_available() to src/backend/storage/ipc/shmem.c, i.e. into
the backend (which more appropriate for functions callable from SQL).
While at it, improve the comment to explain what page size it returns.
* remove unnecessary includes from src/port/pg_numa.c, adding
unnecessary dependencies (src/port should be suitable for frontent).
These were either leftovers or unnecessary thanks to the other changes
in this commit.
This eliminates unnecessary dependencies on backend symbols, which we
don't want in src/port.
Reported-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
https://postgr.es/m/CALdSSPi5fj0a7UG7Fmw2cUD1uWuckU_e8dJ+6x-bJEokcSXzqA@mail.gmail.com
It can be set to either COPY (the default) or CLONE if the system
supports it. CLONE causes callers of copydir(), currently CREATE
DATABASE ... STRATEGY=FILE_COPY and ALTER DATABASE ... SET TABLESPACE =
..., to use copy_file_range (Linux, FreeBSD) or copyfile (macOS) to copy
files instead of a read-write loop over the contents.
CLONE gives the kernel the opportunity to share block ranges on
copy-on-write file systems and push copying down to storage on others,
depending on configuration. On some systems CLONE can be used to clone
large databases quickly with CREATE DATABASE ... TEMPLATE=source
STRATEGY=FILE_COPY.
Other operating systems could be supported; patches welcome.
Co-authored-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Ranier Vilela <ranier.vf@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGLM%2Bt%2BSwBU-cHeMUXJCOgBxSHLGZutV5zCwY4qrCcE02w%40mail.gmail.com
This adds a function for retrieving memory context statistics
and information from backends as well as auxiliary processes.
The intended usecase is cluster debugging when under memory
pressure or unanticipated memory usage characteristics.
When calling the function it sends a signal to the specified
process to submit statistics regarding its memory contexts
into dynamic shared memory. Each memory context is returned
in detail, followed by a cumulative total in case the number
of contexts exceed the max allocated amount of shared memory.
Each process is limited to use at most 1Mb memory for this.
A summary can also be explicitly requested by the user, this
will return the TopMemoryContext and a cumulative total of
all lower contexts.
In order to not block on busy processes the caller specifies
the number of seconds during which to retry before timing out.
In the case where no statistics are published within the set
timeout, the last known statistics are returned, or NULL if
no previously published statistics exist. This allows dash-
board type queries to continually publish even if the target
process is temporarily congested. Context records contain a
timestamp to indicate when they were submitted.
Author: Rahila Syed <rahilasyed90@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Discussion: https://postgr.es/m/CAH2L28v8mc9HDt8QoSJ8TRmKau_8FM_HKS41NeO9-6ZAkuZKXw@mail.gmail.com
Before, BAS_BULKREAD was always of size 256kB. With the default
io_combine_limit of 16, that only allowed 1-2 IOs to be in flight -
insufficient even on very low latency storage.
We don't just want to increase the size to a much larger hardcoded value, as
very large rings (10s of MBs of of buffers), appear to have negative
performance effects when reading in data that the OS has cached (but not when
actually needing to do IO).
To address this, increase the size of BAS_BULKREAD to allow for
io_combine_limit * effective_io_concurrency buffers getting read in. To
prevent the ring being much larger than useful, limit the increased size with
GetPinLimit().
The formula outlined above keeps the ring size to sizes for which we have not
observed performance regressions, unless very large effective_io_concurrency
values are used together with large shared_buffers setting.
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/lqwghabtu2ak4wknzycufqjm5ijnxhb4k73vzphlt2a3wsemcd@gtftg44kdim6
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah@brqs62irg4dt
In addition to the added functions, the pg_buffercache_evict() function now
shows whether the buffer was flushed.
pg_buffercache_evict_relation(): Evicts all shared buffers in a
relation at once.
pg_buffercache_evict_all(): Evicts all shared buffers at once.
Both functions provide mechanism to evict multiple shared buffers at
once. They are designed to address the inefficiency of repeatedly calling
pg_buffercache_evict() for each individual buffer, which can be time-consuming
when dealing with large shared buffer pools. (e.g., ~477ms vs. ~2576ms for
16GB of fully populated shared buffers).
These functions are intended for developer testing and debugging
purposes and are available to superusers only.
Minimal tests for the new functions are included. Also, there was no test for
pg_buffercache_evict(), test for this added too.
No new extension version is needed, as it was already increased this release
by ba2a3c2302.
Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Aidar Imamov <a.imamov@postgrespro.ru>
Reviewed-by: Joseph Koshakow <koshy44@gmail.com>
Discussion: https://postgr.es/m/CAN55FZ0h_YoSqqutxV6DES1RW8ig6wcA8CR9rJk358YRMxZFmw%40mail.gmail.com
Some tests try to invalidate logical slots on the standby server by
running VACUUM on the primary. The problem is that xl_running_xacts was
getting generated and replayed before the VACUUM command, leading to the
advancement of the active slot's catalog_xmin. Due to this, active slots
were not getting invalidated, leading to test failures.
We fix it by skipping the generation of xl_running_xacts for the required
tests with the help of injection points. As the required interface for
injection points was not present in back branches, we fixed the failing
tests in them by disallowing the slot to become active for the required
cases (where rows_removed conflict could be generated).
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Backpatch-through: 16, where it was introduced
Discussion: https://postgr.es/m/Z6oQXc8LmiTLfwLA@ip-10-97-1-34.eu-west-3.compute.internal
Introduce new pg_shmem_alloctions_numa view with information about how
shared memory is distributed across NUMA nodes. For each shared memory
segment, the view returns one row for each NUMA node backing it, with
the total amount of memory allocated from that node.
The view may be relatively expensive, especially when executed for the
first time in a backend, as it has to touch all memory pages to get
reliable information about the NUMA node. This may also force allocation
of the shared memory.
Unlike pg_shmem_allocations, the view does not show anonymous shared
memory allocations. It also does not show memory allocated using the
dynamic shared memory infrastructure.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
In some edge cases valgrind flags issues with the memory referenced by
IOs. All of the cases addressed in this change are false positives.
Most of the false positives are caused by UnpinBuffer[NoOwner] marking buffer
data as inaccessible. This happens even though the AIO subsystem still holds a
pin. That's good, there shouldn't be accesses to the buffer outside of AIO
related code until it is pinned by "user" code again. But it requires some
explicit work - if the buffer is not pinned by the current backend, we need to
explicitly mark the buffer data accessible/inaccessible while executing
completion callbacks.
That however causes a cascading issue in IO workers: After the completion
callbacks for a buffer is executed, the page is marked as inaccessible. If
subsequently the same worker is executing IO targeting the same buffer, we
would get an error, as the memory is still marked inaccessible. To avoid that,
we need to explicitly mark the memory as accessible in IO workers.
Another issue is that IO executed in workers or via io_uring will not mark
memory as DEFINED. In the case of workers that is because valgrind does not
track memory definedness across processes. For io_uring that is because
valgrind does not understand io_uring, and therefore its IOs never mark memory
as defined, whether the completions are processed in the defining process or
in another context. It's not entirely clear how to best solve that. The
current user of AIO is not affected, as it explicitly marks buffers as DEFINED
& NOACCESS anyway. Defer solving this issue until we have a user with
different needs.
Per buildfarm animal skink.
Reviewed-by: Noah Misch <noah@leadboat.com>
Co-authored-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/3pd4322mogfmdd5nln3zphdwhtmq3rzdldqjwb2sfqzcgs22lf@ok2gletdaoe6
This mirrors 1e0dfd166b (+ 46ef520b95), for temporary table buffers. This
is mainly interesting right now because the AIO work currently triggers
spurious valgrind errors, and the fix for that is cleaner if temp buffers
behave the same as shared buffers.
This requires one change beyond the annotations themselves, namely to pin
local buffers while writing them out in FlushRelationBuffers().
Reviewed-by: Noah Misch <noah@leadboat.com>
Co-authored-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/3pd4322mogfmdd5nln3zphdwhtmq3rzdldqjwb2sfqzcgs22lf@ok2gletdaoe6
If the limit returned by GetAdditionalPinLimit() is large, the buffer_limit
variable in read_stream_start_pending_read() can overflow. While the code is
careful to limit buffer_limit PG_INT16_MAX, we subsequently add the number of
forwarded buffers.
The overflow can lead to assertion failures, crashes or wrong query results
when using large shared buffers.
It seems easier to avoid this if we make the buffer_limit variable an int,
instead of an int16. Do so, and clamp buffer_limit after adding the number of
forwarded buffers.
It's possible we might want to address this and related issues more widely by
changing to int instead of int16 more widely, but since the consequences of
this bug can be confusing, it seems better to fix it now.
This bug was introduced in ed0b87caac.
Discussion: https://postgr.es/m/ewvz3cbtlhrwqk7h6ca6cctiqh7r64ol3pzb3iyjycn2r5nxk5@tnhw3a5zatlr
Various places allocated shared memory by first allocating a small chunk
using ShmemInitStruct(), followed by ShmemAlloc() calls to allocate more
memory. Unfortunately, ShmemAlloc() does not update ShmemIndex, so this
affected pg_shmem_allocations - it only shown the initial chunk.
This commit modifies the following allocations, to allocate everything
as a single chunk, and then split it internally.
- PredXactList
- RWConflictPool
- PGPROC structures
- Fast-Path Lock Array
The fast-path lock array is allocated separately, not as a part of the
PGPROC structures allocation.
Author: Rahila Syed <rahilasyed90@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAH2L28vHzRankszhqz7deXURxKncxfirnuW68zD7+hVAqaS5GQ@mail.gmail.com
pg_shmem_allocations tracks the memory allocated by ShmemInitStruct(),
but for shared hash tables that covered only the header and hash
directory. The remaining parts (segments and buckets) were allocated
later using ShmemAlloc(), which does not update the shmem accounting.
Thus, these allocations were not shown in pg_shmem_allocations.
This commit improves the situation by allocating all the hash table
parts at once, using a single ShmemInitStruct() call. This way the
ShmemIndex entries (and thus pg_shmem_allocations) better reflect the
proper size of the hash table.
This affects allocations for private (non-shared) hash tables too, as
the hash_create() code is shared. For non-shared tables this however
makes no practical difference.
This changes the alignment a bit. ShmemAlloc() aligns the chunks using
CACHELINEALIGN(), which means some parts (header, directory, segments)
were aligned this way. Allocating all parts as a single chunk removes
this (implicit) alignment. We've considered adding explicit alignment,
but we've decided not to - it seems to be merely a coincidence due to
using the ShmemAlloc() API, not due to necessity.
Author: Rahila Syed <rahilasyed90@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAH2L28vHzRankszhqz7deXURxKncxfirnuW68zD7+hVAqaS5GQ@mail.gmail.com
Currently, the cancel request key is a 32-bit token, which isn't very
much entropy. If you want to cancel another session's query, you can
brute-force it. In most environments, an unauthorized cancellation of
a query isn't very serious, but it nevertheless would be nice to have
more protection from it. Hence make the key longer, to make it harder
to guess.
The longer cancellation keys are generated when using the new protocol
version 3.2. For connections using version 3.0, short 4-bytes keys are
still used.
The new longer key length is not hardcoded in the protocol anymore,
the client is expected to deal with variable length keys, up to 256
bytes. This flexibility allows e.g. a connection pooler to add more
information to the cancel key, which might be useful for finding the
connection.
Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl>
Reviewed-by: Robert Haas <robertmhaas@gmail.com> (earlier versions)
Discussion: https://www.postgresql.org/message-id/508d0505-8b7a-4864-a681-e7e5edfe32aa@iki.fi
Push an ErrorContextCallback adding additional detail about the process
performing the I/O and the owner of the I/O when those are not the same.
For io_method worker, this adds context specifying which process owns
the I/O that the I/O worker is processing.
For io_method io_uring, this adds context only when a backend is
*completing* I/O for another backend. It specifies the pid of the owning
process.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/rdml3fpukrqnas7qc5uimtl2fyytrnu6ymc2vjf2zuflbsjuul%40hyizyjsexwmm
mdreadv() has a codepath to zero out buffers when a read returns zero bytes,
guarded by a check for zero_damaged_pages || InRecovery.
The InRecovery codepath to zero out buffers in mdreadv() appears to be
unreachable. The only known paths to reach mdreadv()/mdstartreadv() in
recovery are XLogReadBufferExtended(), vm_readbuf(), and fsm_readbuf(), each
of which takes care to extend the relation if necessary. This looks to either
have been the case for a long time, or the code was never reachable.
The zero_damaged_pages path is incomplete, as missing segments are not
created.
Putting blocks into the buffer-pool that do not exist on disk is rather
problematic, as such blocks will, at least initially, not be found by scans
that rely on smgrnblocks(), as they are beyond EOF. It also can cause weird
problems with relation extension, as relation extension does not expect blocks
beyond EOF to exist.
Therefore we would like to remove that path.
mdstartreadv(), which I added in e5fe570b51c, does not implement this zeroing
logic. I had started a discussion about that a while ago (linked below), but
forgot to act on the conclusion of the discussion, namely to disable the
in-memory-zeroing behavior.
We could certainly implement equivalent zeroing logic in mdstartreadv(), but
it would have to be more complicated due to potential differences in the
zero_damaged_pages setting between the definer and completor of IO. Given that
we want to remove the logic, that does not seem worth implementing the
necessary logic.
For now, put an Assert(false) and comments documenting this choice into
mdreadv() and comments documenting the deprecation of the path in mdreadv()
and the non-implementation of it in mdstartreadv(). If we, during testing,
discover that we do need the path, we can implement it at that time.
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/postgr.es/m/20250330024513.ac.nmisch@google.com
Discussion: https://postgr.es/m/postgr.es/m/3qxxsnciyffyf3wyguiz4besdp5t5uxvv3utg75cbcszojlz7p@uibfzmnukkbd
Submitting IO in larger batches can be more efficient than doing so
one-by-one, particularly for many small reads. It does, however, require
the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO
batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not:
a) block without first calling pgaio_submit_staged(), unless a
to-be-waited-on lock cannot be part of a deadlock, e.g. because it is
never held while waiting for IO.
b) directly or indirectly start another batch pgaio_enter_batchmode()
As this requires care and is nontrivial in some cases, batching is only
used with explicit opt-in.
This patch adds an explicit flag (READ_STREAM_USE_BATCHING) to read_stream and
uses it where appropriate.
There are two cases where batching would likely be beneficial, but where we
aren't using it yet:
1) bitmap heap scans, because the callback reads the VM
This should soon be solved, because we are planning to remove the use of
the VM, due to that not being sound.
2) The first phase of heap vacuum
This could be made to support batchmode, but would require some care.
Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Adapt the read stream logic for real AIO:
- If AIO is enabled, we shouldn't issue advice, but if it isn't, we should
continue issuing advice
- AIO benefits from reading ahead with direct IO
- If effective_io_concurrency=0, pass READ_BUFFERS_SYNCHRONOUSLY to
StartReadBuffers() to ensure synchronous IO execution
There are further improvements we should consider:
- While in read_stream_look_ahead(), we can use AIO batch submission mode for
increased efficiency. That however requires care to avoid deadlocks and thus
done separately.
- It can be beneficial to defer starting new IOs until we can issue multiple
IOs at once. That however requires non-trivial heuristics to decide when to
do so.
Reviewed-by: Noah Misch <noah@leadboat.com>
Co-authored-by: Andres Freund <andres@anarazel.de>
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
This finally introduces the first actual use of AIO. StartReadBuffers() now
uses the AIO routines to issue IO.
As the implementation of StartReadBuffers() is also used by the functions for
reading individual blocks (StartReadBuffer() and through that
ReadBufferExtended()) this means all buffered read IO passes through the AIO
paths. However, as those are synchronous reads, actually performing the IO
asynchronously would be rarely beneficial. Instead such IOs are flagged to
always be executed synchronously. This way we don't have to duplicate a fair
bit of code.
When io_method=sync is used, the IO patterns generated after this change are
the same as before, i.e. actual reads are only issued in WaitReadBuffers() and
StartReadBuffers() may issue prefetch requests. This allows to bypass most of
the actual asynchronicity, which is important to make a change as big as this
less risky.
One thing worth calling out is that, if IO is actually executed
asynchronously, the precise meaning of what track_io_timing is measuring has
changed. Previously it tracked the time for each IO, but that does not make
sense when multiple IOs are executed concurrently. Now it only measures the
time actually spent waiting for IO. A subsequent commit will adjust the docs
for this.
While AIO is now actually used, the logic in read_stream.c will often prevent
using sufficiently many concurrent IOs. That will be addressed in the next
commit.
Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Co-authored-by: Andres Freund <andres@anarazel.de>
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
This commit implements the infrastructure to perform asynchronous reads into
the buffer pool.
To do so, it:
- Adds readv AIO callbacks for shared and local buffers
It may be worth calling out that shared buffer completions may be run in a
different backend than where the IO started.
- Adds an AIO wait reference to BufferDesc, to allow backends to wait for
in-progress asynchronous IOs
- Adapts StartBufferIO(), WaitIO(), TerminateBufferIO(), and their localbuf.c
equivalents, to be able to deal with AIO
- Moves the code to handle BM_PIN_COUNT_WAITER into a helper function, as it
now also needs to be called on IO completion
As of this commit, nothing issues AIO on shared/local buffers. A future commit
will update StartReadBuffers() to do so.
Buffer reads executed through this infrastructure will report invalid page /
checksum errors / warnings differently than before:
In the error case the error message will cover all the blocks that were
included in the read, rather than just the reporting the first invalid
block. If more than one block is invalid, the error will include information
about the range of the read, the first invalid block and the number of invalid
pages, with a HINT towards the server log for per-block details.
For the warning case (i.e. zero_damaged_buffers) we would previously emit one
warning message for each buffer in a multi-block read. Now there is only a
single warning message for the entire read, again referring to the server log
for more details in case of multiple checksum failures within a single larger
read.
Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
If an IO succeeds, but issues a warning, e.g. due to a page verification
failure with zero_damaged_pages, we want to issue that warning in the context
of the issuer of the IO, not the process that executes the completion (always
the case for worker).
It's already possible for a completion callback to report a custom error
message, we just didn't have a result status that allowed a user of AIO to
know that a warning should be emitted even though the IO request succeeded.
All that's needed for that is a dedicated PGAIO_RS_ value.
Previously there were not enough bits in PgAioResult.id for the new
value. Increase. While at that, add defines for the amount of bits and static
asserts to check that the widths are appropriate.
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20250329212929.a6.nmisch@google.com
For AIO the completion of a read into shared buffers (i.e. verifying the page
including the checksum, updating the BufferDesc to reflect the IO) can happen
in a different backend than the backend that started the IO. As
ignore_checksum_failure can differ between backends, we need to allow the
caller of PageIsVerified() control whether to ignore checksum failures.
The commit leaves a gap in the PIV_* values, as an upcoming commit, which
depends on this commit, will add PIV_LOG_LOG, which better fits just after
PIV_LOG_WARNING.
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20250329212929.a6.nmisch@google.com
For AIO we execute completion callbacks in critical sections (to ensure that
AIO can in the future be used for WAL, which in turn requires that we can call
completion callbacks in critical sections, to get the resources for WAL
io). To report checksum errors a backend now has to call
pgstat_prepare_report_checksum_failure(), before entering a critical section,
which guarantees the relevant pgstats entry is in shared memory, the relevant
DSM segment is mapped into the backend's memory and the address is known via a
PgStat_EntryRef.
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/wkjj4p2rmkevutkwc6tewoovdqznj6c6nvjmvii4oo5wmbh5sr@retq7d6uqs4j
For AIO on temporary table buffers the AIO subsystem needs to be able to
ensure a pin on a buffer while AIO is going on, even if the IO issuing query
errors out. Tracking the buffer in LocalRefCount does not work, as it would
cause CheckForLocalBufferLeaks() to assert out.
Instead, also track the refcount in BufferDesc.state, not just
LocalRefCount. This also makes local buffers behave a bit more akin to shared
buffers.
Note that we still don't need locking, AIO completion callbacks for local
buffers are executed in the issuing session (i.e. nobody else has access to
the BufferDesc).
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Checksum failure stats could be attributed to the wrong database in two cases:
- when a read of a shared relation encountered a checksum error , it would be
attributed to the current database, instead of the "database" representing
shared relations
- when using CREATE DATABASE ... STRATEGY WAL_LOG checksum errors in the
source database would be attributed to the current database
The checksum stats reporting via PageIsVerifiedExtended(PIV_REPORT_STAT) does
not have access to the information about what database a page belongs to.
This fixes the issue by removing PIV_REPORT_STAT and delegating the
responsibility to report stats to the caller, which now can learn about the
number of stats via a new optional argument.
As this changes the signature of PageIsVerifiedExtended() and all callers
should adapt to the new signature, use the occasion to rename the function to
PageIsVerified() and remove the compatibility macro.
We could instead have fixed this by adding information about the database to
the args of PageIsVerified(), but there are soon-to-be-applied patches that
need to separate the stats reporting from the PageIsVerified() call
anyway. Those patches also include testing for the failure paths, something we
inexplicably have not had.
As there is no caller of pgstat_report_checksum_failure() left, remove it.
It'd be possible, but awkward to fix this in the back branches. We considered
doing the work not quite worth it, as mis-attributed stats should still elicit
concern. The emitted error messages do allow to attribute the errors
correctly.
Discussion: https://postgr.es/m/5tyic6epvdlmd6eddgelv47syg2b5cpwffjam54axp25xyq2ga@ptwkinxqo3az
Discussion: https://postgr.es/m/mglpvvbhighzuwudjxzu4br65qqcxsnyvio3nl4fbog3qknwhg@e4gt7npsohuz
check_createrole_self_grant and check_synchronized_standby_slots
were allocating memory on a LOG elevel without checking if the
allocation succeeded or not, which would have led to a segfault
on allocation failure.
On top of that, a number of callsites were using the ERROR level,
relying on erroring out rather than returning false to allow the
GUC machinery handle it gracefully. Other callsites used WARNING
instead of LOG. While neither being not wrong, this changes all
check_ functions do it consistently with LOG.
init_custom_variable gets a promoted elevel to FATAL to keep
the guc_malloc error handling in line with the rest of the
error handling in that function which already call FATAL. If
we encounter an OOM in this callsite there is no graceful
handling to be had, better to error out hard.
Backpatch the fix to check_createrole_self_grant down to v16
and the fix to check_synchronized_standby_slots down to v17
where they were introduced.
Author: Daniel Gustafsson <daniel@yesql.se>
Reported-by: Nikita <pm91.arapov@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Bug: #18845
Discussion: https://postgr.es/m/18845-582c6e10247377ec@postgresql.org
Backpatch-through: 16
The implementation of FSM for indexes is simpler than heap, where 0 is
used to track if a page is in-use and (BLCKSZ - 1) if a page is free.
One comment in indexfsm.c and one description in the documentation of
pg_freespacemap were incorrect about that.
Author: Alex Friedman <alexf01@gmail.com>
Discussion: https://postgr.es/m/71eef655-c192-453f-ac45-2772fec2cb04@gmail.com
Backpatch-through: 13
The old naming pattern (mirroring liburing's naming) was inconsistent with
the (not yet introduced) callers. It seems better to get rid of the
inconsistency now than to grow more users of the odd naming.
Reported-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20250326001915.bc.nmisch@google.com
Otherwise the results of e.g. temp table buffer verification errors will not
reach bufmgr.c. Obviously that's not right. Found while expanding the tests
for invalid buffer contents.
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20250326001915.bc.nmisch@google.com