1
0
mirror of https://github.com/postgres/postgres.git synced 2025-11-19 13:42:17 +03:00
Commit Graph

723 Commits

Author SHA1 Message Date
Andres Freund
c819d1017d bufmgr: Fix valgrind checking for buffers pinned in StrategyGetBuffer()
In 5e89985928 I made StrategyGetBuffer() pin buffers with a single CAS,
instead of using PinBuffer_Locked(). Unfortunately I missed that
PinBuffer_Locked() marked the page as defined for valgrind.

Fix this oversight by centralizing the valgrind initialization into
TrackNewBufferPin(), which also allows us to reduce the number of places doing
VALGRIND_MAKE_MEM_DEFINED.

Per buildfarm animal skink and Amit Langote.

Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
Discussion: https://postgr.es/m/CA+HiwqGKJ6nEXEPQW7EpykVsEtzxp5-up_xhtcUAkWFtATVQvQ@mail.gmail.com
2025-10-09 19:17:13 -04:00
Andres Freund
5e89985928 bufmgr: Don't lock buffer header in StrategyGetBuffer()
Previously StrategyGetBuffer() acquired the buffer header spinlock for every
buffer, whether it was reusable or not. If reusable, it'd be returned, with
the lock held, to GetVictimBuffer(), which then would pin the buffer with
PinBuffer_Locked(). That's somewhat violating the spirit of the guidelines for
holding spinlocks (i.e. that they are only held for a few lines of consecutive
code) and necessitates using PinBuffer_Locked(), which scales worse than
PinBuffer() due to holding the spinlock.  This alone makes it worth changing
the code.

However, the main reason to change this is that a future commit will make
PinBuffer_Locked() slower (due to making UnlockBufHdr() slower), to gain
scalability for the much more common case of pinning a pre-existing buffer. By
pinning the buffer with a single atomic operation, iff the buffer is reusable,
we avoid any potential regression for miss-heavy workloads. There strictly are
fewer atomic operations for each potential buffer after this change.

The price for this improvement is that freelist.c needs two CAS loops and
needs to be able to set up the resource accounting for pinned buffers. The
latter is achieved by exposing a new function for that purpose from bufmgr.c,
that seems better than exposing the entire private refcount infrastructure.
The improvement seems worth the complexity.

Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
2025-10-08 17:04:07 -04:00
Andres Freund
3baae90013 bufmgr: fewer calls to BufferDescriptorGetContentLock
We're planning to merge buffer content locks into BufferDesc.state. To reduce
the size of that patch, centralize calls to BufferDescriptorGetContentLock().

The biggest part of the change is in assertions, by introducing
BufferIsLockedByMe[InMode]() (and removing BufferIsExclusiveLocked()). This
seems like an improvement even without aforementioned plans.

Additionally replace some direct calls to LWLockAcquire() with calls to
LockBuffer().

Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
2025-10-08 16:06:19 -04:00
Andres Freund
2a2e1b470b bufmgr: Fix signedness of mask variable in BufferSync()
BM_PERMANENT is defined as 1U<<31, which is a negative number when interpreted
as a signed integer. Unfortunately the mask variable in BufferSync() was
signed. This has been wrong for a long time, but failed to fail, due to
integer conversion rules.

However, in an upcoming patch the width of the state variable will be
increased, with the wrong signedness leading to never flushing permanent
buffers - luckily caught in a test.

It seems better to fix this separately, instead of doing so as part of a
large, otherwise mechanical, patch.

Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
2025-10-08 14:34:30 -04:00
Andres Freund
3c2b97b29e bufmgr: Introduce FlushUnlockedBuffer
There were several copies of code locking a buffer, flushing its contents, and
unlocking the buffer. It seems worth centralizing that into a helper function.

Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
2025-10-08 14:34:30 -04:00
Andres Freund
819dc118c0 Improve ReadRecentBuffer() scalability
While testing a new potential use for ReadRecentBuffer(), Andres
reported that it scales badly when called concurrently for the same
buffer by many backends.  Instead of a naive (but wrong) coding with
PinBuffer(), it used the spinlock, so that it could be careful to pin
only if the buffer was valid and holding the expected block, to avoid
breaking invariants in eg GetVictimBuffer().  Unfortunately that made it
less scalable than PinBuffer(), which uses compare-exchange instead.

We can fix that by giving PinBuffer() a new skip_if_not_valid mode that
doesn't pin invalid buffers.  It might occasionally skip when it
shouldn't due to the unlocked read of the header flags, but that's
unlikely and perfectly acceptable for an opportunistic optimisation
routine, and it can only succeed when it really should due to the
compare-exchange loop.

Note that this fixes ReadRecentBuffer()'s failure to bump the usage
count. While this could be seen as a bug, there currently aren't cases
affected by this in core, so it doesn't seem worth backpatching that portion.

Author: Thomas Munro <thomas.munro@gmail.com>
Reported-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/20230627020546.t6z4tntmj7wmjrfh%40awork3.anarazel.de
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
2025-10-08 13:10:40 -04:00
Andres Freund
0110e2ec5c Mark shared buffer lookup table HASH_FIXED_SIZE
StrategyInitialize() calls InitBufTable() with maximum number of entries that
the buffer lookup table can ever have. Thus there should not be any need to
allocate more element after initialization. Hence mark the hash table as fixed
sized.

Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://postgr.es/m/CAExHW5v0jh3F_wj86yC=qBfWk0uiT94qy=Z41uzAHLHh0SerRA@mail.gmail.com
2025-09-17 20:28:43 -04:00
Andres Freund
2c78940527 bufmgr: Remove freelist, always use clock-sweep
This set of changes removes the list of available buffers and instead simply
uses the clock-sweep algorithm to find and return an available buffer.  This
also removes the have_free_buffer() function and simply caps the
pg_autoprewarm process to at most NBuffers.

While on the surface this appears to be removing an optimization it is in fact
eliminating code that induces overhead in the form of synchronization that is
problematic for multi-core systems.

The main reason for removing the freelist, however, is not the moderate
improvement in scalability, but that having the freelist would require
dedicated complexity in several upcoming patches. As we have not been able to
find a case benefiting from the freelist...

Author: Greg Burd <greg@burd.me>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/70C6A5B5-2A20-4D0B-BC73-EB09DD62D61C@getmailspring.com
2025-09-05 12:25:59 -04:00
Andres Freund
50e4c6ace5 bufmgr: Use consistent naming of the clock-sweep algorithm
Minor edits to comments only.

Author: Greg Burd <greg@burd.me>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/70C6A5B5-2A20-4D0B-BC73-EB09DD62D61C@getmailspring.com
2025-09-05 12:25:59 -04:00
Peter Eisentraut
e567e22290 Message style improvements
Mostly adding some quoting.
2025-08-26 22:52:11 +02:00
Peter Eisentraut
ff89e182d4 Add missing Datum conversions
Add various missing conversions from and to Datum.  The previous code
mostly relied on implicit conversions or its own explicit casts
instead of using the correct DatumGet*() or *GetDatum() functions.

We think these omissions are harmless.  Some actual bugs that were
discovered during this process have been committed
separately (80c758a2e1, fd2ab03fea).

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/8246d7ff-f4b7-4363-913e-827dadfeb145%40eisentraut.org
2025-08-08 22:06:57 +02:00
Thomas Munro
b5cd74612c Remove obsolete comment.
Remove a comment about potential for AIO in StartReadBuffersImpl(),
because that change happened.
2025-08-09 01:46:04 +12:00
Tom Lane
9e9190154e Fix MemoryContextAllocAligned's interaction with Valgrind.
Arrange that only the "aligned chunk" part of the allocated space is
included in a Valgrind vchunk.  This suppresses complaints about that
vchunk being possibly lost because PG is retaining only pointers to
the aligned chunk.  Also make sure that trailing wasted space is
marked NOACCESS.

As a tiny performance improvement, arrange that MCXT_ALLOC_ZERO zeroes
only the returned "aligned chunk", not the wasted padding space.

In passing, fix GetLocalBufferStorage to use MemoryContextAllocAligned
instead of rolling its own implementation, which was equally broken
according to Valgrind.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/285483.1746756246@sss.pgh.pa.us
2025-08-02 21:59:46 -04:00
Robert Haas
1d1612aec7 Run pgindent.
Per buildfarm member koel, Nathan Bossart, and David Rowley.
2025-07-29 09:10:41 -04:00
Robert Haas
d5b9b2d402 Remove misleading hint for "unexpected data beyond EOF" error.
Commit ffae5cc5a6 added this hint in 2006,
but it's now obsolete and doesn't reflect what users should really check
in this situation. We were not able to agree on a new hint, so just delete
the existing one and update the comments to mention one possibility that
is known to cause problems of this kind: something other than PostgreSQL
is modifying files in the PostgreSQL data directory.

Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Robert Haas <rhaas@postgresql.org>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Christoph Berg <myon@debian.org>
Discussion: https://postgr.es/m/CAKZiRmxNbcaL76x=09Sxf7aUmrRQJBf8drzDdUHo+j9_eM+VMg@mail.gmail.com
2025-07-28 11:15:47 -04:00
Nathan Bossart
bb938e2c3c Rename CHECKPOINT_IMMEDIATE to CHECKPOINT_FAST.
The new name more accurately reflects the effects of this flag on a
requested checkpoint.  Checkpoint-related log messages (i.e., those
controlled by the log_checkpoints configuration parameter) will now
say "fast" instead of "immediate", too.  Likewise, references to
"immediate" checkpoints in the documentation have been updated to
say "fast".  This is preparatory work for a follow-up commit that
will add a MODE option to the CHECKPOINT command.

Author: Christoph Berg <myon@debian.org>
Discussion: https://postgr.es/m/aDnaKTEf-0dLiEfz%40msg.df7cb.de
2025-07-11 11:51:25 -05:00
Nathan Bossart
cd8324cc89 Rename CHECKPOINT_FLUSH_ALL to CHECKPOINT_FLUSH_UNLOGGED.
The new name more accurately relects the effects of this flag on a
requested checkpoint.  Checkpoint-related log messages (i.e., those
controlled by the log_checkpoints configuration parameter) will now
say "flush-unlogged" instead of "flush-all", too.  This is
preparatory work for a follow-up commit that will add a
FLUSH_UNLOGGED option to the CHECKPOINT command.

Author: Christoph Berg <myon@debian.org>
Discussion: https://postgr.es/m/aDnaKTEf-0dLiEfz%40msg.df7cb.de
2025-07-11 11:51:25 -05:00
Fujii Masao
78ebda66bf Speed up truncation of temporary relations.
Previously, truncating a temporary relation required scanning the entire
local buffer pool once per relation fork to invalidate buffers. This could
be slow, especially with a large local buffers, as the scan was repeated
multiple times.

A similar issue with regular tables (shared buffers) was addressed in
commit 6d05086c0a by scanning the buffer pool only once for all forks.

This commit applies the same optimization to temporary relations,
improving truncation performance.

Author: Daniil Davydov <3danissimo@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Maxim Orlov <orlovmg@gmail.com>
Discussion: https://postgr.es/m/CAJDiXggNqsJOH7C5co4jA8nDk8vw-=sokyh5s1_TENWnC6Ofcg@mail.gmail.com
2025-07-04 09:03:58 +09:00
Peter Eisentraut
58fbfde152 Fix incorrect format placeholders 2025-06-03 21:38:04 +02:00
Noah Misch
4a4ee0c2c1 Remove GLOBALTABLESPACE_OID assert for locked buffers.
Commit f4ece891fc added the assertion in
an attempt to catch some defects even after VACUUM FULL or REINDEX.
However, IsCatalogTextUniqueIndexOid(tag.relNumber) always returns false
after a relfilenode change, provoking unintended assertion failures.

Reported-by: Adam Guo <adamguo@amazon.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Bug: #18912
Discussion: https://postgr.es/m/18912-a41c9bd0e0ad19b1@postgresql.org
2025-05-10 07:36:27 -07:00
Tom Lane
94b84a6072 Don't use double-quotes in #include's of system headers, redux.
This cleans up some loose ends left by commit e8ca9ed1d.  I hadn't
looked closely enough at these places before, but now I have.

The use of double-quoted #includes for Perl headers in plperl_system.h
seems to be simply a mistake introduced in 6c944bf3c and faithfully
copied forward since then.  (I had thought possibly it was required
by some weird Windows build setup, but there's no evidence of that in
our history.)

The occurrences in SectionMemoryManager.h and SectionMemoryManager.cpp
evidently stem from those files' origin as LLVM code.  It's
understandable that LLVM would treat their own files as needing
double-quoted #includes; but they're still system headers to us.

I also applied the same check to *.c files, and found a few other
random incorrect usages in both directions.

Our ECPG headers and test files routinely use angle brackets to refer
to ECPG headers.  I left those usages alone, since it seems reasonable
for an ECPG user to regard those headers as system headers.
2025-04-27 13:23:19 -04:00
David Rowley
78eda9e264 Fix a few more duplicate words in comments
Similar to 84fd3bc14 but these ones were found using a regex that can span
multiple lines.

Author: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAApHDvrMcr8XD107H3NV=WHgyBcu=sx5+7=WArr-n_cWUqdFXQ@mail.gmail.com
2025-04-21 13:50:50 +12:00
Michael Paquier
88e947136b Fix typos and grammar in the code
The large majority of these have been introduced by recent commits done
in the v18 development cycle.

Author: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/9a7763ab-5252-429d-a943-b28941e0e28b@gmail.com
2025-04-19 19:17:42 +09:00
Noah Misch
f4ece891fc Assert lack of hazardous buffer locks before possible catalog read.
Commit 0bada39c83 fixed a bug of this kind,
which existed in all branches for six days before detection.  While the
probability of reaching the trouble was low, the disruption was extreme.  No
new backends could start, and service restoration needed an immediate
shutdown.  Hence, add this to catch the next bug like it.

The new check in RelationIdGetRelation() suffices to make autovacuum detect
the bug in commit 243e9b40f1 that led to commit
0bada39.  This also checks in a number of similar places.  It replaces each
Assert(IsTransactionState()) that pertained to a conditional catalog read.

No back-patch for now, but a back-patch of commit 243e9b4 should back-patch
this, too.  A back-patch could omit the src/test/regress changes, since back
branches won't gain new index columns.

Reported-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/20250410191830.0e.nmisch@google.com
Discussion: https://postgr.es/m/10ec0bc3-5933-1189-6bb8-5dec4114558e@gmail.com
2025-04-17 05:00:30 -07:00
Peter Geoghegan
a6cab6a78e Harmonize function parameter names for Postgres 18.
Make sure that function declarations use names that exactly match the
corresponding names from function definitions in a few places.  These
inconsistencies were all introduced during Postgres 18 development.

This commit was written with help from clang-tidy, by mechanically
applying the same rules as similar clean-up commits (the earliest such
commit was commit 035ce1fe).
2025-04-12 12:07:36 -04:00
Andres Freund
15f0cb26b5 Increase BAS_BULKREAD based on effective_io_concurrency
Before, BAS_BULKREAD was always of size 256kB. With the default
io_combine_limit of 16, that only allowed 1-2 IOs to be in flight -
insufficient even on very low latency storage.

We don't just want to increase the size to a much larger hardcoded value, as
very large rings (10s of MBs of of buffers), appear to have negative
performance effects when reading in data that the OS has cached (but not when
actually needing to do IO).

To address this, increase the size of BAS_BULKREAD to allow for
io_combine_limit * effective_io_concurrency buffers getting read in. To
prevent the ring being much larger than useful, limit the increased size with
GetPinLimit().

The formula outlined above keeps the ring size to sizes for which we have not
observed performance regressions, unless very large effective_io_concurrency
values are used together with large shared_buffers setting.

Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/lqwghabtu2ak4wknzycufqjm5ijnxhb4k73vzphlt2a3wsemcd@gtftg44kdim6
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah@brqs62irg4dt
2025-04-08 02:41:03 -04:00
Andres Freund
dcf7e1697b Add pg_buffercache_evict_{relation,all} functions
In addition to the added functions, the pg_buffercache_evict() function now
shows whether the buffer was flushed.

pg_buffercache_evict_relation(): Evicts all shared buffers in a
relation at once.
pg_buffercache_evict_all(): Evicts all shared buffers at once.

Both functions provide mechanism to evict multiple shared buffers at
once. They are designed to address the inefficiency of repeatedly calling
pg_buffercache_evict() for each individual buffer, which can be time-consuming
when dealing with large shared buffer pools. (e.g., ~477ms vs. ~2576ms for
16GB of fully populated shared buffers).

These functions are intended for developer testing and debugging
purposes and are available to superusers only.

Minimal tests for the new functions are included. Also, there was no test for
pg_buffercache_evict(), test for this added too.

No new extension version is needed, as it was already increased this release
by ba2a3c2302.

Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Aidar Imamov <a.imamov@postgrespro.ru>
Reviewed-by: Joseph Koshakow <koshy44@gmail.com>
Discussion: https://postgr.es/m/CAN55FZ0h_YoSqqutxV6DES1RW8ig6wcA8CR9rJk358YRMxZFmw%40mail.gmail.com
2025-04-08 02:19:32 -04:00
Andres Freund
8e293e689b aio: Make AIO more compatible with valgrind
In some edge cases valgrind flags issues with the memory referenced by
IOs. All of the cases addressed in this change are false positives.

Most of the false positives are caused by UnpinBuffer[NoOwner] marking buffer
data as inaccessible. This happens even though the AIO subsystem still holds a
pin. That's good, there shouldn't be accesses to the buffer outside of AIO
related code until it is pinned by "user" code again. But it requires some
explicit work - if the buffer is not pinned by the current backend, we need to
explicitly mark the buffer data accessible/inaccessible while executing
completion callbacks.

That however causes a cascading issue in IO workers: After the completion
callbacks for a buffer is executed, the page is marked as inaccessible. If
subsequently the same worker is executing IO targeting the same buffer, we
would get an error, as the memory is still marked inaccessible. To avoid that,
we need to explicitly mark the memory as accessible in IO workers.

Another issue is that IO executed in workers or via io_uring will not mark
memory as DEFINED. In the case of workers that is because valgrind does not
track memory definedness across processes. For io_uring that is because
valgrind does not understand io_uring, and therefore its IOs never mark memory
as defined, whether the completions are processed in the defining process or
in another context.  It's not entirely clear how to best solve that. The
current user of AIO is not affected, as it explicitly marks buffers as DEFINED
& NOACCESS anyway.  Defer solving this issue until we have a user with
different needs.

Per buildfarm animal skink.

Reviewed-by: Noah Misch <noah@leadboat.com>
Co-authored-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/3pd4322mogfmdd5nln3zphdwhtmq3rzdldqjwb2sfqzcgs22lf@ok2gletdaoe6
2025-04-07 15:20:30 -04:00
Andres Freund
8ab4241b9f localbuf: Add Valgrind buffer access instrumentation
This mirrors 1e0dfd166b (+ 46ef520b95), for temporary table buffers. This
is mainly interesting right now because the AIO work currently triggers
spurious valgrind errors, and the fix for that is cleaner if temp buffers
behave the same as shared buffers.

This requires one change beyond the annotations themselves, namely to pin
local buffers while writing them out in FlushRelationBuffers().

Reviewed-by: Noah Misch <noah@leadboat.com>
Co-authored-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/3pd4322mogfmdd5nln3zphdwhtmq3rzdldqjwb2sfqzcgs22lf@ok2gletdaoe6
2025-04-07 15:20:30 -04:00
Andres Freund
57dec20fd4 aio: Avoid spurious coverity warning
PgAioResult.result is never accessed in the relevant path, but coverity
complains about an uninitialized access anyway. So just zero-initialize the
whole thing.  While at it, reduce the scope of the variable.

Reported-by: Ranier Vilela <ranier.vf@gmail.com>
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/CAEudQApsKqd-s+fsUQ0OmxJAMHmBSXxrAz3dCs+uvqb3iRtjSw@mail.gmail.com
2025-04-06 12:07:02 -04:00
Andres Freund
93bc3d75d8 aio: Add test_aio module
To make the tests possible, a few functions from bufmgr.c/localbuf.c had to be
exported, via buf_internals.h.

Reviewed-by: Noah Misch <noah@leadboat.com>
Co-authored-by: Andres Freund <andres@anarazel.de>
Co-authored-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
2025-04-01 13:47:46 -04:00
Andres Freund
ae3df4b341 read_stream: Introduce and use optional batchmode support
Submitting IO in larger batches can be more efficient than doing so
one-by-one, particularly for many small reads. It does, however, require
the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO
batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not:
a) block without first calling pgaio_submit_staged(), unless a
   to-be-waited-on lock cannot be part of a deadlock, e.g. because it is
   never held while waiting for IO.

b) directly or indirectly start another batch pgaio_enter_batchmode()

As this requires care and is nontrivial in some cases, batching is only
used with explicit opt-in.

This patch adds an explicit flag (READ_STREAM_USE_BATCHING) to read_stream and
uses it where appropriate.

There are two cases where batching would likely be beneficial, but where we
aren't using it yet:

1) bitmap heap scans, because the callback reads the VM

   This should soon be solved, because we are planning to remove the use of
   the VM, due to that not being sound.

2) The first phase of heap vacuum

   This could be made to support batchmode, but would require some care.

Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
2025-03-30 18:36:41 -04:00
Andres Freund
12ce89fd07 bufmgr: Use AIO in StartReadBuffers()
This finally introduces the first actual use of AIO. StartReadBuffers() now
uses the AIO routines to issue IO.

As the implementation of StartReadBuffers() is also used by the functions for
reading individual blocks (StartReadBuffer() and through that
ReadBufferExtended()) this means all buffered read IO passes through the AIO
paths.  However, as those are synchronous reads, actually performing the IO
asynchronously would be rarely beneficial. Instead such IOs are flagged to
always be executed synchronously. This way we don't have to duplicate a fair
bit of code.

When io_method=sync is used, the IO patterns generated after this change are
the same as before, i.e. actual reads are only issued in WaitReadBuffers() and
StartReadBuffers() may issue prefetch requests.  This allows to bypass most of
the actual asynchronicity, which is important to make a change as big as this
less risky.

One thing worth calling out is that, if IO is actually executed
asynchronously, the precise meaning of what track_io_timing is measuring has
changed. Previously it tracked the time for each IO, but that does not make
sense when multiple IOs are executed concurrently. Now it only measures the
time actually spent waiting for IO. A subsequent commit will adjust the docs
for this.

While AIO is now actually used, the logic in read_stream.c will often prevent
using sufficiently many concurrent IOs. That will be addressed in the next
commit.

Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Co-authored-by: Andres Freund <andres@anarazel.de>
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
2025-03-30 18:02:23 -04:00
Andres Freund
047cba7fa0 bufmgr: Implement AIO read support
This commit implements the infrastructure to perform asynchronous reads into
the buffer pool.

To do so, it:

- Adds readv AIO callbacks for shared and local buffers

  It may be worth calling out that shared buffer completions may be run in a
  different backend than where the IO started.

- Adds an AIO wait reference to BufferDesc, to allow backends to wait for
  in-progress asynchronous IOs

- Adapts StartBufferIO(), WaitIO(), TerminateBufferIO(), and their localbuf.c
  equivalents, to be able to deal with AIO

- Moves the code to handle BM_PIN_COUNT_WAITER into a helper function, as it
  now also needs to be called on IO completion

As of this commit, nothing issues AIO on shared/local buffers. A future commit
will update StartReadBuffers() to do so.

Buffer reads executed through this infrastructure will report invalid page /
checksum errors / warnings differently than before:

In the error case the error message will cover all the blocks that were
included in the read, rather than just the reporting the first invalid
block. If more than one block is invalid, the error will include information
about the range of the read, the first invalid block and the number of invalid
pages, with a HINT towards the server log for per-block details.

For the warning case (i.e. zero_damaged_buffers) we would previously emit one
warning message for each buffer in a multi-block read. Now there is only a
single warning message for the entire read, again referring to the server log
for more details in case of multiple checksum failures within a single larger
read.

Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
2025-03-30 17:28:03 -04:00
Andres Freund
d445990adc Let caller of PageIsVerified() control ignore_checksum_failure
For AIO the completion of a read into shared buffers (i.e. verifying the page
including the checksum, updating the BufferDesc to reflect the IO) can happen
in a different backend than the backend that started the IO. As
ignore_checksum_failure can differ between backends, we need to allow the
caller of PageIsVerified() control whether to ignore checksum failures.

The commit leaves a gap in the PIV_* values, as an upcoming commit, which
depends on this commit, will add PIV_LOG_LOG, which better fits just after
PIV_LOG_WARNING.

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20250329212929.a6.nmisch@google.com
2025-03-30 16:27:10 -04:00
Andres Freund
b96d3c3897 pgstat: Allow checksum errors to be reported in critical sections
For AIO we execute completion callbacks in critical sections (to ensure that
AIO can in the future be used for WAL, which in turn requires that we can call
completion callbacks in critical sections, to get the resources for WAL
io). To report checksum errors a backend now has to call
pgstat_prepare_report_checksum_failure(), before entering a critical section,
which guarantees the relevant pgstats entry is in shared memory, the relevant
DSM segment is mapped into the backend's memory and the address is known via a
PgStat_EntryRef.

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/wkjj4p2rmkevutkwc6tewoovdqznj6c6nvjmvii4oo5wmbh5sr@retq7d6uqs4j
2025-03-30 16:12:04 -04:00
Andres Freund
d6d8054dc7 localbuf: Track pincount in BufferDesc as well
For AIO on temporary table buffers the AIO subsystem needs to be able to
ensure a pin on a buffer while AIO is going on, even if the IO issuing query
errors out. Tracking the buffer in LocalRefCount does not work, as it would
cause CheckForLocalBufferLeaks() to assert out.

Instead, also track the refcount in BufferDesc.state, not just
LocalRefCount. This also makes local buffers behave a bit more akin to shared
buffers.

Note that we still don't need locking, AIO completion callbacks for local
buffers are executed in the issuing session (i.e. nobody else has access to
the BufferDesc).

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
2025-03-29 16:36:51 -04:00
Andres Freund
08ccd56ac7 aio, bufmgr: Comment fixes/improvements
Some of these comments have been wrong for a while (12f3867f55), some I
recently introduced (da7226993f, 55b454d0e1). This includes an update to a
comment in FlushBuffer(), which will be copied in a future commit.

These changes seem big enough to be worth doing in separate commits.

Suggested-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20250319212530.80.nmisch@google.com
2025-03-29 14:45:42 -04:00
Andres Freund
dee8002468 Fix mis-attribution of checksum failure stats to the wrong database
Checksum failure stats could be attributed to the wrong database in two cases:

- when a read of a shared relation encountered a checksum error , it would be
  attributed to the current database, instead of the "database" representing
  shared relations

- when using CREATE DATABASE ... STRATEGY WAL_LOG checksum errors in the
  source database would be attributed to the current database

The checksum stats reporting via PageIsVerifiedExtended(PIV_REPORT_STAT) does
not have access to the information about what database a page belongs to.

This fixes the issue by removing PIV_REPORT_STAT and delegating the
responsibility to report stats to the caller, which now can learn about the
number of stats via a new optional argument.

As this changes the signature of PageIsVerifiedExtended() and all callers
should adapt to the new signature, use the occasion to rename the function to
PageIsVerified() and remove the compatibility macro.

We could instead have fixed this by adding information about the database to
the args of PageIsVerified(), but there are soon-to-be-applied patches that
need to separate the stats reporting from the PageIsVerified() call
anyway. Those patches also include testing for the failure paths, something we
inexplicably have not had.

As there is no caller of pgstat_report_checksum_failure() left, remove it.

It'd be possible, but awkward to fix this in the back branches. We considered
doing the work not quite worth it, as mis-attributed stats should still elicit
concern. The emitted error messages do allow to attribute the errors
correctly.

Discussion: https://postgr.es/m/5tyic6epvdlmd6eddgelv47syg2b5cpwffjam54axp25xyq2ga@ptwkinxqo3az
Discussion: https://postgr.es/m/mglpvvbhighzuwudjxzu4br65qqcxsnyvio3nl4fbog3qknwhg@e4gt7npsohuz
2025-03-29 13:38:35 -04:00
Thomas Munro
ce1a75c4fe Support buffer forwarding in StartReadBuffers().
StartReadBuffers() reports a short read when it finds a cached block
that ends a range needing I/O by updating the caller's *nblocks.  It
doesn't want to have to unpin the trailing hit that it knows the caller
wants, so the v17 version used sleight of hand in the name of
simplicity: it included it in *nblocks as if it were part of the I/O,
but internally tracked the shorter real I/O size in io_buffers_len (now
removed).

This API change "forwards" the delimiting buffer to the next call.  It's
still pinned, and still stored in the caller's array, but *nblocks no
longer includes stray buffers that are not really part of the operation.
The expectation is that the caller still wants the rest of the blocks
and will call again starting from that point, and now it can pass the
already pinned buffer back in (or choose not to and release it).

The change is needed for the coming asynchronous I/O version's larger
version of the problem: by definition it must move BM_IO_IN_PROGRESS
negotiation from WaitReadBuffers() to StartReadBuffers(), but it might
already have many buffers pinned before it discovers a need to split an
I/O.  (The current synchronous I/O version hides that detail from
callers by looping over smaller reads if required to make all covered
buffers valid in WaitReadBuffers(), so it looks like one operation but
it might occasionally be several under the covers.)

Aside from avoiding unnecessary pin traffic, this will also be important
for later work on out-of-order streams: you can't prioritize data that
is already available right now if that fact is hidden from you.

The new API is natural for read_stream.c (see ed0b87ca).  After a short
read it leaves forwarded buffers where they fell in its circular queue
for the continuing call to pick up.

Single-block StartReadBuffer() and traditional ReadBuffer() share code
but are not affected by the change.  They don't do multi-block I/O.

Reviewed-by: Andres Freund <andres@anarazel.de> (earlier versions)
Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com
2025-03-21 20:43:59 +13:00
Andres Freund
202b12774d bufmgr: Improve stats when a buffer is read in concurrently
Previously we would have the following inaccuracies when a backend tried to
read in a buffer, but that buffer was read in concurrently by another backend:
- the read IO was double-counted in the global buffer access stats (pgBufferUsage)
- the buffer hit was not accounted for in:
  - global buffer access statistics
  - pg_stat_io
  - relation level IO stats
  - vacuum cost balancing

While trying to read in a buffer that is concurrently read in by another
backend is not a common occurrence, it's also not that rare, e.g. due to
concurrent sequential scans on the same relation.  This scenario has become
more likely in PG 17, due to the introducing of read streams, which can pin
multiple buffers before calling StartBufferIO() for all the buffers.

This behaviour has historically grown, but there doesn't seem to be any reason
to continue with the wrong accounting.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_Zk-B08AzPsO-6680LUHLOCGaNJYofaxTFseLa=OepV1g@mail.gmail.com
2025-03-20 19:58:22 -04:00
Thomas Munro
10f6646847 Introduce io_max_combine_limit.
The existing io_combine_limit can be changed by users.  The new
io_max_combine_limit is fixed at server startup time, and functions as a
silent clamp on the user setting.  That in itself is probably quite
useful, but the primary motivation is:

aio_init.c allocates shared memory for all asynchronous IOs including
some per-block data, and we didn't want to waste memory you'd never used
by assuming they could be up to PG_IOV_MAX.  This commit already halves
the size of 'AioHandleIov' and 'AioHandleData'.  A follow-up commit can
now expand PG_IOV_MAX without affecting that.

Since our GUC system doesn't support dependencies or cross-checks
between GUCs, the user-settable one now assigns a "raw" value to
io_combine_limit_guc, and the lower of io_combine_limit_guc and
io_max_combine_limit is maintained in io_combine_limit.

Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version)
Discussion: https://postgr.es/m/CA%2BhUKG%2B2T9p-%2BzM6Eeou-RAJjTML6eit1qn26f9twznX59qtCA%40mail.gmail.com
2025-03-19 15:23:54 +13:00
Andres Freund
771ba90298 localbuf: Introduce StartLocalBufferIO()
To initiate IO on a shared buffer we have StartBufferIO(). For temporary table
buffers no similar function exists - likely because the code for that
currently is very simple due to the lack of concurrency.

However, the upcoming AIO support will make it possible to re-encounter a
local buffer, while the buffer already is the target of IO. In that case we
need to wait for already in-progress IO to complete. This commit makes it
easier to add the necessary code, by introducing StartLocalBufferIO().

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_b9anbWzEs5AAF9WCvcEVmgz-1AkHSQ-CLLy-p7WHzvFw@mail.gmail.com
2025-03-15 22:07:48 -04:00
Andres Freund
4b4d33b9ea localbuf: Introduce FlushLocalBuffer()
Previously we had two paths implementing writing out temporary table
buffers. For shared buffers, the logic for that is centralized in
FlushBuffer(). Introduce FlushLocalBuffer() to do the same for local buffers.

Besides being a nice cleanup on its own, it also makes an upcoming change
slightly easier.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_b9anbWzEs5AAF9WCvcEVmgz-1AkHSQ-CLLy-p7WHzvFw@mail.gmail.com
2025-03-15 22:07:48 -04:00
Andres Freund
dd6f2618f6 localbuf: Introduce TerminateLocalBufferIO()
Previously TerminateLocalBufferIO() was open-coded in multiple places, which
doesn't seem like a great idea. While TerminateLocalBufferIO() currently is
rather simple, an upcoming patch requires additional code to be added to
TerminateLocalBufferIO(), making this modification particularly worthwhile.

For some reason FlushRelationBuffers() previously cleared BM_JUST_DIRTIED,
even though that's never set for temporary buffers. This is not carried over
as part of this change.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_b9anbWzEs5AAF9WCvcEVmgz-1AkHSQ-CLLy-p7WHzvFw@mail.gmail.com
2025-03-15 22:07:48 -04:00
Andres Freund
0762a151b0 localbuf: Introduce InvalidateLocalBuffer()
Previously, there were three copies of this code, two of them
identical. There's no good reason for that.

This change is nice on its own, but the main motivation is the AIO patchset,
which needs to add extra checks the deduplicated code, which of course is
easier if there is only one version.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_b9anbWzEs5AAF9WCvcEVmgz-1AkHSQ-CLLy-p7WHzvFw@mail.gmail.com
2025-03-15 22:07:48 -04:00
Andres Freund
fa6af9b25e localbuf: Fix dangerous coding pattern in GetLocalVictimBuffer()
If PinLocalBuffer() were to modify the buf_state, the buf_state in
GetLocalVictimBuffer() would be out of date. Currently that does not happen,
as PinLocalBuffer() only modifies the buf_state if adjust_usagecount=true and
GetLocalVictimBuffer() passes false.

However, it's easy to make this not the case anymore - it cost me a few hours
to debug the consequences.

The minimal fix would be to just refetch the buf_state after after calling
PinLocalBuffer(), but the same danger exists in later parts of the
function. Instead, declare buf_state in the narrower scopes and re-read the
state in conditional branches.  Besides being safer, it also fits well with
an upcoming set of cleanup patches that move the contents of the conditional
branches in GetLocalVictimBuffer() into helper functions.

I "broke" this in 794f259447.

Arguably this should be backpatched, but as the relevant functions are not
exported and there is no actual misbehaviour, I chose to not backpatch, at
least for now.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_b9anbWzEs5AAF9WCvcEVmgz-1AkHSQ-CLLy-p7WHzvFw@mail.gmail.com
2025-03-15 22:07:48 -04:00
Thomas Munro
01261fb078 Improve buffer manager API for backend pin limits.
Previously the support functions assumed that the caller needed one pin
to make progress, and could optionally use some more, allowing enough
for every connection to do the same.  Add a couple more functions for
callers that want to know:

* what the maximum possible number could be, irrespective of currently
  held pins, for space planning purposes

* how many additional pins they could acquire right now, without the
  special case allowing one pin, for callers that already hold pins and
  could already make progress even if no extra pins are available

The pin limit logic began in commit 31966b15.  This refactoring is
better suited to read_stream.c, which will be adjusted to respect the
remaining limit as it changes over time in a follow-up commit.  It also
computes MaxProportionalPins up front, to avoid performing divisions
whenever a caller needs to check the balance.

Reviewed-by: Andres Freund <andres@anarazel.de> (earlier versions)
Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com
2025-03-14 17:13:09 +13:00
Michael Paquier
6c349d83b6 Re-add GUC track_wal_io_timing
This commit is a rework of 2421e9a51d, about which Andres Freund has
raised some concerns as it is valuable to have both track_io_timing and
track_wal_io_timing in some cases, as the WAL write and fsync paths can
be a major bottleneck for some workloads.  Hence, it can be relevant to
not calculate the WAL timings in environments where pg_test_timing
performs poorly while capturing some IO data under track_io_timing for
the non-WAL IO paths.  The opposite can be also true: it should be
possible to disable the non-WAL timings and enable the WAL timings (the
previous GUC setups allowed this possibility).

track_wal_io_timing is added back in this commit, controlling if WAL
timings should be calculated in pg_stat_io for the read, fsync and write
paths, as done previously with pg_stat_wal.  pg_stat_wal previously
tracked only the sync and write parts (now removed), read stats is new
data tracked in pg_stat_io, all three are aggregated if
track_wal_io_timing is enabled.  The read part matters during recovery
or if a XLogReader is used.

Extra note: more control over if the types of timings calculated in
pg_stat_io could be done with a GUC that lists pairs of (IOObject,IOOp).

Reported-by: Andres Freund <andres@anarazel.de>
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Co-authored-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/3opf2wh2oljco6ldyqf7ukabw3jijnnhno6fjb4mlu6civ5h24@fcwmhsgmlmzu
2025-02-26 09:49:59 +09:00
Andres Freund
37c87e63f9 Change relpath() et al to return path by value
For AIO, and also some other recent patches, we need the ability to call
relpath() in a critical section. Until now that was not feasible, as it
allocated memory.

The fact that relpath() allocated memory also made it awkward to use in log
messages because we had to take care to free the memory afterwards. Which we
e.g. didn't do for when zeroing out an invalid buffer.

We discussed other solutions, e.g. filling a pre-allocated buffer that's
passed to relpath(), but they all came with plenty downsides or were larger
projects. The easiest fix seems to be to make relpath() return the path by
value.

To be able to return the path by value we need to determine the maximum length
of a relation path. This patch adds a long #define that computes the exact
maximum, which is verified to be correct in a regression test.

As this change the signature of relpath(), extensions using it will need to
adapt their code. We discussed leaving a backward-compat shim in place, but
decided it's not worth it given the use of relpath() doesn't seem widespread.

Discussion: https://postgr.es/m/xeri5mla4b5syjd5a25nok5iez2kr3bm26j2qn4u7okzof2bmf@kwdh2vf7npra
2025-02-25 09:02:07 -05:00