1
0
mirror of https://github.com/postgres/postgres.git synced 2026-01-27 21:43:08 +03:00
Commit Graph

2996 Commits

Author SHA1 Message Date
Andres Freund
45f658dacb freespace: Don't modify page without any lock
Before this commit fsm_vacuum_page() modified the page without any lock on the
page. Historically that was kind of ok, as we didn't rely on the freespace to
really stay consistent and we did not have checksums. But these days pages are
checksummed and there are ways for FSM pages to be included in WAL records,
even if the FSM itself is still not WAL logged. If a FSM page ever were
modified while a WAL record referenced that page, we'd be in trouble, as the
WAL CRC could end up getting corrupted.

The reason to address this right now is a series of patches with the goal to
only allow modifications of pages with an appropriate lock level. Obviously
not having any lock is not appropriate :)

Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/4wggb7purufpto6x35fd2kwhasehnzfdy3zdcu47qryubs2hdz@fa5kannykekr
Discussion: https://postgr.es/m/e6a8f734-2198-4958-a028-aba863d4a204@iki.fi
2026-01-12 12:40:00 -05:00
Heikki Linnakangas
ad853bb877 Fix misc typos, mostly in comments
The only user-visible change is the fix in the "malformed
pg_dependencies" error detail. That one is new in commit e1405aa5e3,
so no backpatching required.
2026-01-08 18:10:08 +02:00
Michael Paquier
f1e251be80 Allow bgworkers to be terminated for database-related commands
Background workers gain a new flag, called BGWORKER_INTERRUPTIBLE, that
offers the possibility to terminate the workers when these are connected
to a database that is involved in one of the following commands:
ALTER DATABASE RENAME TO
ALTER DATABASE SET TABLESPACE
CREATE DATABASE
DROP DATABASE

This is useful to give background workers the same behavior as backends
and autovacuum workers, which are stopped when these commands are
executed.  The default behavior, that exists since 9.3, is still to
never terminate bgworkers connected to the database involved in any of
these commands.  The new flag has to be set to terminate the workers.

A couple of tests are added to worker_spi to track the commands that
impact the termination of the workers.  There is a test case for a
non-interruptible worker, additionally, that relies on an injection
point to make the wait time in CountOtherDBBackends() reduced from 5s to
0.3s for faster test runs.  The tests rely on the contents of the server
logs to check if a worker has been started or terminated:
- LOG generated by worker_spi_main() at startup, once connection to
database is done.
- FATAL in bgworker_die() when terminated.
A couple of tests run in the CI have showed that this method is stable
enough.  The safe_psql() calls that scan pg_stat_activity could be
replaced with some poll_query_until() for more stability, if the current
method proves to be an issue in the buildfarm.

Author: Aya Iwata <iwata.aya@fujitsu.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Ryo Matsumura <matsumura.ryo@fujitsu.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Discussion: https://postgr.es/m/OS7PR01MB11964335F36BE41021B62EAE8EAE4A@OS7PR01MB11964.jpnprd01.prod.outlook.com
2026-01-06 14:24:29 +09:00
David Rowley
4c144e0452 Use the GetPGProcByNumber() macro when possible
A few places were accessing &ProcGlobal->allProcs directly, so adjust
them to use the accessor macro instead.

Author: Maksim Melnikov <m.melnikov@postgrespro.ru>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/80621c00-aba6-483c-88b1-a845461d1165@postgrespro.ru
2026-01-05 21:19:03 +13:00
Michael Paquier
b8cfcb9e00 Fix typos and inconsistencies in code and comments
This change is a cocktail of harmonization of function argument names,
grammar typos, renames for better consistency and unused code (see
ltree).  All of these have been spotted by the author.

Author: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/b2c0d0b7-3944-487d-a03d-d155851958ff@gmail.com
2026-01-05 09:19:15 +09:00
Bruce Momjian
451c43974f Update copyright for 2026
Backpatch-through: 14
2026-01-01 13:24:10 -05:00
Masahiko Sawada
85d5bd308b Fix macro name for io_uring_queue_init_mem check.
Commit f54af9f267 added a check for
io_uring_queue_init_mem(). However, it used the macro name
HAVE_LIBURING_QUEUE_INIT_MEM in both meson.build and the C code, while
the Autotools build script defined HAVE_IO_URING_QUEUE_INIT_MEM. As a
result, the optimization was never enabled in builds configured with
Autotools, as the C code checked for the wrong macro name.

This commit changes the macro name to HAVE_IO_URING_QUEUE_INIT_MEM in
meson.build and the C code. This matches the actual function
name (io_uring_queue_init_mem), following the standard HAVE_<FUNCTION>
convention.

Backpatch to 18, where the macro was introduced.

Bug: #19368
Reported-by: Evan Si <evsi@amazon.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/19368-016d79a7f3a1c599@postgresql.org
Backpatch-through: 18
2025-12-31 11:18:14 -08:00
Michael Paquier
9adf32da6b Split some long Makefile lists
This change makes more readable code diffs when adding new items or
removing old items, while ensuring that lines do not get excessively
long.  Some SUBDIRS, PROGRAMS and REGRESS lists are split.

Note that there are a few more REGRESS lists that could be split,
particularly in contrib/.

Author: Jelte Fennema-Nio <postgres@jeltef.nl>
Co-Authored-By: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Japin Li <japinli@hotmail.com>
Reviewed-by: Man Zeng <zengman@halodbtech.com>
Discussion: https://postgr.es/m/DF6HDGB559U5.3MPRFCWPONEAE@jeltef.nl
2025-12-28 09:17:42 +09:00
Michael Paquier
bde3a46160 Upgrade BufFile to use int64 for byte positions
This change has the advantage of removing some weird type casts, caused
by offset calculations based on pgoff_t but saved as int (on older
branches we use off_t, which could be 4 or 8 bytes depending on the
environment).  These are safe currently because capped by
MAX_PHYSICAL_FILESIZE, but we would run into problems when to make
MAX_PHYSICAL_FILESIZE larger or allow callers of these routines to use a
larger physical max size on demand.

While on it, this improves BufFileDumpBuffer() so as we do not use an
offset for "availbytes".  It is not a file offset per-set, but a number
of available bytes.

This change should lead to no functional changes.

Author: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/aUStrqoOCDRFAq1M@paquier.xyz
2025-12-26 08:41:56 +09:00
Masahiko Sawada
67c20979ce Toggle logical decoding dynamically based on logical slot presence.
Previously logical decoding required wal_level to be set to 'logical'
at server start. This meant that users had to incur the overhead of
logical-level WAL logging even when no logical replication slots were
in use.

This commit adds functionality to automatically control logical
decoding availability based on logical replication slot presence. The
newly introduced module logicalctl.c allows logical decoding to be
dynamically activated when needed when wal_level is set to
'replica'.

When the first logical replication slot is created, the system
automatically increases the effective WAL level to maintain
logical-level WAL records. Conversely, after the last logical slot is
dropped or invalidated, it decreases back to 'replica' WAL level.

While activation occurs synchronously right after creating the first
logical slot, deactivation happens asynchronously through the
checkpointer process. This design avoids a race condition at the end
of recovery; a concurrent deactivation could happen while the startup
process enables logical decoding at the end of recovery, but WAL
writes are still not permitted until recovery fully completes. The
checkpointer will handle it after recovery is done. Asynchronous
deactivation also avoids excessive toggling of the logical decoding
status in workloads that repeatedly create and drop a single logical
slot. On the other hand, this lazy approach can delay changes to
effective_wal_level and the disabling logical decoding, especially
when the checkpointer is busy with other tasks. We chose this lazy
approach in all deactivation paths to keep the implementation simple,
even though laziness is strictly required only for end-of-recovery
cases. Future work might address this limitation either by using a
dedicated worker instead of the checkpointer, or by implementing
synchronous waiting during slot drops if workloads are significantly
affected by the lazy deactivation of logical decoding.

The effective WAL level, determined internally by XLogLogicalInfo, is
allowed to change within a transaction until an XID is assigned. Once
an XID is assigned, the value becomes fixed for the remainder of the
transaction. This behavior ensures that the logging mode remains
consistent within a writing transaction, similar to the behavior of
GUC parameters.

A new read-only GUC parameter effective_wal_level is introduced to
monitor the actual WAL level in effect. This parameter reflects the
current operational WAL level, which may differ from the configured
wal_level setting.

Bump PG_CONTROL_VERSION as it adds a new field to CheckPoint struct.

Reviewed-by: Shveta Malik <shveta.malik@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://postgr.es/m/CAD21AoCVLeLYq09pQPaWs+Jwdni5FuJ8v2jgq-u9_uFbcp6UbA@mail.gmail.com
2025-12-23 10:13:16 -08:00
Michael Paquier
e5f3839af6 Switch buffile.c/h to use pgoff_t instead of off_t
off_t was previously used for offsets, which is 4 bytes on Windows,
hence limiting the backend code to a hard limit for files longer than
2GB.  This leads to some simplification in these files, removing some
casts based on long, also 4 bytes on Windows.

This commit removes one comment introduced in db3c4c3a2d, not relevant
anymore as pgoff_t is a safe 8-byte alternative on Windows.

This change is surprisingly not invasive, as the callers of
BufFileTell(), BufFileSeek() and BufFileTruncateFileSet() (worker.c,
tuplestore.c, etc.) track offsets in local structures that just to
switch from off_t to pgoff_t for the most part.

The file is still relying on a maximum file size of
MAX_PHYSICAL_FILESIZE (1GB).  This change allows the code to make this
maximum potentially larger in the future, or larger on a per-demand
basis.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/aUStrqoOCDRFAq1M@paquier.xyz
2025-12-23 07:41:34 +09:00
Andres Freund
09ae2c8bac bufmgr: Optimize & harmonize LockBufHdr(), LWLockWaitListLock()
The main optimization is for LockBufHdr() to delay initializing
SpinDelayStatus, similar to what LWLockWaitListLock already did. The
initialization is sufficiently expensive & buffer header lock acquisitions are
sufficiently frequent, to make it worthwhile to instead have a fastpath (via a
likely() branch) that does not initialize the SpinDelayStatus.

While LWLockWaitListLock() already the aforementioned optimization, it did not
use likely(), and inspection of the assembly shows that this indeed leads to
worse code generation (also observed in a microbenchmark). Fix that by adding
the likely().

While the LockBufHdr() improvement is a small gain on its own, it mainly is
aimed at preventing a regression after a future commit, which requires
additional locking to set hint bits.

While touching both, also make the comments more similar to each other.

Reviewed-by: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
2025-12-19 13:23:33 -05:00
Nathan Bossart
48d4a1423d Allow passing a pointer to GetNamedDSMSegment()'s init callback.
This commit adds a new "void *arg" parameter to
GetNamedDSMSegment() that is passed to the initialization callback
function.  This is useful for reusing an initialization callback
function for multiple DSM segments.

Author: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/CAN4CZFMjh8TrT9ZhWgjVTzBDkYZi2a84BnZ8bM%2BfLPuq7Cirzg%40mail.gmail.com
2025-12-15 14:27:16 -06:00
Andres Freund
30df61990c bufmgr: Add one-entry cache for private refcount
The private refcount entry for a buffer is often looked up repeatedly for the
same buffer, e.g. to pin and then unpin a buffer. Benchmarking shows that it's
worthwhile to have a one-entry cache for that case. With that cache in place,
it's worth splitting GetPrivateRefCountEntry() into a small inline
portion (for the cache hit case) and an out-of-line helper for the rest.

This is helpful for some workloads today, but becomes more important in an
upcoming patch that will utilize the private refcount infrastructure to also
store whether the buffer is currently locked, as that increases the rate of
lookups substantially.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/6rgb2nvhyvnszz4ul3wfzlf5rheb2kkwrglthnna7qhe24onwr@vw27225tkyar
2025-12-14 13:09:43 -05:00
Andres Freund
edbaaea0a9 bufmgr: Separate keys for private refcount infrastructure
This makes lookups faster, due to allowing auto-vectorized lookups. It is also
beneficial for an upcoming patch, independent of auto-vectorization, as the
upcoming patch wants to track more information for each pinned buffer, making
the existing loop, iterating over an array of PrivateRefCountEntry, more
expensive due to increasing its size.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
2025-12-14 13:09:43 -05:00
Peter Eisentraut
493eb0da31 Replace most StaticAssertStmt() with StaticAssertDecl()
Similar to commit 75f49221c2, it is preferable to use
StaticAssertDecl() instead of StaticAssertStmt() when possible.

Discussion: https://www.postgresql.org/message-id/flat/CA%2BhUKGKvr0x_oGmQTUkx%3DODgSksT2EtgCA6LmGx_jQFG%3DsDUpg%40mail.gmail.com
2025-12-12 10:06:40 +01:00
Nathan Bossart
b4cbc106a6 Fix some comments.
Like commit 123661427b, these were discovered while reviewing
Aleksander Alekseev's proposed changes to pgindent.
2025-12-11 15:13:04 -06:00
Heikki Linnakangas
b65f1ad9b1 Fix comment on how temp files and subtransactions are handled
The comment was accurate a long time ago, but not any more. I failed
to update the comment in commit ab3148b712.
2025-12-11 15:57:11 +02:00
Michael Paquier
4f7dacc5b8 Use palloc_object() and palloc_array(), the last change
This is the last batch of changes that have been suggested by the
author, this part covering the non-trivial changes.  Some of the changes
suggested have been discarded as they seem to lead to more instructions
generated, leaving the parts that can be qualified as in-place
replacements.

Similar work has been done in 1b105f9472, 0c3c5c3b06 and
31d3847a37.

Author: David Geier <geidav.pg@gmail.com>
Discussion: https://postgr.es/m/ad0748d4-3080-436e-b0bc-ac8f86a3466a@gmail.com
2025-12-11 14:29:12 +09:00
Heikki Linnakangas
fa44b8b7fb Fix some near-bugs related to ResourceOwner function arguments
These functions took a ResourceOwner argument, but only checked if it
was NULL, and then used CurrentResourceOwner for the actual work.
Surely the intention was to use the passed-in resource owner. All
current callers passed CurrentResourceOwner or NULL, so this has no
consequences at the moment, but it's an accident waiting to happen for
future caller and extensions.

Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAEze2Whnfv8VuRZaohE-Af+GxBA1SNfD_rXfm84Jv-958UCcJA@mail.gmail.com
Backpatch-through: 17
2025-12-10 11:43:16 +02:00
Heikki Linnakangas
bae9d2f892 Fix typo in comment
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/CABPTF7V8CbOXGePqrad6EH3Om7DRhNiO3C0rQ-62UuT7RdU-GQ@mail.gmail.com
2025-12-10 01:06:03 +02:00
Michael Paquier
1b105f9472 Use palloc_object() and palloc_array() in backend code
The idea is to encourage more the use of these new routines across the
tree, as these offer stronger type safety guarantees than palloc().
This batch of changes includes most of the trivial changes suggested by
the author for src/backend/.

A total of 334 files are updated here.  Among these files, 48 of them
have their build change slightly; these are caused by line number
changes as the new allocation formulas are simpler, shaving around 100
lines of code in total.

Similar work has been done in 0c3c5c3b06 and 31d3847a37.

Author: David Geier <geidav.pg@gmail.com>
Discussion: https://postgr.es/m/ad0748d4-3080-436e-b0bc-ac8f86a3466a@gmail.com
2025-12-10 07:36:46 +09:00
Peter Eisentraut
2268f2b91b Remove useless casts in format arguments
There were a number of useless casts in format arguments, either
where the input to the cast was already in the right type, or
seemingly uselessly casting between types instead of just using the
right format placeholder to begin with.

Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/07fa29f9-42d7-4aac-8834-197918cbbab6%40eisentraut.org
2025-12-09 07:33:08 +01:00
Peter Eisentraut
2b117bb014 Remove unnecessary casts in printf format arguments (%zu/%zd)
Many of these are probably left over from before use of %zu/%zd was
portable.

Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/07fa29f9-42d7-4aac-8834-197918cbbab6%40eisentraut.org
2025-12-09 07:33:08 +01:00
Nathan Bossart
80f6e2fb4a Fix key size of PrivateRefCountHash.
The key is the first member of PrivateRefCountEntry, which has type
Buffer.  This commit changes the key size from sizeof(int32) to
sizeof(Buffer).  This appears to be an oversight in commit
4b4b680c3d, but it's of no consequence because Buffer has been a
signed 32-bit integer for a long time.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/aS77DTpl0fOkIKSZ%40ip-10-97-1-34.eu-west-3.compute.internal
2025-12-04 15:42:18 -06:00
Andres Freund
6c5c393b74 Rename BUFFERPIN wait event class to BUFFER
In an upcoming patch more wait events will be added to the wait event
class (for buffer locking), making the current name too
specific. Alternatively we could introduce a dedicated wait event class for
those, but it seems somewhat confusing to have a BUFFERPIN and a BUFFER wait
event class.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
2025-12-03 18:38:20 -05:00
Andres Freund
156680055d bufmgr: Turn BUFFER_LOCK_* into an enum
It seems cleaner to use an enum to tie the different values together. It also
helps to have a more descriptive type in the argument to various functions.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
2025-12-03 18:38:20 -05:00
Nathan Bossart
f894acb24a Show size of DSAs and dshashes in pg_dsm_registry_allocations.
Presently, this view reports NULL for the size of DSAs and dshash
tables because 1) the current backend might not be attached to them
and 2) the registry doesn't save the pointers to the dsa_area or
dshash_table in local memory.  Also, the view doesn't show
partially-initialized entries to avoid ambiguity, since those
entries would report a NULL size as well.

This commit introduces a function that looks up the size of a DSA
given its handle (transiently attaching to the control segment if
needed) and teaches pg_dsm_registry_allocations to use it to show
the size of successfully-initialized DSA and dshash entries.
Furthermore, the view now reports partially-initialized entries
with a NULL size.

Reviewed-by: Rahila Syed <rahilasyed90@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/aSeEDeznAsHR1_YF%40nathan
2025-12-02 10:29:45 -06:00
Peter Eisentraut
4f941d432b Remove useless casting to same type
This removes some casts where the input already has the same type as
the type specified by the cast.  Their presence could cause risks of
hiding actual type mismatches in the future or silently discarding
qualifiers.  It also improves readability.  Same kind of idea as
7f798aca1d and ef8fe69360.  (This does not change all such
instances, but only those hand-picked by the author.)

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/flat/aSQy2JawavlVlEB0%40ip-10-97-1-34.eu-west-3.compute.internal
2025-12-02 10:09:32 +01:00
Michael Paquier
9660906dbd Add routines for marking buffers dirty efficiently
This commit introduces new internal bufmgr routines for marking shared
buffers as dirty:
* MarkDirtyUnpinnedBuffer()
* MarkDirtyRelUnpinnedBuffers()
* MarkDirtyAllUnpinnedBuffers()

These functions provide an efficient mechanism to respectively mark one
buffer, all the buffers of a relation, or the entire shared buffer pool
as dirty, something that can be useful to force patterns for the
checkpointer.  MarkDirtyUnpinnedBufferInternal(), an extra routine, is
used by these three, to mark as dirty an unpinned buffer.

They are intended as developer tools to manipulate buffer dirtiness in
bulk, and will be used in a follow-up commit.

Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Aidar Imamov <a.imamov@postgrespro.ru>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Joseph Koshakow <koshy44@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Yuhang Qiu <iamqyh@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/CAN55FZ0h_YoSqqutxV6DES1RW8ig6wcA8CR9rJk358YRMxZFmw@mail.gmail.com
2025-11-28 07:39:33 +09:00
Nathan Bossart
dbdc717ac6 Teach DSM registry to retry entry initialization if needed.
If DSM registry entry initialization fails, backends could try to
use an uninitialized DSM segment, DSA, or dshash table (since the
entry is still added to the registry).  To fix, restructure the
code so that the registry retries initialization as needed.  This
commit also modifies pg_get_dsm_registry_allocations() to leave out
partially-initialized entries, as they shouldn't have any allocated
memory.

DSM registry entry initialization shouldn't fail often in practice,
but retrying was deemed better than leaving entries in a
permanently failed state (as was done by commit 1165a933aa, which
has since been reverted).

Suggested-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/E1vJHUk-006I7r-37%40gemulon.postgresql.org
Backpatch-through: 17
2025-11-26 15:12:25 -06:00
Nathan Bossart
2dd506b859 Revert "Teach DSM registry to ERROR if attaching to an uninitialized entry."
This reverts commit 1165a933aa (and the corresponding commits on
the back-branches).  In a follow-up commit, we'll teach the
registry to retry entry initialization instead of leaving it in a
permanently failed state.

Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/E1vJHUk-006I7r-37%40gemulon.postgresql.org
Backpatch-through: 17
2025-11-26 11:37:21 -06:00
Nathan Bossart
9446f918ac Remove a few unused struct members.
Oversights in commits ab9e0e718a, f3049a603a, and 247ce06b88.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/aScUuBSawPWogUxs%40ip-10-97-1-34.eu-west-3.compute.internal
2025-11-26 09:50:00 -06:00
Andres Freund
81f7738953 lwlock: Fix, currently harmless, bug in LWLockWakeup()
Accidentally the code in LWLockWakeup() checked the list of to-be-woken up
processes to see if LW_FLAG_HAS_WAITERS should be unset. That means that
HAS_WAITERS would not get unset immediately, but only during the next,
unnecessary, call to LWLockWakeup().

Luckily, as the code stands, this is just a small efficiency issue.

However, if there were (as in a patch of mine) a case in which LWLockWakeup()
would not find any backend to wake, despite the wait list not being empty,
we'd wrongly unset LW_FLAG_HAS_WAITERS, leading to potentially hanging.

While the consequences in the backbranches are limited, the code as-is
confusing, and it is possible that there are workloads where the additional
wait list lock acquisitions hurt, therefore backpatch.

Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
Backpatch-through: 14
2025-11-24 18:10:48 -05:00
Jacob Champion
e2ceff13d8 postgres: Use pg_{add,mul}_size_overflow()
The backend implementations of add_size() and mul_size() can now make
use of the APIs provided in common/int.h.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAOYmi%2B%3D%2BpqUd2MUitvgW1pAJuXgG_TKCVc3_Ek7pe8z9nkf%2BAg%40mail.gmail.com
2025-11-24 09:59:54 -08:00
Peter Eisentraut
ef8fe69360 Remove useless casts to (void *)
Their presence causes (small) risks of hiding actual type mismatches
or silently discarding qualifiers.  Some have been missed in
7f798aca1d and some are new ones along the same lines.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/aR8Yv%2BuATLKbJCgI%40ip-10-97-1-34.eu-west-3.compute.internal
2025-11-21 16:49:40 +01:00
Peter Eisentraut
97e04c74be C11 alignas instead of unions
This changes a few union members that only existed to ensure
alignments and replaces them with the C11 alignas specifier.

This change only uses fundamental alignments (meaning approximately
alignments of basic types), which all C11 compilers must support.
There are opportunities for similar changes using extended alignments,
for example in PGIOAlignedBlock, but these are not necessarily
supported by all compilers, so they are kept as a separate change.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/46f05236-d4d4-4b4e-84d4-faa500f14691%40eisentraut.org
2025-11-21 10:08:24 +01:00
Thomas Munro
017249b828 Add some missing #include <limits.h>.
These files relied on transitive inclusion via port/atomics.h for
constants CHAR_BIT and INT_MAX.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/536409d2-c9df-4ef3-808d-1ffc3182868c@iki.fi
2025-11-13 22:56:08 +13:00
Michael Paquier
84fb27511d Replace off_t by pgoff_t in I/O routines
PostgreSQL's Windows port has never been able to handle files larger
than 2GB due to the use of off_t for file offsets, only 32-bit on
Windows.  This causes signed integer overflow at exactly 2^31 bytes when
trying to handle files larger than 2GB, for the routines touched by this
commit.

Note that large files are forbidden by ./configure (3c6248a828) and
meson (recent change, see 79cd66f28c).  This restriction also exists
in v16 and older versions for the now-dead MSVC scripts.

The code base already defines pgoff_t as __int64 (64-bit) on Windows for
this purpose, and some function declarations in headers use it, but many
internals still rely on off_t.  This commit switches more routines to
use pgoff_t, offering more portability, for areas mainly related to file
extensions and storage.

These are not critical for WAL segments yet, which have currently a
maximum size allowed of 1GB (well, this opens the door at allowing a
larger size for them).  This matters more for segment files if we want
to lift the large file restriction in ./configure and meson in the
future, which would make sense to remove once/if all traces of off_t are
gone from the tree.  This can additionally matter for out-of-core code
that may want files larger than 2GB in places where off_t is four bytes
in size.

Note that off_t is still used in other parts of the tree like
buffile.c, WAL sender/receiver, base backup, pg_combinebackup, etc.
These other code paths can be addressed separately, and their update
will be required if we want to remove the large file restriction in the
future.  This commit is a good first cut in itself towards more
portability, hopefully.

On Unix-like systems, pgoff_t is defined as off_t, so this change only
affects Windows behavior.

Author: Bryan Green <dbryan.green@gmail.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/0f238ff4-c442-42f5-adb8-01b762c94ca1@gmail.com
2025-11-13 12:41:40 +09:00
Nathan Bossart
1165a933aa Teach DSM registry to ERROR if attaching to an uninitialized entry.
If DSM entry initialization fails, backends could try to use an
uninitialized DSM segment, DSA, or dshash table (since the entry is
still added to the registry).  To fix, keep track of whether
initialization completed, and ERROR if a backend tries to attach to
an uninitialized entry.  We could instead retry initialization as
needed, but that seemed complicated, error prone, and unlikely to
help most cases.  Furthermore, such problems probably indicate a
coding error.

Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/dd36d384-55df-4fc2-825c-5bc56c950fa9%40gmail.com
Backpatch-through: 17
2025-11-12 14:30:11 -06:00
Bruce Momjian
6204d07ad6 Remove blank line in C code.
Was added in commit 5e89985928.

Reported-by: Ashutosh Bapat

Author: Ashutosh Bapat

Discussion: https://postgr.es/m/CAExHW5tba_biyuMrd_iPVzq-+XvsMdPcEnjQ+d+__V=cjYj8Pg@mail.gmail.com

Backpatch-through: master
2025-11-07 21:54:25 -05:00
Andres Freund
5310fac6e0 bufmgr: Use atomic sub for unpinning buffers
The prior commit made it legal to modify BufferDesc.state while the buffer
header spinlock is held. This allows us to replace the CAS loop
inUnpinBufferNoOwner() with an atomic sub. This improves scalability
significantly. See the prior commits for more background.

Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
2025-11-06 16:43:16 -05:00
Andres Freund
c75ebc657f bufmgr: Allow some buffer state modifications while holding header lock
Until now BufferDesc.state was not allowed to be modified while the buffer
header spinlock was held. This meant that operations like unpinning buffers
needed to use a CAS loop, waiting for the buffer header spinlock to be
released before updating.

The benefit of that restriction is that it allowed us to unlock the buffer
header spinlock with just a write barrier and an unlocked write (instead of a
full atomic operation). That was important to avoid regressions in
48354581a4. However, since then the hottest buffer header spinlock uses have
been replaced with atomic operations (in particular, the most common use of
PinBuffer_Locked(), in GetVictimBuffer() (formerly in BufferAlloc()), has been
removed in 5e89985928).

This change will allow, in a subsequent commit, to release buffer pins with a
single atomic-sub operation. This previously was not possible while such
operations were not allowed while the buffer header spinlock was held, as an
atomic-sub would not have allowed a race-free check for the buffer header lock
being held.

Using atomic-sub to unpin buffers is a nice scalability win, however it is not
the primary motivation for this change (although it would be sufficient). The
primary motivation is that we would like to merge the buffer content lock into
BufferDesc.state, which will result in more frequent changes of the state
variable, which in some situations can cause a performance regression, due to
an increased CAS failure rate when unpinning buffers.  The regression entirely
vanishes when using atomic-sub.

Naively implementing this would require putting CAS loops in every place
modifying the buffer state while holding the buffer header lock. To avoid
that, introduce UnlockBufHdrExt(), which can set/add flags as well as the
refcount, together with releasing the lock.

Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
2025-11-06 16:42:10 -05:00
Álvaro Herrera
a2b02293bc Use XLogRecPtrIsValid() in various places
Now that commit 06edbed478 has introduced XLogRecPtrIsValid(), we can
use that instead of:

- XLogRecPtrIsInvalid()
- direct comparisons with InvalidXLogRecPtr
- direct comparisons with literal 0

This makes the code more consistent.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/aQB7EvGqrbZXrMlg@ip-10-97-1-34.eu-west-3.compute.internal
2025-11-06 20:33:57 +01:00
Heikki Linnakangas
aa9c5fd3e3 Refactor shared memory allocation for semaphores
Before commit e25626677f, spinlocks were implemented using semaphores
on some platforms (--disable-spinlocks). That made it necessary to
initialize semaphores early, before any spinlocks could be used. Now
that we don't support --disable-spinlocks anymore, we can allocate the
shared memory needed for semaphores the same way as other shared
memory structures. Since the semaphores are used only in the PGPROC
array, move the semaphore shmem size estimation and initialization
calls to ProcGlobalShmemSize() and InitProcGlobal().

Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAExHW5seSZpPx-znjidVZNzdagGHOk06F+Ds88MpPUbxd1kTaA@mail.gmail.com
2025-11-06 14:45:00 +02:00
Heikki Linnakangas
daf3d99d2b Add comment to explain why PGReserveSemaphores() is called early
Before commit e25626677f, PGReserveSemaphores() had to be called
before SpinlockSemaInit() because spinlocks were implemented using
semaphores on some platforms (--disable-spinlocks). Add a comment
explaining that.

Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAExHW5seSZpPx-znjidVZNzdagGHOk06F+Ds88MpPUbxd1kTaA@mail.gmail.com
Backpatch-to: 18
2025-11-06 14:20:48 +02:00
Alexander Korotkov
447aae13b0 Implement WAIT FOR command
WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed.  This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.

WAIT FOR needs to wait without any snapshot held.  Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality.  It's not possible to implement this as
a function.  Previous experience shows that stored procedures also have
limitation in this aspect.

Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: jian he <jian.universality@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
2025-11-05 11:44:13 +02:00
Alexander Korotkov
3b4e53a075 Add infrastructure for efficient LSN waiting
Implement a new facility that allows processes to wait for WAL to reach
specific LSNs, both on primary (waiting for flush) and standby (waiting
for replay) servers.

The implementation uses shared memory with per-backend information
organized into pairing heaps, allowing O(1) access to the minimum
waited LSN. This enables fast-path checks: after replaying or flushing
WAL, the startup process or WAL writer can quickly determine if any
waiters need to be awakened.

Key components:
- New xlogwait.c/h module with WaitForLSNReplay() and WaitForLSNFlush()
- Separate pairing heaps for replay and flush waiters
- WaitLSN lightweight lock for coordinating shared state
- Wait events WAIT_FOR_WAL_REPLAY and WAIT_FOR_WAL_FLUSH for monitoring

This infrastructure can be used by features that need to wait for WAL
operations to complete.

Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
2025-11-05 11:44:13 +02:00
Andres Freund
dae00f333b aio: Improve assertions related to io_method
First, the assertions in assign_io_method() were the wrong way round. Second,
the lengthof() assertion checked the length of io_method_options, which is the
wrong array to check and is always longer than pgaio_method_ops_table.

While add it, add a static assert to ensure pgaio_method_ops_table and
io_method_options stay in sync.

Per coverity and Tom Lane.

Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Backpatch-through: 18
2025-11-04 20:03:53 -05:00
Masahiko Sawada
8ae0f6a0c3 Add CHECK_FOR_INTERRUPTS in Evict{Rel,All}UnpinnedBuffers.
This commit adds CHECK_FOR_INTERRUPTS to the shared buffer iteration
loops in EvictRelUnpinnedBuffers and EvictAllUnpinnedBuffers. These
functions, used by pg_buffercache's pg_buffercache_evict_relation and
pg_buffercache_evict_all, can now be interrupted during long-running
operations.

Backpatch to version 18, where these functions and their corresponding
pg_buffercache functions were introduced.

Author: Yuhang Qiu <iamqyh@gmail.com>
Discussion: https://postgr.es/m/8DC280D4-94A2-4E7B-BAB9-C345891D0B78%40gmail.com
Backpatch-through: 18
2025-11-04 15:47:25 -08:00