1
0
mirror of https://github.com/postgres/postgres.git synced 2026-01-27 21:43:08 +03:00
Commit Graph

1559 Commits

Author SHA1 Message Date
Michael Paquier
84fb27511d Replace off_t by pgoff_t in I/O routines
PostgreSQL's Windows port has never been able to handle files larger
than 2GB due to the use of off_t for file offsets, only 32-bit on
Windows.  This causes signed integer overflow at exactly 2^31 bytes when
trying to handle files larger than 2GB, for the routines touched by this
commit.

Note that large files are forbidden by ./configure (3c6248a828) and
meson (recent change, see 79cd66f28c).  This restriction also exists
in v16 and older versions for the now-dead MSVC scripts.

The code base already defines pgoff_t as __int64 (64-bit) on Windows for
this purpose, and some function declarations in headers use it, but many
internals still rely on off_t.  This commit switches more routines to
use pgoff_t, offering more portability, for areas mainly related to file
extensions and storage.

These are not critical for WAL segments yet, which have currently a
maximum size allowed of 1GB (well, this opens the door at allowing a
larger size for them).  This matters more for segment files if we want
to lift the large file restriction in ./configure and meson in the
future, which would make sense to remove once/if all traces of off_t are
gone from the tree.  This can additionally matter for out-of-core code
that may want files larger than 2GB in places where off_t is four bytes
in size.

Note that off_t is still used in other parts of the tree like
buffile.c, WAL sender/receiver, base backup, pg_combinebackup, etc.
These other code paths can be addressed separately, and their update
will be required if we want to remove the large file restriction in the
future.  This commit is a good first cut in itself towards more
portability, hopefully.

On Unix-like systems, pgoff_t is defined as off_t, so this change only
affects Windows behavior.

Author: Bryan Green <dbryan.green@gmail.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/0f238ff4-c442-42f5-adb8-01b762c94ca1@gmail.com
2025-11-13 12:41:40 +09:00
Andres Freund
c75ebc657f bufmgr: Allow some buffer state modifications while holding header lock
Until now BufferDesc.state was not allowed to be modified while the buffer
header spinlock was held. This meant that operations like unpinning buffers
needed to use a CAS loop, waiting for the buffer header spinlock to be
released before updating.

The benefit of that restriction is that it allowed us to unlock the buffer
header spinlock with just a write barrier and an unlocked write (instead of a
full atomic operation). That was important to avoid regressions in
48354581a4. However, since then the hottest buffer header spinlock uses have
been replaced with atomic operations (in particular, the most common use of
PinBuffer_Locked(), in GetVictimBuffer() (formerly in BufferAlloc()), has been
removed in 5e89985928).

This change will allow, in a subsequent commit, to release buffer pins with a
single atomic-sub operation. This previously was not possible while such
operations were not allowed while the buffer header spinlock was held, as an
atomic-sub would not have allowed a race-free check for the buffer header lock
being held.

Using atomic-sub to unpin buffers is a nice scalability win, however it is not
the primary motivation for this change (although it would be sufficient). The
primary motivation is that we would like to merge the buffer content lock into
BufferDesc.state, which will result in more frequent changes of the state
variable, which in some situations can cause a performance regression, due to
an increased CAS failure rate when unpinning buffers.  The regression entirely
vanishes when using atomic-sub.

Naively implementing this would require putting CAS loops in every place
modifying the buffer state while holding the buffer header lock. To avoid
that, introduce UnlockBufHdrExt(), which can set/add flags as well as the
refcount, together with releasing the lock.

Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
2025-11-06 16:42:10 -05:00
Heikki Linnakangas
aa9c5fd3e3 Refactor shared memory allocation for semaphores
Before commit e25626677f, spinlocks were implemented using semaphores
on some platforms (--disable-spinlocks). That made it necessary to
initialize semaphores early, before any spinlocks could be used. Now
that we don't support --disable-spinlocks anymore, we can allocate the
shared memory needed for semaphores the same way as other shared
memory structures. Since the semaphores are used only in the PGPROC
array, move the semaphore shmem size estimation and initialization
calls to ProcGlobalShmemSize() and InitProcGlobal().

Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAExHW5seSZpPx-znjidVZNzdagGHOk06F+Ds88MpPUbxd1kTaA@mail.gmail.com
2025-11-06 14:45:00 +02:00
Alexander Korotkov
3b4e53a075 Add infrastructure for efficient LSN waiting
Implement a new facility that allows processes to wait for WAL to reach
specific LSNs, both on primary (waiting for flush) and standby (waiting
for replay) servers.

The implementation uses shared memory with per-backend information
organized into pairing heaps, allowing O(1) access to the minimum
waited LSN. This enables fast-path checks: after replaying or flushing
WAL, the startup process or WAL writer can quickly determine if any
waiters need to be awakened.

Key components:
- New xlogwait.c/h module with WaitForLSNReplay() and WaitForLSNFlush()
- Separate pairing heaps for replay and flush waiters
- WaitLSN lightweight lock for coordinating shared state
- Wait events WAIT_FOR_WAL_REPLAY and WAIT_FOR_WAL_FLUSH for monitoring

This infrastructure can be used by features that need to wait for WAL
operations to complete.

Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
2025-11-05 11:44:13 +02:00
Peter Eisentraut
e1ac846f3d Mark ItemPointer arguments as const throughout
This is a follow up 991295f.  I searched over src/ and made all
ItemPointer arguments as const as much as possible.

Note: We cut out from the original patch the pieces that would have
created incompatibilities in the index or table AM APIs.  Those could
be considered separately.

Author: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAEoWx2nBaypg16Z5ciHuKw66pk850RFWw9ACS2DqqJ_AkKeRsw%40mail.gmail.com
2025-10-30 14:12:06 +01:00
Peter Eisentraut
3479a0f823 const-qualify ItemPointer comparison functions
Add const qualifiers to ItemPointerEquals() and ItemPointerCompare().
This will allow further changes up the stack.  It also complements
commit aeb767ca0b, as we now have all of itemptr.h appropriately
const-qualified.

Author: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAEoWx2nBaypg16Z5ciHuKw66pk850RFWw9ACS2DqqJ_AkKeRsw@mail.gmail.com
2025-10-30 10:13:47 +01:00
Peter Eisentraut
76acf4b722 Remove Item type
This type is just char * underneath, it provides no real value, no
type safety, and just makes the code one level more mysterious.  It is
more idiomatic to refer to blobs of memory by a combination of void *
and size_t, so change it to that.

Also, since this type hides the pointerness, we can't apply qualifiers
to what is pointed to, which requires some unconstify nonsense.  This
change allows fixing that.

Extension code that uses the Item type can change its code to use
void * to be backward compatible.

Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://www.postgresql.org/message-id/flat/c75cccf5-5709-407b-a36a-2ae6570be766@eisentraut.org
2025-10-27 09:55:59 +01:00
Álvaro Herrera
b7cc6474e9 Make smgr access for a BufferManagerRelation safer in relcache inval
Currently there's no bug, because we have no code path where we
invalidate relcache entries where it'd cause a problem.  But it's more
robust to do it this way in case we introduce such a path later, as some
Postgres forks reportedly already have.

Author: Daniil Davydov <3danissimo@gmail.com>
Reviewed-by: Stepan Neretin <slpmcf@gmail.com>
Discussion: https://postgr.es/m/CAJDiXgj3FNzAhV+jjPqxMs3jz=OgPohsoXFj_fh-L+nS+13CKQ@mail.gmail.com
2025-10-21 10:51:55 +03:00
Andres Freund
5e89985928 bufmgr: Don't lock buffer header in StrategyGetBuffer()
Previously StrategyGetBuffer() acquired the buffer header spinlock for every
buffer, whether it was reusable or not. If reusable, it'd be returned, with
the lock held, to GetVictimBuffer(), which then would pin the buffer with
PinBuffer_Locked(). That's somewhat violating the spirit of the guidelines for
holding spinlocks (i.e. that they are only held for a few lines of consecutive
code) and necessitates using PinBuffer_Locked(), which scales worse than
PinBuffer() due to holding the spinlock.  This alone makes it worth changing
the code.

However, the main reason to change this is that a future commit will make
PinBuffer_Locked() slower (due to making UnlockBufHdr() slower), to gain
scalability for the much more common case of pinning a pre-existing buffer. By
pinning the buffer with a single atomic operation, iff the buffer is reusable,
we avoid any potential regression for miss-heavy workloads. There strictly are
fewer atomic operations for each potential buffer after this change.

The price for this improvement is that freelist.c needs two CAS loops and
needs to be able to set up the resource accounting for pinned buffers. The
latter is achieved by exposing a new function for that purpose from bufmgr.c,
that seems better than exposing the entire private refcount infrastructure.
The improvement seems worth the complexity.

Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
2025-10-08 17:04:07 -04:00
Andres Freund
3baae90013 bufmgr: fewer calls to BufferDescriptorGetContentLock
We're planning to merge buffer content locks into BufferDesc.state. To reduce
the size of that patch, centralize calls to BufferDescriptorGetContentLock().

The biggest part of the change is in assertions, by introducing
BufferIsLockedByMe[InMode]() (and removing BufferIsExclusiveLocked()). This
seems like an improvement even without aforementioned plans.

Additionally replace some direct calls to LWLockAcquire() with calls to
LockBuffer().

Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
2025-10-08 16:06:19 -04:00
Peter Eisentraut
a5b35fcedb Remove PointerIsValid()
This doesn't provide any value over the standard style of checking the
pointer directly or comparing against NULL.

Also remove related:
- AllocPointerIsValid() [unused]
- IndexScanIsValid() [had one user]
- HeapScanIsValid() [unused]
- InvalidRelation [unused]

Leaving HeapTupleIsValid(), ItemIdIsValid(), PortalIsValid(),
RelationIsValid for now, to reduce code churn.

Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/ad50ab6b-6f74-4603-b099-1cd6382fb13d%40eisentraut.org
Discussion: https://www.postgresql.org/message-id/CA+hUKG+NFKnr=K4oybwDvT35dW=VAjAAfiuLxp+5JeZSOV3nBg@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/bccf2803-5252-47c2-9ff0-340502d5bd1c@iki.fi
2025-09-24 15:17:20 +02:00
David Rowley
9fc7f6ab72 Fix various incorrect filename references
Author: Chao Li <li.evan.chao@gmail.com>
Author: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAEoWx2=hOBCPm-Z=F15twr_23XjHeoXSbifP5GdEdtWona97wQ@mail.gmail.com
2025-09-22 13:33:17 +12:00
Peter Eisentraut
d4d1fc527b Update various forward declarations to use typedef
There are a number of forward declarations that use struct but not the
customary typedef, because that could have led to repeat typedefs,
which was not allowed.  This is now allowed in C11, so we can update
these to provide the typedefs as well, so that the later uses of the
types look more consistent.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/10d32190-f31b-40a5-b177-11db55597355@eisentraut.org
2025-09-15 11:04:10 +02:00
Peter Eisentraut
25f36066dd Remove traces of support for Sun Studio compiler
Per discussion, this compiler suite is no longer maintained, and
it has not been able to compile PostgreSQL since at least PostgreSQL
17.

This removes all the remaining support code for this compiler.

Note that the Solaris operating system continues to be supported, but
using GCC as the compiler.

Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/a0f817ee-fb86-483a-8a14-b6f7f5991b6e%40eisentraut.org
2025-09-12 07:39:05 +02:00
Nathan Bossart
ed1aad15e0 Move named LWLock tranche requests to shared memory.
In EXEC_BACKEND builds, GetNamedLWLockTranche() can segfault when
called outside of the postmaster process, as it might access
NamedLWLockTrancheRequestArray, which won't be initialized.  Given
the lack of reports, this is apparently unusual, presumably because
it is usually called from a shmem_startup_hook like this:

    mystruct = ShmemInitStruct(..., &found);
    if (!found)
    {
        mystruct->locks = GetNamedLWLockTranche(...);
        ...
    }

This genre of shmem_startup_hook evades the aforementioned
segfaults because the struct is initialized in the postmaster, so
all other callers skip the !found path.

We considered modifying the documentation or requiring
GetNamedLWLockTranche() to be called from the postmaster, but
ultimately we decided to simply move the request array to shared
memory (and add it to the BackendParameters struct), thereby
allowing calls outside postmaster on all platforms.  Since the main
shared memory segment is initialized after accepting LWLock tranche
requests, postmaster builds the request array in local memory first
and then copies it to shared memory later.

Given the lack of reports, back-patching seems unnecessary.

Reported-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/CAA5RZ0v1_15QPg5Sqd2Qz5rh_qcsyCeHHmRDY89xVHcy2yt5BQ%40mail.gmail.com
2025-09-11 16:13:55 -05:00
Andres Freund
2c78940527 bufmgr: Remove freelist, always use clock-sweep
This set of changes removes the list of available buffers and instead simply
uses the clock-sweep algorithm to find and return an available buffer.  This
also removes the have_free_buffer() function and simply caps the
pg_autoprewarm process to at most NBuffers.

While on the surface this appears to be removing an optimization it is in fact
eliminating code that induces overhead in the form of synchronization that is
problematic for multi-core systems.

The main reason for removing the freelist, however, is not the moderate
improvement in scalability, but that having the freelist would require
dedicated complexity in several upcoming patches. As we have not been able to
find a case benefiting from the freelist...

Author: Greg Burd <greg@burd.me>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/70C6A5B5-2A20-4D0B-BC73-EB09DD62D61C@getmailspring.com
2025-09-05 12:25:59 -04:00
Andres Freund
50e4c6ace5 bufmgr: Use consistent naming of the clock-sweep algorithm
Minor edits to comments only.

Author: Greg Burd <greg@burd.me>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/70C6A5B5-2A20-4D0B-BC73-EB09DD62D61C@getmailspring.com
2025-09-05 12:25:59 -04:00
Nathan Bossart
38b602b028 Move dynamically-allocated LWLock tranche names to shared memory.
There are two ways for shared libraries to allocate their own
LWLock tranches.  One way is to call RequestNamedLWLockTranche() in
a shmem_request_hook, which requires the library to be loaded via
shared_preload_libraries.  The other way is to call
LWLockNewTrancheId(), which is not subject to the same
restrictions.  However, LWLockNewTrancheId() does require each
backend to store the tranche's name in backend-local memory via
LWLockRegisterTranche().  This API is a little cumbersome and leads
to things like unhelpful pg_stat_activity.wait_event values in
backends that haven't loaded the library.

This commit moves these LWLock tranche names to shared memory, thus
eliminating the need for each backend to call
LWLockRegisterTranche().  Instead, the tranche name must be
provided to LWLockNewTrancheId(), which immediately makes the name
available to all backends.  Since the tranche name array is
append-only, lookups can ordinarily avoid locking as long as their
local copy of the LWLock counter is greater than the requested
tranche ID.

One downside of this approach is that we now have a hard limit on
both the length of tranche names (NAMEDATALEN-1 bytes) and the
number of dynamically-allocated tranches (256).  Besides a limit of
NAMEDATALEN-1 bytes for tranche names registered via
RequestNamedLWLockTranche(), no such limits previously existed.  We
could avoid these new limits by using dynamic shared memory, but
the complexity involved didn't seem worth it.  We briefly
considered making the tranche limit user-configurable but
ultimately decided against that, too.  Since there is still a lot
of time left in the v19 development cycle, it's possible we will
revisit this choice.

Author: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Rahila Syed <rahilasyed90@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAA5RZ0vvED3naph8My8Szv6DL4AxOVK3eTPS0qXsaKi%3DbVdW2A%40mail.gmail.com
2025-09-03 13:57:48 -05:00
Nathan Bossart
67fcf48c3b Make LWLockCounter a global variable.
Using the LWLockCounter requires first calculating its address in
shared memory like this:

	LWLockCounter = (int *) ((char *) MainLWLockArray - sizeof(int));

Commit 82e861fbe1 started this trend in order to fix EXEC_BACKEND
builds, but it could also be fixed by adding it to the
BackendParameters struct.  The current approach is somewhat
difficult to follow, so this commit switches to the latter.  While
at it, swap around the code in LWLockShmemSize() to match the order
of assignments in CreateLWLocks() for added readability.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/aLDLnan9gNCS9fHx%40nathan
2025-08-29 12:13:37 -05:00
Peter Eisentraut
991295f387 Mark ItemPointer arguments as const in tuple/table lock functions
The functions LockTuple, ConditionalLockTuple, UnlockTuple, and
XactLockTableWait take an ItemPointer argument that they do not
modify, so the argument can be const-qualified to better convey intent
and allow the compiler to enforce immutability.

Author: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAEoWx2m9e4rECHBwpRE4%2BGCH%2BpbYZXLh2f4rB1Du5hDfKug%2BOg%40mail.gmail.com
2025-08-29 07:39:58 +02:00
Andres Freund
5865150b6d aio: Stop using enum bitfields due to bad code generation
During an investigation into rather odd aio related errors on macos, observed
by Alexander and Konstantin, we started to wonder if bitfield access is
related to the error. At the moment it looks like it is related, we cannot
reproduce the failures when replacing the bitfields. In addition, the problem
can only be reproduced with some compiler [versions] and not everyone has been
able to reproduce the issue.

The observed problem is that, very rarely, PgAioHandle->{state,target} are in
an inconsistent state, after having been checked to be in a valid state not
long before, triggering an assertion failure. Unfortunately, this could be
caused by wrong compiler code generation or somehow of missing memory barriers
- we don't really know. In theory there should not be any concurrent write
access to the handle in the state the bug is triggered, as the handle was idle
and is just being initialized.

Separately from the bug, we observed that at least gcc and clang generate
rather terrible code for the bitfield access. Even if it's not clear if the
observed assertion failure is actually caused by the bitfield somehow, the bad
code generation alone is sufficient reason to stop using bitfields.

Therefore, replace the enum bitfields with uint8s and instead cast in each
switch statement.

Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reported-by: Konstantin Knizhnik <knizhnik@garret.ru>
Discussion: https://postgr.es/m/1500090.1745443021@sss.pgh.pa.us
Backpatch-through: 18
2025-08-27 19:12:11 -04:00
Alexander Korotkov
c13070a27b Revert "Get rid of WALBufMappingLock"
This reverts commit bc22dc0e0d.
It appears that conditional variables are not suitable for use inside
critical sections.  If WaitLatch()/WaitEventSetWaitBlock() face postmaster
death, they exit, releasing all locks instead of PANIC.  In certain
situations, this leads to data corruption.

Reported-by: Andrey Borodin <x4mmm@yandex-team.ru>
Discussion: https://postgr.es/m/B3C69B86-7F82-4111-B97F-0005497BB745%40yandex-team.ru
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Reviewed-by: Aleksander Alekseev <aleksander@tigerdata.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Yura Sokolov <y.sokolov@postgrespro.ru>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Backpatch-through: 18
2025-08-22 19:26:38 +03:00
Michael Paquier
13b935cd52 Change dynahash.c and hsearch.h to use int64 instead of long
This code was relying on "long", which is signed 8 bytes everywhere
except on Windows where it is 4 bytes, that could potentially expose it
to overflows, even if the current uses in the code are fine as far as I
know.  This code is now able to rely on the same sizeof() variable
everywhere, with int64.  long was used for sizes, partition counts and
entry counts.

Some callers of the dynahash.c routines used long declarations, that can
be cleaned up to use int64 instead.  There was one shortcut based on
SIZEOF_LONG, that can be removed.  long is entirely removed from
dynahash.c and hsearch.h.

Similar work was done in b1e5c9fa9a.

Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/aKQYp-bKTRtRauZ6@paquier.xyz
2025-08-22 11:59:02 +09:00
Nathan Bossart
2047ad0681 Cross-check lists of built-in LWLock tranches.
lwlock.c, lwlock.h, and wait_event_names.txt each contain a list of
built-in LWLock tranches.  It is easy to miss one or the other when
adding or removing tranches, and discrepancies have adverse effects
(e.g., breaking JOINs between pg_stat_activity and pg_wait_events).
This commit moves the lists of built-in tranches in lwlock.{c,h} to
lwlocklist.h and adds a cross-check to the script that generates
lwlocknames.h.  If the lists do not match exactly, building will
fail.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aHpOgwuFQfcFMZ/B%40ip-10-97-1-34.eu-west-3.compute.internal
2025-07-23 12:06:20 -05:00
Amit Kapila
228c370868 Preserve conflict-relevant data during logical replication.
Logical replication requires reliable conflict detection to maintain data
consistency across nodes. To achieve this, we must prevent premature
removal of tuples deleted by other origins and their associated commit_ts
data by VACUUM, which could otherwise lead to incorrect conflict reporting
and resolution.

This patch introduces a mechanism to retain deleted tuples on the
subscriber during the application of concurrent transactions from remote
nodes. Retaining these tuples allows us to correctly ignore concurrent
updates to the same tuple. Without this, an UPDATE might be misinterpreted
as an INSERT during resolutions due to the absence of the original tuple.

Additionally, we ensure that origin metadata is not prematurely removed by
vacuum freeze, which is essential for detecting update_origin_differs and
delete_origin_differs conflicts.

To support this, a new replication slot named pg_conflict_detection is
created and maintained by the launcher on the subscriber. Each apply
worker tracks its own non-removable transaction ID, which the launcher
aggregates to determine the appropriate xmin for the slot, thereby
retaining necessary tuples.

Conflict information retention (deleted tuples and commit_ts) can be
enabled per subscription via the retain_conflict_info option. This is
disabled by default to avoid unnecessary overhead for configurations that
do not require conflict resolution or logging.

During upgrades, if any subscription on the old cluster has
retain_conflict_info enabled, a conflict detection slot will be created to
protect relevant tuples from deletion when the new cluster starts.

This is a foundational work to correctly detect update_deleted conflict
which will be done in a follow-up patch.

Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Nisha Moond <nisha.moond412@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/OS0PR01MB5716BE80DAEB0EE2A6A5D1F5949D2@OS0PR01MB5716.jpnprd01.prod.outlook.com
2025-07-23 02:56:00 +00:00
Andres Freund
d3f97fd1dd aio: Fix assertion, clarify README
The assertion wouldn't have triggered for a long while yet, but this won't
accidentally fail to detect the issue if/when it occurs.

Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/CAEze2Wj-43JV4YufW23gm=Uwr7Lkj+p0yKctKHxNm1rwFC+_DQ@mail.gmail.com
Backpatch-through: 18
2025-07-22 08:30:52 -04:00
Andres Freund
f2c87ac04e Remove long-unused TransactionIdIsActive()
TransactionIdIsActive() has not been used since bb38fb0d43, in 2014. There
are no known uses in extensions either and it's hard to see valid uses for
it. Therefore remove TransactionIdIsActive().

Discussion: https://postgr.es/m/odgftbtwp5oq7cxjgf4kjkmyq7ypoftmqy7eqa7w3awnouzot6@hrwnl5tdqrgu
2025-07-12 11:00:44 -04:00
Michael Paquier
62a17a9283 Integrate FullTransactionIds deeper into two-phase code
This refactoring is a follow-up of the work done in 5a1dfde833, that
has switched 2PC file names to use FullTransactionIds when written on
disk.  This will help with the integration of a follow-up solution
related to the handling of two-phase files during recovery, to address
older defects while reading these from disk after a crash.

This change is useful in itself as it reduces the need to build the
file names from epoch numbers and TransactionIds, because we can use
directly FullTransactionIds from which the 2PC file names are guessed.
So this avoids a lot of back-and-forth between the FullTransactionIds
retrieved from the file names and how these are passed around in the
internal 2PC logic.

Note that the core of the change is the use of a FullTransactionId
instead of a TransactionId in GlobalTransactionData, that tracks 2PC
file information in shared memory.  The change in TwoPhaseCallback makes
this commit unfit for stable branches.

Noah has contributed a good chunk of this patch.  I have spent some time
on it as well while working on the issues with two-phase state files and
recovery.

Author: Noah Misch <noah@leadboat.com>
Co-Authored-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/Z5sd5O9JO7NYNK-C@paquier.xyz
Discussion: https://postgr.es/m/20250116205254.65.nmisch@google.com
2025-07-07 12:50:40 +09:00
Fujii Masao
78ebda66bf Speed up truncation of temporary relations.
Previously, truncating a temporary relation required scanning the entire
local buffer pool once per relation fork to invalidate buffers. This could
be slow, especially with a large local buffers, as the scan was repeated
multiple times.

A similar issue with regular tables (shared buffers) was addressed in
commit 6d05086c0a by scanning the buffer pool only once for all forks.

This commit applies the same optimization to temporary relations,
improving truncation performance.

Author: Daniil Davydov <3danissimo@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Maxim Orlov <orlovmg@gmail.com>
Discussion: https://postgr.es/m/CAJDiXggNqsJOH7C5co4jA8nDk8vw-=sokyh5s1_TENWnC6Ofcg@mail.gmail.com
2025-07-04 09:03:58 +09:00
Nathan Bossart
fe07100e82 Add GetNamedDSA() and GetNamedDSHash().
Presently, the dynamic shared memory (DSM) registry only provides
GetNamedDSMSegment(), which allocates a fixed-size segment.  To use
the DSM registry for more sophisticated things like dynamic shared
memory areas (DSAs) or a hash table backed by a DSA (dshash), users
need to create a DSM segment that stores various handles and LWLock
tranche IDs and to write fairly complicated initialization code.
Furthermore, there is likely little variation in this
initialization code between libraries.

This commit introduces functions that simplify allocating a DSA or
dshash within the DSM registry.  These functions are very similar
to GetNamedDSMSegment().  Notable differences include the lack of
an initialization callback parameter and the prohibition of calling
the functions more than once for a given entry in each backend
(which should be trivially avoidable in most circumstances).  While
at it, this commit bumps the maximum DSM registry entry name length
from 63 bytes to 127 bytes.

Also note that even though one could presumably detach/destroy the
DSAs and dshashes created in the registry, such use-cases are not
yet well-supported, if for no other reason than the associated DSM
registry entries cannot be removed.  Adding such support is left as
a future exercise.

The test_dsm_registry test module contains tests for the new
functions and also serves as a complete usage example.

Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Florents Tselai <florents.tselai@gmail.com>
Reviewed-by: Rahila Syed <rahilasyed90@gmail.com>
Discussion: https://postgr.es/m/aEC8HGy2tRQjZg_8%40nathan
2025-07-02 11:50:52 -05:00
Andres Freund
f20a347e1a aio: Fix reference to outdated name
Reported-by: Antonin Houska <ah@cybertec.at>
Author: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/5250.1751266701@localhost
Backpatch-through: 18, where da7226993f introduced this
2025-06-30 10:22:02 -04:00
Tom Lane
b27644bade Sync typedefs.list with the buildfarm.
Our maintenance of typedefs.list has been a little haphazard
(and apparently we can't alphabetize worth a darn).  Replace
the file with the authoritative list from our buildfarm, and
run pgindent using that.

I also updated the additions/exclusions lists in pgindent where
necessary to keep pgindent from messing things up significantly.
Notably, now that regex_t and some related names are macros not real
typedefs, we have to whitelist them explicitly.  The exclusions list
has also drifted noticeably, presumably due to changes of system
headers on the buildfarm animals that contribute to the list.

Unlike in prior years, I've not manually added typedef names that
are missing from the buildfarm's list because they are not used to
declare any variables or fields.  So there are a few places where
the typedef declaration itself is formatted worse than before,
e.g. typedef enum IoMethod.  I could preserve the names that were
manually added to the list previously, but I'd really prefer to find
a less manual way of dealing with these cases.  A quick grep finds
about 75 such symbols, most of which have never gotten any special
treatment.

Per discussion among pgsql-release, doing this now seems appropriate
even though we're still a week or two away from making the v18 branch.
2025-06-15 13:04:24 -04:00
Fujii Masao
73bdcfab35 Rename log_lock_failure GUC to log_lock_failures for consistency.
This commit renames the GUC log_lock_failure to log_lock_failures
to align with the existing similar setting log_lock_waits, which uses
the plural form. This improves naming consistency across related GUCs.

Suggested-by: Peter Eisentraut <peter@eisentraut.org>
Author: Fujii Masao <masao.fujii@gmail.com
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/7a8198b6-d5b8-4910-b41e-8d3efcbb015d@eisentraut.org
2025-06-03 10:02:55 +09:00
Daniel Gustafsson
fb844b9f06 Revert function to get memory context stats for processes
Due to concerns raised about the approach, and memory leaks found
in sensitive contexts the functionality is reverted. This reverts
commits 45e7e8ca9, f8c115a6c, d2a1ed172, 55ef7abf8 and 042a66291
for v18 with an intent to revisit this patch for v19.

Discussion: https://postgr.es/m/594293.1747708165@sss.pgh.pa.us
2025-05-23 15:44:54 +02:00
Michael Paquier
2c6469d4cd Fix incorrect year in some copyright notices
A couple of new files have been added in the tree with a copyright year
of 2024 while we were already in 2025.  These should be marked with
2025, so let's fix them.

Reported-by: Shaik Mohammad Mujeeb <mujeeb.sk.dev@gmail.com>
Discussion: https://postgr.es/m/CALa6HA4_Wu7-2PV0xv-Q84cT8eG7rTx6bdjUV0Pc=McAwkNMfQ@mail.gmail.com
2025-05-19 09:46:52 +09:00
Michael Paquier
c259ba881c aio: Use runtime arguments with injections points in tests
This cleans up the code related to the testing infrastructure of AIO
that used injection points, switching the test code to use the new
facility for injection points added by 371f2db8b0 rather than tweaks
to pass and reset arguments to the callbacks run.

This removes all the dependencies to USE_INJECTION_POINTS in the AIO
code.  pgaio_io_call_inj(), pgaio_inj_io_get() and pgaio_inj_cur_handle
are now gone.

Reviewed-by: Greg Burd <greg@burd.me>
Discussion: https://postgr.es/m/Z_y9TtnXubvYAApS@paquier.xyz
2025-05-10 12:36:57 +09:00
Heikki Linnakangas
b28c59a6cd Use 'void *' for arbitrary buffers, 'uint8 *' for byte arrays
A 'void *' argument suggests that the caller might pass an arbitrary
struct, which is appropriate for functions like libc's read/write, or
pq_sendbytes(). 'uint8 *' is more appropriate for byte arrays that
have no structure, like the cancellation keys or SCRAM tokens. Some
places used 'char *', but 'uint8 *' is better because 'char *' is
commonly used for null-terminated strings. Change code around SCRAM,
MD5 authentication, and cancellation key handling to follow these
conventions.

Discussion: https://www.postgresql.org/message-id/61be9e31-7b7d-49d5-bc11-721800d89d64@eisentraut.org
2025-05-08 22:01:25 +03:00
Tom Lane
e8ca9ed1d2 Don't use double-quotes in #include's of system headers.
While few if any C compilers will complain about this, it's
inconsistent with our other #include's of the same headers.

There are some other questionable usages in
src/include/jit/SectionMemoryManager.h and
src/pl/plperl/plperl_system.h, but perhaps those have a
reason to be like that.  I can't see that these do.

Noticed while fooling around with a script to do analysis
of our header cross-inclusions.
2025-04-26 20:30:27 -04:00
David Rowley
936457419d Eliminate divide in new fast-path locking code
c4d5cb71d2 adjusted the fast-path locking code to allow some
configuration of the number of fast-path locking slots via the
max_locks_per_transaction GUC.  In that commit the FAST_PATH_REL_GROUP()
macro used integer division to determine the fast-path locking group slot
to use for the lock.

The divisor in this case is always a power-of-two value.  Here we swap
out the divide by a bitwise-AND, which is a significantly faster
operation to perform.

In passing, adjust the code that's setting FastPathLockGroupsPerBackend
so that it's more clear that the value being set is a power-of-two.

Also, adjust some comments in the area which contained some magic
numbers.  It seems better to justify the 1024 upper limit in the
location where the #define is made instead of where it is used.

Author: David Rowley <drowleyml@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAApHDvodr3bcnpxcs7+k-3cFwYR0tP-BYhyd2PpDhe-bCx9i=g@mail.gmail.com
2025-04-27 11:53:40 +12:00
David Rowley
84fd3bc141 Fix a few duplicate words in comments
These are all new to v18

Author: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAApHDvrMcr8XD107H3NV=WHgyBcu=sx5+7=WArr-n_cWUqdFXQ@mail.gmail.com
2025-04-21 10:41:18 +12:00
Noah Misch
8180136652 Comment on need to MarkBufferDirty() if omitting DELAY_CHKPT_START.
Blocking checkpoint phase 2 requires MarkBufferDirty() and
BUFFER_LOCK_EXCLUSIVE; neither suffices by itself.  transam/README documents
this, citing SyncOneBuffer().  Update the DELAY_CHKPT_START documentation to
say this.  Expand the heap_inplace_update_and_unlock() comment that cites
XLogSaveBufferForHint() as precedent, since heap_inplace_update_and_unlock()
could have opted not to use DELAY_CHKPT_START.

Commit 8e7e672cda added DELAY_CHKPT_START to
heap_inplace_update_and_unlock().  Since commit
bc6bad8857 reverted it in non-master branches,
no back-patch.

Discussion: https://postgr.es/m/20250406180054.26.nmisch@google.com
2025-04-20 12:00:17 -07:00
Michael Paquier
88e947136b Fix typos and grammar in the code
The large majority of these have been introduced by recent commits done
in the v18 development cycle.

Author: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/9a7763ab-5252-429d-a943-b28941e0e28b@gmail.com
2025-04-19 19:17:42 +09:00
Noah Misch
f4ece891fc Assert lack of hazardous buffer locks before possible catalog read.
Commit 0bada39c83 fixed a bug of this kind,
which existed in all branches for six days before detection.  While the
probability of reaching the trouble was low, the disruption was extreme.  No
new backends could start, and service restoration needed an immediate
shutdown.  Hence, add this to catch the next bug like it.

The new check in RelationIdGetRelation() suffices to make autovacuum detect
the bug in commit 243e9b40f1 that led to commit
0bada39.  This also checks in a number of similar places.  It replaces each
Assert(IsTransactionState()) that pertained to a conditional catalog read.

No back-patch for now, but a back-patch of commit 243e9b4 should back-patch
this, too.  A back-patch could omit the src/test/regress changes, since back
branches won't gain new index columns.

Reported-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/20250410191830.0e.nmisch@google.com
Discussion: https://postgr.es/m/10ec0bc3-5933-1189-6bb8-5dec4114558e@gmail.com
2025-04-17 05:00:30 -07:00
Peter Geoghegan
a6cab6a78e Harmonize function parameter names for Postgres 18.
Make sure that function declarations use names that exactly match the
corresponding names from function definitions in a few places.  These
inconsistencies were all introduced during Postgres 18 development.

This commit was written with help from clang-tidy, by mechanically
applying the same rules as similar clean-up commits (the earliest such
commit was commit 035ce1fe).
2025-04-12 12:07:36 -04:00
Peter Eisentraut
7d430a5728 Add missing PGDLLIMPORT markings
Discussion: https://www.postgresql.org/message-id/flat/25095db5-b595-4b85-9100-d358907c25b5%40eisentraut.org
2025-04-11 08:59:52 +02:00
Tomas Vondra
3887d0cfeb Cleanup of pg_numa.c
This moves/renames some of the functions defined in pg_numa.c:

* pg_numa_get_pagesize() is renamed to pg_get_shmem_pagesize(), and
  moved to src/backend/storage/ipc/shmem.c. The new name better reflects
  that the page size is not related to NUMA, and it's specifically about
  the page size used for the main shared memory segment.

* move pg_numa_available() to src/backend/storage/ipc/shmem.c, i.e. into
  the backend (which more appropriate for functions callable from SQL).
  While at it, improve the comment to explain what page size it returns.

* remove unnecessary includes from src/port/pg_numa.c, adding
  unnecessary dependencies (src/port should be suitable for frontent).
  These were either leftovers or unnecessary thanks to the other changes
  in this commit.

This eliminates unnecessary dependencies on backend symbols, which we
don't want in src/port.

Reported-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
https://postgr.es/m/CALdSSPi5fj0a7UG7Fmw2cUD1uWuckU_e8dJ+6x-bJEokcSXzqA@mail.gmail.com
2025-04-09 21:50:17 +02:00
Thomas Munro
f78ca6f3eb Introduce file_copy_method setting.
It can be set to either COPY (the default) or CLONE if the system
supports it.  CLONE causes callers of copydir(), currently CREATE
DATABASE ... STRATEGY=FILE_COPY and ALTER DATABASE ... SET TABLESPACE =
..., to use copy_file_range (Linux, FreeBSD) or copyfile (macOS) to copy
files instead of a read-write loop over the contents.

CLONE gives the kernel the opportunity to share block ranges on
copy-on-write file systems and push copying down to storage on others,
depending on configuration.  On some systems CLONE can be used to clone
large databases quickly with CREATE DATABASE ... TEMPLATE=source
STRATEGY=FILE_COPY.

Other operating systems could be supported; patches welcome.

Co-authored-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Ranier Vilela <ranier.vf@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGLM%2Bt%2BSwBU-cHeMUXJCOgBxSHLGZutV5zCwY4qrCcE02w%40mail.gmail.com
2025-04-08 21:35:38 +12:00
Daniel Gustafsson
042a66291b Add function to get memory context stats for processes
This adds a function for retrieving memory context statistics
and information from backends as well as auxiliary processes.
The intended usecase is cluster debugging when under memory
pressure or unanticipated memory usage characteristics.

When calling the function it sends a signal to the specified
process to submit statistics regarding its memory contexts
into dynamic shared memory.  Each memory context is returned
in detail, followed by a cumulative total in case the number
of contexts exceed the max allocated amount of shared memory.
Each process is limited to use at most 1Mb memory for this.

A summary can also be explicitly requested by the user, this
will return the TopMemoryContext and a cumulative total of
all lower contexts.

In order to not block on busy processes the caller specifies
the number of seconds during which to retry before timing out.
In the case where no statistics are published within the set
timeout,  the last known statistics are returned, or NULL if
no previously published statistics exist.  This allows dash-
board type queries to continually publish even if the target
process is temporarily congested.  Context records contain a
timestamp to indicate when they were submitted.

Author: Rahila Syed <rahilasyed90@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Discussion: https://postgr.es/m/CAH2L28v8mc9HDt8QoSJ8TRmKau_8FM_HKS41NeO9-6ZAkuZKXw@mail.gmail.com
2025-04-08 11:06:56 +02:00
Andres Freund
dcf7e1697b Add pg_buffercache_evict_{relation,all} functions
In addition to the added functions, the pg_buffercache_evict() function now
shows whether the buffer was flushed.

pg_buffercache_evict_relation(): Evicts all shared buffers in a
relation at once.
pg_buffercache_evict_all(): Evicts all shared buffers at once.

Both functions provide mechanism to evict multiple shared buffers at
once. They are designed to address the inefficiency of repeatedly calling
pg_buffercache_evict() for each individual buffer, which can be time-consuming
when dealing with large shared buffer pools. (e.g., ~477ms vs. ~2576ms for
16GB of fully populated shared buffers).

These functions are intended for developer testing and debugging
purposes and are available to superusers only.

Minimal tests for the new functions are included. Also, there was no test for
pg_buffercache_evict(), test for this added too.

No new extension version is needed, as it was already increased this release
by ba2a3c2302.

Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Aidar Imamov <a.imamov@postgrespro.ru>
Reviewed-by: Joseph Koshakow <koshy44@gmail.com>
Discussion: https://postgr.es/m/CAN55FZ0h_YoSqqutxV6DES1RW8ig6wcA8CR9rJk358YRMxZFmw%40mail.gmail.com
2025-04-08 02:19:32 -04:00
Tomas Vondra
65c298f61f Add support for basic NUMA awareness
Add basic NUMA awareness routines, using a minimal src/port/pg_numa.c
portability wrapper and an optional build dependency, enabled by
--with-libnuma configure option. For now this is Linux-only, other
platforms may be supported later.

A built-in SQL function pg_numa_available() allows checking NUMA
support, i.e. that the server was built/linked with the NUMA library.

The main function introduced is pg_numa_query_pages(), which allows
determining the NUMA node for individual memory pages. Internally the
function uses move_pages(2) syscall, as it allows batching, and is more
efficient than get_mempolicy(2).

Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
2025-04-07 23:08:17 +02:00