1
0
mirror of https://github.com/postgres/postgres.git synced 2025-11-18 02:02:55 +03:00
Commit Graph

27434 Commits

Author SHA1 Message Date
Alexander Korotkov
c13070a27b Revert "Get rid of WALBufMappingLock"
This reverts commit bc22dc0e0d.
It appears that conditional variables are not suitable for use inside
critical sections.  If WaitLatch()/WaitEventSetWaitBlock() face postmaster
death, they exit, releasing all locks instead of PANIC.  In certain
situations, this leads to data corruption.

Reported-by: Andrey Borodin <x4mmm@yandex-team.ru>
Discussion: https://postgr.es/m/B3C69B86-7F82-4111-B97F-0005497BB745%40yandex-team.ru
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Reviewed-by: Aleksander Alekseev <aleksander@tigerdata.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Yura Sokolov <y.sokolov@postgrespro.ru>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Backpatch-through: 18
2025-08-22 19:26:38 +03:00
Heikki Linnakangas
661f821ef0 Use ereport() rather than elog()
Noah pointed this out before I committed 50f770c3d9, but I
accidentally pushed the old version with elog() anyway. Oops.

Reported-by: Noah Misch <noah@leadboat.com>
Discussion: https://www.postgresql.org/message-id/20250820003756.31.nmisch@google.com
2025-08-22 13:35:05 +03:00
Heikki Linnakangas
50f770c3d9 Revert GetTransactionSnapshot() to return historic snapshot during LR
Commit 1585ff7387 changed GetTransactionSnapshot() to throw an error
if it's called during logical decoding, instead of returning the
historic snapshot. I made that change for extra protection, because a
historic snapshot can only be used to access catalog tables while
GetTransactionSnapshot() is usually called when you're executing
arbitrary queries. You might get very subtle visibility problems if
you tried to use the historic snapshot for arbitrary queries.

There's no built-in code in PostgreSQL that calls
GetTransactionSnapshot() during logical decoding, but it turns out
that the pglogical extension does just that, to evaluate row filter
expressions. You would get weird results if the row filter runs
arbitrary queries, but it is sane as long as you don't access any
non-catalog tables. Even though there are no checks to enforce that in
pglogical, a typical row filter expression does not access any tables
and works fine. Accessing tables marked with the user_catalog_table =
true option is also OK.

To fix pglogical with row filters, and any other extensions that might
do similar things, revert GetTransactionSnapshot() to return a
historic snapshot during logical decoding.

To try to still catch the unsafe usage of historic snapshots, add
checks in heap_beginscan() and index_beginscan() to complain if you
try to use a historic snapshot to scan a non-catalog table. We're very
close to the version 18 release however, so add those new checks only
in master.

Backpatch-through: 18
Reported-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://www.postgresql.org/message-id/20250809222338.cc.nmisch@google.com
2025-08-22 13:07:46 +03:00
Peter Eisentraut
16a0039dc0 Reduce lock level for ALTER DOMAIN ... VALIDATE CONSTRAINT
Reduce from ShareLock to ShareUpdateExclusivelock.  Validation during
ALTER DOMAIN ... ADD CONSTRAINT keeps using ShareLock.

Example:

    create domain d1 as int;
    create table t (a d1);
    alter domain d1 add constraint cc10 check (value > 10) not valid;

    begin;
    alter domain d1 validate constraint cc10;

    -- another session
    insert into t values (8);

Now we should still be able to perform DML operations on table t while
the domain constraint is being validated.  The equivalent works
already on table constraints.

Author: jian he <jian.universality@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CACJufxHz92A88NLRTA2msgE2dpXpE-EoZ2QO61od76-6bfqurA%40mail.gmail.com
2025-08-22 08:56:11 +02:00
Michael Paquier
13b935cd52 Change dynahash.c and hsearch.h to use int64 instead of long
This code was relying on "long", which is signed 8 bytes everywhere
except on Windows where it is 4 bytes, that could potentially expose it
to overflows, even if the current uses in the code are fine as far as I
know.  This code is now able to rely on the same sizeof() variable
everywhere, with int64.  long was used for sizes, partition counts and
entry counts.

Some callers of the dynahash.c routines used long declarations, that can
be cleaned up to use int64 instead.  There was one shortcut based on
SIZEOF_LONG, that can be removed.  long is entirely removed from
dynahash.c and hsearch.h.

Similar work was done in b1e5c9fa9a.

Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/aKQYp-bKTRtRauZ6@paquier.xyz
2025-08-22 11:59:02 +09:00
Michael Paquier
ef03ea01fe Ignore temporary relations in RelidByRelfilenumber()
Temporary relations may share the same RelFileNumber with a permanent
relation, or other temporary relations associated with other sessions.

Being able to uniquely identify a temporary relation would require
RelidByRelfilenumber() to know about the proc number of the temporary
relation it wants to identify, something it is not designed for since
its introduction in f01d1ae3a1.

There are currently three callers of RelidByRelfilenumber():
- autoprewarm.
- Logical decoding, reorder buffer.
- pg_filenode_relation(), that attempts to find a relation OID based on
a tablespace OID and a RelFileNumber.

This makes the situation problematic particularly for the first two
cases, leading to the possibility of random ERRORs due to
inconsistencies that temporary relations can create in the cache
maintained by RelidByRelfilenumber().  The third case should be less of
an issue, as I suspect that there are few direct callers of
pg_filenode_relation().

The window where the ERRORs are happen is very narrow, requiring an OID
wraparound to create a lookup conflict in RelidByRelfilenumber() with a
temporary table reusing the same OID as another relation already cached.
The problem is easier to reach in workloads with a high OID consumption
rate, especially with a higher number of temporary relations created.

We could get pg_filenode_relation() and RelidByRelfilenumber() to work
with temporary relations if provided the means to identify them with an
optional proc number given in input, but the years have also shown that
we do not have a use case for it, yet.  Note that this could not be
backpatched if pg_filenode_relation() needs changes.  It is simpler to
ignore temporary relations.

Reported-by: Shenhao Wang <wangsh.fnst@fujitsu.com>
Author: Vignesh C <vignesh21@gmail.com>
Reviewed-By: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-By: Takamichi Osumi <osumi.takamichi@fujitsu.com>
Reviewed-By: Michael Paquier <michael@paquier.xyz>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Reported-By: Shenhao Wang <wangsh.fnst@fujitsu.com>
Discussion: https://postgr.es/m/bbaaf9f9-ebb2-645f-54bb-34d6efc7ac42@fujitsu.com
Backpatch-through: 13
2025-08-22 09:03:59 +09:00
Peter Eisentraut
47932f3cdc Use consistent type for pgaio_io_get_id() result
The result of pgaio_io_get_id() was being assigned to a mix of int and
uint32 variables.  This fixes it to use int consistently, which seems
the most correct.  Also change the queue empty special value in
method_worker.c to -1 from UINT32_MAX.

Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/70c784b3-f60b-4652-b8a6-75e5f051243e%40eisentraut.org
2025-08-21 19:45:25 +02:00
Fujii Masao
12da45742c Disallow server start with sync_replication_slots = on and wal_level < logical.
Replication slot synchronization (sync_replication_slots = on)
requires wal_level to be logical. This commit prevents the server
from starting if sync_replication_slots is enabled but wal_level
is set to minimal or replica.

Failing early during startup helps users catch invalid configurations
immediately, which is important because changing wal_level requires
a server restart.

Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Shveta Malik <shveta.malik@gmail.com>
Discussion: https://postgr.es/m/CAH0PTU_pc3oHi__XESF9ZigCyzai1Mo3LsOdFyQA4aUDkm01RA@mail.gmail.com
2025-08-21 22:18:11 +09:00
Tom Lane
a67d4847a4 Fix re-execution of a failed SQLFunctionCache entry.
If we error out during execution of a SQL-language function, we will
often leave behind non-null pointers in its SQLFunctionCache's cplan
and eslist fields.  This is problematic if the SQLFunctionCache is
re-used, because those pointers will point at resources that were
released during error cleanup.  This problem escaped detection so far
because ordinarily we won't re-use an FmgrInfo+SQLFunctionCache struct
after a query error.  However, in the rather improbable case that
someone implements an opclass support function in SQL language, there
will be long-lived FmgrInfos for it in the relcache, and then the
problem is reachable after the function throws an error.

To fix, add a flag to SQLFunctionCache that tracks whether execution
escapes out of fmgr_sql, and clear out the relevant fields during
init_sql_fcache if so.  (This is going to need more thought if we ever
try to share FMgrInfos across threads; but it's very far from being
the only problem such a project will encounter, since many functions
regard fn_extra as being query-local state.)

This broke at commit 0313c5dc6; before that we did not try to re-use
SQLFunctionCache state across calls.  Hence, back-patch to v18.

Bug: #19026
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/19026-90aed5e71d0c8af3@postgresql.org
Backpatch-through: 18
2025-08-20 16:09:18 -04:00
Peter Eisentraut
e9c043a11a Minor error message enhancement
In refuseDupeIndexAttach(), change from

    errdetail("Another index is already attached for partition \"%s\"."...)

to

    errdetail("Another index \"%s\" is already attached for partition \"%s\"."...)

so we can easily understand which index is already attached for
partition \"%s\".

Author: Jian He <jian.universality@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/flat/CACJufxGBfykJ_1ztk9T%2BL_gLmkOSOF%2BmL9Mn4ZPydz-rh%3DLccQ%40mail.gmail.com
2025-08-20 18:14:24 +02:00
Michael Paquier
1f2e51e3c7 Fix assertion failure with replication slot release in single-user mode
Some replication slot manipulations (logical decoding via SQL,
advancing) were failing an assertion when releasing a slot in
single-user mode, because active_pid was not set in a ReplicationSlot
when its slot is acquired.

ReplicationSlotAcquire() has some logic to be able to work with the
single-user mode.  This commit sets ReplicationSlot->active_pid to
MyProcPid, to let the slot-related logic fall-through, considering the
single process as the one holding the slot.

Some TAP tests are added for various replication slot functions with the
single-user mode, while on it, for slot creation, drop, advancing, copy
and logical decoding with multiple slot types (temporary, physical vs
logical).  These tests are skipped on Windows, as direct calls of
postgres --single would fail on permission failures.  There is no
platform-specific behavior that needs to be checked, so living with this
restriction should be fine.  The CI is OK with that, now let's see what
the buildfarm tells.

Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Paul A. Jungwirth <pj@illuminatedcomputing.com>
Reviewed-by: Mutaamba Maasha <maasha@gmail.com>
Discussion: https://postgr.es/m/OSCPR01MB14966ED588A0328DAEBE8CB25F5FA2@OSCPR01MB14966.jpnprd01.prod.outlook.com
Backpatch-through: 13
2025-08-20 15:00:04 +09:00
Nathan Bossart
3eec0e6533 Fix comment for MAX_SIMUL_LWLOCKS.
This comment mentions that pg_buffercache locks all buffer
partitions simultaneously, but it hasn't done so since v10.

Oversight in commit 6e654546fb.

Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/aKTuAHVEuYCUmmIy%40nathan
2025-08-19 16:48:22 -05:00
Amit Kapila
aa21e49225 Fix self-deadlock during DROP SUBSCRIPTION.
The DROP SUBSCRIPTION command performs several operations: it stops the
subscription workers, removes subscription-related entries from system
catalogs, and deletes the replication slot on the publisher server.
Previously, this command acquired an AccessExclusiveLock on
pg_subscription before initiating these steps.

However, while holding this lock, the command attempts to connect to the
publisher to remove the replication slot. In cases where the connection is
made to a newly created database on the same server as subscriber, the
cache-building process during connection tries to acquire an
AccessShareLock on pg_subscription, resulting in a self-deadlock.

To resolve this issue, we reduce the lock level on pg_subscription during
DROP SUBSCRIPTION from AccessExclusiveLock to RowExclusiveLock. Earlier,
the higher lock level was used to prevent the launcher from starting a new
worker during the drop operation, as a restarted worker could become
orphaned.

Now, instead of relying on a strict lock, we acquire an AccessShareLock on
the specific subscription being dropped and re-validate its existence
after acquiring the lock. If the subscription is no longer valid, the
worker exits gracefully. This approach avoids the deadlock while still
ensuring that orphan workers are not created.

Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: vignesh C <vignesh21@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Backpatch-through: 13
Discussion: https://postgr.es/m/18988-7312c868be2d467f@postgresql.org
2025-08-19 05:33:17 +00:00
Michael Paquier
a977e419ee Refactor ReadMultiXactCounts() into GetMultiXactInfo()
This provides a single entry point to access some information about the
state of MultiXacts, able to return some data about multixacts offsets
and counts.  Originally this function was only able to return some
information about the number of multixacts and multixact members,
extended here to provide some data about the oldest multixact ID in use
and the oldest offset, if known.

This change has been proposed in a patch that aims at providing more
monitoring capabilities for multixacts, and it is useful on its own.
GetMultiXactInfo() is added to multixact.h, becoming available for
out-of-core code.

Extracted from a larger patch by the same author.

Author: Naga Appani <nagnrik@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CA+QeY+AAsYK6WvBW4qYzHz4bahHycDAY_q5ECmHkEV_eB9ckzg@mail.gmail.com
2025-08-19 14:04:09 +09:00
Michael Paquier
9b7eb6f02e Remove useless pointer update in StatsShmemInit()
This pointer was not used after its last update.  This variable
assignment was most likely a vestige artifact of the earlier versions of
the patch set that have led to 5891c7a8ed.

This pointer update is useless, so let's remove it.  It removes one call
to pgstat_dsa_init_size(), making the code slightly easier to grasp.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/aKLsu2sdpnyeuSSc@ip-10-97-1-34.eu-west-3.compute.internal
2025-08-19 09:54:18 +09:00
Richard Guo
bf9ee294e5 Simplify relation_has_unique_index_for()
Now that the only call to relation_has_unique_index_for() that
supplied an exprlist and oprlist has been removed, the loop handling
those lists is effectively dead code.  This patch removes that loop
and simplifies the function accordingly.

Author: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/CAMbWs4-EBnaRvEs7frTLbsXiweSTUXifsteF-d3rvv01FKO86w@mail.gmail.com
2025-08-19 09:37:04 +09:00
Richard Guo
24225ad9aa Pathify RHS unique-ification for semijoin planning
There are two implementation techniques for semijoins: one uses the
JOIN_SEMI jointype, where the executor emits at most one matching row
per left-hand side (LHS) row; the other unique-ifies the right-hand
side (RHS) and then performs a plain inner join.

The latter technique currently has some drawbacks related to the
unique-ification step.

* Only the cheapest-total path of the RHS is considered during
unique-ification.  This may cause us to miss some optimization
opportunities; for example, a path with a better sort order might be
overlooked simply because it is not the cheapest in total cost.  Such
a path could help avoid a sort at a higher level, potentially
resulting in a cheaper overall plan.

* We currently rely on heuristics to choose between hash-based and
sort-based unique-ification.  A better approach would be to generate
paths for both methods and allow add_path() to decide which one is
preferable, consistent with how path selection is handled elsewhere in
the planner.

* In the sort-based implementation, we currently pay no attention to
the pathkeys of the input subpath or the resulting output.  This can
result in redundant sort nodes being added to the final plan.

This patch improves semijoin planning by creating a new RelOptInfo for
the RHS rel to represent its unique-ified version.  It then generates
multiple paths that represent elimination of distinct rows from the
RHS, considering both a hash-based implementation using the cheapest
total path of the original RHS rel, and sort-based implementations
that either exploit presorted input paths or explicitly sort the
cheapest total path.  All resulting paths compete in add_path(), and
those deemed worthy of consideration are added to the new RelOptInfo.
Finally, the unique-ified rel is joined with the other side of the
semijoin using a plain inner join.

As a side effect, most of the code related to the JOIN_UNIQUE_OUTER
and JOIN_UNIQUE_INNER jointypes -- used to indicate that the LHS or
RHS path should be made unique -- has been removed.  Besides, the
T_Unique path now has the same meaning for both semijoins and upper
DISTINCT clauses: it represents adjacent-duplicate removal on
presorted input.  This patch unifies their handling by sharing the
same data structures and functions.

This patch also removes the UNIQUE_PATH_NOOP related code along the
way, as it is dead code -- if the RHS rel is provably unique, the
semijoin should have already been simplified to a plain inner join by
analyzejoins.c.

Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com>
Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com>
Discussion: https://postgr.es/m/CAMbWs4-EBnaRvEs7frTLbsXiweSTUXifsteF-d3rvv01FKO86w@mail.gmail.com
2025-08-19 09:35:40 +09:00
Michael Paquier
24e71d53f8 Remove unneeded header declarations in multixact.c
Two header declarations were related to SQL-callable functions, that
should have been cleaned up in df9133fa63.  Some more includes can be
removed on closer inspection, so let's clean up these as well, while on
it.

Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/345438.1755524834@sss.pgh.pa.us
2025-08-19 08:57:20 +09:00
David Rowley
a98ccf727e Remove HASH_DEBUG output from dynahash.c
This existed in a semi broken stated from be0a66666 until 296cba276.
Recent discussion has questioned the value of having this at all as it
only outputs static information from various of the hash table's
properties when the hash table is created.

Author: Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/OSCPR01MB1496650D03FA0293AB9C21416F534A@OSCPR01MB14966.jpnprd01.prod.outlook.com
2025-08-19 11:14:21 +12:00
David Rowley
05fcb9667c Use elog(DEBUG4) for dynahash.c statistics output
Previously this was being output to stderr.  This commit adjusts things
to use elog(DEBUG4).  Here we also adjust the format of the message to
add the hash table name and also put the message on a single line.  This
should make grepping the logs for this information easier.

Also get rid of the global hash table statistics.  This seems very dated
and didn't fit very well with trying to put all the statistics for a
specific hash table on a single log line.

The main aim here is to allow it so we can have at least one buildfarm
member build with HASH_STATISTICS to help prevent future changes from
breaking things in that area.  ca3891251 recently fixed some issues
here.

In passing, switch to using uint64 data types rather than longs for the
usage counters.  The long type is 32 bits on some platforms we support.

Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAApHDvoccvJ9CG5zx+i-EyCzJbcL5K=CzqrnL_YN59qaL5hiaw@mail.gmail.com
2025-08-19 10:57:44 +12:00
Peter Eisentraut
c61d51d500 Detect buffer underflow in get_th()
Input with zero length can result in a buffer underflow when
accessing *(num + (len - 1)), as (len - 1) would produce a negative
index.  Add an assertion for zero-length input to prevent it.

This was found by ALT Linux Team.

Reviewing the call sites shows that get_th() currently cannot be
applied to an empty string: it is always called on a string containing
a number we've just printed.  Therefore, an assertion rather than a
user-facing error message is sufficient.

Co-authored-by: Alexander Kuznetsov <kuznetsovam@altlinux.org>
Discussion: https://www.postgresql.org/message-id/flat/e22df993-cdb4-4d0a-b629-42211ebed582@altlinux.org
2025-08-18 11:03:22 +02:00
Michael Paquier
df9133fa63 Move SQL-callable code related to multixacts into its own file
A patch is under discussion to add more SQL capabilities related to
multixacts, and this move avoids bloating the file more than necessary.
This affects pg_get_multixact_members().  A side effect of this move is
the requirement to add mxstatus_to_string() to multixact.h.

Extracted from a larger patch by the same author, tweaked by me.

Author: Naga Appani <nagnrik@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://postgr.es/m/CA+QeY+AAsYK6WvBW4qYzHz4bahHycDAY_q5ECmHkEV_eB9ckzg@mail.gmail.com
2025-08-18 14:57:55 +09:00
Michael Paquier
ba3d93b2e8 Refactor init_params() in sequence.c to not use FormData_pg_sequence_data
init_params() sets up "last_value" and "is_called" for a sequence
relation holdind its metadata, based on the sequence properties in
pg_sequences.  "log_cnt" is the third property that can be updated in
this routine for FormData_pg_sequence_data, tracking when WAL records
should be generated for a sequence after nextval() iterations.  This
routine is called when creating or altering a sequence.

This commit refactors init_params() to not depend anymore on
FormData_pg_sequence_data, removing traces of it in sequence.c, making
easier the manipulation of metadata related to sequences.  The knowledge
about "log_cnt" is replaced with a more general "reset_state" flag, to
let the caller know if the sequence state should be reset.  In the case
of in-core sequences, this relates to WAL logging.  We still need to
depend on FormData_pg_sequence.

Author: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/ZWlohtKAs0uVVpZ3@paquier.xyz
2025-08-18 11:38:44 +09:00
Masahiko Sawada
928da6ff12 Fix typos in comments.
Oversight in commit fd5a1a0c3e.

Author: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/CAHewXNmTT3M_w4NngG=6G3mdT3iJ6DdncTqV9YnGXBPHW8XYtA@mail.gmail.com
2025-08-16 01:11:40 -07:00
Masahiko Sawada
37265ca01f Fix constant when extracting timestamp from UUIDv7.
When extracting a timestamp from a UUIDv7, a conversion from
milliseconds to microseconds was using the incorrect constant
NS_PER_US instead of US_PER_MS. Although both constants have the same
value, this fix improves code clarity by using the semantically
correct constant.

Backpatch to v18, where UUIDv7 was introduced.

Author: Erik Nordström <erik@tigerdata.com>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CACAa4V+i07eaP6h4MHNydZeX47kkLPwAg0sqe67R=M5tLdxNuQ@mail.gmail.com
Backpatch-through: 18
2025-08-15 11:58:53 -07:00
David Rowley
296cba2760 Fix invalid format string in HASH_DEBUG code
This seems to have been broken back in be0a66666.

Reported-by: Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com>
Author: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/OSCPR01MB14966E11EEFB37D7857FCEDB7F535A@OSCPR01MB14966.jpnprd01.prod.outlook.com
Backpatch-through: 14
2025-08-15 18:05:44 +12:00
David Rowley
ca38912512 Fix failing -D HASH_STATISTICS builds
This seems to have been broken for a few years by cc5ef90ed.

Author: Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/OSCPR01MB14966E11EEFB37D7857FCEDB7F535A@OSCPR01MB14966.jpnprd01.prod.outlook.com
Backpatch-through: 17
2025-08-15 17:23:45 +12:00
David Rowley
b4632883d4 Add Asserts to validate prevbit values in bms_prev_member
bms_prev_member() could attempt to access memory outside of the words[]
array in cases where the prevbit was a number < -1 or > a->nwords *
BITS_PER_BITMAPWORD + 1.

Here we add the Asserts to help draw attention to bogus callers so we're
more likely to catch them during development.

In passing, fix wording of bms_prev_member's header comment which talks
about how we expect the callers to ensure only valid prevbit values are
used.

Author: Greg Burd <greg@burd.me>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/2000A717-1FFE-4031-827B-9330FB2E9065%40getmailspring.com
2025-08-15 16:33:07 +12:00
Álvaro Herrera
d0e7e04ede Avoid including tableam.h and xlogreader.h in nbtree.h
Doing that seems rather random and unnecessary.  This commit removes
those and fixes fallout, which is pretty minimal.  We do need to add a
forward declaration of struct TM_IndexDeleteOp (whose full definition
appears in tableam.h) so that _bt_delitems_delete_check()'s declaration
can use it.

Author: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/202508051109.lzk3lcuzsaxo@alvherre.pgsql
2025-08-14 17:48:46 +02:00
Tom Lane
ed07361721 Don't leak memory during failure exit from SelectConfigFiles().
Make sure the memory allocated by make_absolute_path() is freed
when SelectConfigFiles() fails.  Since all the callers will exit
immediately in that case, there's no practical gain here, but
silencing Valgrind leak complaints seems useful.  In any case,
it was inconsistent that only one of the failure exits did this.

Author: Aleksander Alekseev <aleksander@tigerdata.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAJ7c6TMByXE8dc7zDvDWTQjk6o-XXAdRg_RAg5CBaUOgFPV3LQ%40mail.gmail.com
2025-08-14 11:39:19 -04:00
Heikki Linnakangas
4ec6e22b43 Fix LSN format in debug message
Commit 2633dae2e4 standardized all existing messages to use `%X/%08X`
for LSNs, but this one crept back in after the commit.
2025-08-14 13:31:18 +03:00
Michael Paquier
6304256e79 Fix compilation warning with SerializeClientConnectionInfo()
This function uses an argument named "maxsize" that is only used in
assertions, being set once outside the assertion area.  Recent gcc
versions with -Wunused-but-set-parameter complain about a warning when
building without assertions enabled, because of that.

In order to fix this issue, PG_USED_FOR_ASSERTS_ONLY is added to the
function argument of SerializeClientConnectionInfo(), which is the first
time we are doing so in the tree.  The CI is fine with the change, but
let's see what the buildfarm has to say on the matter.

Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Jacob Champion <jchampion@postgresql.org>
Discussion: https://postgr.es/m/pevajesswhxafjkivoq3yvwxga77tbncghlf3gq5fvchsvfuda@6uivg25sb3nx
Backpatch-through: 16
2025-08-14 16:21:50 +09:00
Fujii Masao
e9a31c0cc6 Revert logical snapshot filename format change in SnapBuildSnapshotExists().
Commit 2633dae2e4 standardized LSN formatting but mistakenly changed
the logical snapshot filename format in SnapBuildSnapshotExists() from
"%X-%X.snap" to "%08X-%08X.snap". Other code still used the original
"%X-%X.snap" format, causing the replication slot synchronization worker
to fail to find existing snapshot files and produce excessive log messages.

This commit restores the original "%X-%X.snap" format
in SnapBuildSnapshotExists() to resolve the issue.

Author: Shveta Malik <shveta.malik@gmail.com>
Discussion: https://postgr.es/m/CAHGQGwHuHPB-ucAk_Tq3uSs4Fdziu1Jp_AA_RD3m5Ycky7m48w@mail.gmail.com
2025-08-14 12:33:14 +09:00
Tom Lane
ee54046601 Grab the low-hanging fruit from forcing USE_FLOAT8_BYVAL to true.
Remove conditionally-compiled code for the other case.

Replace uses of FLOAT8PASSBYVAL with constant "true", mainly because
it was quite confusing in cases where the type we were dealing with
wasn't float8.

I left the associated pg_control and Pg_magic_struct fields in place.
Perhaps we should get rid of them, but it would save little, so it
doesn't seem worth thinking hard about the compatibility implications.
I just labeled them "vestigial" in places where that seemed helpful.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/1749799.1752797397@sss.pgh.pa.us
2025-08-13 17:18:22 -04:00
Tom Lane
6aebedc384 Grab the low-hanging fruit from forcing sizeof(Datum) to 8.
Remove conditionally-compiled code for smaller Datum widths,
and simplify comments that describe cases no longer of interest.

I also fixed up a few more places that were not using
DatumGetIntXX where they should, and made some cosmetic
adjustments such as using sizeof(int64) not sizeof(Datum)
in places where that fit better with the surrounding code.

One thing I remembered while preparing this part is that SP-GiST
stores pass-by-value prefix keys as Datums, so that the on-disk
representation depends on sizeof(Datum).  That's even more
unfortunate than the existing commentary makes it out to be,
because now there is a hazard that the change of sizeof(Datum)
will break SP-GiST indexes on 32-bit machines.  It appears that
there are no existing SP-GiST opclasses that are actually
affected; and if there are some that I didn't find, the number
of installations that are using them on 32-bit machines is
doubtless tiny.  So I'm proceeding on the assumption that we
can get away with this, but it's something to worry about.

(gininsert.c looks like it has a similar problem, but it's okay
because the "tuples" it's constructing are just transient data
within the tuplesort step.  That's pretty poorly documented
though, so I added some comments.)

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/1749799.1752797397@sss.pgh.pa.us
2025-08-13 17:18:22 -04:00
Tom Lane
2a600a93c7 Make type Datum be 8 bytes wide everywhere.
This patch makes sizeof(Datum) be 8 on all platforms including
32-bit ones.  The objective is to allow USE_FLOAT8_BYVAL to be true
everywhere, and in consequence to remove a lot of code that is
specific to pass-by-reference handling of float8, int8, etc.  The
code for abbreviated sort keys can be simplified similarly.  In this
way we can reduce the maintenance effort involved in supporting 32-bit
platforms, without going so far as to actually desupport them.  Since
Datum is strictly an in-memory concept, this has no impact on on-disk
storage, though an initdb or pg_upgrade will be needed to fix affected
catalog entries.

We have required platforms to support [u]int64 for ages, so this
breaks no supported platform.  We can expect that this change will
make 32-bit builds a bit slower and more memory-hungry, although being
able to use pass-by-value handling of 8-byte types may buy back some
of that.  But we stopped optimizing for 32-bit cases a long time ago,
and this seems like just another step on that path.

This initial patch simply forces the correct type definition and
USE_FLOAT8_BYVAL setting, and cleans up a couple of minor compiler
complaints that ensued.  This is sufficient for testing purposes.
In the wake of a bunch of Datum-conversion cleanups by Peter
Eisentraut, this now compiles cleanly with gcc on a 32-bit platform.
(I'd only tested the previous version with clang, which it turns out
is less picky than gcc about width-changing coercions.)

There is a good deal of now-dead code that I'll remove in separate
follow-up patches.

A catversion bump is required because this affects initial catalog
contents (on 32-bit machines) in two ways: pg_type.typbyval changes
for some built-in types, and Const nodes in stored views/rules will
now have 8 bytes not 4 for pass-by-value types.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/1749799.1752797397@sss.pgh.pa.us
2025-08-13 17:18:22 -04:00
Tom Lane
21fddb3d76 Don't treat EINVAL from semget() as a hard failure.
It turns out that on some platforms (at least current macOS, NetBSD,
OpenBSD) semget(2) will return EINVAL if there is a pre-existing
semaphore set with the same key and too few semaphores.  Our code
expects EEXIST in that case and treats EINVAL as a hard failure,
resulting in failure during initdb or postmaster start.

POSIX does document EINVAL for too-few-semaphores-in-set, and is
silent on its priority relative to EEXIST, so this behavior arguably
conforms to spec.  Nonetheless it's quite problematic because EINVAL
is also documented to mean that nsems is greater than the system's
limit on the number of semaphores per set (SEMMSL).  If that is
where the problem lies, retrying would just become an infinite loop.

To resolve this contradiction, retry after EINVAL, but also install a
loop limit that will make us give up regardless of the specific errno
after trying 1000 different keys.  (1000 is a pretty arbitrary number,
but it seems like it should be sufficient.)  I like this better than
the previous infinite-looping behavior, since it will also keep us out
of trouble if (say) we get EACCES due to a system-level permissions
problem rather than anything to do with a specific semaphore set.

This problem has only been observed in the field in PG 17, which uses
a higher nsems value than other branches (cf. 38da05346, 810a8b1c8).
That makes it possible to get the failure if a new v17 postmaster
has a key collision with an existing postmaster of another branch.
In principle though, we might see such a collision against a semaphore
set created by some other application, in which case all branches are
vulnerable on these platforms.  Hence, backpatch.

Reported-by: Gavin Panella <gavinpanella@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CALL7chmzY3eXHA7zHnODUVGZLSvK3wYCSP0RmcDFHJY8f28Q3g@mail.gmail.com
Backpatch-through: 13
2025-08-13 12:00:03 -04:00
Andres Freund
b227b0bb4e Reduce ExecSeqScan* code size using pg_assume()
fb9f955025 optimized code generation by using specialized variants of
ExecSeqScan* for [not] having a qual, projection etc. This allowed the
compiler to optimize the code out the code for qual / projection. However, as
observed by David Rowley at the time, the compiler couldn't prove the
opposite, i.e. that the qual etc *are* present.

By using pg_assume(), introduced in d65eb5b1b8, we can tell the compiler that
the relevant variables are non-null.

This reduces the code size to a surprising degree and seems to lead to a small
but reproducible performance gain.

Reviewed-by: Amit Langote <amitlangote09@gmail.com> Discussion:
https://postgr.es/m/CA+HiwqFk-MbwhfX_kucxzL8zLmjEt9MMcHi2YF=DyhPrSjsBEA@mail.gmail.com
2025-08-11 15:41:34 -04:00
Andres Freund
01d6832c10 meson: add and use stamp files for generated headers
Without using stamp files, meson lists the generated headers as the dependency
for every .c file, bloating build.ninja by more than 2x. Processing all the
dependencies also increases the time to generate build.ninja.

The immediate benefit is that this makes re-configuring and clean builds a bit
faster. The main motivation however is that I have other patches that
introduce additional build targets that further would increase the size of
build.ninja, making re-configuring more noticeably slower.

Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/cgkdgvzdpinkacf4v33mky7tbmk467oda5dd4dlmucjjockxzi@xkqfvjoq4uiy
2025-08-11 15:18:23 -04:00
Dean Rasheed
22424953cd Fix security checks in selectivity estimation functions.
Commit e2d4ef8de8 (the fix for CVE-2017-7484) added security checks
to the selectivity estimation functions to prevent them from running
user-supplied operators on data obtained from pg_statistic if the user
lacks privileges to select from the underlying table. In cases
involving inheritance/partitioning, those checks were originally
performed against the child RTE (which for plain inheritance might
actually refer to the parent table). Commit 553d2ec271 then extended
that to also check the parent RTE, allowing access if the user had
permissions on either the parent or the child. It turns out, however,
that doing any checks using the child RTE is incorrect, since
securityQuals is set to NULL when creating an RTE for an inheritance
child (whether it refers to the parent table or the child table), and
therefore such checks do not correctly account for any RLS policies or
security barrier views. Therefore, do the security checks using only
the parent RTE. This is consistent with how RLS policies are applied,
and the executor's ACL checks, both of which use only the parent
table's permissions/policies. Similar checks are performed in the
extended stats code, so update that in the same way, centralizing all
the checks in a new function.

In addition, note that these checks by themselves are insufficient to
ensure that the user has access to the table's data because, in a
query that goes via a view, they only check that the view owner has
permissions on the underlying table, not that the current user has
permissions on the view itself. In the selectivity estimation
functions, there is no easy way to navigate from underlying tables to
views, so add permissions checks for all views mentioned in the query
to the planner startup code. If the user lacks permissions on a view,
a permissions error will now be reported at planner-startup, and the
selectivity estimation functions will not be run.

Checking view permissions at planner-startup in this way is a little
ugly, since the same checks will be repeated at executor-startup.
Longer-term, it might be better to move all the permissions checks
from the executor to the planner so that permissions errors can be
reported sooner, instead of creating a plan that won't ever be run.
However, such a change seems too far-reaching to be back-patched.

Back-patch to all supported versions. In v13, there is the added
complication that UPDATEs and DELETEs on inherited target tables are
planned using inheritance_planner(), which plans each inheritance
child table separately, so that the selectivity estimation functions
do not know that they are dealing with a child table accessed via its
parent. Handle that by checking access permissions on the top parent
table at planner-startup, in the same way as we do for views. Any
securityQuals on the top parent table are moved down to the child
tables by inheritance_planner(), so they continue to be checked by the
selectivity estimation functions.

Author: Dean Rasheed <dean.a.rasheed@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Noah Misch <noah@leadboat.com>
Backpatch-through: 13
Security: CVE-2025-8713
2025-08-11 09:03:11 +01:00
Thomas Munro
b421223172 Fix rare bug in read_stream.c's split IO handling.
The internal queue of buffers could become corrupted in a rare edge case
that failed to invalidate an entry, causing a stale buffer to be
"forwarded" to StartReadBuffers().  This is a simple fix for the
immediate problem.

A small API change might be able to remove this and related fragility
entirely, but that will have to wait a bit.

Defect in commit ed0b87ca.

Bug: 19006
Backpatch-through: 18
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/19006-80fcaaf69000377e%40postgresql.org
2025-08-09 13:04:38 +12:00
Tom Lane
665c3dbba4 Mop-up for Datum conversion cleanups.
Fix a couple more places where an explicit Datum conversion
is needed (not clear how we missed these in ff89e182d and
previous commits).

Replace the minority usage "(Datum) NULL" with "(Datum) 0".
The former depends on the assumption that Datum is the same
width as Pointer, the latter doesn't.  Anyway consistency
is a good thing.

This is, I believe, the last of the notational mop-up needed
before we can consider changing Datum to uint64 everywhere.
It's also important cleanup for more aggressive ideas such
as making Datum a struct.

Discussion: https://postgr.es/m/1749799.1752797397@sss.pgh.pa.us
Discussion: https://postgr.es/m/8246d7ff-f4b7-4363-913e-827dadfeb145@eisentraut.org
2025-08-08 18:44:57 -04:00
Peter Eisentraut
ff89e182d4 Add missing Datum conversions
Add various missing conversions from and to Datum.  The previous code
mostly relied on implicit conversions or its own explicit casts
instead of using the correct DatumGet*() or *GetDatum() functions.

We think these omissions are harmless.  Some actual bugs that were
discovered during this process have been committed
separately (80c758a2e1, fd2ab03fea).

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/8246d7ff-f4b7-4363-913e-827dadfeb145%40eisentraut.org
2025-08-08 22:06:57 +02:00
Peter Eisentraut
dcfc0f8912 Remove useless/superfluous Datum conversions
Remove useless DatumGetFoo() and FooGetDatum() calls.  These are
places where no conversion from or to Datum was actually happening.

We think these extra calls covered here were harmless.  Some actual
bugs that were discovered during this process have been committed
separately (80c758a2e1, 2242b26ce4).

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/8246d7ff-f4b7-4363-913e-827dadfeb145%40eisentraut.org
2025-08-08 22:06:57 +02:00
Thomas Munro
b5cd74612c Remove obsolete comment.
Remove a comment about potential for AIO in StartReadBuffersImpl(),
because that change happened.
2025-08-09 01:46:04 +12:00
Etsuro Fujita
9e63f83a7e Fix oversight in FindTriggerIncompatibleWithInheritance.
This function is called from ATExecAttachPartition/ATExecAddInherit,
which prevent tables with row-level triggers with transition tables from
becoming partitions or inheritance children, to check if there is such a
trigger on the given table, but failed to check if a found trigger is
row-level, causing the caller functions to needlessly prevent a table
with only a statement-level trigger with transition tables from becoming
a partition or inheritance child.  Repair.

Oversight in commit 501ed02cf.

Author: Etsuro Fujita <etsuro.fujita@gmail.com>
Discussion: https://postgr.es/m/CAPmGK167mXzwzzmJ_0YZ3EZrbwiCxtM1vogH_8drqsE6PtxRYw%40mail.gmail.com
Backpatch-through: 13
2025-08-08 17:35:00 +09:00
Etsuro Fujita
62a1211d33 Disallow collecting transition tuples from child foreign tables.
Commit 9e6104c66 disallowed transition tables on foreign tables, but
failed to account for cases where a foreign table is a child table of a
partitioned/inherited table on which transition tables exist, leading to
incorrect transition tuples collected from such foreign tables for
queries on the parent table triggering transition capture.  This
occurred not only for inherited UPDATE/DELETE but for partitioned INSERT
later supported by commit 3d956d956, which should have handled it at
least for the INSERT case, but didn't.

To fix, modify ExecAR*Triggers to throw an error if the given relation
is a foreign table requesting transition capture.  Also, this commit
fixes make_modifytable so that in case of an inherited UPDATE/DELETE
triggering transition capture, FDWs choose normal operations to modify
child foreign tables, not DirectModify; which is needed because they
would otherwise skip the calls to ExecAR*Triggers at execution, causing
unexpected behavior.

Author: Etsuro Fujita <etsuro.fujita@gmail.com>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Discussion: https://postgr.es/m/CAPmGK14QJYikKzBDCe3jMbpGENnQ7popFmbEgm-XTNuk55oyHg%40mail.gmail.com
Backpatch-through: 13
2025-08-08 10:50:00 +09:00
Michael Paquier
84b32fd228 Add information about "generation" when dropping twice pgstats entry
Dropping twice a pgstats entry should not happen, and the error report
generated was missing the "generation" counter (tracking when an entry
is reused) that has been added in 818119afcc.

Like d92573adcb, backpatch down to v15 where this information is
useful to have, to gather more information from instances where the
problem shows up.  A report has shown that this error path has been
reached on a standby based on 17.3, for a relation stats entry and an
OID close to wraparound.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/CAN4RuQvYth942J2+FcLmJKgdpq6fE5eqyFvb_PuskxF2eL=Wzg@mail.gmail.com
Backpatch-through: 15
2025-08-08 09:07:10 +09:00
Dean Rasheed
d699687b32 Extend int128.h to support more numeric code.
This adds a few more functions to int128.h, allowing more of numeric.c
to use 128-bit integers on all platforms.

Specifically, int64_div_fast_to_numeric() and the following aggregate
functions can now use 128-bit integers for improved performance on all
platforms, rather than just platforms with native support for int128:

- SUM(int8)
- AVG(int8)
- STDDEV_POP(int2 or int4)
- STDDEV_SAMP(int2 or int4)
- VAR_POP(int2 or int4)
- VAR_SAMP(int2 or int4)

In addition to improved performance on platforms lacking native
128-bit integer support, this significantly simplifies this numeric
code by allowing a lot of conditionally compiled code to be deleted.

A couple of numeric functions (div_var_int64() and sqrt_var()) still
contain conditionally compiled 128-bit integer code that only works on
platforms with native 128-bit integer support. Making those work more
generally would require rolling our own higher precision 128-bit
division, which isn't supported for now.

Author: Dean Rasheed <dean.a.rasheed@gmail.com>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Discussion: https://postgr.es/m/CAEZATCWgBMc9ZwKMYqQpaQz2X6gaamYRB+RnMsUNcdMcL2Mj_w@mail.gmail.com
2025-08-07 15:49:24 +01:00
Alexander Korotkov
466c5435fd Fix checkpointer shared memory allocation
Use Min(NBuffers, MAX_CHECKPOINT_REQUESTS) instead of NBuffers in
CheckpointerShmemSize() to match the actual array size limit set in
CheckpointerShmemInit().  This prevents wasting shared memory when
NBuffers > MAX_CHECKPOINT_REQUESTS.  Also, fix the comment.

Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/1439188.1754506714%40sss.pgh.pa.us
Author: Xuneng Zhou <xunengzhou@gmail.com>
Co-authored-by: Alexander Korotkov <aekorotkov@gmail.com>
2025-08-07 14:29:02 +03:00