postgres

mirror of https://github.com/postgres/postgres.git synced 2025-12-21 05:21:08 +03:00

Author	SHA1	Message	Date
Nathan Bossart	b26d76f643	Teach DSM registry to ERROR if attaching to an uninitialized entry. If DSM entry initialization fails, backends could try to use an uninitialized DSM segment, DSA, or dshash table (since the entry is still added to the registry). To fix, keep track of whether initialization completed, and ERROR if a backend tries to attach to an uninitialized entry. We could instead retry initialization as needed, but that seemed complicated, error prone, and unlikely to help most cases. Furthermore, such problems probably indicate a coding error. Reported-by: Alexander Lakhin <exclusion@gmail.com> Reviewed-by: Sami Imseih <samimseih@gmail.com> Discussion: https://postgr.es/m/dd36d384-55df-4fc2-825c-5bc56c950fa9%40gmail.com Backpatch-through: 17	2025-11-12 14:30:11 -06:00
Heikki Linnakangas	82fa6b78db	Clear 'xid' in dummy async notify entries written to fill up pages Before we started to freeze async notify entries (commit `8eeb4a0f7c`), no one looked at the 'xid' on an entry with invalid 'dboid'. But now we might actually need to freeze it later. Initialize them with InvalidTransactionId to begin with, to avoid that work later. Álvaro pointed this out in review of commit `8eeb4a0f7c`, but I forgot to include this change there. Author: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://www.postgresql.org/message-id/202511071410.52ll56eyixx7@alvherre.pgsql Backpatch-through: 14	2025-11-12 21:26:58 +02:00
Heikki Linnakangas	7b069a1876	Fix remaining race condition with CLOG truncation and LISTEN/NOTIFY Previous commit fixed a bug where VACUUM would truncate the CLOG that's still needed to check the commit status of XIDs in the async notify queue, but as mentioned in the commit message, it wasn't a full fix. If a backend is executing asyncQueueReadAllNotifications() and has just made a local copy of an async SLRU page which contains old XIDs, vacuum can concurrently truncate the CLOG covering those XIDs, and the backend still gets an error when it calls TransactionIdDidCommit() on those XIDs in the local copy. This commit fixes that race condition. To fix, hold the SLRU bank lock across the TransactionIdDidCommit() calls in NOTIFY processing. Per Tom Lane's idea. Backpatch to all supported versions. Reviewed-by: Joel Jacobson <joel@compiler.org> Reviewed-by: Arseniy Mukhin <arseniy.mukhin.dev@gmail.com> Discussion: https://www.postgresql.org/message-id/2759499.1761756503@sss.pgh.pa.us Backpatch-through: 14	2025-11-12 21:00:46 +02:00
Heikki Linnakangas	321ec54625	Fix bug where we truncated CLOG that was still needed by LISTEN/NOTIFY The async notification queue contains the XID of the sender, and when processing notifications we call TransactionIdDidCommit() on the XID. But we had no safeguards to prevent the CLOG segments containing those XIDs from being truncated away. As a result, if a backend didn't for some reason process its notifications for a long time, or when a new backend issued LISTEN, you could get an error like: test=# listen c21; ERROR: 58P01: could not access status of transaction 14279685 DETAIL: Could not open file "pg_xact/000D": No such file or directory. LOCATION: SlruReportIOError, slru.c:1087 To fix, make VACUUM "freeze" the XIDs in the async notification queue before truncating the CLOG. Old XIDs are replaced with FrozenTransactionId or InvalidTransactionId. Note: This commit is not a full fix. A race condition remains, where a backend is executing asyncQueueReadAllNotifications() and has just made a local copy of an async SLRU page which contains old XIDs, while vacuum concurrently truncates the CLOG covering those XIDs. When the backend then calls TransactionIdDidCommit() on those XIDs from the local copy, you still get the error. The next commit will fix that remaining race condition. This was first reported by Sergey Zhuravlev in 2021, with many other people hitting the same issue later. Thanks to: - Alexandra Wang, Daniil Davydov, Andrei Varashen and Jacques Combrink for investigating and providing reproducable test cases, - Matheus Alcantara and Arseniy Mukhin for review and earlier proposed patches to fix this, - Álvaro Herrera and Masahiko Sawada for reviews, - Yura Sokolov aka funny-falcon for the idea of marking transactions as committed in the notification queue, and - Joel Jacobson for the final patch version. I hope I didn't forget anyone. Backpatch to all supported versions. I believe the bug goes back all the way to commit `d1e027221d`, which introduced the SLRU-based async notification queue. Discussion: https://www.postgresql.org/message-id/16961-25f29f95b3604a8a@postgresql.org Discussion: https://www.postgresql.org/message-id/18804-bccbbde5e77a68c2@postgresql.org Discussion: https://www.postgresql.org/message-id/CAK98qZ3wZLE-RZJN_Y%2BTFjiTRPPFPBwNBpBi5K5CU8hUHkzDpw@mail.gmail.com Backpatch-through: 14	2025-11-12 21:00:42 +02:00
Heikki Linnakangas	aab4a84bb0	Escalate ERRORs during async notify processing to FATAL Previously, if async notify processing encountered an error, we would report the error to the client and advance our read position past the offending entry to prevent trying to process it over and over again. Trying to continue after an error has a few problems however: - We have no way of telling the client that a notification was lost. They get an ERROR, but that doesn't tell you much. As such, it's not clear if keeping the connection alive after losing a notification is a good thing. Depending on the application logic, missing a notification could cause the application to get stuck waiting, for example. - If the connection is idle, PqCommReadingMsg is set and any ERROR is turned into FATAL anyway. - We bailed out of the notification processing loop on first error without processing any subsequent notifications. The subsequent notifications would not be processed until another notify interrupt arrives. For example, if there were two notifications pending, and processing the first one caused an ERROR, the second notification would not be processed until someone sent a new NOTIFY. This commit changes the behavior so that any ERROR while processing async notifications is turned into FATAL, causing the client connection to be terminated. That makes the behavior more consistent as that's what happened in idle state already, and terminating the connection is a clear signal to the application that it might've missed some notifications. The reason to do this now is that the next commits will change the notification processing code in a way that would make it harder to skip over just the offending notification entry on error. Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Arseniy Mukhin <arseniy.mukhin.dev@gmail.com> Discussion: https://www.postgresql.org/message-id/fedbd908-4571-4bbe-b48e-63bfdcc38f64@iki.fi Backpatch-through: 14	2025-11-12 21:00:25 +02:00
Daniel Gustafsson	6ae5228a13	Fix range for commit_siblings in sample conf The range for commit_siblings was incorrectly listed as starting on 1 instead of 0 in the sample configuration file. Backpatch down to all supported branches. Author: Man Zeng <zengman@halodbtech.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://postgr.es/m/tencent_53B70BA72303AE9C6889E78E@qq.com Backpatch-through: 14	2025-11-12 13:51:53 +01:00
Michael Paquier	84e826567c	Report better object limits in error messages for injection points Previously, error messages for oversized injection point names, libraries, and functions showed buffer sizes (64, 128, 128) instead of the usable character limits (63, 127, 127) as it did not count for the zero-terminated byte, which was confusing. These messages are adjusted to show better the reality. The limit enforced for the private area was also too strict by one byte, as specifying a zone worth exactly INJ_PRIVATE_MAXLEN should be able to work because three is no zero-terminated byte in this case. This is a stylistic change (well, mostly, a private_area size of exactly 1024 bytes can be defined with this change, something that nobody seem to care about based on the lack of complaints). However, this is a testing facility let's keep the logic consistent across all the branches where this code exists, as there is an argument in favor of out-of-core extensions that use injection points. Author: Xuneng Zhou <xunengzhou@gmail.com> Co-authored-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CABPTF7VxYp4Hny1h+7ejURY-P4O5-K8WZg79Q3GUx13cQ6B2kg@mail.gmail.com Backpatch-through: 17	2025-11-12 10:19:17 +09:00
Nathan Bossart	00eb646ea4	Check for CREATE privilege on the schema in CREATE STATISTICS. This omission allowed table owners to create statistics in any schema, potentially leading to unexpected naming conflicts. For ALTER TABLE commands that require re-creating statistics objects, skip this check in case the user has since lost CREATE on the schema. The addition of a second parameter to CreateStatistics() breaks ABI compatibility, but we are unaware of any impacted third-party code. Reported-by: Jelte Fennema-Nio <postgres@jeltef.nl> Author: Jelte Fennema-Nio <postgres@jeltef.nl> Co-authored-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Security: CVE-2025-12817 Backpatch-through: 13	2025-11-10 09:00:00 -06:00
Peter Eisentraut	292b81a0be	Translation updates Source-Git-URL: https://git.postgresql.org/git/pgtranslation/messages.git Source-Git-Hash: 3de351860d82ccc17ca814f59f9013691d751125	2025-11-10 12:58:04 +01:00
Peter Eisentraut	0f9e0068bc	Disallow generated columns in COPY WHERE clause Stored generated columns are not yet computed when the filtering happens, so we need to prohibit them to avoid incorrect behavior. Virtual generated columns currently error out ("unexpected virtual generated column reference"). They could probably work if we expand them in the right place, but for now let's keep them consistent with the stored variant. This doesn't change the behavior, it only gives a nicer error message. Co-authored-by: jian he <jian.universality@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CACJufxHb8YPQ095R_pYDr77W9XKNaXg5Rzy-WP525mkq+hRM3g@mail.gmail.com	2025-11-06 13:55:08 +01:00
Heikki Linnakangas	75ec47c38b	Add comment to explain why PGReserveSemaphores() is called early Before commit `e25626677f`, PGReserveSemaphores() had to be called before SpinlockSemaInit() because spinlocks were implemented using semaphores on some platforms (--disable-spinlocks). Add a comment explaining that. Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://www.postgresql.org/message-id/CAExHW5seSZpPx-znjidVZNzdagGHOk06F+Ds88MpPUbxd1kTaA@mail.gmail.com Backpatch-to: 18	2025-11-06 14:20:58 +02:00
Etsuro Fujita	233d79ec4f	Update obsolete comment in ExecScanReScan(). Commit `27cc7cd2b` removed the epqScanDone flag from the EState struct, and instead added an equivalent flag named relsubs_done to the EPQState struct; but it failed to update this comment. Author: Etsuro Fujita <etsuro.fujita@gmail.com> Discussion: https://postgr.es/m/CAPmGK152zJ3fU5avDT5udfL0namrDeVfMTL3dxdOXw28SOrycg%40mail.gmail.com Backpatch-through: 13	2025-11-06 12:25:01 +09:00
Tom Lane	6d8acb7777	Avoid possible crash within libsanitizer. We've successfully used libsanitizer for awhile with the undefined and alignment sanitizers, but with some other sanitizers (at least thread and hwaddress) it crashes due to internal recursion before it's fully initialized itself. It turns out that that's due to the "__ubsan_default_options" hack installed by commit `f686ae82f`, and we can fix it by ensuring that __ubsan_default_options is built without any sanitizer instrumentation hooks. Reported-by: Emmanuel Sibi <emmanuelsibi.mec@gmail.com> Reported-by: Alexander Lakhin <exclusion@gmail.com> Diagnosed-by: Emmanuel Sibi <emmanuelsibi.mec@gmail.com> Fix-suggested-by: Jacob Champion <jacob.champion@enterprisedb.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/F7543B04-E56C-4D68-A040-B14CCBAD38F1@gmail.com Discussion: https://postgr.es/m/dbf77bf7-6e54-ed8a-c4ae-d196eeb664ce@gmail.com Backpatch-through: 16	2025-11-05 11:09:30 -05:00
Richard Guo	500f646368	Fix assertion failure in generate_orderedappend_paths() In generate_orderedappend_paths(), there is an assumption that a child relation's row estimate is always greater than zero. There is an Assert verifying this assumption, and the estimate is also used to convert an absolute tuple count into a fraction. However, this assumption is not always valid -- for example, upper relations can have their row estimates unset, resulting in a value of zero. This can cause an assertion failure in debug builds or lead to the tuple fraction being computed as infinity in production builds. To fix, use the row estimate from the cheapest_total path to compute the tuple fraction. The row estimate in this path should already have been forced to a valid value. In passing, update the comment for generate_orderedappend_paths() to note that the function also considers the cheapest-fractional case when not all tuples need to be retrieved. That is, it collects all the cheapest fractional paths and builds an ordered append path for each interesting ordering. Backpatch to v18, where this issue was introduced. Bug: #19102 Reported-by: Kuntal Ghosh <kuntalghosh.2007@gmail.com> Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Kuntal Ghosh <kuntalghosh.2007@gmail.com> Reviewed-by: Andrei Lepikhov <lepihov@gmail.com> Discussion: https://postgr.es/m/19102-93480667e1200169@postgresql.org Backpatch-through: 18	2025-11-05 18:15:02 +09:00
Richard Guo	b166a71526	Fix comments for ChangeVarNodes() and related functions The comment for ChangeVarNodes() refers to a parameter named change_RangeTblRef, which does not exist in the code. The comment for ChangeVarNodesExtended() contains an extra space, while the comment for replace_relid_callback() has an awkward line break and a typo. This patch fixes these issues and revises some sentences for smoother wording. Oversights in commits `ab42d643c` and `fc069a3a6`. Author: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/CAMbWs480j16HC1JtjKCgj5WshivT8ZJYkOfTyZAM0POjFomJkg@mail.gmail.com Backpatch-through: 18	2025-11-05 12:31:33 +09:00
Andres Freund	19e391e38a	aio: Improve assertions related to io_method First, the assertions in assign_io_method() were the wrong way round. Second, the lengthof() assertion checked the length of io_method_options, which is the wrong array to check and is always longer than pgaio_method_ops_table. While add it, add a static assert to ensure pgaio_method_ops_table and io_method_options stay in sync. Per coverity and Tom Lane. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Backpatch-through: 18	2025-11-04 20:03:54 -05:00
Andres Freund	850b7da15b	jit: Fix accidentally-harmless type confusion In `2a0faed9d7`, which added JIT compilation support for expressions, I accidentally used sizeof(LLVMBasicBlockRef *) instead of sizeof(LLVMBasicBlockRef) as part of computing the size of an allocation. That turns out to have no real negative consequences due to LLVMBasicBlockRef being a pointer itself (and thus having the same size). It still is wrong and confusing, so fix it. Reported by coverity. Backpatch-through: 13	2025-11-04 20:03:54 -05:00
Jeff Davis	3ebea75f9a	Special case C_COLLATION_OID in pg_newlocale_from_collation(). Allow pg_newlocale_from_collation(C_COLLATION_OID) to work even if there's no catalog access, which some extensions expect. Not known to be a bug without extensions involved, but backport to 18. Also corrects an issue in master with dummy_c_locale (introduced in commit `5a38104b36`) where deterministic was not set. That wasn't a bug, but could have been if that structure was used more widely. Reported-by: Alexander Kukushkin <cyberdemn@gmail.com> Reviewed-by: Alexander Kukushkin <cyberdemn@gmail.com> Discussion: https://postgr.es/m/CAFh8B=nj966ECv5vi_u3RYij12v0j-7NPZCXLYzNwOQp9AcPWQ@mail.gmail.com Backpatch-through: 18	2025-11-04 16:49:00 -08:00
Masahiko Sawada	71aa2e1147	Add CHECK_FOR_INTERRUPTS in Evict{Rel,All}UnpinnedBuffers. This commit adds CHECK_FOR_INTERRUPTS to the shared buffer iteration loops in EvictRelUnpinnedBuffers and EvictAllUnpinnedBuffers. These functions, used by pg_buffercache's pg_buffercache_evict_relation and pg_buffercache_evict_all, can now be interrupted during long-running operations. Backpatch to version 18, where these functions and their corresponding pg_buffercache functions were introduced. Author: Yuhang Qiu <iamqyh@gmail.com> Discussion: https://postgr.es/m/8DC280D4-94A2-4E7B-BAB9-C345891D0B78%40gmail.com Backpatch-through: 18	2025-11-04 15:47:22 -08:00
Álvaro Herrera	8733f0b54c	Fix snapshot handling bug in recent BRIN fix Commit `a95e3d84c0` added ActiveSnapshot push+pop when processing work-items (BRIN autosummarization), but forgot to handle the case of a transaction failing during the run, which drops the snapshot untimely. Fix by making the pop conditional on an element being actually there. Author: Álvaro Herrera <alvherre@kurilemu.de> Backpatch-through: 13 Discussion: https://postgr.es/m/202511041648.nofajnuddmwk@alvherre.pgsql	2025-11-04 20:31:43 +01:00
Tomas Vondra	a26b753a08	Limit the size of TID lists during parallel GIN build When building intermediate TID lists during parallel GIN builds, split the sorted lists into smaller chunks, to limit the amount of memory needed when merging the chunks later. The leader may need to keep in memory up to one chunk per worker, and possibly one extra chunk (before evicting some of the data). The code processing item pointers uses regular palloc/repalloc calls, which means it's subject to the MaxAllocSize (1GB) limit. We could fix this by allowing huge allocations, but that'd require changes in many places without much benefit. Larger chunks do not actually improve performance, so the memory usage would be wasted. Fixed by limiting the chunk size to not hit MaxAllocSize. Each worker gets a fair share. This requires remembering the number of participating workers, in a place that can be accessed from the callback. Luckily, the bs_worker_id field in GinBuildState was unused, so repurpose that. Report by Greg Smith, investigation and fix by me. Batchpatched to 18, where parallel GIN builds were introduced. Reported-by: Gregory Smith <gregsmithpgsql@gmail.com> Discussion: https://postgr.es/m/CAHLJuCWDwn-PE2BMZE4Kux7x5wWt_6RoWtA0mUQffEDLeZ6sfA@mail.gmail.com Backpatch-through: 18	2025-11-04 18:47:14 +01:00
Peter Eisentraut	ba99c9491c	Tighten check for generated column in partition key expression A generated column may end up being part of the partition key expression, if it's specified as an expression e.g. "(<generated column name>)" or if the partition key expression contains a whole-row reference, even though we do not allow a generated column to be part of partition key expression. Fix this hole. Co-authored-by: jian he <jian.universality@gmail.com> Co-authored-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Discussion: https://www.postgresql.org/message-id/flat/CACJufxF%3DWDGthXSAQr9thYUsfx_1_t9E6N8tE3B8EqXcVoVfQw%40mail.gmail.com	2025-11-04 14:47:15 +01:00
Álvaro Herrera	419ffde235	BRIN autosummarization may need a snapshot It's possible to define BRIN indexes on functions that require a snapshot to run, but the autosummarization feature introduced by commit `7526e10224` fails to provide one. This causes autovacuum to leave a BRIN placeholder tuple behind after a failed work-item execution, making such indexes less efficient. Repair by obtaining a snapshot prior to running the task, and add a test to verify this behavior. Author: Álvaro Herrera <alvherre@kurilemu.de> Reported-by: Giovanni Fabris <giovanni.fabris@icon.it> Reported-by: Arthur Nascimento <tureba@gmail.com> Backpatch-through: 13 Discussion: https://postgr.es/m/202511031106.h4fwyuyui6fz@alvherre.pgsql	2025-11-04 13:23:26 +01:00
Peter Eisentraut	1baae827ef	Error message stylistic correction Fixup for commit `ef5e60a9d3`: The inconsistent use of articles was a bit awkward.	2025-11-04 12:25:14 +01:00
Michael Paquier	a142010739	Fix unconditional WAL receiver shutdown during stream-archive transition Commit `b4f584f9d2` (affecting v15~, later backpatched down to 13 as of `3635a0a35a`) introduced an unconditional WAL receiver shutdown when switching from streaming to archive WAL sources. This causes problems during a timeline switch, when a WAL receiver enters WALRCV_WAITING state but remains alive, waiting for instructions. The unconditional shutdown can break some monitoring scenarios as the WAL receiver gets repeatedly terminated and re-spawned, causing pg_stat_wal_receiver.status to show a "streaming" instead of "waiting" status, masking the fact that the WAL receiver is waiting for a new TLI and a new LSN to be able to continue streaming. This commit changes the WAL receiver behavior so as the shutdown becomes conditional, with InstallXLogFileSegmentActive being always reset to prevent the regression fixed by `b4f584f9d2`: only terminate the WAL receiver when it is actively streaming (WALRCV_STREAMING, WALRCV_STARTING, or WALRCV_RESTARTING). When in WALRCV_WAITING state, just reset InstallXLogFileSegmentActive flag to allow archive restoration without killing the process. WALRCV_STOPPED and WALRCV_STOPPING are not reachable states in this code path. For the latter, the startup process is the one in charge of setting WALRCV_STOPPING via ShutdownWalRcv(), waiting for the WAL receiver to reach a WALRCV_STOPPED state after switching walRcvState, so WaitForWALToBecomeAvailable() cannot be reached while a WAL receiver is in a WALRCV_STOPPING state. A regression test is added to check that a WAL receiver is not stopped on timeline jump, that fails when the fix of this commit is reverted. Reported-by: Ryan Bird <ryanzxg@gmail.com> Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/19093-c4fff49a608f82a0@postgresql.org Backpatch-through: 13	2025-11-04 10:52:33 +09:00
Noah Misch	46524d519a	Doc: cover index CONCURRENTLY causing errors in INSERT ... ON CONFLICT. Author: Mikhail Nikalayeu <mihailnikalayeu@gmail.com> Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/CANtu0ojXmqjmEzp-=aJSxjsdE76iAsRgHBoK0QtYHimb_mEfsg@mail.gmail.com Backpatch-through: 13	2025-11-03 12:57:12 -08:00
Álvaro Herrera	d9ffc27291	Prevent setting a column as identity if its not-null constraint is invalid We don't allow null values to appear in identity-generated columns in other ways, so we shouldn't let unvalidated not-null constraints do it either. Oversight in commit `a379061a22`. Author: jian he <jian.universality@gmail.com> Backpatch-through: 18 Discussion: https://postgr.es/m/CACJufxGQM_+vZoYJMaRoZfNyV=L2jxosjv_0TLAScbuLJXWRfQ@mail.gmail.com	2025-11-03 15:58:19 +01:00
Peter Geoghegan	6c3b1df878	Document nbtree row comparison design. Add comments explaining when and where it is safe for nbtree to treat row compare keys as if they were simple scalar inequality keys on the row's most significant column. This is particularly important within _bt_advance_array_keys, which deals with required inequality keys in a general and uniform way, without any special handling for row compares. Also spell out the implications of _bt_check_rowcompare's approach of _conditionally_ evaluating lower-order row compare subkeys, particularly when one of its lower-order subkeys might see NULL index tuple values (these may or may not affect whether the qual as a whole is satisfied). The behavior in this area isn't particularly intuitive, so these issues seem worth going into. In passing, add a few more defensive/documenting row comparison related assertions to _bt_first and _bt_check_rowcompare. Follow-up to commits `bd3f59fd` and `ec986020`. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Victor Yegorov <vyegorov@gmail.com> Reviewed-By: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CAH2-Wznwkak_K7pcAdv9uH8ZfNo8QO7+tHXOaCUddMeTfaCCFw@mail.gmail.com Backpatch-through: 18	2025-11-02 15:27:03 -05:00
Peter Geoghegan	742bd91a83	Remove obsolete nbtree equality key comments. _bt_first reliably uses the same equality key (on each index column) for initial positioning purposes as the one that _bt_checkkeys can use to end the scan following commit `f09816a0`. _bt_first no longer applies its own independent rules to determine which initial positioning key to use on each column (for equality and inequality keys alike). Preprocessing is now fully in control of determining which keys start and end each scan, ensuring that _bt_first and _bt_checkkeys have symmetric behavior. Remove obsolete comments that described why _bt_first was expected to use at least one of the available required equality keys for initial positioning purposes. The rules in this area are now maximally strict and uniform, so there's no reason to draw attention to equality keys. Any column with a required equality key cannot have a redundant required inequality key (nor can it have a redundant required equality key). Oversight in commit `f09816a0`, which removed similar comments from _bt_first, but missed these comments. Author: Peter Geoghegan <pg@bowt.ie> Backpatch-through: 18	2025-11-02 13:34:16 -05:00
Michael Paquier	bf3dba508e	Fix regression with slot invalidation checks This commit reverts `818fefd8fd`, that has been introduced to address a an instability in some of the TAP tests due to the presence of random standby snapshot WAL records, when slots are invalidated by InvalidatePossiblyObsoleteSlot(). Anyway, this commit had also the consequence of introducing a behavior regression. After `818fefd8fd`, the code may determine that a slot needs to be invalidated while it may not require one: the slot may have moved from a conflicting state to a non-conflicting state between the moment when the mutex is released and the moment when we recheck the slot, in InvalidatePossiblyObsoleteSlot(). Hence, the invalidations may be more aggressive than they actually have to. `105b2cb336` has tackled the test instability in a way that should be hopefully sufficient for the buildfarm, even for slow members: - In v18, the test relies on an injection point that bypasses the creation of the random records generated for standby snapshots, eliminating the random factor that impacted the test. This option was not available when `818fefd8fd` was discussed. - In v16 and v17, the problem was bypassed by disallowing a slot to become active in some of the scenarios tested. While on it, this commit adds a comment to document that it is fine for a recheck to use xmin and LSN values stored in the slot, without storing and reusing them across multiple checks. Reported-by: "suyu.cmj" <mengjuan.cmj@alibaba-inc.com> Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/f492465f-657e-49af-8317-987460cb68b0.mengjuan.cmj@alibaba-inc.com Backpatch-through: 16	2025-10-30 13:13:31 +09:00
Richard Guo	ef6168bafe	Disable parallel plans for RIGHT_SEMI joins RIGHT_SEMI joins rely on the HEAP_TUPLE_HAS_MATCH flag to guarantee that only the first match for each inner tuple is considered. However, in a parallel hash join, the inner relation is stored in a shared global hash table that can be probed by multiple workers concurrently. This allows different workers to inspect and set the match flags of the same inner tuples at the same time. If two workers probe the same inner tuple concurrently, both may see the match flag as unset and emit the same tuple, leading to duplicate output rows and violating RIGHT_SEMI join semantics. For now, we disable parallel plans for RIGHT_SEMI joins. In the long term, it may be possible to support parallel execution by performing atomic operations on the match flag, for example using a CAS or similar mechanism. Backpatch to v18, where RIGHT_SEMI join was introduced. Bug: #19094 Reported-by: Lori Corbani <Lori.Corbani@jax.org> Diagnosed-by: Tom Lane <tgl@sss.pgh.pa.us> Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/19094-6ed410eb5b256abd@postgresql.org Backpatch-through: 18	2025-10-30 12:03:15 +09:00
David Rowley	af3a79e083	Fix bogus use of "long" in AllocSetCheck() Because long is 32-bit on 64-bit Windows, it isn't a good datatype to store the difference between 2 pointers. The under-sized type could overflow and lead to scary warnings in MEMORY_CONTEXT_CHECKING builds, such as: WARNING: problem in alloc set ExecutorState: bad single-chunk %p in block %p However, the problem lies only in the code running the check, not from an actual memory accounting bug. Fix by using "Size" instead of "long". This means using an unsigned type rather than the previous signed type. If the block's freeptr was corrupted, we'd still catch that if the unsigned type wrapped. Unsigned allows us to avoid further needless complexities around comparing signed and unsigned types. Author: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Backpatch-through: 13 Discussion: https://postgr.es/m/CAApHDvo-RmiT4s33J=aC9C_-wPZjOXQ232V-EZFgKftSsNRi4w@mail.gmail.com	2025-10-30 14:49:07 +13:00
Peter Eisentraut	74197bdc84	Check that index can return in get_actual_variable_range() Some recent changes were made to remove the explicit dependency on btree indexes in some parts of the code. One of these changes was made in commit `9ef1851685`, which allows non-btree indexes to be used in get_actual_variable_range(). A follow-up commit `ee1ae8b99f` fixes the cases where an index doesn’t have a sortopfamily as this is a prerequisite to be used in get_actual_variable_range(). However, it was found that indexes that have amcanorder = true but do not allow index-only-scans (amcanreturn returns false or is NULL) will pass all of the conditions, while they should be rejected since get_actual_variable_range() uses the index-only-scan machinery in get_actual_variable_endpoint(). Such an index might cause errors like ERROR: no data returned for index-only scan during query planning. The fix is to add a check in get_actual_variable_range() to reject indexes that do not allow index-only scans. Author: Maxime Schoemans <maxime.schoemans@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/20ED852A-C2D9-41EB-8671-8C8B9D418BE9%40enterprisedb.com	2025-10-28 10:11:35 +01:00
Amit Kapila	b45a8d7d8b	Fix GUC check_hook validation for synchronized_standby_slots. Previously, the check_hook for synchronized_standby_slots attempted to validate that each specified slot existed and was physical. However, these checks were not performed during server startup. As a result, if users configured non-existent slots before startup, the misconfiguration would go undetected initially. This could later cause parallel query failures, as newly launched workers would detect the issue and raise an ERROR. This patch improves the check_hook by validating the syntax and format of slot names. Validation of slot existence and type is deferred to the WAL sender process, aligning with the behavior of the check_hook for primary_slot_name. Reported-by: Fabrice Chapuis <fabrice636861@gmail.com> Author: Shlok Kyal <shlok.kyal.oss@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Ashutosh Sharma <ashu.coek88@gmail.com> Reviewed-by: Rahila Syed <rahilasyed90@gmail.com> Backpatch-through: 17, where it was introduced Discussion: https://postgr.es/m/CAA5-nLCeO4MQzWipCXH58qf0arruiw0OeUc1+Q=Z=4GM+=v1NQ@mail.gmail.com	2025-10-27 06:37:35 +00:00
David Rowley	a2387c32f2	Fix incorrect logic for caching ResultRelInfos for triggers When dealing with ResultRelInfos for partitions, there are cases where there are mixed requirements for the ri_RootResultRelInfo. There are cases when the partition itself requires a NULL ri_RootResultRelInfo and in the same query, the same partition may require a ResultRelInfo with its parent set in ri_RootResultRelInfo. This could cause the column mapping between the partitioned table and the partition not to be done which could result in crashes if the column attnums didn't match exactly. The fix is simple. We now check that the ri_RootResultRelInfo matches what the caller passed to ExecGetTriggerResultRel() and only return a cached ResultRelInfo when the ri_RootResultRelInfo matches what the caller wants, otherwise we'll make a new one. Author: David Rowley <dgrowleyml@gmail.com> Author: Amit Langote <amitlangote09@gmail.com> Reported-by: Dmitry Fomin <fomin.list@gmail.com> Discussion: https://postgr.es/m/7DCE78D7-0520-4207-822B-92F60AEA14B4@gmail.com Backpatch-through: 15	2025-10-26 11:01:32 +13:00
Tom Lane	e7a3fae39e	Fix off-by-one Asserts in FreePageBtreeInsertInternal/Leaf. These two functions expect there to be room to insert another item in the FreePageBtree's array, but their assertions were too weak to guarantee that. This has little practical effect granting that the callers are not buggy, but it seems to be misleading late-model Coverity into complaining about possible array overrun. Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/799984.1761150474@sss.pgh.pa.us Backpatch-through: 13	2025-10-23 12:32:20 -04:00
Fujii Masao	5f88da5de3	Add comments explaining overflow entries in the replication lag tracker. Commit `883a95646a` introduced overflow entries in the replication lag tracker to fix an issue where lag columns in pg_stat_replication could stall when the replay LSN stopped advancing. This commit adds comments clarifying the purpose and behavior of overflow entries to improve code readability and understanding. Since commit `883a95646a` was recently applied and backpatched to all supported branches, this follow-up commit is also backpatched accordingly. Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Discussion: https://postgr.es/m/CABPTF7VxqQA_DePxyZ7Y8V+ErYyXkmwJ1P6NC+YC+cvxMipWKw@mail.gmail.com Backpatch-through: 13	2025-10-23 13:26:15 +09:00
David Rowley	ceb51d09b4	Fix incorrect zero extension of Datum in JIT tuple deform code When JIT deformed tuples (controlled via the jit_tuple_deforming GUC), types narrower than sizeof(Datum) would be zero-extended up to Datum width. This wasn't the same as what fetch_att() does in the standard tuple deforming code. Logically the values are the same when fetching via the DatumGet*() marcos, but negative numbers are not the same in binary form. In the report, the problem was manifesting itself with: ERROR: could not find memoization table entry in a query which had a "Cache Mode: binary" Memoize node. However, it's currently unclear what else is affected. Anything that uses datum_image_eq() or datum_image_hash() on a Datum from a tuple deformed by JIT could be affected, but it may not be limited to that. The fix for this is simple: use signed extension instead of zero extension. Many thanks to Emmanuel Touzery for reporting this issue and providing steps and backup which allowed the problem to easily be recreated. Reported-by: Emmanuel Touzery <emmanuel.touzery@plandela.si> Author: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/DB8P194MB08532256D5BAF894F241C06393F3A@DB8P194MB0853.EURP194.PROD.OUTLOOK.COM Backpatch-through: 13	2025-10-23 13:12:03 +13:00
Fujii Masao	6ff7ba9fe5	Make invalid primary_slot_name follow standard GUC error reporting. Previously, if primary_slot_name was set to an invalid slot name and the configuration file was reloaded, both the postmaster and all other backend processes reported a WARNING. With many processes running, this could produce a flood of duplicate messages. The problem was that the GUC check hook for primary_slot_name reported errors at WARNING level via ereport(). This commit changes the check hook to use GUC_check_errdetail() and GUC_check_errhint() for error reporting. As with other GUC parameters, this causes non-postmaster processes to log the message at DEBUG3, so by default, only the postmaster's message appears in the log file. Backpatch to all supported versions. Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Chao Li <lic@highgo.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Discussion: https://postgr.es/m/CAHGQGwFud-cvthCTfusBfKHBS6Jj6kdAPTdLWKvP2qjUX6L_wA@mail.gmail.com Backpatch-through: 13	2025-10-22 20:10:58 +09:00
Fujii Masao	9670032cc5	Fix stalled lag columns in pg_stat_replication when replay LSN stops advancing. Previously, when the replay LSN reported in feedback messages from a standby stopped advancing, for example, due to a recovery conflict, the write_lag and flush_lag columns in pg_stat_replication would initially update but then stop progressing. This prevented users from correctly monitoring replication lag. The problem occurred because when any LSN stopped updating, the lag tracker's cyclic buffer became full (the write head reached the slowest read head). In that state, the lag tracker could no longer compute round-trip lag values correctly. This commit fixes the issue by handling the slowest read entry (the one causing the buffer to fill) as a separate overflow entry and freeing space so the write and other read heads can continue advancing in the buffer. As a result, write_lag and flush_lag now continue updating even if the reported replay LSN remains stalled. Backpatch to all supported versions. Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Chao Li <lic@highgo.com> Reviewed-by: Shinya Kato <shinya11.kato@gmail.com> Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com> Discussion: https://postgr.es/m/CAHGQGwGdGQ=1-X-71Caee-LREBUXSzyohkoQJd4yZZCMt24C0g@mail.gmail.com Backpatch-through: 13	2025-10-22 11:28:48 +09:00
Nathan Bossart	2795f5a542	Re-pgindent brin.c. Backpatch-through: 13	2025-10-21 09:56:26 -05:00
David Rowley	715983a81a	Fix BRIN 32-bit counter wrap issue with huge tables A BlockNumber (32-bit) might not be large enough to add bo_pagesPerRange to when the table contains close to 2^32 pages. At worst, this could result in a cancellable infinite loop during the BRIN index scan with power-of-2 pagesPerRange, and slow (inefficient) BRIN index scans and scanning of unneeded heap blocks for non power-of-2 pagesPerRange. Backpatch to all supported versions. Author: sunil s <sunilfeb26@gmail.com> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CAOG6S4-tGksTQhVzJM19NzLYAHusXsK2HmADPZzGQcfZABsvpA@mail.gmail.com Backpatch-through: 13	2025-10-21 20:46:49 +13:00
Michael Paquier	b2abcfa33a	Fix comment in pg_get_shmem_allocations_numa() The comment fixed in this commit described the function as dealing with database blocks, but in reality it processes shared memory allocations. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/aH4DDhdiG9Gi0rG7@ip-10-97-1-34.eu-west-3.compute.internal Backpatch-through: 18	2025-10-21 16:12:34 +09:00
Richard Guo	40c2428307	Fix pushdown of degenerate HAVING clauses `67a54b9e8` taught the planner to push down HAVING clauses even when grouping sets are present, as long as the clause does not reference any columns that are nullable by the grouping sets. However, there was an oversight: if any empty grouping sets are present, the aggregation node can produce a row that did not come from the input, and pushing down a HAVING clause in this case may cause us to fail to filter out that row. Currently, non-degenerate HAVING clauses are not pushed down when empty grouping sets are present, since the empty grouping sets would nullify the vars they reference. However, degenerate (variable-free) HAVING clauses are not subject to this restriction and may be incorrectly pushed down. To fix, explicitly check for the presence of empty grouping sets and retain degenerate clauses in HAVING when they are present. This ensures that we don't emit a bogus aggregated row. A copy of each such clause is also put in WHERE so that query_planner() can use it in a gating Result node. To facilitate this check, this patch expands the groupingSets tree of the query to a flat list of grouping sets before applying the HAVING pushdown optimization. This does not add any additional planning overhead, since we need to do this expansion anyway. In passing, make a small tweak to preprocess_grouping_sets() by reordering its initial operations a bit. Backpatch to v18, where this issue was introduced. Reported-by: Yuhang Qiu <iamqyh@gmail.com> Author: Richard Guo <guofenglinux@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/0879D9C9-7FE2-4A20-9593-B23F7A0B5290@gmail.com Backpatch-through: 18	2025-10-21 12:38:16 +09:00
David Rowley	0b6a02f035	Fix reset of incorrect hash iterator in GROUPING SETS queries This fixes an unlikely issue when fetching GROUPING SET results from their internally stored hash tables. It was possible in rare cases that the hash iterator would be set up incorrectly which could result in a crash. This was introduced in `4d143509c`, so backpatch to v18. Many thanks to Yuri Zamyatin for reporting and helping to debug this issue. Bug: #19078 Reported-by: Yuri Zamyatin <yuri@yrz.am> Author: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Jeff Davis <pgsql@j-davis.com> Discussion: https://postgr.es/m/19078-dfd62f840a2c0766@postgresql.org Backpatch-through: 18	2025-10-18 16:07:41 +13:00
Tomas Vondra	aa151022ec	Fix hashjoin memory balancing logic Commit `a1b4f289be` improved the hashjoin sizing to also consider the memory used by BufFiles for batches. The code however had multiple issues, making it ineffective or not working as expected in some cases. * The amount of memory needed by buffers was calculated using uint32, so it would overflow for nbatch >= 262144. If this happened the loop would exit prematurely and the memory usage would not be reduced. The nbatch overflow is fixed by reworking the condition to not use a multiplication at all, so there's no risk of overflow. An explicit cast was added to a similar calculation in ExecHashIncreaseBatchSize. * The loop adjusting the nbatch value used hash_table_bytes to calculate the old/new size, but then updated only space_allowed. The consequence is the total memory usage was not reduced, but all the memory saved by reducing the number of batches was used for the internal hash table. This was fixed by using only space_allowed. This is also more correct, because hash_table_bytes does not account for skew buckets. * The code was also doubling multiple parameters (e.g. the number of buckets for hash table), but was missing overflow protections. The loop now checks for overflow, and terminates if needed. It'd be possible to cap the value and continue the loop, but it's not worth the complexity. And the overflow implies the in-memory hash table is already very large anyway. While at it, rework the comment explaining how the memory balancing works, to make it more concise and easier to understand. The initial nbatch overflow issue was reported by Vaibhav Jain. The other issues were noticed by me and Melanie Plageman. Fix by me, with a lot of review and feedback by Melanie. Backpatch to 18, where the hashjoin memory balancing was introduced. Reported-by: Vaibhav Jain <jainva@google.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Backpatch-through: 18 Discussion: https://postgr.es/m/CABa-Az174YvfFq7rLS+VNKaQyg7inA2exvPWmPWqnEn6Ditr_Q@mail.gmail.com	2025-10-17 22:27:49 +02:00
Amit Langote	1296dcf18b	Fix EPQ crash from missing partition directory in EState EvalPlanQualStart() failed to propagate es_partition_directory into the child EState used for EPQ rechecks. When execution time partition pruning ran during the EPQ scan, executor code dereferenced a NULL partition directory and crashed. Previously, propagating es_partition_directory into the EPQ EState was unnecessary because CreatePartitionPruneState(), which sets it on demand, also initialized the exec-pruning context. After commit `d47cbf474`, CreatePartitionPruneState() now initializes only the init- time pruning context, leaving exec-pruning context initialization to ExecInitNode(). Since EvalPlanQualStart() runs only ExecInitNode() and not CreatePartitionPruneState(), it can encounter a NULL es_partition_directory. Other executor fields initialized during CreatePartitionPruneState() are already copied into the child EState thanks to commit `8741e48e5d`, but es_partition_directory was missed. Fix by borrowing the parent estate's es_partition_directory in EvalPlanQualStart(), and by clearing that field in EvalPlanQualEnd() so the parent remains responsible for freeing the directory. Add an isolation test permutation that triggers EPQ with execution- time partition pruning, the case that reproduces this crash. Bug: #19078 Reported-by: Yuri Zamyatin <yuri@yrz.am> Diagnosed-by: David Rowley <dgrowleyml@gmail.com> Author: David Rowley <dgrowleyml@gmail.com> Co-authored-by: Amit Langote <amitlangote09@gmail.com> Discussion: https://postgr.es/m/19078-dfd62f840a2c0766@postgresql.org Backpatch-through: 18	2025-10-16 14:01:50 +09:00
Nathan Bossart	c8af5019be	Fix lookups in pg_{clear,restore}_{attribute,relation}_stats(). Presently, these functions look up the relation's OID, lock it, and then check privileges. Not only does this approach provide no guarantee that the locked relation matches the arguments of the lookup, but it also allows users to briefly lock relations for which they do not have privileges, which might enable denial-of-service attacks. This commit adjusts these functions to use RangeVarGetRelidExtended(), which is purpose-built to avoid both of these issues. The new RangeVarGetRelidCallback function is somewhat complicated because it must handle both tables and indexes, and for indexes, we must check privileges on the parent table and lock it first. Also, it needs to handle a couple of extremely unlikely race conditions involving concurrent OID reuse. A downside of this change is that the coding doesn't allow for locking indexes in AccessShare mode anymore; everything is locked in ShareUpdateExclusive mode. Per discussion, the original choice of lock levels was intended for a now defunct implementation that used in-place updates, so we believe this change is okay. Reviewed-by: Jeff Davis <pgsql@j-davis.com> Discussion: https://postgr.es/m/Z8zwVmGzXyDdkAXj%40nathan Backpatch-through: 18	2025-10-15 12:47:33 -05:00
Tom Lane	b48ae226e6	Fix incorrect message-printing in win32security.c. log_error() would probably fail completely if used, and would certainly print garbage for anything that needed to be interpolated into the message, because it was failing to use the correct printing subroutine for a va_list argument. This bug likely went undetected because the error cases this code is used for are rarely exercised - they only occur when Windows security API calls fail catastrophically (out of memory, security subsystem corruption, etc). The FRONTEND variant can be fixed just by calling vfprintf() instead of fprintf(). However, there was no va_list variant of write_stderr(), so create one by refactoring that function. Following the usual naming convention for such things, call it vwrite_stderr(). Author: Bryan Green <dbryan.green@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAF+pBj8goe4fRmZ0V3Cs6eyWzYLvK+HvFLYEYWG=TzaM+tWPnw@mail.gmail.com Backpatch-through: 13	2025-10-13 17:56:45 -04:00
Peter Geoghegan	9de72704e0	Remove unused nbtree array advancement variable. Remove a variable that is no longer in use following commit `9a2e2a28`. It's not immediately clear why there were no compiler warnings about this oversight. Author: Peter Geoghegan <pg@bowt.ie> Backpatch-through: 18	2025-10-12 14:04:06 -04:00

1 2 3 4 5 ...

27236 Commits