postgres

mirror of https://github.com/postgres/postgres.git synced 2025-09-11 00:12:06 +03:00

Author	SHA1	Message	Date
Tom Lane	e708ffe79d	Fix GIN's shimTriConsistentFn to not corrupt its input. Commit `0f21db36d` made an assumption that GIN triConsistentFns would not modify their input entryRes[] arrays. But in fact, the "shim" triConsistentFn that we use for opclasses that don't supply their own did exactly that, potentially leading to wrong answers from a GIN index search. Through bad luck, none of the test cases that we have for such opclasses exposed the bug. One response to this could be that the assumption of consistency check functions not modifying entryRes[] arrays is a bad one, but it still seems reasonable to me. Notably, shimTriConsistentFn is itself assuming that with respect to the underlying boolean consistentFn, so it's sure being self-centered in supposing that it gets to do so. Fortunately, it's quite simple to fix shimTriConsistentFn to restore the entry-time state of entryRes[], so let's do that instead. This issue doesn't affect any core GIN opclasses, since they all supply their own triConsistentFns. It does affect contrib modules btree_gin, hstore, and intarray. Along the way, I (tgl) noticed that shimTriConsistentFn failed to pick up on a "recheck" flag returned by its first call to the boolean consistentFn. This may be only a latent problem, since it would be unlikely for a consistentFn to set recheck for the all-false case and not any other cases. (Indeed, none of our contrib modules do that.) Nonetheless, it's formally wrong. Reported-by: Vinod Sridharan <vsridh90@gmail.com> Author: Vinod Sridharan <vsridh90@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAFMdLD7XzsXfi1+DpTqTgrD8XU0i2C99KuF=5VHLWjx4C1pkcg@mail.gmail.com Backpatch-through: 13	2025-04-12 12:28:02 -04:00
Peter Geoghegan	a6cab6a78e	Harmonize function parameter names for Postgres 18. Make sure that function declarations use names that exactly match the corresponding names from function definitions in a few places. These inconsistencies were all introduced during Postgres 18 development. This commit was written with help from clang-tidy, by mechanically applying the same rules as similar clean-up commits (the earliest such commit was commit `035ce1fe`).	2025-04-12 12:07:36 -04:00
Daniel Gustafsson	847bbb21f8	Fix recently introduced typos This fixes typos in docs and comments introduced during the v18 development cycle, to keep them from ending up in backbranches. Author: Jacob Brazeal <jacob.brazeal@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://postgr.es/m/CA+COZaCgGua25f2hSrjrDLJcJJAHkwoKgTTqUy-wyL1=64JNjw@mail.gmail.com	2025-04-11 22:17:12 +02:00
Michael Paquier	2e57790836	Fix race with synchronous_standby_names at startup synchronous_standby_names cannot be reloaded safely by backends, and the checkpointer is in charge of updating a state in shared memory if the GUC is enabled in WalSndCtl, to let the backends know if they should wait or not for a given LSN. This provides a strict control on the timing of the waiting queues if the GUC is enabled or disabled, then reloaded. The checkpointer is also in charge of waking up the backends that could be waiting for a LSN when the GUC is disabled. This logic had a race condition at startup, where it would be possible for backends to not wait for a LSN even if synchronous_standby_names is enabled. This would cause visibility issues with transactions that we should be waiting for but they were not. The problem lasts until the checkpointer does its initial update of the shared memory state when it loads synchronous_standby_names. In order to take care of this problem, the shared memory state in WalSndCtl is extended to detect if it has been initialized by the checkpointer, and not only check if synchronous_standby_names is defined. In WalSndCtlData, sync_standbys_defined is renamed to sync_standbys_status, a bits8 able to know about two states: - If the shared memory state has been initialized. This flag is set by the checkpointer at startup once, and never removed. - If synchronous_standby_names is known as defined in the shared memory state. This is the same as the previous sync_standbys_defined in WalSndCtl. This method gives a way for backends to decide what they should do until the shared memory area is initialized, and they now ultimately fall back to a check on the GUC value in this case, which is the best thing that can be done. Fortunately, SyncRepUpdateSyncStandbysDefined() is called immediately by the checkpointer when this process starts, so the window is very narrow. It is possible to enlarge the problematic window by making the checkpointer wait at the beginning of SyncRepUpdateSyncStandbysDefined() with a hardcoded sleep for example, and doing so has showed that a 2PC visibility test is indeed failing. On machines slow enough, this bug would cause spurious failures. In 17~, we have looked at the possibility of adding an injection point to have a reproducible test, but as the problematic window happens at early startup, we would need to invent a way to make an injection point optionally persistent across restarts when attached, something that would be fine for this case as it would involve the checkpointer. This issue is quite old, and can be reproduced on all the stable branches. Author: Melnikov Maksim <m.melnikov@postgrespro.ru> Co-authored-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/163fcbec-900b-4b07-beaa-d2ead8634bec@postgrespro.ru Backpatch-through: 13	2025-04-11 10:00:21 +09:00
David Rowley	530050d8d2	Add code comment explaining ins_since_vacuum and aborted inserts Sami complained that there's a discrepancy between n_mod_since_analyze and n_ins_since_vacuum, as the former only accounts for committed changes and the latter tracks committed and aborted inserts. Nobody seemed overly concerned that this would cause any concerning issues. The repercussions, from what I can tell, are limited to causing an autovacuum to trigger for inserts sooner than it otherwise might. For typical ratios of commits to aborts, it's unlikely to ever be noticed. Fixing things to make it so n_ins_since_vacuum only displays committed inserts would require an additional field in PgStat_TableCounts, which does not quite seem worthwhile at this stage. This commit just adds a comment with some details to mention that we know about it, which will hopefully prevent repeat discussions. Reported-by: Sami Imseih <samimseih@gmail.com> Author: David Rowley <drowleyml@gmail.com> Reviewed-by: Sami Imseih <samimseih@gmail.com> Discussion: https://postgr.es/m/CAApHDvpgV3a-R2EGmPOh0L-x3pHbZpM3y4dySWfy+UqUazwDQA@mail.gmail.com	2025-04-11 11:36:21 +12:00
David Rowley	928394b664	Improve various new-to-v18 appendStringInfo calls Similar to `8461424fd`, here we adjust a few new locations which were not using the most suitable appendStringInfo* function for the intended purpose. Author: David Rowley <drowleyml@gmail.com Discussion: https://postgr.es/m/CAApHDvqJnNjueb=Eoj8K+8n0g7nj_AcPWSiCj5RNV4fDejAfqA@mail.gmail.com	2025-04-11 10:07:22 +12:00
Daniel Gustafsson	55ef7abf88	Rename global variable backing DSA area The global variable backing the DSA area for Memory Context stats reporting had a too generic name, rename to be more descriptive. Independently reported by Peter and Laurenz. Author: Daniel Gustafsson <daniel@yesql.se> Reported-by: Peter Eisentraut <peter@eisentraut.org> Reported-by: Laurenz Albe <laurenz.albe@cybertec.at> Discussion: https://postgr.es/m/d51172bd4e7f4b07a18a0288ca1b1c28a71a5f6a.camel@cybertec.at Discussion: https://postgr.es/m/25095db5-b595-4b85-9100-d358907c25b5@eisentraut.org	2025-04-10 22:40:27 +02:00
Tom Lane	f27eb0325b	Remove useless check for negative result of ip_addrsize(). By inspection, ip_addrsize() can't return a negative result. (If it could, we'd have way bigger problems elsewhere.) So delete useless check in network_send(). Most C compilers are probably perfectly capable of removing this code by themselves, but it's confusing/misleading. Bug: #18889 Reported-by: Daniel Elishakov <dan-eli@mail.ru> Discussion: https://postgr.es/m/18889-73d4f19e953a629e@postgresql.org	2025-04-10 14:18:07 -04:00
Amit Kapila	4909b38af0	Fix data loss in logical replication. Data loss can happen when the DDLs like ALTER PUBLICATION ... ADD TABLE ... or ALTER TYPE ... that don't take a strong lock on table happens concurrently to DMLs on the tables involved in the DDL. This happens because logical decoding doesn't distribute invalidations to concurrent transactions and those transactions use stale cache data to decode the changes. The problem becomes bigger because we keep using the stale cache even after those in-progress transactions are finished and skip the changes required to be sent to the client. This commit fixes the issue by distributing invalidation messages from catalog-modifying transactions to all concurrent in-progress transactions. This allows the necessary rebuild of the catalog cache when decoding new changes after concurrent DDL. We observed performance regression primarily during frequent execution of publication DDL statements that modify the published tables. The regression is minor or nearly nonexistent for DDLs that do not affect the published tables or occur infrequently, making this a worthwhile cost to resolve a longstanding data loss issue. An alternative approach considered was to take a strong lock on each affected table during publication modification. However, this would only address issues related to publication DDLs (but not the ALTER TYPE ...) and require locking every relation in the database for publications created as FOR ALL TABLES, which is impractical. The bug exists in all supported branches, but we are backpatching till 14. The fix for 13 requires somewhat bigger changes than this fix, so the fix for that branch is still under discussion. Reported-by: hubert depesz lubaczewski <depesz@depesz.com> Reported-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Author: Shlok Kyal <shlok.kyal.oss@gmail.com> Author: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Tested-by: Benoit Lobréau <benoit.lobreau@dalibo.com> Backpatch-through: 14 Discussion: https://postgr.es/m/de52b282-1166-1180-45a2-8d8917ca74c6@enterprisedb.com Discussion: https://postgr.es/m/CAD21AoAenVqiMjpN-PvGHL1N9DWnHSq673bfgr6phmBUzx=kLQ@mail.gmail.com	2025-04-10 13:14:40 +05:30
David Rowley	d7c04db27a	Update wording in optimizer/README for EquivalenceClasses `d69d45a5a` changed how em_is_child members are stored in EquivalenceClasses. Children are no longer stored in the ec_members list. optimizer/README mentioned that most operations "should ignore child members", but that felt a little untrue now since child members are now stored in a separate place, they simply won't be found by the normal means of looking (a foreach loop over ec_members), and if you don't find them, there's technically no need to "ignore" them. Here we tweak the wording slightly to reflect the new storage location for child members. Reported-by: Amit Langote <amitlangote09@gmail.com> Author: Amit Langote <amitlangote09@gmail.com> Author: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CA+HiwqE8v=EuAP_3F_A2xn8zWx+nG_etW_Fe_DvKO-Fkx=+DdQ@mail.gmail.com	2025-04-10 17:33:58 +12:00
Tomas Vondra	3887d0cfeb	Cleanup of pg_numa.c This moves/renames some of the functions defined in pg_numa.c: * pg_numa_get_pagesize() is renamed to pg_get_shmem_pagesize(), and moved to src/backend/storage/ipc/shmem.c. The new name better reflects that the page size is not related to NUMA, and it's specifically about the page size used for the main shared memory segment. * move pg_numa_available() to src/backend/storage/ipc/shmem.c, i.e. into the backend (which more appropriate for functions callable from SQL). While at it, improve the comment to explain what page size it returns. * remove unnecessary includes from src/port/pg_numa.c, adding unnecessary dependencies (src/port should be suitable for frontent). These were either leftovers or unnecessary thanks to the other changes in this commit. This eliminates unnecessary dependencies on backend symbols, which we don't want in src/port. Reported-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> https://postgr.es/m/CALdSSPi5fj0a7UG7Fmw2cUD1uWuckU_e8dJ+6x-bJEokcSXzqA@mail.gmail.com	2025-04-09 21:50:17 +02:00
Heikki Linnakangas	0f1433f053	Fix a few oversights in the longer cancel keys patch Change MyCancelKeyLength's type from uint8 to int. While it always fits in a uint8, plain int is less surprising, as there's no particular reason for it to be uint8. Fix one ProcSignalInit caller that passed 'false' instead of NULL for the pointer argument. Author: Peter Eisentraut <peter@eisentraut.org> Discussion: https://www.postgresql.org/message-id/61be9e31-7b7d-49d5-bc11-721800d89d64@eisentraut.org	2025-04-09 13:11:42 +03:00
Tom Lane	dd496eedea	Doc: note that two examples in optimizer/README are oversimplified. These examples fail to account for join clauses generated by EquivalenceClasses, but since we haven't mentioned EquivalenceClasses yet it seems like it'd just add confusion to make them fully accurate. Instead, parenthetically note that they're oversimplified. Reported-by: Zeyuan Hu <ferrishu3886@gmail.com> Co-authored-by: David Rowley <dgrowleyml@gmail.com> Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CACvHWmYFo+60yMqKJajDDvKN5EM41YHrCT3oxukwXmGAqpWvyw@mail.gmail.com	2025-04-08 23:03:33 -04:00
Amit Kapila	12eece5fd5	Fix uninitialized index information access during apply. The issue happens when building conflict information during apply of INSERT or UPDATE operations that violate unique constraints on leaf partitions. The problem was introduced in commit `9ff68679b5`, which removed the redundant calls to ExecOpenIndices/ExecCloseIndices. The previous code was relying on the redundant ExecOpenIndices call in apply_handle_tuple_routing() to build the index information required for unique key conflict detection. The fix is to delay building the index information until a conflict is detected instead of relying on ExecOpenIndices to do the same. The additional benefit of this approach is that it avoids building index information when there is no conflict. Author: Hou Zhijie <houzj.fnst@fujitsu.com> Reviewed-by:Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/TYAPR01MB57244ADA33DDA57119B9D26494A62@TYAPR01MB5724.jpnprd01.prod.outlook.com	2025-04-08 15:35:42 +05:30
Thomas Munro	f78ca6f3eb	Introduce file_copy_method setting. It can be set to either COPY (the default) or CLONE if the system supports it. CLONE causes callers of copydir(), currently CREATE DATABASE ... STRATEGY=FILE_COPY and ALTER DATABASE ... SET TABLESPACE = ..., to use copy_file_range (Linux, FreeBSD) or copyfile (macOS) to copy files instead of a read-write loop over the contents. CLONE gives the kernel the opportunity to share block ranges on copy-on-write file systems and push copying down to storage on others, depending on configuration. On some systems CLONE can be used to clone large databases quickly with CREATE DATABASE ... TEMPLATE=source STRATEGY=FILE_COPY. Other operating systems could be supported; patches welcome. Co-authored-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Ranier Vilela <ranier.vf@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGLM%2Bt%2BSwBU-cHeMUXJCOgBxSHLGZutV5zCwY4qrCcE02w%40mail.gmail.com	2025-04-08 21:35:38 +12:00
Daniel Gustafsson	042a66291b	Add function to get memory context stats for processes This adds a function for retrieving memory context statistics and information from backends as well as auxiliary processes. The intended usecase is cluster debugging when under memory pressure or unanticipated memory usage characteristics. When calling the function it sends a signal to the specified process to submit statistics regarding its memory contexts into dynamic shared memory. Each memory context is returned in detail, followed by a cumulative total in case the number of contexts exceed the max allocated amount of shared memory. Each process is limited to use at most 1Mb memory for this. A summary can also be explicitly requested by the user, this will return the TopMemoryContext and a cumulative total of all lower contexts. In order to not block on busy processes the caller specifies the number of seconds during which to retry before timing out. In the case where no statistics are published within the set timeout, the last known statistics are returned, or NULL if no previously published statistics exist. This allows dash- board type queries to continually publish even if the target process is temporarily congested. Context records contain a timestamp to indicate when they were submitted. Author: Rahila Syed <rahilasyed90@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Tomas Vondra <tomas@vondra.me> Reviewed-by: Atsushi Torikoshi <torikoshia@oss.nttdata.com> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com> Discussion: https://postgr.es/m/CAH2L28v8mc9HDt8QoSJ8TRmKau_8FM_HKS41NeO9-6ZAkuZKXw@mail.gmail.com	2025-04-08 11:06:56 +02:00
Andres Freund	15f0cb26b5	Increase BAS_BULKREAD based on effective_io_concurrency Before, BAS_BULKREAD was always of size 256kB. With the default io_combine_limit of 16, that only allowed 1-2 IOs to be in flight - insufficient even on very low latency storage. We don't just want to increase the size to a much larger hardcoded value, as very large rings (10s of MBs of of buffers), appear to have negative performance effects when reading in data that the OS has cached (but not when actually needing to do IO). To address this, increase the size of BAS_BULKREAD to allow for io_combine_limit * effective_io_concurrency buffers getting read in. To prevent the ring being much larger than useful, limit the increased size with GetPinLimit(). The formula outlined above keeps the ring size to sizes for which we have not observed performance regressions, unless very large effective_io_concurrency values are used together with large shared_buffers setting. Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/lqwghabtu2ak4wknzycufqjm5ijnxhb4k73vzphlt2a3wsemcd@gtftg44kdim6 Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah@brqs62irg4dt	2025-04-08 02:41:03 -04:00
Andres Freund	dcf7e1697b	Add pg_buffercache_evict_{relation,all} functions In addition to the added functions, the pg_buffercache_evict() function now shows whether the buffer was flushed. pg_buffercache_evict_relation(): Evicts all shared buffers in a relation at once. pg_buffercache_evict_all(): Evicts all shared buffers at once. Both functions provide mechanism to evict multiple shared buffers at once. They are designed to address the inefficiency of repeatedly calling pg_buffercache_evict() for each individual buffer, which can be time-consuming when dealing with large shared buffer pools. (e.g., ~477ms vs. ~2576ms for 16GB of fully populated shared buffers). These functions are intended for developer testing and debugging purposes and are available to superusers only. Minimal tests for the new functions are included. Also, there was no test for pg_buffercache_evict(), test for this added too. No new extension version is needed, as it was already increased this release by `ba2a3c2302`. Author: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Aidar Imamov <a.imamov@postgrespro.ru> Reviewed-by: Joseph Koshakow <koshy44@gmail.com> Discussion: https://postgr.es/m/CAN55FZ0h_YoSqqutxV6DES1RW8ig6wcA8CR9rJk358YRMxZFmw%40mail.gmail.com	2025-04-08 02:19:32 -04:00
David Rowley	d69d45a5a9	Speedup child EquivalenceMember lookup in planner When planning queries to partitioned tables, we clone all EquivalenceMembers belonging to the partitioned table into em_is_child EquivalenceMembers for each non-pruned partition. For partitioned tables with large numbers of partitions, this meant the ec_members list could become large and code searching that list would become slow. Effectively, the more partitions which were present, the more searches needed to be performed for operations such as find_ec_member_matching_expr() during create_plan() and the more partitions present, the longer these searches would take, i.e., a quadratic slowdown. To fix this, here we adjust how we store EquivalenceMembers for em_is_child members. Instead of storing these directly in ec_members, these are now stored in a new array of Lists in the EquivalenceClass, which is indexed by the relid. When we want to find EquivalenceMembers belonging to a certain child relation, we can narrow the search to the array element for that relation. To make EquivalenceMember lookup easier and to reduce the amount of code change, this commit provides a pair of functions to allow iteration over the EquivalenceMembers of an EC which also handles finding the child members, if required. Callers that never need to look at child members can remain using the foreach loop over ec_members, which will now often be faster due to only parent-level members being stored there. The actual performance increases here are highly dependent on the number of partitions and the query being planned. Performance increases can be visible with as few as 8 partitions, but the speedup is marginal for such low numbers of partitions. The speedups become much more visible with a few dozen to hundreds of partitions. With some tested queries using 56 partitions, the planner was around 3x faster than before. For use cases with thousands of partitions, these are likely to become significantly faster. Some testing has shown planner speedups of 60x or more with 8192 partitions. Author: Yuya Watari <watari.yuya@gmail.com> Co-authored-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andrey Lepikhov <a.lepikhov@postgrespro.ru> Reviewed-by: Alena Rybakina <lena.ribackina@yandex.ru> Reviewed-by: Dmitry Dolgov <9erthalion6@gmail.com> Reviewed-by: Amit Langote <amitlangote09@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Tested-by: Thom Brown <thom@linux.com> Tested-by: newtglobal postgresql_contributors <postgresql_contributors@newtglobalcorp.com> Discussion: https://postgr.es/m/CAJ2pMkZNCgoUKSE%2B_5LthD%2BKbXKvq6h2hQN8Esxpxd%2Bcxmgomg%40mail.gmail.com	2025-04-08 18:09:57 +12:00
Amit Kapila	105b2cb336	Stabilize 035_standby_logical_decoding.pl. Some tests try to invalidate logical slots on the standby server by running VACUUM on the primary. The problem is that xl_running_xacts was getting generated and replayed before the VACUUM command, leading to the advancement of the active slot's catalog_xmin. Due to this, active slots were not getting invalidated, leading to test failures. We fix it by skipping the generation of xl_running_xacts for the required tests with the help of injection points. As the required interface for injection points was not present in back branches, we fixed the failing tests in them by disallowing the slot to become active for the required cases (where rows_removed conflict could be generated). Author: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Backpatch-through: 16, where it was introduced Discussion: https://postgr.es/m/Z6oQXc8LmiTLfwLA@ip-10-97-1-34.eu-west-3.compute.internal	2025-04-08 09:38:02 +05:30
Bruce Momjian	46b4ba533c	Fix PG 17 [NOT] NULL optimization bug for domains A PG 17 optimization allowed columns with NOT NULL constraints to skip table scans for IS NULL queries, and to skip IS NOT NULL checks for IS NOT NULL queries. This didn't work for domain types, since domain types don't follow the IS NULL/IS NOT NULL constraint logic. To fix, disable this optimization for domains for PG 17+. Reported-by: Jan Behrens Diagnosed-by: Tom Lane Discussion: https://postgr.es/m/Z37p0paENWWUarj-@momjian.us Backpatch-through: 17	2025-04-07 21:33:42 -04:00
Michael Paquier	039549d70f	Flush the IO statistics of active WAL senders more frequently WAL senders do not flush their statistics until they exit, limiting the monitoring possible for live processes. This is penalizing when WAL senders are running for a long time, like in streaming or logical replication setups, because it is not possible to know the amount of IO they generate while running. This commit makes WAL senders more aggressive with their statistics flush, using an internal of 1 second, with the flush timing calculated based on the existing GetCurrentTimestamp() done before the sleeps done to wait for some activity. Note that the sleep done for logical and physical WAL senders happens in two different code paths, so the stats flushes need to happen in these two places. One test is added for the physical WAL sender case, and one for the logical WAL sender case. This can be done in a stable fashion by relying on the WAL generated by the TAP tests in combination with a stats reset while a server is running, but only on HEAD as WAL data has been added to pg_stat_io in `a051e71e28`. This issue exists since `a9c70b46db` and the introduction of pg_stat_io, so backpatch down to v16. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: vignesh C <vignesh21@gmail.com> Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com> Discussion: https://postgr.es/m/Z73IsKBceoVd4t55@ip-10-97-1-34.eu-west-3.compute.internal Backpatch-through: 16	2025-04-08 07:57:19 +09:00
Tomas Vondra	8cc139bec3	Introduce pg_shmem_allocations_numa view Introduce new pg_shmem_alloctions_numa view with information about how shared memory is distributed across NUMA nodes. For each shared memory segment, the view returns one row for each NUMA node backing it, with the total amount of memory allocated from that node. The view may be relatively expensive, especially when executed for the first time in a backend, as it has to touch all memory pages to get reliable information about the NUMA node. This may also force allocation of the shared memory. Unlike pg_shmem_allocations, the view does not show anonymous shared memory allocations. It also does not show memory allocated using the dynamic shared memory infrastructure. Author: Jakub Wartak <jakub.wartak@enterprisedb.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com	2025-04-07 23:08:17 +02:00
Tomas Vondra	65c298f61f	Add support for basic NUMA awareness Add basic NUMA awareness routines, using a minimal src/port/pg_numa.c portability wrapper and an optional build dependency, enabled by --with-libnuma configure option. For now this is Linux-only, other platforms may be supported later. A built-in SQL function pg_numa_available() allows checking NUMA support, i.e. that the server was built/linked with the NUMA library. The main function introduced is pg_numa_query_pages(), which allows determining the NUMA node for individual memory pages. Internally the function uses move_pages(2) syscall, as it allows batching, and is more efficient than get_mempolicy(2). Author: Jakub Wartak <jakub.wartak@enterprisedb.com> Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com	2025-04-07 23:08:17 +02:00
Andres Freund	8e293e689b	aio: Make AIO more compatible with valgrind In some edge cases valgrind flags issues with the memory referenced by IOs. All of the cases addressed in this change are false positives. Most of the false positives are caused by UnpinBuffer[NoOwner] marking buffer data as inaccessible. This happens even though the AIO subsystem still holds a pin. That's good, there shouldn't be accesses to the buffer outside of AIO related code until it is pinned by "user" code again. But it requires some explicit work - if the buffer is not pinned by the current backend, we need to explicitly mark the buffer data accessible/inaccessible while executing completion callbacks. That however causes a cascading issue in IO workers: After the completion callbacks for a buffer is executed, the page is marked as inaccessible. If subsequently the same worker is executing IO targeting the same buffer, we would get an error, as the memory is still marked inaccessible. To avoid that, we need to explicitly mark the memory as accessible in IO workers. Another issue is that IO executed in workers or via io_uring will not mark memory as DEFINED. In the case of workers that is because valgrind does not track memory definedness across processes. For io_uring that is because valgrind does not understand io_uring, and therefore its IOs never mark memory as defined, whether the completions are processed in the defining process or in another context. It's not entirely clear how to best solve that. The current user of AIO is not affected, as it explicitly marks buffers as DEFINED & NOACCESS anyway. Defer solving this issue until we have a user with different needs. Per buildfarm animal skink. Reviewed-by: Noah Misch <noah@leadboat.com> Co-authored-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/3pd4322mogfmdd5nln3zphdwhtmq3rzdldqjwb2sfqzcgs22lf@ok2gletdaoe6	2025-04-07 15:20:30 -04:00
Andres Freund	8ab4241b9f	localbuf: Add Valgrind buffer access instrumentation This mirrors `1e0dfd166b` (+ `46ef520b95`), for temporary table buffers. This is mainly interesting right now because the AIO work currently triggers spurious valgrind errors, and the fix for that is cleaner if temp buffers behave the same as shared buffers. This requires one change beyond the annotations themselves, namely to pin local buffers while writing them out in FlushRelationBuffers(). Reviewed-by: Noah Misch <noah@leadboat.com> Co-authored-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/3pd4322mogfmdd5nln3zphdwhtmq3rzdldqjwb2sfqzcgs22lf@ok2gletdaoe6	2025-04-07 15:20:30 -04:00
Tom Lane	b73e6d71a8	Fix erroneous construction of functions' dependencies on transforms. The list of transform objects that a function should use is specified in CREATE FUNCTION's TRANSFORM clause, and then represented indirectly in pg_proc.protrftypes. However, ProcedureCreate completely ignored that for purposes of constructing pg_depend entries, and instead made the function depend on any transforms that exist for its parameter or return data types. This is bad in both directions: the function could be made dependent on a transform it does not actually use, or it could try to use a transform that's since been dropped. (The latter scenario would require use of a transform that's not for any of the parameter or return types, but that seems legit for cases where the function performs SQL operations internally.) To fix, pass in the list of transform objects that CreateFunction identified, and build pg_depend entries from that not from the parameter/return types. This results in changes in the expected test outputs in contrib/bool_plperl, which I guess are due to different ordering of pg_depend entries -- that test case is surely not exercising either of the problem scenarios. This fix is not back-patchable as-is: changing the signature of ProcedureCreate seems too risky in stable branches. We could do something like making ProcedureCreate a wrapper around ProcedureCreateExt or so. However, I'm more inclined to do nothing in the back branches. We had no field complaints up to now, so the hazards don't seem to be a big issue in practice. And we couldn't do anything about existing pg_depend entries, so a back-patched fix would result in a mishmash of dependencies created according to different rules. That cure could be worse than the disease, perhaps. I bumped catversion just to lay down a marker that the expected contents of pg_depend are a bit different than before. Reported-by: Chapman Flack <jcflack@acm.org> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/3112950.1743984111@sss.pgh.pa.us	2025-04-07 13:31:37 -04:00
Álvaro Herrera	a379061a22	Allow NOT NULL constraints to be added as NOT VALID This allows them to be added without scanning the table, and validating them afterwards without holding access exclusive lock on the table after any violating rows have been deleted or fixed. Doing ALTER TABLE ... SET NOT NULL for a column that has an invalid not-null constraint validates that constraint. ALTER TABLE .. VALIDATE CONSTRAINT is also supported. There are various checks on whether an invalid constraint is allowed in a child table when the parent table has a valid constraint; this should match what we do for enforced/not enforced constraints. pg_attribute.attnotnull is now only an indicator for whether a not-null constraint exists for the column; whether it's valid or invalid must be queried in pg_constraint. Applications can continue to query pg_attribute.attnotnull as before, but now it's possible that NULL rows are present in the column even when that's set to true. For backend internal purposes, we cache the nullability status in CompactAttribute->attnullability that each tuple descriptor carries (replacing CompactAttribute.attnotnull, which was a mirror of Form_pg_attribute.attnotnull). During the initial tuple descriptor creation, based on the pg_attribute scan, we set this to UNRESTRICTED if pg_attribute.attnotnull is false, or to UNKNOWN if it's true; then we update the latter to VALID or INVALID depending on the pg_constraint scan. This flag is also copied when tupledescs are copied. Comparing tuple descs for equality must also compare the CompactAttribute.attnullability flag and return false in case of a mismatch. pg_dump deals with these constraints by storing the OIDs of invalid not-null constraints in a separate array, and running a query to obtain their properties. The regular table creation SQL omits them entirely. They are then dealt with in the same way as "separate" CHECK constraints, and dumped after the data has been loaded. Because no additional pg_dump infrastructure was required, we don't bump its version number. I decided not to bump catversion either, because the old catalog state works perfectly in the new world. (Trying to run with new catalog state and the old server version would likely run into issues, however.) System catalogs do not support invalid not-null constraints (because commit `14e87ffa5c` didn't allow them to have pg_constraint rows anyway.) Author: Rushabh Lathia <rushabh.lathia@gmail.com> Author: Jian He <jian.universality@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Tested-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://postgr.es/m/CAGPqQf0KitkNack4F5CFkFi-9Dqvp29Ro=EpcWt=4_hs-Rt+bQ@mail.gmail.com	2025-04-07 19:19:50 +02:00
Tom Lane	3516ea768c	Add local-address escape "%L" to log_line_prefix. This escape shows the numeric server IP address that the client has connected to. Unix-socket connections will show "[local]". Non-client processes (e.g. background processes) will show "[none]". We expect that this option will be of interest to only a fairly small number of users. Therefore the implementation is optimized for the case where it's not used (that is, we don't do the string conversion until we have to), and we've not added the field to csvlog or jsonlog formats. Author: Greg Sabino Mullane <htamfids@gmail.com> Reviewed-by: Cary Huang <cary.huang@highgo.ca> Reviewed-by: David Steele <david@pgmasters.net> Reviewed-by: Jim Jones <jim.jones@uni-muenster.de> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAKAnmmK-U+UicE-qbNU23K--Q5XTLdM6bj+gbkZBZkjyjrd3Ow@mail.gmail.com	2025-04-07 11:06:05 -04:00
Andres Freund	8ce79483dc	read_stream: Fix overflow hazard with large shared buffers If the limit returned by GetAdditionalPinLimit() is large, the buffer_limit variable in read_stream_start_pending_read() can overflow. While the code is careful to limit buffer_limit PG_INT16_MAX, we subsequently add the number of forwarded buffers. The overflow can lead to assertion failures, crashes or wrong query results when using large shared buffers. It seems easier to avoid this if we make the buffer_limit variable an int, instead of an int16. Do so, and clamp buffer_limit after adding the number of forwarded buffers. It's possible we might want to address this and related issues more widely by changing to int instead of int16 more widely, but since the consequences of this bug can be confusing, it seems better to fix it now. This bug was introduced in `ed0b87caac`. Discussion: https://postgr.es/m/ewvz3cbtlhrwqk7h6ca6cctiqh7r64ol3pzb3iyjycn2r5nxk5@tnhw3a5zatlr	2025-04-07 09:45:00 -04:00
Alexander Korotkov	717d0e8dd9	Remove GUC_NOT_IN_SAMPLE from enable_self_join_elimination `fc069a3a63` implements Self-Join Elimination (SJE) and provides a new GUC variable: enable_self_join_elimination. This new GUC variable was marked as GUC_NOT_IN_SAMPLE. However, enable_self_join_elimination is documented and is not different from any other enable_* GUCs. Thus, remove GUC_NOT_IN_SAMPLE from it and add it to the postgresql.conf.sample. Discussion: https://postgr.es/m/CAPpHfdsqMTEsmxk3aQwt6xPz%2BKpUELO%3D6fzmER9ZRGrbs4uMfA%40mail.gmail.com Author: Tender Wang <tndrwang@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>	2025-04-07 16:28:54 +03:00
Michael Paquier	c36eda2591	Clarify comment for worst-case allocation in quote_literal_cstr() palloc() is invoked with a specific formula for its allocation size in quote_literal_cstr(). This wastes some memory, but the size is large enough to cover even the worst-case scenarios. No explanations were given about the reasons behind these numbers. This commit adds more documentation about all that. Author: Steve Chavez <steve@supabase.io> Discussion: https://postgr.es/m/CAGRrpzZ9bToRWS+fAnjxDJrxwZN1QcJ-y1Pn2yg=Hst6rydLtw@mail.gmail.com	2025-04-07 10:02:12 +09:00
Michael Paquier	3191a593d6	Fix use-after-free in pgstat_fetch_stat_backend_by_pid() stats_fetch_consistency set to "snapshot" causes the backend entry "beentry" retrieved by pgstat_get_beentry_by_proc_number() to be reset at the beginning of pgstat_fetch_stat_backend() when fetching the backend pgstats entry. As coded, "beentry" was being accessed after being freed. This commit moves all the accesses to "beentry" to happen before calling pgstat_fetch_stat_backend(), fixing the problem. This problem could be reached by calling the SQL functions pg_stat_get_backend_io() or pg_stat_get_backend_wal(). Issue caught by valgrind. Reported-by: Alexander Lakhin <exclusion@gmail.com> Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/f1788cc0-253a-4a3a-aee0-1b8ab9538736@gmail.com	2025-04-07 09:51:40 +09:00
Fujii Masao	173c97812f	Use XLOG_CONTROL_FILE macro consistently for control file name. The XLOG_CONTROL_FILE macro (defined in access/xlog_internal.h) represents the control file name. While some parts of the codebase already use this macro, others previously hardcoded the file name as a string. This commit replaces those hardcoded strings with the macro, ensuring consistent usage throughout the code. This makes future maintenance easier and improves searchability, for example when grepping for control file usage. Author: Anton A. Melnikov <a.melnikov@postgrespro.ru> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Masao Fujii <masao.fujii@gmail.com> Discussion: https://postgr.es/m/0841ec77-47e5-452a-adb4-c6fa55d605fc@postgrespro.ru	2025-04-07 09:27:33 +09:00
Andres Freund	57dec20fd4	aio: Avoid spurious coverity warning PgAioResult.result is never accessed in the relevant path, but coverity complains about an uninitialized access anyway. So just zero-initialize the whole thing. While at it, reduce the scope of the variable. Reported-by: Ranier Vilela <ranier.vf@gmail.com> Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/CAEudQApsKqd-s+fsUQ0OmxJAMHmBSXxrAz3dCs+uvqb3iRtjSw@mail.gmail.com	2025-04-06 12:07:02 -04:00
Tom Lane	2e4ccf1b45	Use "(void)" to mark pgstat_lock_entry(..., false) calls. This should silence Coverity's complaints about the result being sometimes ignored. I'm inclined to think that these routines are simply misdesigned, because sometimes it's okay to ignore the result and sometimes it isn't, and we have no way to enforce the latter. But for now I just added a comment.	2025-04-06 11:37:09 -04:00
Peter Eisentraut	a8025f5448	Relax ordering-related hardcoded btree requirements in planning There were several places in ordering-related planning where a requirement for btree was hardcoded but an amcanorder index could suffice. This fixes that. We just need to do the necessary mapping between strategy numbers and compare types and adjust some related APIs so that this works independent of btree strategy numbers. For instance, non-btree amcanorder indexes can now be used to support sorting and merge joins. Also, predtest.c works independent of btree strategy numbers now. To avoid performance regressions, some details on btree and other built-in index types are still hardcoded as shortcuts, but other index types now have access to the same features by providing the required flags and callbacks. Author: Mark Dilger <mark.dilger@enterprisedb.com> Co-authored-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com	2025-04-06 14:43:51 +02:00
Alexander Korotkov	3a1a7c5a70	Revert "Put enable_self_join_elimination into postgresql.conf.sample" This reverts commit `c2d329260c`. Reported-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://postgr.es/m/D292EB44-806E-439A-82A4-491A1BA59E7A%40yesql.se	2025-04-06 14:30:20 +03:00
Alexander Korotkov	c2d329260c	Put enable_self_join_elimination into postgresql.conf.sample `fc069a3a63` implements Self-Join Elimination (SJE) and provides a new GUC variable: enable_self_join_elimination. This commit adds enable_self_join_elimination to the postgresql.conf.sample, as it was forgotten in the original commit. Discussion: https://postgr.es/m/CAHewXN%3D%2Bghd6O6im46q7j2u6c3H6vkXtXmF%3D_v4CfGSnjje8PA%40mail.gmail.com Author: Tender Wang <tndrwang@gmail.com>	2025-04-06 13:24:16 +03:00
Tom Lane	691836405f	Fix parse_cte.c's failure to examine sub-WITHs in DML statements. makeDependencyGraphWalker thought that only SelectStmt nodes could contain a WithClause. Which was true in our original implementation of WITH, but astonishingly we missed updating this code when we added the ability to attach WITH to INSERT/UPDATE/DELETE (and later MERGE). Moreover, since it was coded to deliberately block recursion to a WithClause, even updating raw_expression_tree_walker didn't save it. The upshot of this was that we didn't see references to outer CTE names appearing within an inner WITH, and would neither complain about disallowed recursion nor account for such references when sorting CTEs into a usable order. The lack of complaints about this is perhaps not so surprising, because typical usage of WITH wouldn't hit either case. Still, it's pretty broken; failing to detect recursion here leads to assert failures or worse later on. Fix by factoring out the processing of sub-WITHs into a new function WalkInnerWith, and invoking that for all the statement types that can have WITH. Bug: #18878 Reported-by: Yu Liang <luy70@psu.edu> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/18878-a26fa5ab6be2f2cf@postgresql.org Backpatch-through: 13	2025-04-05 15:01:48 -04:00
Tom Lane	e33f2335a9	Avoid double transformation of json_array()'s subquery. transformJsonArrayQueryConstructor() applied transformStmt() to the same subquery tree twice. While this causes no issue in many cases, there are some where it causes a coredump, thanks to the parser's habit of scribbling on its input. Fix by making a copy before the first transformation (compare `0f43083d1`). This is quite brute-force, but then so is the whole business of transforming the input twice. Per discussion in the bug thread, this implementation of json_array() parsing should be replaced completely. But that will take some work and will surely not be back-patchable, so for the moment let's take the easy way out. Oversight in `7081ac46a`. Back-patch to v16 where that came in. Bug: #18877 Reported-by: Yu Liang <luy70@psu.edu> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/18877-c3c3ad75845833bb@postgresql.org Backpatch-through: 16	2025-04-05 12:13:35 -04:00
Tom Lane	43b8e6c4ab	Repair misbehavior with duplicate entries in FK SET column lists. Since v15 we've had an option to apply a foreign key constraint's ON DELETE SET DEFAULT or SET NULL action to just some of the referencing columns. There was not a check for duplicate entries in the list of columns-to-set, though. That caused a potential memory stomp in CreateConstraintEntry(), which incautiously assumed that the list of columns-to-set couldn't be longer than the number of key columns. Even after fixing that, the case doesn't work because you get an error like "multiple assignments to same column" from the SQL command that is generated to do the update. We could either raise an error for duplicate columns or silently suppress the dups, and after a bit of thought I chose to do the latter. This is motivated by the fact that duplicates in the FK column list are legal, so it's not real clear why duplicates in the columns-to-set list shouldn't be. Of course there's no need to actually set the column more than once. I left in the fix in CreateConstraintEntry() too, just because it didn't seem like such low-level code ought to be making assumptions about what it's handed. Bug: #18879 Reported-by: Yu Liang <luy70@psu.edu> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/18879-259fc59d072bd4d7@postgresql.org Backpatch-through: 15	2025-04-04 20:11:48 -04:00
Tom Lane	0f43083d16	functions.c: copy trees from source_list before parse analysis etc. This is yet another bit of fallout from the fact that backend/parser (like other code) feels free to scribble on the parse tree it's handed. In this case that resulted in modifying the relatively-short-lived copy in the cached function's source_list. That would be fine since we only need each source_list tree once ... except that if the parser fails after making some changes, the function cache entry remains as-is and will still be there if the user tries to execute the function again. Then we have problems because we're feeding a non-pristine tree to the parser. The most expedient fix is a quick copyObject(). I considered other answers like somehow marking the cache entry invalid temporarily, but that would add complexity and I'm not sure it's worth it. In typical scenarios we'd only do this once per function query per session. Reported-by: Alexander Lakhin <exclusion@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/6d442183-102c-498a-81d1-eeeb086cdc5a@gmail.com	2025-04-04 18:26:51 -04:00
Peter Geoghegan	b3f1a13f22	Avoid extra index searches through preprocessing. Transform low_compare and high_compare nbtree skip array inequalities (with opclasses that offer skip support) in such a way as to allow _bt_first to consistently apply later keys when it descends the tree. This can lower the number of index searches for multi-column scans that use a ">" key on one of the index's prefix columns (or use a "<" key, when scanning backwards) when it precedes some later lower-order key. For example, an index qual "WHERE a > 5 AND b = 2" will now be converted to "WHERE a >= 6 AND b = 2" by a new preprocessing step that takes place after low_compare and high_compare have been finalized. That way, the initial call to _bt_first can use "WHERE a >= 6 AND b = 2" to find an initial position, rather than just using "WHERE a > 5" -- "b = 2" can be applied during every _bt_first call. There's a decent chance that this will allow such a scan to avoid the extra search that might otherwise be needed to determine the lowest "a" value still satisfying "WHERE a > 5". The transformation process can only lower the total number of index pages read when the use of a more restrictive set of initial positioning keys in _bt_first actually allows the scan to land on some later leaf page directly, relative to the unoptimized case (or on an earlier leaf page directly, when scanning backwards). But the savings can really add up in cases where an affected skip array comes after some other array. For example, a scan indexqual "WHERE x IN (1, 2, 3) AND y > 5 AND z = 2" can save as many as 3 _bt_first calls by applying the new transformation to its "y" array (up to 1 extra search can be avoided per "x" element). Follow-up to commit `92fe23d9`, which added nbtree skip scan. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-Wz=FJ78K3WsF3iWNxWnUCY9f=Jdg3QPxaXE=uYUbmuRz5Q@mail.gmail.com	2025-04-04 14:14:08 -04:00
Peter Geoghegan	21a152b37f	Improve nbtree skip scan primitive scan scheduling. Don't allow nbtree scans with skip arrays to end any primitive scan on its first leaf page without giving some consideration to how many times the scan's arrays advanced while changing at least one skip array (though continue not caring about the number of array advancements that only affected SAOP arrays, even during skip scans with SAOP arrays). Now when a scan performs more than 3 such array advancements in the course of reading a single leaf page, it is taken as a signal that the next page is unlikely to be skippable. We'll therefore continue the ongoing primitive index scan, at least until we can perform a recheck against the next page's finaltup. Testing has shown that this new heuristic occasionally makes all the difference with skip scans that were expected to rely on the "passed first page" heuristic added by commit `9a2e2a28`. Without it, there is a remaining risk that certain kinds of skip scans will never quite manage to clear the initial hurdle of performing a primitive scan that lasts beyond its first leaf page (or that such a skip scan will only clear that initial hurdle when it has already wasted noticeably-many cycles due to inefficient primitive scan scheduling). Follow-up to commits `92fe23d9` and `9a2e2a28`. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-Wz=RVdG3zWytFWBsyW7fWH7zveFvTHed5JKEsuTT0RCO_A@mail.gmail.com	2025-04-04 13:58:05 -04:00
Peter Geoghegan	8a510275dd	Further optimize nbtree search scan key comparisons. Postgres 17 commit `e0b1ee17` added two complementary optimizations to nbtree: the "prechecked" and "firstmatch" optimizations. _bt_readpage was made to avoid needlessly evaluating keys that are guaranteed to be satisfied by applying page-level context. "prechecked" did this for keys required in the current scan direction, while "firstmatch" did it for keys required in the opposite-to-scan direction only. The "prechecked" design had a number of notable issues. It didn't account for the fact that an = array scan key's sk_argument field might need to advance at the point of the page precheck (it didn't check the precheck tuple against the key's array, only the key's sk_argument, which needlessly made it ineffective in cases involving stepping to a page having advanced the scan's arrays using a truncated high key). "prechecked" was also completely ineffective when only one scan key wasn't guaranteed to be satisfied by every tuple (it didn't recognize that it was still safe to avoid evaluating other, earlier keys). The "firstmatch" optimization had similar limitations. It could only be applied after _bt_readpage found its first matching tuple, regardless of why any earlier tuples failed to satisfy the scan's index quals. This allowed unsatisfied non-required scan keys to impede the optimization. Replace both optimizations with a new optimization, without any of these limitations: the "startikey" optimization. Affected _bt_readpage calls generate a page-level key offset ("startikey"), that their _bt_checkkeys calls can then start at. This is an offset to the first key that isn't known to be satisfied by every tuple on the page. Although this is independently useful work, its main goal is to avoid performance regressions with index scans that use skip arrays, but still never manage to skip over irrelevant leaf pages. We must avoid wasting CPU cycles on overly granular skip array maintenance in these cases. The new "startikey" optimization helps with this by selectively disabling array maintenance for the duration of a _bt_readpage call. This has no lasting consequences for the scan's array keys (they'll still reliably track the scan's progress through the index's key space whenever the scan is "between pages"). Skip scan adds skip arrays during preprocessing using simple, static rules, and decides how best to navigate/apply the scan's skip arrays dynamically, at runtime. The "startikey" optimization enables this approach. As a result of all this, the planner doesn't need to generate distinct, competing index paths (one path for skip scan, another for an equivalent traditional full index scan). The overall effect is to make scan runtime close to optimal, even when the planner works off an incorrect cardinality estimate. Scans will also perform well given a skipped column with data skew: individual groups of pages with many distinct values (in respect of a skipped column) can be read about as efficiently as before -- without the scan being forced to give up on skipping over other groups of pages that are provably irrelevant. Many scans that cannot possibly skip will still benefit from the use of skip arrays, since they'll allow the "startikey" optimization to be as effective as possible (by allowing preprocessing to mark all the scan's keys as required). A scan that uses a skip array on "a" for a qual "WHERE a BETWEEN 0 AND 1_000_000 AND b = 42" is often much faster now, even when every tuple read by the scan has its own distinct "a" value. However, there are still some remaining regressions, affecting certain trickier cases. Scans whose index quals have several range skip arrays, each on some high cardinality column, can still be slower than they were before the introduction of skip scan -- even with the new "startikey" optimization. There are also known regressions affecting very selective index scans that use a skip array. The underlying issue with such selective scans is that they never get as far as reading a second leaf page, and so will never get a chance to consider applying the "startikey" optimization. In principle, all regressions could be avoided by teaching preprocessing to not add skip arrays whenever they aren't expected to help, but it seems best to err on the side of robust performance. Follow-up to commit `92fe23d9`, which added nbtree skip scan. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Heikki Linnakangas <heikki.linnakangas@iki.fi> Reviewed-By: Masahiro Ikeda <ikedamsh@oss.nttdata.com> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-Wz=Y93jf5WjoOsN=xvqpMjRy-bxCE037bVFi-EasrpeUJA@mail.gmail.com Discussion: https://postgr.es/m/CAH2-WznWDK45JfNPNvDxh6RQy-TaCwULaM5u5ALMXbjLBMcugQ@mail.gmail.com	2025-04-04 12:27:52 -04:00
Peter Geoghegan	92fe23d93a	Add nbtree skip scan optimization. Teach nbtree multi-column index scans to opportunistically skip over irrelevant sections of the index given a query with no "=" conditions on one or more prefix index columns. When nbtree is passed input scan keys derived from a predicate "WHERE b = 5", new nbtree preprocessing steps output "WHERE a = ANY(<every possible 'a' value>) AND b = 5" scan keys. That is, preprocessing generates a "skip array" (and an output scan key) for the omitted prefix column "a", which makes it safe to mark the scan key on "b" as required to continue the scan. The scan is therefore able to repeatedly reposition itself by applying both the "a" and "b" keys. A skip array has "elements" that are generated procedurally and on demand, but otherwise works just like a regular ScalarArrayOp array. Preprocessing can freely add a skip array before or after any input ScalarArrayOp arrays. Index scans with a skip array decide when and where to reposition the scan using the same approach as any other scan with array keys. This design builds on the design for array advancement and primitive scan scheduling added to Postgres 17 by commit `5bf748b8`. Testing has shown that skip scans of an index with a low cardinality skipped prefix column can be multiple orders of magnitude faster than an equivalent full index scan (or sequential scan). In general, the cardinality of the scan's skipped column(s) limits the number of leaf pages that can be skipped over. The core B-Tree operator classes on most discrete types generate their array elements with the help of their own custom skip support routine. This infrastructure gives nbtree a way to generate the next required array element by incrementing (or decrementing) the current array value. It can reduce the number of index descents in cases where the next possible indexable value frequently turns out to be the next value stored in the index. Opclasses that lack a skip support routine fall back on having nbtree "increment" (or "decrement") a skip array's current element by setting the NEXT (or PRIOR) scan key flag, without directly changing the scan key's sk_argument. These sentinel values behave just like any other value from an array -- though they can never locate equal index tuples (they can only locate the next group of index tuples containing the next set of non-sentinel values that the scan's arrays need to advance to). A skip array's range is constrained by "contradictory" inequality keys. For example, a skip array on "x" will only generate the values 1 and 2 given a qual such as "WHERE x BETWEEN 1 AND 2 AND y = 66". Such a skip array qual usually has near-identical performance characteristics to a comparable SAOP qual "WHERE x = ANY('{1, 2}') AND y = 66". However, improved performance isn't guaranteed. Much depends on physical index characteristics. B-Tree preprocessing is optimistic about skipping working out: it applies static, generic rules when determining where to generate skip arrays, which assumes that the runtime overhead of maintaining skip arrays will pay for itself -- or lead to only a modest performance loss. As things stand, these assumptions are much too optimistic: skip array maintenance will lead to unacceptable regressions with unsympathetic queries (queries whose scan can't skip over many irrelevant leaf pages). An upcoming commit will address the problems in this area by enhancing _bt_readpage's approach to saving cycles on scan key evaluation, making it work in a way that directly considers the needs of = array keys (particularly = skip array keys). Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Masahiro Ikeda <masahiro.ikeda@nttdata.com> Reviewed-By: Heikki Linnakangas <heikki.linnakangas@iki.fi> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-By: Tomas Vondra <tomas@vondra.me> Reviewed-By: Aleksander Alekseev <aleksander@timescale.com> Reviewed-By: Alena Rybakina <a.rybakina@postgrespro.ru> Discussion: https://postgr.es/m/CAH2-Wzmn1YsLzOGgjAQZdn1STSG_y8qP__vggTaPAYXJP+G4bw@mail.gmail.com	2025-04-04 12:27:04 -04:00
Nathan Bossart	e1a8b1ad58	Re-pgindent pg_largeobject.c after commit `0d6c477664`.	2025-04-04 09:38:22 -05:00
Alexander Korotkov	c0962a113d	Convert 'x IN (VALUES ...)' to 'x = ANY ...' then appropriate This commit implements the automatic conversion of 'x IN (VALUES ...)' into ScalarArrayOpExpr. That simplifies the query tree, eliminating the appearance of an unnecessary join. Since VALUES describes a relational table, and the value of such a list is a table row, the optimizer will likely face an underestimation problem due to the inability to estimate cardinality through MCV statistics. The cardinality evaluation mechanism can work with the array inclusion check operation. If the array is small enough (< 100 elements), it will perform a statistical evaluation element by element. We perform the transformation in the convert_ANY_sublink_to_join() if VALUES RTE is proper and the transformation is convertible. The conversion is only possible for operations on scalar values, not rows. Also, we currently support the transformation only when it ends up with a constant array. Otherwise, the evaluation of non-hashed SAOP might be slower than the corresponding Hash Join with VALUES. Discussion: https://postgr.es/m/0184212d-1248-4f1f-a42d-f5cb1c1976d2%40tantorlabs.com Author: Alena Rybakina <a.rybakina@postgrespro.ru> Author: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Ivan Kush <ivan.kush@tantorlabs.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>	2025-04-04 16:01:50 +03:00
Alexander Korotkov	d48d2e2dc8	Extract make_SAOP_expr() function from match_orclause_to_indexcol() This commit extracts the code to generate ScalarArrayOpExpr on top of the list of expressions from match_orclause_to_indexcol() into a separate function make_SAOP_expr(). This function was extracted to be used in optimization for conversion of 'x IN (VALUES ...)' to 'x = ANY ...'. make_SAOP_expr() is placed in clauses.c file as only two additional headers were needed there compared with other places. Discussion: https://postgr.es/m/0184212d-1248-4f1f-a42d-f5cb1c1976d2%40tantorlabs.com Author: Alena Rybakina <a.rybakina@postgrespro.ru> Author: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Ivan Kush <ivan.kush@tantorlabs.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>	2025-04-04 16:01:28 +03:00

1 2 3 4 5 ...

26877 Commits