postgres

mirror of https://github.com/postgres/postgres.git synced 2025-11-22 12:22:45 +03:00

Author	SHA1	Message	Date
Andres Freund	96da9050a5	aio: Be more paranoid about interrupts As reported by Noah, it's possible, although practically very unlikely, that interrupts could be processed in between pgaio_io_reopen() and pgaio_io_perform_synchronously(). Prevent that by explicitly holding interrupts. It also seems good to add an assertion to pgaio_io_before_prep() to ensure that interrupts are held, as otherwise FDs referenced by the IO could be closed during interrupt processing. All code in the aio series currently runs the code with interrupts held, but it seems better to be paranoid. Reviewed-by: Noah Misch <noah@leadboat.com> Reported-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/20250324002939.5c.nmisch@google.com	2025-03-26 16:06:54 -04:00
Tomas Vondra	818245506c	Keep the decompressed filter in brin_bloom_union The brin_bloom_union() function combines two BRIN summaries, by merging one filter into the other. With bloom, we have to decompress the filters first, but the function failed to update the summary to store the merged filter. As a consequence, the index may be missing some of the data, and return false negatives. This issue exists since BRIN bloom indexes were introduced in Postgres 14, but at that point the union function was called only when two sessions happened to summarize a range concurrently, which is rare. It got much easier to hit in 17, as parallel builds use the union function to merge summaries built by workers. Fixed by storing a pointer to the decompressed filter, and freeing the original one. Free the second filter too, if it was decompressed. The freeing is not strictly necessary, because the union is called in short-lived contexts, but it's tidy. Backpatch to 14, where BRIN bloom indexes were introduced. Reported by Arseniy Mukhin, investigation and fix by me. Reported-by: Arseniy Mukhin Discussion: https://postgr.es/m/18855-1cf1c8bcc22150e6%40postgresql.org Backpatch-through: 14	2025-03-26 17:01:41 +01:00
Tom Lane	55527368bd	Use PG_MODULE_MAGIC_EXT in our installable shared libraries. It seems potentially useful to label our shared libraries with version information, now that a facility exists for retrieving that. This patch labels them with the PG_VERSION string. There was some discussion about using semantic versioning conventions, but that doesn't seem terribly helpful for modules with no SQL-level presence; and for those that do have SQL objects, we typically expect them to support multiple revisions of the SQL definitions, so it'd still not be very helpful. I did not label any of src/test/modules/. It seems unnecessary since we don't install those, and besides there ought to be someplace that still provides test coverage for the original PG_MODULE_MAGIC macro. Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/dd4d1b59-d0fe-49d5-b28f-1e463b68fa32@gmail.com	2025-03-26 11:11:02 -04:00
Tom Lane	9324c8c580	Introduce PG_MODULE_MAGIC_EXT macro. This macro allows dynamically loaded shared libraries (modules) to provide a wired-in module name and version, and possibly other compile-time-constant fields in future. This information can be retrieved with the new pg_get_loaded_modules() function. This feature is expected to be particularly useful for modules that do not have any exposed SQL functionality and thus are not associated with a SQL-level extension object. But even for modules that do belong to extensions, being able to verify the actual code version can be useful. Author: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Yurii Rashkovskii <yrashk@omnigres.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/dd4d1b59-d0fe-49d5-b28f-1e463b68fa32@gmail.com	2025-03-26 11:06:12 -04:00
Dean Rasheed	a3b6dfd410	Add support for gamma() and lgamma() functions. These are useful general-purpose math functions which are included in POSIX and C99, and are commonly included in other math libraries, so expose them as SQL-callable functions. Author: Dean Rasheed <dean.a.rasheed@gmail.com> Reviewed-by: Stepan Neretin <sncfmgg@gmail.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Dmitry Koval <d.koval@postgrespro.ru> Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com> Discussion: https://postgr.es/m/CAEZATCXpGyfjXCirFk9au+FvM0y2Ah+2-0WSJx7MO368ysNUPA@mail.gmail.com	2025-03-26 09:35:53 +00:00
Michael Paquier	787514b30b	Use relation name instead of OID in query jumbling for RangeTblEntry custom_query_jumble (introduced in `5ac462e2b7` as a node field attribute) is now assigned to the expanded reference name "eref" of RangeTblEntry, adding in the query jumble computation the non-qualified aliased relation name, without the list of column names. The relation OID is removed from the query jumbling. The effects of this change can be seen in the tests added by `3430215fe3`, where pg_stat_statements (PGSS) entries are now grouped using the relation name, ignoring the relation search_path may point at. For example, these two relations are different, but are now grouped in a single PGSS entry as they are assigned the same query ID: CREATE TABLE foo1.tab (a int); CREATE TABLE foo2.tab (b int); SET search_path = 'foo1'; SELECT count() FROM tab; SET search_path = 'foo2'; SELECT count() FROM tab; SELECT count() FROM foo1.tab; SELECT count() FROM foo2.tab; SELECT query, calls FROM pg_stat_statements WHERE query ~ 'FROM tab'; query \| calls --------------------------+------- SELECT count(*) FROM tab \| 4 (1 row) It is still possible to use an alias in the FROM clause to split these. This behavior is useful for relations re-created with the same name, where queries based on such relations would be grouped in the same PGSS entry. For permanent schemas, it should not really matter in practice. The main benefit is for workloads that use a lot of temporary relations, which are usually re-created with the same name continuously. These can be a heavy source of bloat in PGSS depending on the workload. Such entries can now be grouped together, improving the user experience. The original idea from Christoph Berg used catalog lookups to find temporary relations, something that the query jumble has never done, and it could cause some performance regressions. The idea to use RangeTblEntry.eref and the relation name, applying the same rules for all relations, temporary and not temporary, has been proposed by Tom Lane. The documentation additions have been suggested by Sami Imseih. Author: Michael Paquier <michael@paquier.xyz> Co-authored-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: Christoph Berg <myon@debian.org> Reviewed-by: Lukas Fittl <lukas@fittl.com> Reviewed-by: Sami Imseih <samimseih@gmail.com> Discussion: https://postgr.es/m/Z9iWXKGwkm8RAC93@msg.df7cb.de	2025-03-26 15:21:05 +09:00
Michael Paquier	27ee6ede6b	Fix two issues with custom_query_jumble in gen_node_support.pl A node field marked with custom_query_jumble and query_jumble_ignore would generate some code of a custom routine. The script is changed so as custom_query_jumble behaves like the other options in this case, query_jumble_ignore taking priority, with no code generated. A comment related to the code generated for node types was misplaced. Thinkos introduced in `5ac462e2b7`. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/1324036.1742945060@sss.pgh.pa.us	2025-03-26 09:06:36 +09:00
Jeff Davis	650ab8aaf1	Stats: use schemaname/relname instead of regclass. For import and export, use schemaname/relname rather than regclass. This is more natural during export, fits with the other arguments better, and it gives better control over error handling in case we need to downgrade more errors to warnings. Also, use text for the argument types for schemaname, relname, and attname so that casts to "name" are not required. Author: Corey Huinker <corey.huinker@gmail.com> Discussion: https://postgr.es/m/CADkLM=ceOSsx_=oe73QQ-BxUFR2Cwqum7-UP_fPe22DBY0NerA@mail.gmail.com	2025-03-25 11:16:06 -07:00
Peter Eisentraut	ef7a5af77d	refactor: Pass relation OID instead of Relation to createForeignKeyCheckTriggers() Currently, createForeignKeyCheckTriggers() takes a Relation type as its first argument, but it doesn't use that argument directly. Instead, it fetches the relation OID by calling RelationGetRelid(). Therefore, it would be more consistent with other functions (e.g., createForeignKeyCheckTriggers()) to pass the relation OID directly instead of the whole Relation. Author: Amul Sul <amul.sul@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/CAAJ_b962c5AcYW9KUt_R_ER5qs3fUGbe4az-SP-vuwPS-w-AGA@mail.gmail.com	2025-03-25 17:04:12 +01:00
Peter Eisentraut	639238b978	refactor: Split ATExecAlterConstraintInternal() Split ATExecAlterConstraintInternal() into two functions: ATExecAlterConstrDeferrability() and ATExecAlterConstrInheritability(). This simplifies the code and avoids unnecessary confusion caused by recursive code, which isn't needed for ATExecAlterConstrInheritability(). (This also takes over the changes in commit `64224a834c`, as the new AlterConstrDeferrabilityRecurse() is essentially the old ATExecAlterChildConstr().) Author: Amul Sul <amul.sul@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/CAAJ_b962c5AcYW9KUt_R_ER5qs3fUGbe4az-SP-vuwPS-w-AGA@mail.gmail.com	2025-03-25 16:18:00 +01:00
Peter Eisentraut	a3280e2a49	refactor: Move some code that updates pg_constraint to a separate function This extracts common/duplicate code for different ALTER CONSTRAINT variants into a common function. We plan to add more variants that would use the same code. Author: Amul Sul <amul.sul@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/CAAJ_b962c5AcYW9KUt_R_ER5qs3fUGbe4az-SP-vuwPS-w-AGA@mail.gmail.com	2025-03-25 14:37:22 +01:00
Peter Eisentraut	f4b2a62ae3	Small fixes for Add ALTER TABLE ... ALTER CONSTRAINT ... SET [NO] INHERIT Small fixes for commit `f4e53e10b6`: Add missing calls to InvokeObjectPostAlterHook() and also CacheInvalidateRelcache(). The former change could have a user-visible effect. The latter omission might have caused other bugs, but it is not clear whether one actually existed. With these changes, the code is now more consistent with similar ALTER CONSTRAINT variants, especially the ones that set the deferrability. Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/CAF1DzPVfOW6Kk=7SSh7LbneQDJWh=PbJrEC_Wkzc24tHOyQWGg@mail.gmail.com	2025-03-25 13:40:24 +01:00
Peter Eisentraut	be1cc9aaf5	Generalize index support in network support function The network (inet) support functions currently only supported a hardcoded btree operator family. With the generalized compare type facility, we can generalize this to support any operator family from any index type that supports the required operators. Author: Mark Dilger <mark.dilger@enterprisedb.com> Co-authored-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com	2025-03-25 07:11:56 +01:00
Michael Paquier	5ac462e2b7	Add support for custom_query_jumble as a node field attribute This option gives the possibility for query jumble to define a custom routine for the field of a Node, extending support for custom_query_jumble as a node field attribute. When dealing with complex node structures, this can be simpler than having to enforce a custom function across a full node. Custom functions need to be defined in queryjumblefuncs.c, named as _jumble${node}_${field}(), and use in input the JumbleState, the node and its field. The field is not really required if we have the Node, but it makes custom implementations somewhat easier to think about. The code generated by gen_node_support.pl uses a macro called JUMBLE_CUSTOM(), hiding the internals of the logic inside queryjumblefuncs.c. This will be used by an upcoming patch manipulating adding a custom routine into a field of RangeTblEntry, but this facility can become useful in more cases. Reviewed-by: Christoph Berg <myon@debian.org> Discussion: https://postgr.es/m/Z9y43-dRvb4EtxQ0@paquier.xyz	2025-03-25 14:18:00 +09:00
Jeff Davis	626df47ad9	Remove 'additional' pointer from TupleHashEntryData. Reduces memory required for hash aggregation by avoiding an allocation and a pointer in the TupleHashEntryData structure. That structure is used for all buckets, whether occupied or not, so the savings is substantial. Discussion: https://postgr.es/m/AApHDvpN4v3t_sdz4dvrv1Fx_ZPw=twSnxuTEytRYP7LFz5K9A@mail.gmail.com Reviewed-by: David Rowley <dgrowleyml@gmail.com>	2025-03-24 22:06:02 -07:00
Jeff Davis	a0942f441e	Add ExecCopySlotMinimalTupleExtra(). Allows an "extra" argument that allocates extra memory at the end of the MinimalTuple. This is important for callers that need to store additional data, but do not want to perform an additional allocation. Suggested-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAApHDvppeqw2pNM-+ahBOJwq2QmC0hOAGsmCpC89QVmEoOvsdg@mail.gmail.com	2025-03-24 22:05:53 -07:00
Jeff Davis	4d143509cb	Create accessor functions for TupleHashEntry. Refactor for upcoming optimizations. Reviewed-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/1cc3b400a0e8eead18ff967436fa9e42c0c14cfb.camel@j-davis.com	2025-03-24 22:05:41 -07:00
Jeff Davis	cc721c459d	HashAgg: use Bump allocator for hash TupleHashTable entries. The entries aren't freed until the entire hash table is destroyed, so use the Bump allocator to improve allocation speed, avoid wasting space on the chunk header, and avoid wasting space due to the power-of-two allocations. Discussion: https://postgr.es/m/CAApHDvqv1aNB4cM36FzRwivXrEvBO_LsG_eQ3nqDXTjECaatOQ@mail.gmail.com Reviewed-by: David Rowley	2025-03-24 22:05:33 -07:00
Amit Kapila	b87ced747d	Fix an oversight in `3abe9dc188`. Forgot to update the comment atop one of the functions. Author: Hayato Kuroda <kuroda.hayato@fujitsu.com> Discussion: https://postgr.es/m/OSCPR01MB1496623BE1125B44614494E7AF5A72@OSCPR01MB14966.jpnprd01.prod.outlook.com	2025-03-25 09:26:23 +05:30
Andres Freund	adb5f85fa5	Redefine max_files_per_process to control additionally opened files Until now max_files_per_process=N limited each backend to open N files in total (minus a safety factor), even if there were already more files opened in postmaster and inherited by backends. Change max_files_per_process to control how many additional files each process is allowed to open. The main motivation for this is the patch to add io_method=io_uring, which needs to open one file for each backend. Without this patch, even if RLIMIT_NOFILE is high enough, postmaster will fail in set_max_safe_fds() if started with a high max_connections. The cause of the failure is that, until now, set_max_safe_fds() subtracted the already open files from max_files_per_process. Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/w6uiicyou7hzq47mbyejubtcyb2rngkkf45fk4q7inue5kfbeo@bbfad3qyubvs Discussion: https://postgr.es/m/CAGECzQQh6VSy3KG4pN1d=h9J=D1rStFCMR+t7yh_Kwj-g87aLQ@mail.gmail.com	2025-03-24 18:20:18 -04:00
Melanie Plageman	aea916fe55	Fix bitmapheapscan incorrect recheck of NULL tuples The bitmap heap scan skip fetch optimization skips fetching the heap block when a page is set all-visible in the visibility map and no columns from the table are needed to satisfy the query. `2b73a8cd33` and `c3953226a0` changed the control flow of bitmap heap scan to use the read stream API. The read stream API returns buffers containing blocks to the user. To make this work with the skip fetch optimization, we keep a count of the empty tuples we need to emit for all the blocks skipped and only emit the empty tuples after processing the next block fetched from the heap or at the end of the scan. It's incorrect to recheck NULL tuples, so we must set `recheck` to false before yielding control back to BitmapHeapNext(). This was done before emitting any remaining empty tuples at the end of the scan but not for empty tuples emitted during the scan. This meant that if a page fetched from the heap did require recheck and set `recheck` to true and then we emitted empty tuples for subsequent blocks, we would get wrong results. Fix this by always setting `recheck` to false before emitting empty tuples. Reported-by: Alexander Lakhin <exclusion@gmail.com> Tested-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/496f7acd-881c-4df3-9bd3-8f8534dfec26%40gmail.com	2025-03-24 16:40:59 -04:00
Amit Kapila	73eba5004a	Detect and Log multiple_unique_conflicts type conflict. Introduce a new conflict type, multiple_unique_conflicts, to handle cases where an incoming row during logical replication violates multiple UNIQUE constraints. Previously, the apply worker detected and reported only the first encountered key conflict (insert_exists/update_exists), causing repeated failures as each constraint violation needs to be handled one by one making the process slow and error-prone. With this patch, the apply worker checks all unique constraints upfront once the first key conflict is detected and reports multiple_unique_conflicts if multiple violations exist. This allows users to resolve all conflicts at once by deleting all conflicting tuples rather than dealing with them individually or skipping the transaction. In the future, this will also allow us to specify different resolution handlers for such a conflict type. Add the stats for this conflict type in pg_stat_subscription_stats. Author: Nisha Moond <nisha.moond412@gmail.com> Author: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Discussion: https://postgr.es/m/CABdArM7FW-_dnthGkg2s0fy1HhUB8C3ELA0gZX1kkbs1ZZoV3Q@mail.gmail.com	2025-03-24 12:30:44 +05:30
Michael Paquier	2a0cd38da5	Allow plugins to set a 64-bit plan identifier in PlannedStmt This field can be optionally set in a PlannedStmt through the planner hook, giving extensions the possibility to assign an identifier related to a computed plan. The backend is changed to report it in the backend entry of a process running (including the extended query protocol), with semantics and APIs to set or get it similar to what is used for the existing query ID (introduced in the backend via `4f0b0966c8`). The plan ID is reset at the same timing as the query ID. Currently, this information is not added to the system view pg_stat_activity; extensions can access it through PgBackendStatus. Some patches have been proposed to provide some features in the planning area, where a plan identifier is used as a key to know the plan involved (for statistics, plan storage and manipulations, etc.), and the point of this commit is to provide an anchor in the backend that extensions can rely on for future work. The reset of the plan identifier is controlled by core and follows the same pattern as the query identifier added in `4f0b0966c8`. The contents of this commit are extracted from a larger set proposed originally by Lukas Fittl, that Sami Imseih has proposed as an independent change, with a few tweaks sprinkled by me. Author: Lukas Fittl <lukas@fittl.com> Author: Sami Imseih <samimseih@gmail.com> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CAP53Pkyow59ajFMHGpmb1BK9WHDypaWtUsS_5DoYUEfsa_Hktg@mail.gmail.com Discussion: https://postgr.es/m/CAA5RZ0vyWd4r35uUBUmhngv8XqeiJUkJDDKkLf5LCoWxv-t_pw@mail.gmail.com	2025-03-24 13:23:42 +09:00
Heikki Linnakangas	2817525f0d	Fix rare assertion failure in standby, if primary is restarted During hot standby, ExpireAllKnownAssignedTransactionIds() and ExpireOldKnownAssignedTransactionIds() functions mark old transactions as no-longer running, but they failed to update xactCompletionCount and latestCompletedXid. AFAICS it would not lead to incorrect query results, because those functions effectively turn in-progress transactions into aborted transactions and an MVCC snapshot considers both as "not visible". But it could surprise GetSnapshotDataReuse() and trigger the "TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin))" assertion in it, if the apparent xmin in a backend would move backwards. We saw this happen when GetCatalogSnapshot() would reuse an older catalog snapshot, when GetTransactionSnapshot() had already advanced TransactionXmin. The bug goes back all the way to commit `623a9ba79b` in v14 that introduced the snapshot reuse mechanism, but it started to happen more frequently with commit `952365cded` which removed a GetTransactionSnapshot() call from backend startup. That made it more likely for ExpireOldKnownAssignedTransactionIds() to be called between GetCatalogSnapshot() and the first GetTransactionSnapshot() in a backend. Andres Freund first spotted this assertion failure on buildfarm member 'skink'. Reproduction and analysis by Tomas Vondra. Backpatch-through: 14 Discussion: https://www.postgresql.org/message-id/oey246mcw43cy4qw2hqjmurbd62lfdpcuxyqiu7botx3typpax%40h7o7mfg5zmdj	2025-03-23 20:41:16 +02:00
Andres Freund	ca3067cc57	aio: Change prefix of PgAioResultStatus values to PGAIO_RS_ The previous prefix wasn't consistent with the naming of other AIO related enum values. It seems best to rename it before the users are introduced. Reported-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CAAKRu_Yb+JzQpNsgUxCB0gBi+sE-mi_HmcJF6ALnmO4W+UgwpA@mail.gmail.com	2025-03-22 17:30:44 -04:00
Peter Geoghegan	9a2e2a285a	Improve nbtree array primitive scan scheduling. Add a new scheduling heuristic: don't end the ongoing primitive index scan immediately (at the point where _bt_advance_array_keys notices that the next set of matching tuples must be on a later page) if the primscan already managed to step right/left from its first leaf page. Schedule a recheck against the next sibling leaf page's finaltup instead. The new heuristic tends to avoid scenarios where the top-level scan repeatedly starts and ends primitive index scans that each read only one leaf page from a group of neighboring leaf pages. Affected top-level scans will now tend to step forward (or backward) through the index instead, without wasting cycles on descending the index anew. The recheck mechanism isn't exactly new. But up until now it has only been used to deal with edge cases involving high key finaltups with one or more truncated -inf attributes that _bt_advance_array_keys deemed "provisionally satisfied" (satisfied for the purposes of allowing the scan to step onto the next page, subject to recheck once on that page). The mechanism was added by commit `5bf748b8`, which invented the general concept of primitive scan scheduling. It was later enhanced by commit `79fa7b3b`, which taught it about cases involving -inf attributes that satisfy inequality scan keys required in the opposite-to-scan direction only (arguably, they should have been covered by the earliest version). Now the recheck mechanism can be applied based on scan-level heuristics, which have nothing to do with truncated high keys. Now rechecks might be performed by _bt_readpage when scanning in _either_ scan direction. The theory behind the new heuristic is that any primitive scan that makes it past its first leaf page is one that is already likely to have arrays whose key values match index tuples that are closely clustered together in the index. The rules that determine whether we ever get past the first page are still conservative (that'll still only happen when pstate.finaltup strongly suggests that it's the right thing to do). Surviving past the first leaf page is a strong signal in itself. Preparation for an upcoming patch that will add skip scan optimizations to nbtree. That'll work by adding skip arrays, which behave similarly to SAOP arrays, but generate their elements procedurally and on-demand. Note that this commit isn't specifically concerned with skip arrays; the scheduling logic doesn't (and won't) condition anything on whether the scan uses skip arrays, SAOP arrays, or some combination of the two (which seems like a good general principle for _bt_advance_array_keys). While the problems that this commit ameliorates are more likely with skip arrays (at least in practice), SAOP arrays (or those with very dense, contiguous array elements) are also affected. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-Wzkz0wPe6+02kr+hC+JJNKfGtjGTzpG3CFVTQmKwWNrXNw@mail.gmail.com	2025-03-22 13:02:18 -04:00
Melanie Plageman	e215166c9c	Use streaming read I/O in SP-GiST vacuuming Like `69273b818b` did for GiST vacuuming, make SP-GiST vacuum use the read stream API for vacuuming physically contiguous index pages. Concurrent insertions may cause SP-GiST index tuples to be redirected. While vacuuming, these are added to a pending list which is later processed to ensure no dead tuples are left behind. Pages containing such tuples are still read by directly calling ReadBuffer() and do not use the read stream API. Author: Andrey M. Borodin <x4mmm@yandex-team.ru> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/37432403-8657-403B-9CDF-5A642BECDD81%40yandex-team.ru	2025-03-21 17:51:22 -04:00
Thomas Munro	e51ca405ed	Fix ps display for IO workers. This code must have missed a memo about the backend type description being supplied automatically these days, and was duplicating that information. Before: "io worker io worker: N" After: "io worker N"	2025-03-22 10:13:23 +13:00
Masahiko Sawada	04ff636cbc	Add GUC option to control maximum active replication origins. This commit introduces a new GUC option max_active_replication_origins to control the maximum number of active replication origins. Previously, this was controlled by 'max_replication_slots'. Having a separate GUC option provides better flexibility for setting up subscribers, as they may not require replication slots (for cascading replication) but always require replication origins. Author: Euler Taveira <euler@eulerto.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: vignesh C <vignesh21@gmail.com> Discussion: https://postgr.es/m/b81db436-8262-4575-b7c4-bc0c1551000b@app.fastmail.com	2025-03-21 12:20:15 -07:00
Tom Lane	cd72c1b76e	Label the contents of pg__d.h files a little better. Make genbki.pl emit some boilerplate comments identifying the sections of the pg__d.h files that it generates. This is in hopes of making them slightly more readable, in case people look at those files and not the pg_.h/pg_.dat originals. Discussion: https://postgr.es/m/1134562.1742507765@sss.pgh.pa.us	2025-03-21 15:09:46 -04:00
Melanie Plageman	69273b818b	Use streaming read I/O in GiST vacuuming Like `c5c239e26e` did for btree vacuuming, make GiST vacuum use the read stream API for sequentially processed pages. Because it is possible for concurrent insertions to relocate unprocessed index entries to already vacuumed pages, GiST vacuum must backtrack and reprocess those pages. These pages are still read with explicit ReadBuffer() calls. Author: Andrey M. Borodin <x4mmm@yandex-team.ru> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/EFEBED92-18D1-4C0F-A4EB-CD47072EF071%40yandex-team.ru	2025-03-21 14:06:45 -04:00
Melanie Plageman	3f850c3fc5	Assorted trivial cleanup of `c5c239e26e` `c5c239e26e` made btree vacuum use the read stream API. Though it used functions declared in read_stream.h, it relied on transitively including it. Explicitly include that file. Also remove an extraneous newline and decrease the scope of one of the local variables in btvacuumscan().	2025-03-21 14:06:40 -04:00
Melanie Plageman	c5c239e26e	Use streaming read I/O in btree vacuuming Btree vacuum processes all index pages in physical order. Now it uses the read stream API to get the next buffer instead of explicitly invoking ReadBuffer(). It is possible for concurrent insertions to cause page splits during index vacuuming. This can lead to index entries that have yet to be vacuumed being moved to pages that have already been vacuumed. Btree vacuum code handles this by backtracking to reprocess those pages. So, while sequentially encountered pages are now read through the read stream API, backtracked pages are still read with explicit ReadBuffer() calls. Author: Andrey Borodin <x4mmm@yandex-team.ru> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Discussion: https://postgr.es/m/flat/CAAKRu_bW1UOyup%3DjdFw%2BkOF9bCaAm%3D9UpiyZtbPMn8n_vnP%2Big%40mail.gmail.com#3b3a84132fc683b3ee5b40bc4c2ea2a5	2025-03-21 09:09:39 -04:00
Álvaro Herrera	1d617a2028	Change one loop in ATRewriteTable to use 1-based attnums All TupleDescAttr() calls in tablecmds.c that aren't in loops across all attributes use AttrNumber-style indexes (1-based); there was only one place in ATRewriteTable that was stashing 0-based indexes in a list for later processing. Switch that to use attnums for consistency. Author: jian he <jian.universality@gmail.com> Discussion: https://postgr.es/m/CACJufxEoYA5ScUr2=CmA1xcpaS_1ixneDbEkVU77X1ctGxY2mA@mail.gmail.com	2025-03-21 10:55:06 +01:00
Thomas Munro	ce1a75c4fe	Support buffer forwarding in StartReadBuffers(). StartReadBuffers() reports a short read when it finds a cached block that ends a range needing I/O by updating the caller's nblocks. It doesn't want to have to unpin the trailing hit that it knows the caller wants, so the v17 version used sleight of hand in the name of simplicity: it included it in nblocks as if it were part of the I/O, but internally tracked the shorter real I/O size in io_buffers_len (now removed). This API change "forwards" the delimiting buffer to the next call. It's still pinned, and still stored in the caller's array, but *nblocks no longer includes stray buffers that are not really part of the operation. The expectation is that the caller still wants the rest of the blocks and will call again starting from that point, and now it can pass the already pinned buffer back in (or choose not to and release it). The change is needed for the coming asynchronous I/O version's larger version of the problem: by definition it must move BM_IO_IN_PROGRESS negotiation from WaitReadBuffers() to StartReadBuffers(), but it might already have many buffers pinned before it discovers a need to split an I/O. (The current synchronous I/O version hides that detail from callers by looping over smaller reads if required to make all covered buffers valid in WaitReadBuffers(), so it looks like one operation but it might occasionally be several under the covers.) Aside from avoiding unnecessary pin traffic, this will also be important for later work on out-of-order streams: you can't prioritize data that is already available right now if that fact is hidden from you. The new API is natural for read_stream.c (see `ed0b87ca`). After a short read it leaves forwarded buffers where they fell in its circular queue for the continuing call to pick up. Single-block StartReadBuffer() and traditional ReadBuffer() share code but are not affected by the change. They don't do multi-block I/O. Reviewed-by: Andres Freund <andres@anarazel.de> (earlier versions) Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com	2025-03-21 20:43:59 +13:00
Thomas Munro	ed0b87caac	Support buffer forwarding in read_stream.c. In preparation for a follow-up change to the buffer manager, teach read_stream.c to manage buffers "forwarded" from one StartReadBuffers() call to the next after a short read. This involves a small amount of extra book-keeping, and opens the way for lower levels to split I/O operations without having to drop pins, as required for efficient handling of various edge cases. Concretely, the "buffers" argument will change from an out parameter to an in/out parameter. Buffer queue elements must be initialized on first use and cleared after they're consumed, but forwarded buffers are left where they fall ahead of the current pending read in the queue, ready for use by the operation that continues where a short read left off. The stream also needs to count them for pin limit management and release them on reset/early end. Tested-by: Andres Freund <andres@anarazel.de> (earlier versions) Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com	2025-03-21 18:44:47 +13:00
David Rowley	00b52c3db6	Simplify EXPLAIN code for Memoize This removes a needless special case for Memoize's FORMAT TEXT EXPLAIN output. ExplainPropertyText() outputs the same thing in text mode as the special-case code was doing, so removing the special-case code results in the same EXPLAIN output, just with less code. It seems like a good idea to fix this to help prevent future changes in this area from copying the same pattern. Author: Ilia Evdokimov <ilya.evdokimov@tantorlabs.com> Reported-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/88a71bcd-0b5c-4d0b-8107-757e96f402d5@tantorlabs.com	2025-03-21 13:40:05 +13:00
Andres Freund	202b12774d	bufmgr: Improve stats when a buffer is read in concurrently Previously we would have the following inaccuracies when a backend tried to read in a buffer, but that buffer was read in concurrently by another backend: - the read IO was double-counted in the global buffer access stats (pgBufferUsage) - the buffer hit was not accounted for in: - global buffer access statistics - pg_stat_io - relation level IO stats - vacuum cost balancing While trying to read in a buffer that is concurrently read in by another backend is not a common occurrence, it's also not that rare, e.g. due to concurrent sequential scans on the same relation. This scenario has become more likely in PG 17, due to the introducing of read streams, which can pin multiple buffers before calling StartBufferIO() for all the buffers. This behaviour has historically grown, but there doesn't seem to be any reason to continue with the wrong accounting. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CAAKRu_Zk-B08AzPsO-6680LUHLOCGaNJYofaxTFseLa=OepV1g@mail.gmail.com	2025-03-20 19:58:22 -04:00
Andres Freund	fc51a60dd4	smgr: Hold interrupts in most smgr functions We need to hold interrupts across most of the smgr.c/md.c functions, as otherwise interrupt processing, e.g. due to a < ERROR elog/ereport, can trigger procsignal processing, which in turn can trigger smgrreleaseall(). As the relevant code is not reentrant, we quickly end up in a bad situation. The only reason we haven't noticed this before is that there is only one non-error ereport called in affected routines, in register_dirty_segments(), and that one is extremely rarely reached. If one enables fd.c's FDDEBUG it's easy to reproduce crashes. It seems better to put the HOLD_INTERRUPTS()/RESUME_INTERRUPTS() in smgr.c, instead of trying to push them down to md.c where possible: For one, every smgr implementation would be vulnerable, for another, a good bit of smgr.c code itself is affected too. Eventually we might want a more targeted solution, allowing e.g. a networked smgr implementation to be interrupted, but many other, more complicated, problems would need to be fixed for that to be viable (e.g. smgr.c is often called with interrupts already held). One could argue this should be backpatched, but the existing < ERROR elog/ereports that can be reached with unmodified sources are unlikely to be reached. On balance the risk of backpatching seems higher than the gain - at least for now. Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/3vae7l5ozvqtxmd7rr7zaeq3qkuipz365u3rtim5t5wdkr6f4g@vkgf2fogjirl	2025-03-20 17:33:57 -04:00
Robert Haas	50ba65e733	Add an additional hook for EXPLAIN option validation. Commit `c65bc2e1d1` made it possible for loadable modules to add EXPLAIN options. Normally, any necessary validation can be performed by the hook function passed to RegisterExtensionExplainOption, but if a loadable module wants to sanity check options against each other, that needs to be done after the entire options list has been processed. So, add an additional hook for that purpose. Author: Sami Imseih <samimseih@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: http://postgr.es/m/CAA5RZ0vOcJF91O2e5AQN+V6guMNLMhJx83dxALf-iUZ-hLGO_Q@mail.gmail.com	2025-03-20 13:47:55 -04:00
Nathan Bossart	0164a0f9ee	Add vacuum_truncate configuration parameter. This new parameter works just like the storage parameter of the same name: if set to true (which is the default), autovacuum and VACUUM attempt to truncate any empty pages at the end of the table. It is primarily intended to help users avoid locking issues on hot standbys. The setting can be overridden with the storage parameter or VACUUM's TRUNCATE option. Since there's presently no way to determine whether a Boolean storage parameter is explicitly set or has just picked up the default value, this commit also introduces an isset_offset member to relopt_parse_elt. Suggested-by: Will Storey <will@summercat.com> Author: Nathan Bossart <nathandbossart@gmail.com> Co-authored-by: Gurjeet Singh <gurjeet@singh.im> Reviewed-by: Laurenz Albe <laurenz.albe@cybertec.at> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Robert Treat <rob@xzilla.net> Discussion: https://postgr.es/m/Z2DE4lDX4tHqNGZt%40dev.null	2025-03-20 10:16:50 -05:00
Peter Eisentraut	618c64ffd3	Revert workarounds for -Wmissing-braces false positives on old GCC We have collected several instances of a workaround for GCC bug 53119, which caused false-positive compiler warnings. This bug has long been fixed, but was still seen on the buildfarm, most recently on lapwing with gcc (Debian 4.7.2-5). (The GCC bug tracker mentions that a fix was backported to 4.7.4 and 4.8.3.) That compiler no longer runs warning-free since commit `6fdd5d9563`, so we don't need to keep these workarounds. And furthermore, the consensus appears to be that we don't want to keep supporting that era of platform anymore at all. This reverts the following commits: `d937904cce` `506428d091` `b449afb582` `6392f2a096` `bad0763a4d` `5e0c761d0a` and makes a few similar fixes to newer code. Discussion: https://www.postgresql.org/message-id/flat/e170d61f-01ab-4cf9-ab68-91cd1fac62c5%40eisentraut.org Discussion: https://www.postgresql.org/message-id/flat/CA%2BTgmoYEAm-KKZibAP3hSqbTFTjUd47XtVcf3xSFDpyecXX9uQ%40mail.gmail.com	2025-03-20 11:25:58 +01:00
Peter Eisentraut	47929324c5	Fix typo in comment	2025-03-20 10:44:12 +01:00
Peter Eisentraut	190dc27998	Update a code comment The comment explained that ALTER TABLE ADD CONSTRAINT USING INDEX is only supported with a btree index. (This is not being changed.) The reason is to keep upgrades robust, as explained there. The other part of the comment, that btree is the only unique index kind anyway, is somewhat less true as we're trying to enable unique indexes other than btree, and it's irrelevant to this check. There is a check for indisunique earlier already. So just remove this part of the comment. Author: Mark Dilger <mark.dilger@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com	2025-03-19 10:39:06 +01:00
Peter Eisentraut	4f7f7b0375	extension_control_path The new GUC extension_control_path specifies a path to look for extension control files. The default value is $system, which looks in the compiled-in location, as before. The path search uses the same code and works in the same way as dynamic_library_path. Some use cases of this are: (1) testing extensions during package builds, (2) installing extensions outside security-restricted containers like Python.app (on macOS), (3) adding extensions to PostgreSQL running in a Kubernetes environment using operators such as CloudNativePG without having to rebuild the base image for each new extension. There is also a tweak in Makefile.global so that it is possible to install extensions using PGXS into an different directory than the default, using 'make install prefix=/else/where'. This previously only worked when specifying the subdirectories, like 'make install datadir=/else/where/share pkglibdir=/else/where/lib', for purely implementation reasons. (Of course, without the path feature, installing elsewhere was rarely useful.) Author: Peter Eisentraut <peter@eisentraut.org> Co-authored-by: Matheus Alcantara <matheusssilv97@gmail.com> Reviewed-by: David E. Wheeler <david@justatheory.com> Reviewed-by: Gabriele Bartolini <gabriele.bartolini@enterprisedb.com> Reviewed-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> Reviewed-by: Niccolò Fei <niccolo.fei@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/E7C7BFFB-8857-48D4-A71F-88B359FADCFD@justatheory.com	2025-03-19 07:03:20 +01:00
Amit Langote	28317de723	Ensure first ModifyTable rel initialized if all are pruned Commit `cbc127917e` introduced tracking of unpruned relids to avoid processing pruned relations, and changed ExecInitModifyTable() to initialize only unpruned result relations. As a result, MERGE statements that prune all target partitions can now lead to crashes or incorrect behavior during execution. The crash occurs because some executor code paths rely on ModifyTableState.resultRelInfo[0] being present and initialized, even when no result relations remain after pruning. For example, ExecMerge() and ExecMergeNotMatched() use the first resultRelInfo to determine the appropriate action. Similarly, ExecInitPartitionInfo() assumes that at least one result relation exists. To preserve these assumptions, ExecInitModifyTable() now includes the first result relation in the initialized result relation list if all result relations for that ModifyTable were pruned. To enable that, ExecDoInitialPruning() ensures the first relation is locked if it was pruned and locking is necessary. To support this exception to the pruning logic, PlannedStmt now includes a list of RT indexes identifying the first result relation of each ModifyTable node in the plan. This allows ExecDoInitialPruning() to check whether each such relation was pruned and, if so, lock it if necessary. Bug: #18830 Reported-by: Robins Tharakan <tharakan@gmail.com> Diagnozed-by: Tender Wang <tndrwang@gmail.com> Diagnozed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Co-authored-by: Dean Rasheed <dean.a.rasheed@gmail.com> Reviewed-by: Tender Wang <tndrwang@gmail.com> Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Discussion: https://postgr.es/m/18830-1f31ea1dc930d444%40postgresql.org	2025-03-19 12:14:24 +09:00
Thomas Munro	06fb5612c9	Increase io_combine_limit range to 1MB. The default of 128kB is unchanged, but the upper limit is changed from 32 blocks to 128 blocks, unless the operating system's IOV_MAX is too low. Some other RDBMSes seem to cap their multi-block buffer pool I/O around this number, and it seems useful to allow experimentation. The concrete change is to our definition of PG_IOV_MAX, which provides the maximum for io_combine_limit and io_max_combine_limit. It also affects a couple of other places that work with arrays of struct iovec or smaller objects on the stack, so we still don't want to use the system IOV_MAX directly without a clamp: it is not under our control and likely to be 1024. 128 seems acceptable for our current usage. For Windows, we can't use real scatter/gather yet, so we continue to define our own IOV_MAX value of 16 and emulate preadv()/pwritev() with loops. Someone would need to research the trade-offs of raising that number. NB if trying to see this working: you might temporarily need to hack BAS_BULKREAD to be bigger, since otherwise the obvious way of "a very big SELECT" is limited by that for now. Suggested-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CA%2BhUKG%2B2T9p-%2BzM6Eeou-RAJjTML6eit1qn26f9twznX59qtCA%40mail.gmail.com	2025-03-19 15:40:35 +13:00
Thomas Munro	10f6646847	Introduce io_max_combine_limit. The existing io_combine_limit can be changed by users. The new io_max_combine_limit is fixed at server startup time, and functions as a silent clamp on the user setting. That in itself is probably quite useful, but the primary motivation is: aio_init.c allocates shared memory for all asynchronous IOs including some per-block data, and we didn't want to waste memory you'd never used by assuming they could be up to PG_IOV_MAX. This commit already halves the size of 'AioHandleIov' and 'AioHandleData'. A follow-up commit can now expand PG_IOV_MAX without affecting that. Since our GUC system doesn't support dependencies or cross-checks between GUCs, the user-settable one now assigns a "raw" value to io_combine_limit_guc, and the lower of io_combine_limit_guc and io_max_combine_limit is maintained in io_combine_limit. Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version) Discussion: https://postgr.es/m/CA%2BhUKG%2B2T9p-%2BzM6Eeou-RAJjTML6eit1qn26f9twznX59qtCA%40mail.gmail.com	2025-03-19 15:23:54 +13:00
Michael Paquier	17d8bba6da	Fix copy-paste error related to the autovacuum launcher in pgstat_io.c Autovacuum launchers perform no WAL IO reads, but pgstat_tracks_io_op() was tracking them as an allowed combination for the "init" and "normal" contexts. This caused the "read", "read_bytes" and "read_time" attributes of pg_stat_io to show zeros for the autovacuum launcher rather than NULL. NULL means that a combination of IO object, IO context and IO operation has no meaning for a backend type. Zero is the same as telling that a combination is relevant, and that WAL reads are possible in an autovacuum launcher, but it is not relevant. Copy-pasto introduced in `a051e71e28`. Author: Ranier Vilela <ranier.vf@gmail.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/CAEudQAopEMAPiUqE7BvDV+x2fUPmKmb9RrsaoDR+hhQzLKg4PQ@mail.gmail.com	2025-03-19 08:52:10 +09:00
Masahiko Sawada	f4290f20dd	Fix assertion failure in parallel vacuum with minimal maintenance_work_mem setting. `bbf668d66f` lowered the minimum value of maintenance_work_mem to 64kB. However, in parallel vacuum cases, since the initial underlying DSA size is 256kB, it attempts to perform a cycle of index vacuuming and table vacuuming with an empty TID store, resulting in an assertion failure. This commit ensures that at least one page is processed before index vacuuming and table vacuuming begins. Backpatch to 17, where the minimum maintenance_work_mem value was lowered. Reviewed-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAD21AoCEAmbkkXSKbj4dB+5pJDRL4ZHxrCiLBgES_g_g8mVi1Q@mail.gmail.com Backpatch-through: 17	2025-03-18 16:37:02 -07:00

1 2 3 4 5 ...

26855 Commits