The startup process does not process shared invalidation messages, only
sending them, and never calls AtEOXact_SMgr() which clean up any
unpinned SMgrRelations. Hence, it is never able to free SMgrRelations
on a periodic basis, bloating its hashtable over time.
Like the checkpointer and the bgwriter, this commit takes a conservative
approach by freeing periodically SMgrRelations when replaying a
checkpoint record, either online or shutdown, so as the startup process
has a way to perform a periodic cleanup.
Issue caused by 21d9c3ee4e, so backpatch down to v17.
Author: Jingtang Zhang <mrdrivingduck@gmail.com>
Reviewed-by: Yuhang Qiu <iamqyh@gmail.com>
Discussion: https://postgr.es/m/28C687D4-F335-417E-B06C-6612A0BD5A10@gmail.com
Backpatch-through: 17
SLRU bank locks are referred as "bank locks" or "SLRU bank locks" in the
code comments. The comments updated in this commit use the latter term.
Oversight in 53c2a97a92, that has replaced the single ControlLock by
the bank control locks.
Author: Julien Rouhaud <julien.rouhaud@free.fr>
Discussion: https://postgr.es/m/aLUT2UO8RjJOzZNq@jrouhaud
Backpatch-through: 17
Commit 4b754d6c1 introduced the concept of an excludeOnly scan key,
which cannot select matching index entries but can reject
non-matching tuples, for example a tsquery such as '!term'. There are
poorly-documented assumptions that such scan keys do not appear as the
first scan key. ginNewScanKey did nothing to ensure that, however,
with the result that certain GIN index searches could go into an
infinite loop while apparently-equivalent queries with the clauses in
a different order were fine.
Fix by teaching ginNewScanKey to place all excludeOnly scan keys
after all not-excludeOnly ones. So far as we know at present,
it might be sufficient to avoid the case where the very first
scan key is excludeOnly; but I'm not very convinced that there
aren't other dependencies on the ordering.
Bug: #19031
Reported-by: Tim Wood <washwithcare@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/19031-0638148643d25548@postgresql.org
Backpatch-through: 13
The CHECK_FOR_INTERRUPTS call in gingetbitmap turns out to be
inadequate to prevent a long uninterruptible loop, because
we now know a case where looping occurs within scanGetItem.
While the next patch will fix the bug that caused that, it
seems foolish to assume that no similar patterns are possible.
Let's do the CFI within scanGetItem's retry loop, instead.
This demonstrably allows canceling out of the loop exhibited
in bug #19031.
Bug: #19031
Reported-by: Tim Wood <washwithcare@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/19031-0638148643d25548@postgresql.org
Backpatch-through: 13
The result of "DirectFunctionCall1(numeric_float8, d)" is already in
Datum form, but the code was incorrectly applying PG_RETURN_FLOAT8()
to it. On machines where float8 is pass-by-reference, this would
result in complete garbage, since an unpredictable pointer value
would be treated as an integer and then converted to float. It's not
entirely clear how much of a problem would ensue on 64-bit hardware,
but certainly interpreting a float8 bitpattern as uint64 and then
converting that to float isn't the intended behavior.
As luck would have it, even the complete-garbage case doesn't break
BRIN indexes, since the results are only used to make choices about
how to merge values into ranges: at worst, we'd make poor choices
resulting in an inefficient index. Doubtless that explains the lack
of field complaints. However, users with BRIN indexes that use the
numeric_minmax_multi_ops opclass may wish to reindex in hopes of
making their indexes more efficient.
Author: Peter Eisentraut <peter@eisentraut.org>
Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/2093712.1753983215@sss.pgh.pa.us
Backpatch-through: 14
If a crash occurs while writing a WAL record that spans multiple pages, the
recovery process marks the page with the XLP_FIRST_IS_OVERWRITE_CONTRECORD
flag. However, logical decoding currently attempts to read the full WAL
record based on its expected size before checking this flag, which can lead
to an infinite wait if the remaining data is never written (e.g., no activity
after crash).
This patch updates the logic first to read the page header and check for
the XLP_FIRST_IS_OVERWRITE_CONTRECORD flag before attempting to reconstruct
the full WAL record. If the flag is set, decoding correctly identifies
the record as incomplete and avoids waiting for WAL data that will never
arrive.
Discussion: https://postgr.es/m/CAAKRu_ZCOzQpEumLFgG_%2Biw3FTa%2BhJ4SRpxzaQBYxxM_ZAzWcA%40mail.gmail.com
Discussion: https://postgr.es/m/CALDaNm34m36PDHzsU_GdcNXU0gLTfFY5rzh9GSQv%3Dw6B%2BQVNRQ%40mail.gmail.com
Author: Vignesh C <vignesh21@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Backpatch-through: 13
Previously, the check_hook functions for max_slot_wal_keep_size and
idle_replication_slot_timeout would incorrectly raise an ERROR for values
set in postgresql.conf during upgrade, even though those values were not
actively used in the upgrade process.
To prevent logical slot invalidation during upgrade, we used to set
special values for these GUCs. Now, instead of relying on those values, we
directly prevent WAL removal and logical slot invalidation caused by
max_slot_wal_keep_size and idle_replication_slot_timeout.
Note: PostgreSQL 17 does not include the idle_replication_slot_timeout
GUC, so related changes were not backported.
BUG #18979
Reported-by: jorsol <jorsol@gmail.com>
Author: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed by: vignesh C <vignesh21@gmail.com>
Reviewed by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Backpatch-through: 17, where it was introduced
Discussion: https://postgr.es/m/219561.1751826409@sss.pgh.pa.us
Discussion: https://postgr.es/m/18979-a1b7fdbb7cd181c6@postgresql.org
Attempting to use commit timestamps during bootstrapping leads to an
assertion failure, that can be reached for example with an initdb -c
that enables track_commit_timestamp. It makes little sense to register
a commit timestamp for a BootstrapTransactionId, so let's disable the
activation of the module in this case.
This problem has been independently reported once by each author of this
commit. Each author has proposed basically the same patch, relying on
IsBootstrapProcessingMode() to skip the use of commit_ts during
bootstrap. The test addition is a suggestion by me, and is applied down
to v16.
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Author: Andy Fan <zhihuifan1213@163.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/OSCPR01MB14966FF9E4C4145F37B937E52F5102@OSCPR01MB14966.jpnprd01.prod.outlook.com
Discussion: https://postgr.es/m/87plejmnpy.fsf@163.com
Backpatch-through: 13
Commit c120550edb optimized the vacuuming of relations without
indexes (a.k.a. one-pass strategy) by directly marking dead item IDs
as LP_UNUSED. However, the periodic FSM vacuum was still checking if
dead item IDs had been marked as LP_DEAD when attempting to vacuum the
FSM every VACUUM_FSM_EVERY_PAGES blocks. This condition was never met
due to the optimization, resulting in missed FSM vacuum
opportunities.
This commit modifies the periodic FSM vacuum condition to use the
number of tuples deleted during HOT pruning. This count includes items
marked as either LP_UNUSED or LP_REDIRECT, both of which are expected
to result in new free space to report.
Back-patch to v17 where the vacuum optimization for tables with no
indexes was introduced.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAD21AoBL8m6B9GSzQfYxVaEgvD7-Kr3AJaS-hJPHC+avm-29zw@mail.gmail.com
Backpatch-through: 17
The new tests verify that logical and physical replication slots are still
valid after an immediate restart on checkpoint completion when the slot was
advanced during the checkpoint.
This commit introduces two new injection points to make these tests possible:
* checkpoint-before-old-wal-removal - triggered in the checkpointer process
just before old WAL segments cleanup;
* logical-replication-slot-advance-segment - triggered in
LogicalConfirmReceivedLocation() when restart_lsn was changed enough to
point to the next WAL segment.
Discussion: https://postgr.es/m/flat/1d12d2-67235980-35-19a406a0%4063439497
Author: Vitaly Davydov <v.davydov@postgrespro.ru>
Author: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Backpatch-through: 17
The patch fixes the issue with the unexpected removal of old WAL segments
after checkpoint, followed by an immediate restart. The issue occurs when
a slot is advanced after the start of the checkpoint and before old WAL
segments are removed at the end of the checkpoint.
The idea of the patch is to get the minimal restart_lsn at the beginning
of checkpoint (or restart point) creation and use this value when calculating
the oldest LSN for WAL segments removal at the end of checkpoint. This idea
was proposed by Tomas Vondra in the discussion. Unlike 291221c46575, this
fix doesn't affect ABI and is intended for back branches.
Discussion: https://postgr.es/m/flat/1d12d2-67235980-35-19a406a0%4063439497
Author: Vitaly Davydov <v.davydov@postgrespro.ru>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Backpatch-through: 13
Teach nbtree's _bt_killitems to leave the so->currPos page that it sets
LP_DEAD items on in whatever state it was in when _bt_killitems was
called. In particular, make sure that so->dropPin scans don't acquire a
pin whose reference is saved in so->currPos.buf.
Allowing _bt_killitems to change so->currPos.buf like this is wrong.
The immediate consequence of allowing it is that code in _bt_steppage
(that copies so->currPos into so->markPos) will behave as if the scan is
a !so->dropPin scan. so->markPos will therefore retain the buffer pin
indefinitely, even though _bt_killitems only needs to acquire a pin
(along with a lock) for long enough to mark known-dead items LP_DEAD.
This issue came to light following a report of a failure of an assertion
from recent commit e6eed40e. The test case in question involves the use
of mark and restore. An initial call to _bt_killitems takes place that
leaves so->currPos.buf in a state that is inconsistent with the scan
being so->dropPin. A subsequent call to _bt_killitems for the same
position (following so->currPos being saved in so->markPos, and then
restored as so->currPos) resulted in the failure of an assertion that
tests that so->currPos.buf is InvalidBuffer when the scan is so->dropPin
(non-assert builds got a "resource was not closed" WARNING instead).
The same problem exists on earlier releases, though the issue is far
more subtle there. Recent commit e6eed40e introduced the so->dropPin
field as a partial replacement for testing so->currPos.buf directly.
Earlier releases won't get an assertion failure (or buffer pin leak),
but they will allow the second _bt_killitems call from the test case to
behave as if a buffer pin was consistently held since the original call
to _bt_readpage. This is wrong; there will have been an initial window
during which no pin was held on the so->currPos page, and yet the second
_bt_killitems call will neglect to check if so->currPos.lsn continues to
match the page's now-current LSN.
As a result of all this, it's just about possible that _bt_killitems
will set the wrong items LP_DEAD (on release branches). This could only
happen with merge joins (the sole user of nbtree mark/restore support),
when a concurrently inserted index tuple used a recently-recycled TID
(and only when the new tuple was inserted onto the same page as a
distinct concurrently-removed tuple with the same TID). This is exactly
the scenario that _bt_killitems' check of the page's now-current LSN
against the LSN stashed in currPos was supposed to prevent.
A follow-up commit will make nbtree completely stop conditioning whether
or not a position's pin needs to be dropped on whether the 'buf' field
is set. All call sites that might need to drop a still-held pin will be
taught to rely on the scan-level so->dropPin field recently introduced
by commit e6eed40e. That will make bugs of the same general nature as
this one impossible (or make them much easier to detect, at least).
Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/545be1e5-3786-439a-9257-a90d30f8b849@gmail.com
Backpatch-through: 13
We store values for these options as array elements with the syntax
"name=value", hence a name containing "=" confuses matters when
it's time to read the array back in. Since validation of the
options is often done (long) after this conversion to array format,
that leads to confusing and off-point error messages. We can
improve matters by rejecting names containing "=" up-front.
(Probably a better design would have involved pairs of array
elements, but it's too late now --- and anyway, there's no
evident use-case for option names like this. We already
reject such names in some other contexts such as GUCs.)
Reported-by: Chapman Flack <jcflack@acm.org>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Chapman Flack <jcflack@acm.org>
Discussion: https://postgr.es/m/6830EB30.8090904@acm.org
Backpatch-through: 13
A few places that access system catalogs don't set up an active
snapshot before potentially accessing their TOAST tables. To fix,
push an active snapshot just before each section of code that might
require accessing one of these TOAST tables, and pop it shortly
afterwards. While at it, this commit adds some rather strict
assertions in an attempt to prevent such issues in the future.
Commit 16bf24e0e4 recently removed pg_replication_origin's TOAST
table in order to fix the same problem for that catalog. On the
back-branches, those bugs are left in place. We cannot easily
remove a catalog's TOAST table on released major versions, and only
replication origins with extremely long names are affected. Given
the low severity of the issue, fixing older versions doesn't seem
worth the trouble of significantly modifying the patch.
Also, on v13 and v14, the aforementioned strict assertions have
been omitted because commit 2776922201, which added
HaveRegisteredOrActiveSnapshot(), was not back-patched. While we
could probably back-patch it now, I've opted against it because it
seems unlikely that new TOAST snapshot issues will be introduced in
the oldest supported versions.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/18127-fe54b6a667f29658%40postgresql.org
Discussion: https://postgr.es/m/18309-c0bf914950c46692%40postgresql.org
Discussion: https://postgr.es/m/ZvMSUPOqUU-VNADN%40nathan
Backpatch-through: 13
Since commit 8b1dccd37c, the PREPARE TRANSACTION WAL record includes
information about dropped statistics entries. However, the WAL resource
manager description function for PREPARE TRANSACTION record failed to
parse this information correctly and always assumed there were
no such entries.
As a result, for example, pg_waldump could not display the dropped
statistics entries stored in PREPARE TRANSACTION records.
The root cause was that ParsePrepareRecord() did not set the number of
statistics entries to drop on commit or abort. These values remained
zero-initialized and were never updated from the parsed record.
This commit fixes the issue by properly setting those values during parsing.
With this fix, pg_waldump can now correctly report dropped statistics
entries in PREPARE TRANSACTION records.
Back-patch to v15, where commit 8b1dccd37c was introduced.
Author: Daniil Davydov <3danissimo@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAJDiXgh-6Epb2XiJe4uL0zF-cf0_s_7Lw1TfEHDMLzYjEmfGOw@mail.gmail.com
Backpatch-through: 15
Commit 0f21db36d made an assumption that GIN triConsistentFns
would not modify their input entryRes[] arrays. But in fact,
the "shim" triConsistentFn that we use for opclasses that don't
supply their own did exactly that, potentially leading to wrong
answers from a GIN index search. Through bad luck, none of the
test cases that we have for such opclasses exposed the bug.
One response to this could be that the assumption of consistency check
functions not modifying entryRes[] arrays is a bad one, but it still
seems reasonable to me. Notably, shimTriConsistentFn is itself
assuming that with respect to the underlying boolean consistentFn,
so it's sure being self-centered in supposing that it gets to do so.
Fortunately, it's quite simple to fix shimTriConsistentFn to restore
the entry-time state of entryRes[], so let's do that instead.
This issue doesn't affect any core GIN opclasses, since they all
supply their own triConsistentFns. It does affect contrib modules
btree_gin, hstore, and intarray.
Along the way, I (tgl) noticed that shimTriConsistentFn failed to
pick up on a "recheck" flag returned by its first call to the boolean
consistentFn. This may be only a latent problem, since it would be
unlikely for a consistentFn to set recheck for the all-false case
and not any other cases. (Indeed, none of our contrib modules do
that.) Nonetheless, it's formally wrong.
Reported-by: Vinod Sridharan <vsridh90@gmail.com>
Author: Vinod Sridharan <vsridh90@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAFMdLD7XzsXfi1+DpTqTgrD8XU0i2C99KuF=5VHLWjx4C1pkcg@mail.gmail.com
Backpatch-through: 13
Commit 28d3c2ddcf introduced an assertion that if the memorized
downlink location in the insertion stack isn't valid, the parent's
LSN should've changed too. Turns out that was too strict. In
gistFindCorrectParent(), if we walk right, we update the parent's
block number and clear its memorized 'downlinkoffnum'. That triggered
the assertion on next call to gistFindCorrectParent(), if the parent
needed to be split too. Relax the assertion, so that it's OK if
downlinkOffnum is InvalidOffsetNumber.
Backpatch to v13-, all supported versions. The assertion was added in
commit 28d3c2ddcf in v12.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://www.postgresql.org/message-id/18396-03cac9beb2f7aac3@postgresql.org
The brin_bloom_union() function combines two BRIN summaries, by merging
one filter into the other. With bloom, we have to decompress the filters
first, but the function failed to update the summary to store the merged
filter. As a consequence, the index may be missing some of the data, and
return false negatives.
This issue exists since BRIN bloom indexes were introduced in Postgres
14, but at that point the union function was called only when two
sessions happened to summarize a range concurrently, which is rare. It
got much easier to hit in 17, as parallel builds use the union function
to merge summaries built by workers.
Fixed by storing a pointer to the decompressed filter, and freeing the
original one. Free the second filter too, if it was decompressed. The
freeing is not strictly necessary, because the union is called in
short-lived contexts, but it's tidy.
Backpatch to 14, where BRIN bloom indexes were introduced.
Reported by Arseniy Mukhin, investigation and fix by me.
Reported-by: Arseniy Mukhin
Discussion: https://postgr.es/m/18855-1cf1c8bcc22150e6%40postgresql.org
Backpatch-through: 14
bbf668d66f lowered the minimum value of maintenance_work_mem to
64kB. However, in parallel vacuum cases, since the initial underlying
DSA size is 256kB, it attempts to perform a cycle of index vacuuming
and table vacuuming with an empty TID store, resulting in an assertion
failure.
This commit ensures that at least one page is processed before index
vacuuming and table vacuuming begins.
Backpatch to 17, where the minimum maintenance_work_mem value was
lowered.
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAD21AoCEAmbkkXSKbj4dB+5pJDRL4ZHxrCiLBgES_g_g8mVi1Q@mail.gmail.com
Backpatch-through: 17
With improperly defined operator classes, it's possible to get a
Postgres crash because we'd try to invoke a procedure that doesn't
exist. This is because the code is being a bit too trusting that the
opclass is correctly defined. Add some ereport(ERROR)s for cases where
mandatory support procedures are not defined, transforming the crashes
into errors.
The particular case that was reported is an incomplete opclass in
PostGIS.
Backpatch all the way down to 13.
Reported-by: Tobias Wendorff <tobias.wendorff@tu-dortmund.de>
Diagnosed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/fb6d9a35-6c8e-4869-af80-0a4944a793a4@tu-dortmund.de
If a GIN index search had a lot of search keys (for example,
"jsonbcol ?| array[]" with tens of thousands of array elements),
both ginFillScanKey() and startScanKey() took O(N^2) time.
Worse, those loops were uncancelable for lack of CHECK_FOR_INTERRUPTS.
The problem in ginFillScanKey() is the brute-force search key
de-duplication done in ginFillScanEntry(). The most expedient
solution seems to be to just stop trying to de-duplicate once
there are "too many" search keys. We could imagine working harder,
say by using a sort-and-unique algorithm instead of brute force
compare-all-the-keys. But it seems unlikely to be worth the trouble.
There is no correctness issue here, since the code already allowed
duplicate keys if any extra_data is present.
The problem in startScanKey() is the loop that attempts to identify
the first non-required search key. In the submitted test case, that
vainly tests all the key positions, and each iteration takes O(N)
time. One part of that is that it's reinitializing the entryRes[]
array from scratch each time, which is entirely unnecessary given
that the triConsistentFn isn't supposed to scribble on its input.
We can easily adjust the array contents incrementally instead.
The other part of it is that the triConsistentFn may itself take
O(N) time (and does in this test case). This is all extremely
brute force: in simple cases with AND or OR semantics, we could
know without any looping whatever that all or none of the keys
are required. But GIN opclasses don't have any API for exposing
that knowledge, so at least in the short run there is little to
be done about that. Put in a CHECK_FOR_INTERRUPTS so that at
least the loop is cancelable.
These two changes together resolve the primary complaint that
the test query doesn't respond promptly to cancel interrupts.
Also, while they don't completely eliminate the O(N^2) behavior,
they do provide quite a nice speedup for mid-sized examples.
Bug: #18831
Reported-by: Niek <niek.brasa@hitachienergy.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/18831-e845ac44ebc5dd36@postgresql.org
Backpatch-through: 13
If the requested recovery timeline is not reachable, the logged
checkpoint and timeline should to be the values read from the
backup_label when it is defined. The message generated used the values
from the control file in this case, which is fine when recovering from
the control file without a backup_label, but not if there is a
backup_label.
Issue introduced in ee994272ca. v15 has introduced xlogrecovery.c and
more simplifications in this area (4a92a1c3d1, a27048cbcb), making
this change a bit simpler to think about, so backpatch only down to this
version.
Author: David Steele <david@pgbackrest.org>
Reviewed-by: Andrey M. Borodin <x4mmm@yandex-team.ru>
Reviewed-by: Benoit Lobréau <benoit.lobreau@dalibo.com>
Discussion: https://postgr.es/m/c3d617d4-1696-4aa7-8a4d-5a7d19cc5618@pgbackrest.org
Backpatch-through: 15
Since commit 29cf61ade3, table_block_relation_estimate_size() considers
fillfactor when estimating number of rows in a relation before the first
ANALYZE. The formula however did not consider tuples may be larger than
available space determined by fillfactor, ending with density 0. This
ultimately means the relation was estimated to contain a single row.
The executor however places at least one tuple per page, even with very
low fillfactor values, so the density should be at least 1. Fixed by
clamping the density estimate using clamp_row_est().
Reported by Heikki Linnakangas. Fix by me, with regression test inspired
by example provided by Heikki.
Backpatch to 17, where the issue was introduced.
Reported-by: Heikki Linnakangas
Backpatch-through: 17
Discussion: https://postgr.es/m/2bf9d973-7789-4937-a7ca-0af9fb49c71e@iki.fi
Logical replication crashes if the subscriber's partitioned table
has a BRIN index. There are two independently blamable causes,
and this patch fixes both:
1. brininsertcleanup fails if called twice for the same IndexInfo,
because it half-destroys its BrinInsertState but leaves it still
linked from ii_AmCache. brininsert would also fail in that state,
so it's pretty hard to see any advantage to this coding. Fully
remove the BrinInsertState, instead, so that a new brininsert
call would create a new cache.
2. A logical replication subscriber sometimes does ExecOpenIndices
twice on the same ResultRelInfo, followed by doing ExecCloseIndices
twice; the second call reaches the brininsertcleanup bug. Quite
aside from tickling unexpected cases in aminsertcleanup methods,
this seems very wasteful, because the IndexInfos built in the
first ExecOpenIndices call are just lost during the second call,
and have to be rebuilt at possibly-nontrivial cost. We should
establish a coding rule that you don't do that.
The problematic coding is that when the target table is partitioned,
apply_handle_tuple_routing calls ExecFindPartition which does
ExecOpenIndices (and expects that ExecCleanupTupleRouting will
close the indexes again). Using the ResultRelInfo made by
ExecFindPartition, it calls apply_handle_delete_internal or
apply_handle_insert_internal, both of which think they need to do
ExecOpenIndices/ExecCloseIndices for themselves. They do in the main
non-partitioned code paths, but not here. The simplest fix is to pull
their ExecOpenIndices/ExecCloseIndices calls out and put them in the
call sites for the non-partitioned cases. (We could have refactored
apply_handle_update_internal similarly, but I did not do so today
because there's no bug there: the partitioned code path doesn't
call it.)
Also, remove the always-duplicative open/close calls within
apply_handle_tuple_routing itself.
Since brininsertcleanup and indeed the whole aminsertcleanup mechanism
are new in v17, there's no observable bug in older branches. A case
could be made for trying to avoid these duplicative open/close calls
in the older branches, but for now it seems not worth the trouble and
risk of new bugs.
Bug: #18815
Reported-by: Sergey Belyashov <sergey.belyashov@gmail.com>
Discussion: https://postgr.es/m/18815-2a0407cc7f40b327@postgresql.org
Backpatch-through: 17
The right mix of DDL and VACUUM could corrupt a catalog page header such
that PageIsVerified() durably fails, requiring a restore from backup.
This affects only catalogs that both have a syscache and have DDL code
that uses syscache tuples to construct updates. One of the test
permutations shows a variant not yet fixed.
This makes !TransactionIdIsValid(TM_FailureData.xmax) possible with
TM_Deleted. I think core and PGXN are indifferent to that.
Per bug #17821 from Alexander Lakhin. Back-patch to v13 (all supported
versions). The test case is v17+, since it uses INJECTION_POINT.
Discussion: https://postgr.es/m/17821-dd8c334263399284@postgresql.org
XLogPageRead() checks immediately for an invalid WAL record header on a
standby, to be able to handle the case of continuation records that need
to be read across two different sources. As written, the check was too
generic, applying to any target LSN. Based on an analysis by Kyotaro
Horiguchi, what really matters is to make sure that the page header is
checked when attempting to read a LSN at the boundary of a segment, to
handle the case of a continuation record that spawns across multiple
pages when dealing with multiple segments, as WAL receivers are spawned
they request WAL from the beginning of a segment. This fix has been
proposed by Kyotaro Horiguchi.
This could cause standbys to loop infinitely when dealing with a
continuation record during a timeline jump, in the case where the
contents of the record in the follow-up page are invalid.
Some regression tests are added to check such scenarios, able to
reproduce the original problem. In the test, the contents of a
continuation record are overwritten with junk zeros on its follow-up
page, and replayed on standbys. This is inspired by 039_end_of_wal.pl,
and is enough to show how standbys should react on promotion by not
being stuck. Without the fix, the test would fail with a timeout. The
test to reproduce the problem has been written by Alexander Kukushkin.
The original check has been introduced in 0668719801, for a similar
problem.
Author: Kyotaro Horiguchi, Alexander Kukushkin
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/CAFh8B=mozC+e1wGJq0H=0O65goZju+6ab5AU7DEWCSUA2OtwDg@mail.gmail.com
Backpatch-through: 13
This commit reverts 8f67f994e8 (down to v13) and c3de0f9eed (down to
v17), as these are proving to not be completely correct regarding two
aspects:
- In v17 and newer branches, c3de0f9eed38's check for epoch handling is
incorrect, and does not correctly handle frozen epochs. A logic closer
to widen_snapshot_xid() should be used. The 2PC code should try to
integrate deeper with FullTransactionIds, 5a1dfde833 being not enough.
- In v13 and newer branches, 8f67f994e8 is a workaround for the real
issue, which is that we should not attempt CLOG lookups without reaching
consistency. This exists since 728bd991c3, and this is reachable with
ProcessTwoPhaseBuffer() called by restoreTwoPhaseData() at the beginning
of recovery.
Per discussion with Noah Misch.
Discussion: https://postgr.es/m/20250116010051.f3.nmisch@google.com
Backpatch-through: 13
This function expects an "int64" as result and stores the number of
pages to add to the index scan bitmap as an "int", multiplying its final
result by 10. For a relation large enough, this can theoretically
overflow if counting more than (INT32_MAX / 10) pages, knowing that the
number of pages is upper-bounded by MaxBlockNumber.
To avoid the overflow, this commit redefines "totalpages", used to
calculate the result, to be an "int64" rather than an "int".
Reported-by: Evgeniy Gorbanyov
Author: James Hunter
Discussion: https://www.postgresql.org/message-id/07704817-6fa0-460c-b1cf-cd18f7647041@basealt.ru
Backpatch-through: 13
The originally submitted code (using bit masking) was correct when the
number of slots was restricted to be a power of two -- but that
limitation was removed during development that led to commit
53c2a97a92, which made the bank selection code incorrect. This led to
always using a smaller number of banks than available. Change said code
to use integer modulo instead, which works correctly with an arbitrary
number of banks.
It's likely that we could improve on this to avoid runtime use of
integer division. But with this change we're, at least, not wasting
memory on unused banks, and more banks mean less contention, which is
likely to have a much higher performance impact than a single
instruction's latency.
Author: Yura Sokolov <y.sokolov@postgrespro.ru>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Discussion: https://postgr.es/m/9444dc46-ca47-43ed-9058-89c456316306@postgrespro.ru
At the beginning of recovery, an orphaned two-phase file in an epoch
different than the one defined in the checkpoint record could not be
removed based on the assumptions that AdjustToFullTransactionId() relies
on, assuming that all files would be either from the current epoch or
from the previous epoch.
If the checkpoint epoch was 0 while the 2PC file was orphaned and in the
future, AdjustToFullTransactionId() would underflow the epoch used to
build the 2PC file path. In non-assert builds, this would create a
WARNING message referring to a 2PC file with an epoch of "FFFFFFFF" (or
UINT32_MAX), as an effect of the underflow calculation, leaving the
orphaned file around.
Some tests are added with dummy 2PC files in the past and the future,
checking that these are properly removed.
Issue introduced by 5a1dfde833, that has switched two-phase state
files to use FullTransactionIds.
Reported-by: Vitaly Davydov
Author: Michael Paquier
Reviewed-by: Vitaly Davydov
Discussion: https://postgr.es/m/13b5b6-676c3080-4d-531db900@47931709
Backpatch-through: 17
Before 728bd991c3, that has improved the support for 2PC files during
recovery, the initial logic scanning files in pg_twophase was done so as
files in the future of the transaction ID horizon were checked first,
followed by a check if a transaction ID is aborted or committed which
could involve a pg_xact lookup. After this commit, these checks have
been done in reverse order.
Files detected as in the future do not have a state that can be checked
in pg_xact, hence this caused recovery to fail abruptly should an
orphaned 2PC file in the future of the transaction ID horizon exist in
pg_twophase at the beginning of recovery.
A test is added to check for this scenario, using an empty 2PC with a
transaction ID large enough to be in the future when running the test.
This test is added in 16 and older versions for now. 17 and newer
versions are impacted by a second bug caused by the addition of the
epoch in the 2PC file names. An equivalent test will be added in these
branches in a follow-up commit, once the second set of issues reported
are fixed.
Author: Vitaly Davydov, Michael Paquier
Discussion: https://postgr.es/m/11e597-676ab680-8d-374f23c0@145466129
Backpatch-through: 13
Cause parallel workers to not check datallowconn, rolcanlogin, and
ACL_CONNECT privileges. The leader already checked these things
(except for rolcanlogin which might have been checked for a different
role). Re-checking can accomplish little except to induce unexpected
failures in applications that might not even be aware that their query
has been parallelized. We already had the principle that parallel
workers rely on their leader to pass a valid set of authorization
information, so this change just extends that a bit further.
Also, modify the ReservedConnections, datconnlimit and rolconnlimit
logic so that these limits are only enforced against regular backends,
and only regular backends are counted while checking if the limits
were already reached. Previously, background processes that had an
assigned database or role were subject to these limits (with rather
random exclusions for autovac workers and walsenders), and the set of
existing processes that counted against each limit was quite haphazard
as well. The point of these limits, AFAICS, is to ensure the
availability of PGPROC slots for regular backends. Since all other
types of processes have their own separate pools of PGPROC slots, it
makes no sense either to enforce these limits against them or to count
them while enforcing the limit.
While edge-case failures of these sorts have been possible for a
long time, the problem got a good deal worse with commit 5a2fed911
(CVE-2024-10978), which caused parallel workers to make some of these
checks using the leader's current role where before we had used its
AuthenticatedUserId, thus allowing parallel queries to fail after
SET ROLE. The previous behavior was fairly accidental and I have
no desire to return to it.
This patch includes reverting 73c9f91a1, which was an emergency hack
to suppress these same checks in some cases. It wasn't complete,
as shown by a recent bug report from Laurenz Albe. We can also revert
fd4d93d26 and 492217301, which hacked around the same problems in one
regression test.
In passing, remove the special case for autovac workers in
CheckMyDatabase; it seems cleaner to have AutoVacWorkerMain pass
the INIT_PG_OVERRIDE_ALLOW_CONNS flag, now that that does what's
needed.
Like 5a2fed911, back-patch to supported branches (which sadly no
longer includes v12).
Discussion: https://postgr.es/m/1808397.1735156190@sss.pgh.pa.us
Consistently reset so->scanBehind at the beginning of nbtree array
advancement, even during sktrig_required=false calls (calls where array
advancement is triggered by an unsatisfied non-required array scan key).
Otherwise, it's possible for queries to fail to return all relevant
tuples to the scan given a low-order required scan key that was
previously deemed "satisfied" by a truncated high key attribute value.
This only happened at the point where a later non-required array scan
key needed to be "advanced" once on the next leaf page (that is, once
the right sibling of the truncated high key page was reached).
The underlying issue was that later code within _bt_advance_array_keys
assumed that the so->scanBehind flag must have been set using the
current page's high key (not the previous page's high key). Any later
successful recheck call to _bt_check_compare would therefore spuriously
be prevented from making _bt_advance_array_keys return true, based on
the faulty belief that the truncated attribute must be from the scan's
current tuple (i.e. the non-pivot tuple at the start of the next page).
_bt_advance_array_keys would return false for the tuple, ultimately
resulting in _bt_checkkeys failing to return a matching tuple.
Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp
execution.
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzkJKncfqyAUTeuB5GgRhT1vhsWO2q11dbZNqKmvjopP_g@mail.gmail.com
Backpatch: 17-, where commit 5bf748b8 first appears.
parallel_vacuum_reset_dead_items used a local variable to hold a
pointer from the passed vacrel, purely as a shorthand. This pointer
was later freed and a new allocation was made and stored to the
struct. Then the local pointer was mistakenly referenced again.
This apparently happened not to break anything since the freed chunk
would have been put on the context's freelist, so it was accidentally
the same pointer anyway, in which case the DSA handle was correctly
updated. The minimal fix is to change two places so they access
dead_items through the vacrel. This coding style is a maintenance
hazard, so while at it get rid of most other similar usages, which
were inconsistently used anyway.
Analysis and patch by Vallimaharajan G, with further defensive coding
by me
Backpath to v17, when TidStore came in
Discussion: https://postgr.es/m/1936493cc38.68cb2ef27266.7456585136086197135@zohocorp.com
When using a pipeline, a transaction starts from the first command and
is committed with a Sync message or when the pipeline ends.
Functions like IsInTransactionBlock() or PreventInTransactionBlock()
were already able to understand a pipeline as being in a transaction
block, but it was not the case of CheckTransactionBlock(). This
function is called for example to generate a WARNING for SET LOCAL,
complaining that it is used outside of a transaction block.
The current state of the code caused multiple problems, like:
- SET LOCAL executed at any stage of a pipeline issued a WARNING, even
if the command was at least second in line where the pipeline is in a
transaction state.
- LOCK TABLE failed when invoked at any step of a pipeline, even if it
should be able to work within a transaction block.
The pipeline protocol assumes that the first command of a pipeline is
not part of a transaction block, and that any follow-up commands is
considered as within a transaction block.
This commit changes the backend so as an implicit transaction block is
started each time the first Execute message of a pipeline has finished
processing, with this implicit transaction block ended once a sync is
processed. The checks based on XACT_FLAGS_PIPELINING in the routines
checking if we are in a transaction block are not necessary: it is
enough to rely on the existing ones.
Some tests are added to pgbench, that can be backpatched down to v17
when \syncpipeline is involved and down to v14 where \startpipeline and
\endpipeline are available. This is unfortunately limited regarding the
error patterns that can be checked, but it provides coverage for various
pipeline combinations to check if these succeed or fail. These tests
are able to capture the case of SET LOCAL's WARNING. The author has
proposed a different feature to improve the coverage by adding similar
meta-commands to psql where error messages could be checked, something
more useful for the cases where commands cannot be used in transaction
blocks, like REINDEX CONCURRENTLY or VACUUM. This is considered as
future work for v18~.
Author: Anthonin Bonnefoy
Reviewed-by: Jelte Fennema-Nio, Michael Paquier
Discussion: https://postgr.es/m/CAO6_XqrWO8uNBQrSu5r6jh+vTGi5Oiyk4y8yXDORdE2jbzw8xw@mail.gmail.com
Backpatch-through: 13
The SQL spec mandates that SET SESSION AUTHORIZATION implies
SET ROLE NONE. We tried to implement that within the lowest-level
functions that manipulate these settings, but that was a bad idea.
In particular, guc.c assumes that it doesn't matter in what order
it applies GUC variable updates, but that was not the case for these
two variables. This problem, compounded by some hackish attempts to
work around it, led to some security-grade issues:
* Rolling back a transaction that had done SET SESSION AUTHORIZATION
would revert to SET ROLE NONE, even if that had not been the previous
state, so that the effective user ID might now be different from what
it had been.
* The same for SET SESSION AUTHORIZATION in a function SET clause.
* If a parallel worker inspected current_setting('role'), it saw
"none" even when it should see something else.
Also, although the parallel worker startup code intended to cope
with the current role's pg_authid row having disappeared, its
implementation of that was incomplete so it would still fail.
Fix by fully separating the miscinit.c functions that assign
session_authorization from those that assign role. To implement the
spec's requirement, teach set_config_option itself to perform "SET
ROLE NONE" when it sets session_authorization. (This is undoubtedly
ugly, but the alternatives seem worse. In particular, there's no way
to do it within assign_session_authorization without incompatible
changes in the API for GUC assign hooks.) Also, improve
ParallelWorkerMain to directly set all the relevant user-ID variables
instead of relying on some of them to get set indirectly. That
allows us to survive not finding the pg_authid row during worker
startup.
In v16 and earlier, this includes back-patching 9987a7bf3 which
fixed a violation of GUC coding rules: SetSessionAuthorization
is not an appropriate place to be throwing errors from.
Security: CVE-2024-10978
Commit ac04aa84a put the shutoff for this into the planner, which is
not ideal because it doesn't prevent us from re-using a previously
made parallel plan. Revert the planner change and instead put the
shutoff into InitializeParallelDSM, modeling it on the existing code
there for recovering from failure to allocate a DSM segment.
However, that code path is mostly untested, and testing a bit harder
showed there's at least one bug: ExecHashJoinReInitializeDSM is not
prepared for us to have skipped doing parallel DSM setup. I also
thought the Assert in ReinitializeParallelWorkers is pretty
ill-advised, and replaced it with a silent Min() operation.
The existing test case added by ac04aa84a serves fine to test this
version of the fix, so no change needed there.
Patch by me, but thanks to Noah Misch for the core idea that we
could shut off worker creation when !INTERRUPTS_CAN_BE_PROCESSED.
Back-patch to v12, as ac04aa84a was.
Discussion: https://postgr.es/m/CAC-SaSzHUKT=vZJ8MPxYdC_URPfax+yoA1hKTcF4ROz_Q6z0_Q@mail.gmail.com
This reverts commit 95c5acb3fc (v17) and
counterparts in each other non-master branch. If released, that commit
would have caused a worst-in-years minor release regression, via
undetected LWLock self-deadlock. This commit and its self-deadlock fix
warrant more bake time in the master branch.
Reported by Alexander Lakhin.
Discussion: https://postgr.es/m/10ec0bc3-5933-1189-6bb8-5dec4114558e@gmail.com
A bug in nbtree's handling of primitive index scan scheduling could lead
to wrong answers when a scrollable cursor was used with an index scan
that had a SAOP index qual. Wrong answers were only possible when the
scan direction changed after a primitive scan was scheduled, but before
_bt_next was asked to fetch the next tuple in line (i.e. for things to
break, _bt_next had to be denied the opportunity to step off the page in
the same direction as the one used when the primscan was scheduled).
Furthermore, the issue only occurred when the page in question happened
to be the first page to be visited by the entire top-level scan; the
issue hinged upon the cursor backing up to the absolute beginning of the
key space that it returns tuples from (fetching in the opposite scan
direction across a "primitive scan boundary" always worked correctly).
To fix, make _bt_next unset the "needs primitive index scan" flag when
it detects that the current scan direction is not the one that was used
by _bt_readpage back when the primitive scan in question was scheduled.
This fixes the cases that are known to be faulty, and also seems like a
good idea on general robustness grounds.
Affected scrollable cursor cases now avoid a spurious primitive index
scan when they fetch backwards to the absolute start of the key space to
be visited by their cursor. Fetching backwards now only returns those
tuples at the start of the scan, as expected. It'll also be okay to
once again fetch forwards from the start at that point, since the scan
will be left in a state that's exactly consistent with the state it was
in before any tuples were ever fetched, as expected.
Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp
execution.
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wznv49bFsE2jkt4GuZ0tU2C91dEST=50egzjY2FeOcHL4Q@mail.gmail.com
Backpatch: 17-, where commit 5bf748b8 first appears.
Commit a07e03fd8f changed inplace updates
to wait for heap_update() commands like GRANT TABLE and GRANT DATABASE.
By keeping the pin during that wait, a sequence of autovacuum workers
and an uncommitted GRANT starved one foreground LockBufferForCleanup()
for six minutes, on buildfarm member sarus. Prevent, at the cost of a
bit of complexity. Back-patch to v12, like the earlier commit. That
commit and heap_inplace_lock() have not yet appeared in any release.
Discussion: https://postgr.es/m/20241026184936.ae.nmisch@google.com
A buffer lock won't stop a reader having already checked tuple
visibility. If a vac_update_datfrozenid() and then a crash happened
during inplace update of a relfrozenxid value, datfrozenxid could
overtake relfrozenxid. That could lead to "could not access status of
transaction" errors. Back-patch to v12 (all supported versions). In
v14 and earlier, this also back-patches the assertion removal from
commit 7fcf2faf9c.
Discussion: https://postgr.es/m/20240620012908.92.nmisch@google.com
The inplace update survives ROLLBACK. The inval didn't, so another
backend's DDL could then update the row without incorporating the
inplace update. In the test this fixes, a mix of CREATE INDEX and ALTER
TABLE resulted in a table with an index, yet relhasindex=f. That is a
source of index corruption. Back-patch to v12 (all supported versions).
The back branch versions don't change WAL, because those branches just
added end-of-recovery SIResetAll(). All branches change the ABI of
extern function PrepareToInvalidateCacheTuple(). No PGXN extension
calls that, and there's no apparent use case in extensions.
Reviewed by Nitin Motiani and (in earlier versions) Andres Freund.
Discussion: https://postgr.es/m/20240523000548.58.nmisch@google.com
An inplace update's invalidation messages are part of its transaction's
commit record. However, the update survives even if its transaction
aborts or we stop recovery before replaying its transaction commit.
After recovery, a backend that started in recovery could update the row
without incorporating the inplace update. That could result in a table
with an index, yet relhasindex=f. That is a source of index corruption.
This bulk invalidation avoids the functional consequences. A future
change can fix the !RecoveryInProgress() scenario without changing the
WAL format. Back-patch to v17 - v12 (all supported versions). v18 will
instead add invalidations to WAL.
Discussion: https://postgr.es/m/20240618152349.7f.nmisch@google.com