GetSnapshotData() set TransactionXmin = MyProc->xmin, but when
SnapshotResetXmin() advanced MyProc->xmin, it did not advance
TransactionXmin correspondingly. That meant that TransactionXmin could
be older than MyProc->xmin, and XIDs between than TransactionXmin and
the real MyProc->xmin could be vacuumed away. One known consequence is
in pg_subtrans lookups: we might try to look up the status of an XID
that was already truncated away.
Back-patch to all supported versions.
Reviewed-by: Andres Freund
Discussion: https://www.postgresql.org/message-id/d27a046d-a1e4-47d1-a95c-fbabe41debb4@iki.fi
RelationTruncate() does three things, while holding an
AccessExclusiveLock and preventing checkpoints:
1. Logs the truncation.
2. Drops buffers, even if they're dirty.
3. Truncates some number of files.
Step 2 could previously be canceled if it had to wait for I/O, and step
3 could and still can fail in file APIs. All orderings of these
operations have data corruption hazards if interrupted, so we can't give
up until the whole operation is done. When dirty pages were discarded
but the corresponding blocks were left on disk due to ERROR, old page
versions could come back from disk, reviving deleted data (see
pgsql-bugs #18146 and several like it). When primary and standby were
allowed to disagree on relation size, standbys could panic (see
pgsql-bugs #18426) or revive data unknown to visibility management on
the primary (theorized).
Changes:
* WAL is now unconditionally flushed first
* smgrtruncate() is now called in a critical section, preventing
interrupts and causing PANIC on file API failure
* smgrtruncate() has a new parameter for existing fork sizes,
because it can't call smgrnblocks() itself inside a critical section
The changes apply to RelationTruncate(), smgr_redo() and
pg_truncate_visibility_map(). That last is also brought up to date with
other evolutions of the truncation protocol.
The VACUUM FileTruncate() failure mode had been discussed in older
reports than the ones referenced below, with independent analysis from
many people, but earlier theories on how to fix it were too complicated
to back-patch. The more recently invented cancellation bug was
diagnosed by Alexander Lakhin. Other corruption scenarios were spotted
by me while iterating on this patch and earlier commit 75818b3a.
Back-patch to all supported releases.
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reported-by: rootcause000@gmail.com
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/18146-04e908c662113ad5%40postgresql.org
Discussion: https://postgr.es/m/18426-2d18da6586f152d6%40postgresql.org
Consistently reset so->scanBehind at the beginning of nbtree array
advancement, even during sktrig_required=false calls (calls where array
advancement is triggered by an unsatisfied non-required array scan key).
Otherwise, it's possible for queries to fail to return all relevant
tuples to the scan given a low-order required scan key that was
previously deemed "satisfied" by a truncated high key attribute value.
This only happened at the point where a later non-required array scan
key needed to be "advanced" once on the next leaf page (that is, once
the right sibling of the truncated high key page was reached).
The underlying issue was that later code within _bt_advance_array_keys
assumed that the so->scanBehind flag must have been set using the
current page's high key (not the previous page's high key). Any later
successful recheck call to _bt_check_compare would therefore spuriously
be prevented from making _bt_advance_array_keys return true, based on
the faulty belief that the truncated attribute must be from the scan's
current tuple (i.e. the non-pivot tuple at the start of the next page).
_bt_advance_array_keys would return false for the tuple, ultimately
resulting in _bt_checkkeys failing to return a matching tuple.
Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp
execution.
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzkJKncfqyAUTeuB5GgRhT1vhsWO2q11dbZNqKmvjopP_g@mail.gmail.com
Backpatch: 17-, where commit 5bf748b8 first appears.
If the non-recursive part of a recursive CTE ended up using
TTSOpsBufferHeapTuple as the table slot type, then a duplicate value
could cause an Assert failure in CheckOpSlotCompatibility() when
checking the hash table for the duplicate value. The expected slot type
for the deform step was TTSOpsMinimalTuple so the Assert failed when the
TTSOpsBufferHeapTuple slot was used.
This is a long-standing bug which we likely didn't notice because it
seems much more likely that the non-recursive term would have required
projection and used a TTSOpsVirtual slot, which CheckOpSlotCompatibility
is ok with.
There doesn't seem to be any harm done here other than the Assert
failure. Both TTSOpsMinimalTuple and TTSOpsBufferHeapTuple slot types
require tuple deformation, so the EEOP_*_FETCHSOME ExprState step would
have properly existed in the ExprState.
The solution is to pass NULL for the ExecBuildGroupingEqual's 'lops'
parameter. This means the ExprState's EEOP_*_FETCHSOME step won't
expect a fixed slot type. This makes CheckOpSlotCompatibility() happy as
no checking is performed when the ExprEvalStep is not expecting a fixed
slot type.
Reported-by: Richard Guo
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/CAMbWs4-8U9q2LAtf8+ghV11zeUReA3AmrYkxzBEv0vKnDxwkKA@mail.gmail.com
Backpatch-through: 13, all supported versions
If a dshash table grows very large (e.g., the dshash table for
cumulative statistics when there are millions of tables), resizing
it may fail with an error like:
ERROR: invalid DSA memory alloc request size 1073741824
To fix, permit dshash resizing to allocate more than 1 GB by
providing the DSA_ALLOC_HUGE flag.
Reported-by: Andreas Scherbaum
Author: Matthias van de Meent
Reviewed-by: Cédric Villemain, Michael Paquier, Andres Freund
Discussion: https://postgr.es/m/80a12d59-0d5e-4c54-866c-e69cd6536471%40pgug.de
Backpatch-through: 13
Commit b437571714 allowed parallel builds for BRIN, but left behind
two comments claiming only btree indexes support parallel builds.
Reported by Egor Rogov, along with similar issues in SGML docs.
Backpatch to 17, where parallel builds for BRIN were introduced.
Reported-by: Egor Rogov
Backpatch-through: 17
Discussion: https://postgr.es/m/114e2d5d-125e-07d8-94aa-5ad175fb7443@postgrespro.ru
This reverts commit 562bee0fc1.
We received a report from the field about this change in behavior,
so it seems best to revert this commit and to add proper
multibyte-aware truncation as a follow-up exercise.
Fixes bug #18711.
Reported-by: Adam Rauch
Reviewed-by: Tom Lane, Bertrand Drouvot, Bruce Momjian, Thomas Munro
Discussion: https://postgr.es/m/18711-7503ee3e449d2c47%40postgresql.org
Backpatch-through: 17
Since commit 97550c0711, these failed with
"PANIC: proc_exit() called in child process" due to uninitialized or
stale MyProcPid. That was reachable if close() failed in
ClosePostmasterPorts() or setlocale(category, "C") failed, both
unlikely. Back-patch to v13 (all supported versions).
Discussion: https://postgr.es/m/20241208034614.45.nmisch@google.com
The GUC assign and check hooks used "assign_timezone_abbreviations",
which was incorrect.
Issue noticed while browsing this area of the code, introduced in
0a20ff54f5.
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/Z1eV6Y8yk77GZhZI@paquier.xyz
Backpatch-through: 16
Our parallel-mode code only works when we are executing a query
in full, so ExecutePlan must disable parallel mode when it is
asked to do partial execution. The previous logic for this
involved passing down a flag (variously named execute_once or
run_once) from callers of ExecutorRun or PortalRun. This is
overcomplicated, and unsurprisingly some of the callers didn't
get it right, since it requires keeping state that not all of
them have handy; not to mention that the requirements for it were
undocumented. That led to assertion failures in some corner
cases. The only state we really need for this is the existing
QueryDesc.already_executed flag, so let's just put all the
responsibility in ExecutePlan. (It could have been done in
ExecutorRun too, leading to a slightly shorter patch -- but if
there's ever more than one caller of ExecutePlan, it seems better
to have this logic in the subroutine than the callers.)
This makes those ExecutorRun/PortalRun parameters unnecessary.
In master it seems okay to just remove them, returning the
API for those functions to what it was before parallelism.
Such an API break is clearly not okay in stable branches,
but for them we can just leave the parameters in place after
documenting that they do nothing.
Per report from Yugo Nagata, who also reviewed and tested
this patch. Back-patch to all supported branches.
Discussion: https://postgr.es/m/20241206062549.710dc01cf91224809dd6c0e1@sraoss.co.jp
pgstat_write_statsfile() discards any entries marked as dropped from
being written to the stats file at shutdown, and also included an
assertion based on the same condition.
The intention of the assertion is to track that no pgstats entries
should be left around as terminating backends should drop any entries
they still hold references on before the stats file is written by the
checkpointer, and it not worth taking down the server in this case if
there is a bug making that possible.
Let's improve the comment of this area to document clearly what's
intended.
Based on a discussion with Bertrand Drouvot and Anton A. Melnikov.
Author: Bertrand Drouvot
Discussion: https://postgr.es/m/a13e8cdf-b97a-4ecb-8f42-aaa367974e29@postgrespro.ru
Backpatch-through: 15
818119afcc has introduced the "generation" concept in pgstats entries,
incremented a counter when a pgstats entry is reinitialized, but it did
not count on the fact that backends still holding local references to
such entries need to be refreshed if the cache age is outdated. The
previous logic only updated local references when an entry was dropped,
but it needs also to consider entries that are reinitialized.
This matters for replication slot stats (as well as custom pgstats kinds
in 18~), where concurrent drops and creates of a slot could cause
incorrect stats to be locally referenced. This would lead to an
assertion failure at shutdown when writing out the stats file, as the
backend holding an outdated local reference would not be able to drop
during its shutdown sequence the stats entry that should be dropped, as
the last process holding a reference to the stats entry. The
checkpointer was then complaining about such an entry late in the
shutdown sequence, after the shutdown checkpoint is finished with the
control file updated, causing the stats file to not be generated. In
non-assert builds, the entry would just be skipped with the stats file
written.
Note that only logical replication slots use statistics.
A test case based on TAP is added to test_decoding, where a persistent
connection peeking at a slot's data is kept with concurrent drops and
creates of the same slot. This is based on the isolation test case that
Anton has sent. As it requires a node shutdown with a check to make
sure that the stats file is written with this specific sequence of
events, TAP is used instead.
Reported-by: Anton A. Melnikov
Reviewed-by: Bertrand Drouvot
Discussion: https://postgr.es/m/56bf8ff9-dd8c-47b2-872a-748ede82af99@postgrespro.ru
Backpatch-through: 15
When short-circuiting WindowAgg node evaluation on the top-level
WindowAgg node using quals on monotonic window functions, because the
WindowAgg run condition can mean there's no need to evaluate subsequent
window function results in the same partition once the run condition
becomes false, it was possible that the executor would use stale results
from the previous invocation of the window function in some cases.
A fix for this was partially done by a5832722, but that commit only
fixed the issue for non-top-level WindowAgg nodes. I mistakenly thought
that the top-level WindowAgg didn't have this issue, but Jayesh's example
case clearly shows that's incorrect. At the time, I also thought that
this only affected 32-bit systems as all window functions which then
supported run conditions returned BIGINT, however, that's wrong as
ExecProject is still called and that could cause evaluation of any other
window function belonging to the same WindowAgg node, one of which may
return a byref type.
The only queries affected by this are WindowAggs with a "Run Condition"
which contains at least one window function with a byref result type,
such as lead() or lag() on a byref column. The window clause must also
contain a PARTITION BY clause (without a PARTITION BY, execution of the
WindowAgg stops immediately when the run condition becomes false and
there's no risk of using the stale results).
Reported-by: Jayesh Dehankar
Discussion: https://postgr.es/m/193261e2c4d.3dd3cd7c1842.871636075166132237@zohocorp.com
Backpatch-through: 15, where WindowAgg run conditions were added
Usually an entry in pg_amop or pg_amproc does not need a dependency on
its amoplefttype/amoprighttype/amproclefttype/amprocrighttype types,
because there is an indirect dependency via the argument types of its
referenced operator or procedure, or via the opclass it belongs to.
However, for some support procedures in some index AMs, the argument
types of the support procedure might not mention the column data type
at all. Also, the amop/amproc entry might be treated as "loose" in
the opfamily, in which case it lacks a dependency on any particular
opclass; or it might be a cross-type entry having a reference to a
datatype that is not its opclass' opcintype.
The upshot of all this is that there are cases where a datatype can
be dropped while leaving behind amop/amproc entries that mention it,
because there is no path in pg_depend showing that those entries
depend on that type. Such entries are harmless in normal activity,
because they won't get used, but they cause problems for maintenance
actions such as dropping the operator family. They also cause pg_dump
to produce bogus output. The previous commit put a band-aid on the
DROP OPERATOR FAMILY failure, but a real fix is needed.
To fix, add pg_depend entries showing that a pg_amop/pg_amproc entry
depends on its lefttype/righttype. To avoid bloating pg_depend too
much, skip this if the referenced operator or function has that type
as an input type. (I did not bother with considering the possible
indirect dependency via the opclass' opcintype; at least in the
reported case, that wouldn't help anyway.)
Probably, the reason this has escaped notice for so long is that
add-on datatypes and relevant opclasses/opfamilies are usually
packaged as extensions nowadays, so that there's no way to drop
a type without dropping the referencing opclasses/opfamilies too.
Still, in the absence of pg_depend entries there's nothing that
constrains DROP EXTENSION to drop the opfamily entries before the
datatype, so it seems possible for a DROP failure to occur anyway.
The specific case that was reported doesn't fail in v13, because
v13 prefers to attach the support procedure to the opclass not the
opfamily. But it's surely possible to construct other edge cases
that do fail in v13, so patch that too.
Per report from Yoran Heling. Back-patch to all supported branches.
Discussion: https://postgr.es/m/Z1MVCOh1hprjK5Sf@gmai021
Yoran Heling reported a case where a data type could be dropped
while references to its OID remain behind in pg_amproc. This
causes getObjectDescription to fail, which blocks dropping the
operator family (since our DROP code likes to construct descriptions
of everything it's dropping). The proper fix for this requires
adding more pg_depend entries. But to allow DROP to go through with
already-corrupt catalogs, tweak getObjectDescription to print "???"
for the type instead of failing when it processes such an entry.
I changed the logic for pg_amop similarly, for consistency,
although it is not known that the problem can manifest in pg_amop.
Per report from Yoran Heling. Back-patch to all supported
branches (although the problem may be unreachable in v13).
Discussion: https://postgr.es/m/Z1MVCOh1hprjK5Sf@gmai021
These format codes produce or consume strings of digits, so they
should be labeled with is_digit = true, but they were not.
This has effect in only one place, where is_next_separator()
is checked to see if the preceding format code should slurp up
all the available digits. Thus, with a format such as '...SSFF3'
with remaining input '12345', the 'SS' code would consume all
five digits (and then complain about seconds being out of range)
when it should eat only two digits.
Per report from Nick Davies. This bug goes back to d589f9446
where the FFn codes were introduced, so back-patch to v13.
Discussion: https://postgr.es/m/AM8PR08MB6356AC979252CFEA78B56678B6312@AM8PR08MB6356.eurprd08.prod.outlook.com
parallel_vacuum_reset_dead_items used a local variable to hold a
pointer from the passed vacrel, purely as a shorthand. This pointer
was later freed and a new allocation was made and stored to the
struct. Then the local pointer was mistakenly referenced again.
This apparently happened not to break anything since the freed chunk
would have been put on the context's freelist, so it was accidentally
the same pointer anyway, in which case the DSA handle was correctly
updated. The minimal fix is to change two places so they access
dead_items through the vacrel. This coding style is a maintenance
hazard, so while at it get rid of most other similar usages, which
were inconsistently used anyway.
Analysis and patch by Vallimaharajan G, with further defensive coding
by me
Backpath to v17, when TidStore came in
Discussion: https://postgr.es/m/1936493cc38.68cb2ef27266.7456585136086197135@zohocorp.com
The validate_sync_standby_slots subroutine requires an LWLock, so it
cannot run in processes without PGPROC; skip it there to avoid a crash.
This replaces the current test for ReplicationSlotCtl being not null,
which appears to be a solution for the same problem but less general.
I also rewrote a related comment that mentioned ReplicationSlotCtl in
StandbySlotsHaveCaughtup.
This code came in with commit bf279ddd1c28; backpatch to 17.
Reported-by: Gabriele Bartolini <gabriele.bartolini@enterprisedb.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com>
Discussion: https://postgr.es/m/202411281216.sutbxtr6idnn@alvherre.pgsql
Previously, it set only DELAY_CHKPT_COMPLETE. That was important,
because it meant that if the XLOG_SMGR_TRUNCATE record preceded a
XLOG_CHECKPOINT_ONLINE record in the WAL, then the truncation would also
happen on disk before the XLOG_CHECKPOINT_ONLINE record was
written.
However, it didn't guarantee that the sync request for the truncation
was processed before the XLOG_CHECKPOINT_ONLINE record was written. By
setting DELAY_CHKPT_START, we guarantee that if an XLOG_SMGR_TRUNCATE
record is written to WAL before the redo pointer of a concurrent
checkpoint, the sync request queued by that operation must be processed
by that checkpoint, rather than being left for the following one.
This is a refinement of commit 412ad7a556. Back-patch to all supported
releases, like that commit.
Author: Robert Haas <robertmhaas@gmail.com>
Reported-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKG%2B-2rjGZC2kwqr2NMLBcEBp4uf59QT1advbWYF_uc%2B0Aw%40mail.gmail.com
As I'd feared, commit 5c9d8636d was still a few bricks shy of a load.
We can't just leave pulled-up lateral-reference Vars with no new
nullingrels: we have to carefully compute what subset of the
to-be-replaced Var's nullingrels apply to them, else we still get
"wrong varnullingrels" errors. This is a bit tedious, but it looks
like we can use the nullingrel data this patch computes for other
purposes, enabling better optimization. We don't want to inject
unnecessary plan changes into stable branches though, so leave that
idea for a later HEAD-only patch.
Patch by me, but thanks to Richard Guo for devising a test case that
broke 5c9d8636d, and for preliminary investigation about how to fix
it. As before, back-patch to v16.
Discussion: https://postgr.es/m/E1tGn4j-0003zi-MP@gemulon.postgresql.org
If we are pulling up a subquery that's under an outer join, and
the subquery's target list contains a strict expression that uses
both a subquery variable and a lateral-reference variable, it's okay
to pull up the expression without wrapping it in a PlaceHolderVar.
That's safe because if the subquery variable is forced to NULL
by the outer join, the expression result will come out as NULL too,
so we don't have to force that outcome by evaluating the expression
below the outer join. It'd be correct to wrap in a PHV, but that can
lead to very significantly worse plans, since we'd then have to use
a nestloop plan to pass down the lateral reference to where the
expression will be evaluated.
However, when we do that, we should not mark the lateral reference
variable as being nulled by the outer join, because it isn't after
we pull up the expression in this way. So the marking logic added
by cb8e50a4a was incorrect in this detail, leading to "wrong
varnullingrels" errors from the consistency-checking logic in
setrefs.c. It seems to be sufficient to just not mark lateral
references at all in this case. (I have a nagging feeling that more
complexity may be needed in cases where there are several levels of
outer join, but some attempts to break it with that didn't succeed.)
Per report from Bertrand Mamasam. Back-patch to v16, as the previous
patch was.
Discussion: https://postgr.es/m/CACZ67_UA_EVrqiFXJu9XK50baEpH=ofEPJswa2kFxg6xuSw-ww@mail.gmail.com
We added pg_constraint rows for all not-null constraints, first for
tables and later for domains; but while the ones for tables were
reverted, the ones for domains were not. However, we did accidentally
revert ruleutils.c support for the ones on domains in 6f8bb7c1e9,
which breaks running pg_get_constraintdef() on them. Put that back.
This is only needed in branch 17, because we've reinstated this code in
branch master with commit 14e87ffa5c. Add some new tests in both
branches.
I couldn't find anything else that needs de-reverting.
Reported-by: Erki Eessaar <erki.eessaar@taltech.ee>
Reviewed-by: Magnus Hagander <magnus@hagander.net>
Discussion: https://postgr.es/m/AS8PR01MB75110350415AAB8BBABBA1ECFE222@AS8PR01MB7511.eurprd01.prod.exchangelabs.com
When using a pipeline, a transaction starts from the first command and
is committed with a Sync message or when the pipeline ends.
Functions like IsInTransactionBlock() or PreventInTransactionBlock()
were already able to understand a pipeline as being in a transaction
block, but it was not the case of CheckTransactionBlock(). This
function is called for example to generate a WARNING for SET LOCAL,
complaining that it is used outside of a transaction block.
The current state of the code caused multiple problems, like:
- SET LOCAL executed at any stage of a pipeline issued a WARNING, even
if the command was at least second in line where the pipeline is in a
transaction state.
- LOCK TABLE failed when invoked at any step of a pipeline, even if it
should be able to work within a transaction block.
The pipeline protocol assumes that the first command of a pipeline is
not part of a transaction block, and that any follow-up commands is
considered as within a transaction block.
This commit changes the backend so as an implicit transaction block is
started each time the first Execute message of a pipeline has finished
processing, with this implicit transaction block ended once a sync is
processed. The checks based on XACT_FLAGS_PIPELINING in the routines
checking if we are in a transaction block are not necessary: it is
enough to rely on the existing ones.
Some tests are added to pgbench, that can be backpatched down to v17
when \syncpipeline is involved and down to v14 where \startpipeline and
\endpipeline are available. This is unfortunately limited regarding the
error patterns that can be checked, but it provides coverage for various
pipeline combinations to check if these succeed or fail. These tests
are able to capture the case of SET LOCAL's WARNING. The author has
proposed a different feature to improve the coverage by adding similar
meta-commands to psql where error messages could be checked, something
more useful for the cases where commands cannot be used in transaction
blocks, like REINDEX CONCURRENTLY or VACUUM. This is considered as
future work for v18~.
Author: Anthonin Bonnefoy
Reviewed-by: Jelte Fennema-Nio, Michael Paquier
Discussion: https://postgr.es/m/CAO6_XqrWO8uNBQrSu5r6jh+vTGi5Oiyk4y8yXDORdE2jbzw8xw@mail.gmail.com
Backpatch-through: 13
Most came in during the 17 cycle, so backpatch there. Some
(particularly reorderbuffer.h) are very old, but backpatching doesn't
seem useful.
Like commits c9d2977519, c4f113e8fe.
If passed a read-write expanded object pointer, the EEOP_NULLIF
code would hand that same pointer to the equality function
and then (unless equality was reported) also return the same
pointer as its value. This is no good, because a function that
receives a read-write expanded object pointer is fully entitled
to scribble on or even delete the object, thus corrupting the
NULLIF output. (This problem is likely unobservable with the
equality functions provided in core Postgres, but it's easy to
demonstrate with one coded in plpgsql.)
To fix, make sure the pointer passed to the equality function
is read-only. We can still return the original read-write
pointer as the NULLIF result, allowing optimization of later
operations.
Per bug #18722 from Alexander Lakhin. This has been wrong
since we invented expanded objects, so back-patch to all
supported branches.
Discussion: https://postgr.es/m/18722-fd9e645448cc78b4@postgresql.org
This WARNING appeared because SearchSysCacheLocked1() read
cc_relisshared before catcache initialization, when the field is false
unconditionally. On the basis of reading false there, it constructed a
locktag as though pg_tablespace weren't relisshared. Only shared
catalogs could be affected, and only GRANT TABLESPACE was affected in
practice. SearchSysCacheLocked1() callers use one other shared-relation
syscache, DATABASEOID. DATABASEOID is initialized by the end of
CheckMyDatabase(), making the problem unreachable for pg_database.
Back-patch to v13 (all supported versions). This has no known impact
before v16, where ExecGrant_common() first appeared. Earlier branches
avoid trouble by having a separate ExecGrant_Tablespace() that doesn't
use LOCKTAG_TUPLE. However, leaving this unfixed in v15 could ensnare a
future back-patch of a SearchSysCacheLocked1() call.
Reported by Aya Iwata.
Discussion: https://postgr.es/m/OS7PR01MB11964507B5548245A7EE54E70EA212@OS7PR01MB11964.jpnprd01.prod.outlook.com
If a user started a bulk write operation on a fork with existing data
to append data in bulk, the bulk_write machinery would zero out all
previously written pages up to the last page written by the new
bulk_write operation.
This is not an issue for PostgreSQL itself, because we never use the
bulk_write facility on a non-empty fork. But there are use cases where
it makes sense. TimescaleDB extension is known to do that to merge
partitions, for example.
Backpatch to v17, where the bulk_write machinery was introduced.
Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reported-By: Erik Nordström <erik@timescale.com>
Reviewed-by: Erik Nordström <erik@timescale.com>
Discussion: https://www.postgresql.org/message-id/CACAa4VJ%2BQY4pY7M0ECq29uGkrOygikYtao1UG9yCDFosxaps9g@mail.gmail.com
RelationSyncCache, the hash table in charge of tracking the relation
schemas sent through pgoutput, was forgetting to free the TupleDesc
associated to the two slots used to store the new and old tuples,
causing some memory to be leaked each time a relation is invalidated
when the slots of an existing relation entry are cleaned up.
This is rather hard to notice as the bloat is pretty minimal, but a
long-running WAL sender would be in trouble over time depending on the
workload. sysbench has proved to be pretty good at showing the problem,
coupled with some memory monitoring of the WAL sender.
Issue introduced in 52e4f0cd47, that has added row filters for tables
logically replicated.
Author: Boyu Yang
Reviewed-by: Michael Paquier, Hou Zhijie
Discussion: https://postgr.es/m/DM3PR84MB3442E14B340E553313B5C816E3252@DM3PR84MB3442.NAMPRD84.PROD.OUTLOOK.COM
Backpatch-through: 15
Ordinarily transformSetOperationTree will collect all UNION/
INTERSECT/EXCEPT steps into the setOperations tree of the topmost
Query, so that leaf queries do not contain any setOperations.
However, it cannot thus flatten a subquery that also contains
WITH, ORDER BY, FOR UPDATE, or LIMIT. I (tgl) forgot that in
commit 07b4c48b6 and wrote an assertion in rule deparsing that
a leaf's setOperations would always be empty.
If it were nonempty then we would want to parenthesize the subquery
to ensure that the output represents the setop nesting correctly
(e.g. UNION below INTERSECT had better get parenthesized). So
rather than just removing the faulty Assert, let's change it into
an additional case to check to decide whether to add parens. We
don't expect that the additional case will ever fire, but it's
cheap insurance.
Man Zeng and Tom Lane
Discussion: https://postgr.es/m/tencent_7ABF9B1F23B0C77606FC5FE3@qq.com
In the dim past we figured it was okay to ignore collations
when combining UNION set-operation nodes into a single N-way
UNION operation. I believe that was fine at the time, but
it stopped being fine when we added nondeterministic collations:
the semantics of distinct-ness are affected by those. v17 made
it even less fine by allowing per-child sorting operations to
be merged via MergeAppend, although I think we accidentally
avoided any live bug from that.
Add a check that collations match before deciding that two
UNION nodes are equivalent. I also failed to resist the
temptation to comment plan_union_children() a little better.
Back-patch to all supported branches (v13 now), since they
all have nondeterministic collations.
Discussion: https://postgr.es/m/3605568.1731970579@sss.pgh.pa.us
After commit 5a2fed911a, the catalog state
resulting from these commands ceased to affect sessions. Restore the
longstanding behavior, which is like beginning the session with a SET
ROLE command. If cherry-picking the CVE-2024-10978 fixes, default to
including this, too. (This fixes an unintended side effect of fixing
CVE-2024-10978.) Back-patch to v12, like that commit. The release team
decided to include v12, despite the original intent to halt v12 commits
earlier this week.
Tom Lane and Noah Misch. Reported by Etienne LAFARGE.
Discussion: https://postgr.es/m/CADOZwSb0UsEr4_UTFXC5k7=fyyK8uKXekucd+-uuGjJsGBfxgw@mail.gmail.com
Previously LogicalIncreaseRestartDecodingForSlot() accidentally
accepted any LSN as the candidate_lsn and candidate_valid after the
restart_lsn of the replication slot was updated, so it potentially
caused the restart_lsn to move backwards.
A scenario where this could happen in logical replication is: after a
logical replication restart, based on previous candidate_lsn and
candidate_valid values in memory, the restart_lsn advances upon
receiving a subscriber acknowledgment. Then, logical decoding restarts
from an older point, setting candidate_lsn and candidate_valid based
on an old RUNNING_XACTS record. Subsequent subscriber acknowledgments
then update the restart_lsn to an LSN older than the current value.
In the reported case, after WAL files were removed by a checkpoint,
the retreated restart_lsn prevented logical replication from
restarting due to missing WAL segments.
This change essentially modifies the 'if' condition to 'else if'
condition within the function. The previous code had an asymmetry in
this regard compared to LogicalIncreaseXminForSlot(), which does
almost the same thing for different fields.
The WAL removal issue was reported by Hubert Depesz Lubaczewski.
Backpatch to all supported versions, since the bug exists since 9.4
where logical decoding was introduced.
Reviewed-by: Tomas Vondra, Ashutosh Bapat, Amit Kapila
Discussion: https://postgr.es/m/Yz2hivgyjS1RfMKs%40depesz.com
Discussion: https://postgr.es/m/85fff40e-148b-4e86-b921-b4b846289132%40vondra.me
Backpatch-through: 13
In commit 08c0d6ad6 which introduced "rainbow" arcs in regex NFAs,
I didn't think terribly hard about what to do when creating the color
complement of a rainbow arc. Clearly, the complement cannot match any
characters, and I took the easy way out by just not building any arcs
at all in the complement arc set. That mostly works, but Nikolay
Shaplov found a case where it doesn't: if we decide to delete that
sub-NFA later because it's inside a "{0}" quantifier, delsub()
suffered an assertion failure. That's because delsub() relies on
the target sub-NFA being fully connected. That was always true
before, and the best fix seems to be to restore that property.
Hence, invent a new arc type CANTMATCH that can be generated in
place of an empty color complement, and drop it again later when we
start NFA optimization. (At that point we don't need to do delsub()
any more, and besides there are other cases where NFA optimization can
lead to disconnected subgraphs.)
It appears that this bug has no consequences in a non-assert-enabled
build: there will be some transiently leaked NFA states/arcs, but
they'll get cleaned up eventually. Still, we don't like assertion
failures, so back-patch to v14 where rainbow arcs were introduced.
Per bug #18708 from Nikolay Shaplov.
Discussion: https://postgr.es/m/18708-f94f2599c9d2c005@postgresql.org
This fixes a set of race conditions with cumulative statistics where a
shared stats entry could be dropped while it should still be valid in
the event when it is reused: an entry may refer to a different object
but requires the same hash key. This can happen with various stats
kinds, like:
- Replication slots that compute internally an index number, for
different slot names.
- Stats kinds that use an OID in the object key, where a wraparound
causes the same key to be used if an OID is used for the same object.
- As of PostgreSQL 18, custom pgstats kinds could also be an issue,
depending on their implementation.
This issue is fixed by introducing a counter called "generation" in the
shared entries via PgStatShared_HashEntry, initialized at 0 when an
entry is created and incremented when the same entry is reused, to avoid
concurrent issues on drop because of other backends still holding a
reference to it. This "generation" is copied to the local copy that a
backend holds when looking at an object, then cross-checked with the
shared entry to make sure that the entry is not dropped even if its
"refcount" justifies that if it has been reused.
This problem could show up when a backend shuts down and needs to
discard any entries it still holds, causing statistics to be removed
when they should not, or even an assertion failure. Another report
involved a failure in a standby after an OID wraparound, where the
startup process would FATAL on a "can only drop stats once", stopping
recovery abruptly. The buildfarm has been sporadically complaining
about the problem, as well, but the window is hard to reach with the
in-core tests.
Note that the issue can be reproduced easily by adding a sleep before
dshash_find() in pgstat_release_entry_ref() to enlarge the problematic
window while repeating test_decoding's isolation test oldest_xmin a
couple of times, for example, as pointed out by Alexander Lakhin.
Reported-by: Alexander Lakhin, Peter Smith
Author: Kyotaro Horiguchi, Michael Paquier
Reviewed-by: Bertrand Drouvot
Discussion: https://postgr.es/m/CAA4eK1KxuMVyAryz_Vk5yq3ejgKYcL6F45Hj9ZnMNBS-g+PuZg@mail.gmail.com
Discussion: https://postgr.es/m/17947-b9554521ad963c9c@postgresql.org
Backpatch-through: 15
InjectionPointEntry->name was described as a hash key, which was fine
when introduced in d86d20f0ba, but it is not now.
Oversight in 86db52a506, that has changed the way injection points are
stored in shared memory from a hash table to an array.
Backpatch-through: 17
The current code calls array_eq() and does not provide FmgrInfo. This commit
provides initialization of FmgrInfo and uses C collation as the safe option
for text comparison because we don't know anything about the semantics of
opclass options.
Backpatch to 13, where opclass options were introduced.
Reported-by: Nicolas Maus
Discussion: https://postgr.es/m/18692-72ea398df3ec6712%40postgresql.org
Backpatch-through: 13
Commit 5a2fed911 had an unexpected side-effect: the parallel worker
launched for the new test case would fail if it couldn't use a
superuser-reserved connection slot. The reason that test failed
while all our pre-existing ones worked is that the connection
privilege tests in InitPostgres had been based on the superuserness
of the leader's AuthenticatedUserId, but after the rearrangements
of 5a2fed911 we were testing the superuserness of CurrentUserId,
which the new test case deliberately made to be a non-superuser.
This all seems very accidental and probably not the behavior we really
want, but a security patch is no time to be redesigning things.
Pending some discussion about desirable semantics, hack it so that
InitPostgres continues to pay attention to the superuserness of
AuthenticatedUserId when starting a parallel worker.
Nathan Bossart and Tom Lane, per buildfarm member sawshark.
Security: CVE-2024-10978
The SQL spec mandates that SET SESSION AUTHORIZATION implies
SET ROLE NONE. We tried to implement that within the lowest-level
functions that manipulate these settings, but that was a bad idea.
In particular, guc.c assumes that it doesn't matter in what order
it applies GUC variable updates, but that was not the case for these
two variables. This problem, compounded by some hackish attempts to
work around it, led to some security-grade issues:
* Rolling back a transaction that had done SET SESSION AUTHORIZATION
would revert to SET ROLE NONE, even if that had not been the previous
state, so that the effective user ID might now be different from what
it had been.
* The same for SET SESSION AUTHORIZATION in a function SET clause.
* If a parallel worker inspected current_setting('role'), it saw
"none" even when it should see something else.
Also, although the parallel worker startup code intended to cope
with the current role's pg_authid row having disappeared, its
implementation of that was incomplete so it would still fail.
Fix by fully separating the miscinit.c functions that assign
session_authorization from those that assign role. To implement the
spec's requirement, teach set_config_option itself to perform "SET
ROLE NONE" when it sets session_authorization. (This is undoubtedly
ugly, but the alternatives seem worse. In particular, there's no way
to do it within assign_session_authorization without incompatible
changes in the API for GUC assign hooks.) Also, improve
ParallelWorkerMain to directly set all the relevant user-ID variables
instead of relying on some of them to get set indirectly. That
allows us to survive not finding the pg_authid row during worker
startup.
In v16 and earlier, this includes back-patching 9987a7bf3 which
fixed a violation of GUC coding rules: SetSessionAuthorization
is not an appropriate place to be throwing errors from.
Security: CVE-2024-10978
If a CTE, subquery, sublink, security invoker view, or coercion
projection references a table with row-level security policies, we
neglected to mark the plan as potentially dependent on which role
is executing it. This could lead to later executions in the same
session returning or hiding rows that should have been hidden or
returned instead.
Reported-by: Wolfgang Walther
Reviewed-by: Noah Misch
Security: CVE-2024-10976
Backpatch-through: 12
Commit ac04aa84a put the shutoff for this into the planner, which is
not ideal because it doesn't prevent us from re-using a previously
made parallel plan. Revert the planner change and instead put the
shutoff into InitializeParallelDSM, modeling it on the existing code
there for recovering from failure to allocate a DSM segment.
However, that code path is mostly untested, and testing a bit harder
showed there's at least one bug: ExecHashJoinReInitializeDSM is not
prepared for us to have skipped doing parallel DSM setup. I also
thought the Assert in ReinitializeParallelWorkers is pretty
ill-advised, and replaced it with a silent Min() operation.
The existing test case added by ac04aa84a serves fine to test this
version of the fix, so no change needed there.
Patch by me, but thanks to Noah Misch for the core idea that we
could shut off worker creation when !INTERRUPTS_CAN_BE_PROCESSED.
Back-patch to v12, as ac04aa84a was.
Discussion: https://postgr.es/m/CAC-SaSzHUKT=vZJ8MPxYdC_URPfax+yoA1hKTcF4ROz_Q6z0_Q@mail.gmail.com
If the collation of any join key column doesn’t match the collation of
the corresponding partition key, partitionwise joins can yield incorrect
results. For example, rows that would match under the join key collation
might be located in different partitions due to the partitioning
collation. In such cases, a partitionwise join would yield different
results from a non-partitionwise join, so disallow it in such cases.
Reported-by: Tender Wang <tndrwang@gmail.com>
Author: Jian He <jian.universality@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Discussion: https://postgr.es/m/CAHewXNno_HKiQ6PqyLYfuqDtwp7KKHZiH1J7Pqyz0nr+PS2Dwg@mail.gmail.com
Backpatch-through: 12
If the collation of any grouping column doesn’t match the collation of
the corresponding partition key, partitionwise grouping can yield
incorrect results. For example, rows that would be grouped under the
grouping collation may end up in different partitions under the
partitioning collation. In such cases, full partitionwise grouping
would produce results that differ from those without partitionwise
grouping, so disallowed that.
Partial partitionwise aggregation is still allowed, as the Finalize
step reconciles partition-level aggregates with grouping requirements
across all partitions, ensuring that the final output remains
consistent.
This commit also fixes group_by_has_partkey() by ensuring the
RelabelType node is stripped from grouping expressions when matching
them to partition key expressions to avoid false mismatches.
Bug: #18568
Reported-by: Webbo Han <1105066510@qq.com>
Author: Webbo Han <1105066510@qq.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Reviewed-by: Aleksander Alekseev <aleksander@timescale.com>
Reviewed-by: Jian He <jian.universality@gmail.com>
Discussion: https://postgr.es/m/18568-2a9afb6b9f7e6ed3@postgresql.org
Discussion: https://postgr.es/m/tencent_9D9103CDA420C07768349CC1DFF88465F90A@qq.com
Discussion: https://postgr.es/m/CAHewXNno_HKiQ6PqyLYfuqDtwp7KKHZiH1J7Pqyz0nr+PS2Dwg@mail.gmail.com
Backpatch-through: 12
When we generate multiple clones of the same qual condition to cope
with outer join identity 3, we need to ensure that all the clones get
the same serial number. To achieve this, we reset the
root->last_rinfo_serial counter each time we produce RestrictInfo(s)
from the qual list (see deconstruct_distribute_oj_quals). This
approach works only if we ensure that we are not changing the qual
list in any way that'd affect the number of RestrictInfos built from
it.
However, with b262ad440, an IS NULL qual on a NOT NULL column might
result in an additional constant-FALSE RestrictInfo. And different
versions of the same qual clause can lead to different conclusions
about whether it can be reduced to constant-FALSE. This would affect
the number of RestrictInfos built from the qual list for different
versions, causing inconsistent RestrictInfo serial numbers across
multiple clones of the same qual. This inconsistency can confuse
users of these serial numbers, such as rebuild_joinclause_attr_needed,
and lead to planner errors such as "ERROR: variable not found in
subplan target lists".
To fix, reset the root->last_rinfo_serial counter after generating the
additional constant-FALSE RestrictInfo.
Back-patch to v17 where the issue crept in. In v17, I failed to make
a test case that would expose this bug, so no test case for v17.
Author: Richard Guo
Discussion: https://postgr.es/m/CAMbWs4-B6kafn+LmPuh-TYFwFyEm-vVj3Qqv7Yo-69CEv14rRg@mail.gmail.com