When dealing with ResultRelInfos for partitions, there are cases where
there are mixed requirements for the ri_RootResultRelInfo. There are
cases when the partition itself requires a NULL ri_RootResultRelInfo and
in the same query, the same partition may require a ResultRelInfo with
its parent set in ri_RootResultRelInfo. This could cause the column
mapping between the partitioned table and the partition not to be done
which could result in crashes if the column attnums didn't match
exactly.
The fix is simple. We now check that the ri_RootResultRelInfo matches
what the caller passed to ExecGetTriggerResultRel() and only return a
cached ResultRelInfo when the ri_RootResultRelInfo matches what the
caller wants, otherwise we'll make a new one.
Author: David Rowley <dgrowleyml@gmail.com>
Author: Amit Langote <amitlangote09@gmail.com>
Reported-by: Dmitry Fomin <fomin.list@gmail.com>
Discussion: https://postgr.es/m/7DCE78D7-0520-4207-822B-92F60AEA14B4@gmail.com
Backpatch-through: 15
If inside an EPQ recheck, ExecScanFetch would run the recheck method
function for foreign/custom joins even if they aren't descendant nodes
in the EPQ recheck plan tree, which is problematic at least in the
foreign-join case, because such a foreign join isn't guaranteed to have
an alternative local-join plan required for running the recheck method
function; in the postgres_fdw case this could lead to a segmentation
fault or an assert failure in an assert-enabled build when running the
recheck method function.
Even if inside an EPQ recheck, any scan nodes that aren't descendant
ones in the EPQ recheck plan tree should be normally processed by using
the access method function; fix by modifying ExecScanFetch so that if
inside an EPQ recheck, it runs the recheck method function for
foreign/custom joins that are descendant nodes in the EPQ recheck plan
tree as before and runs the access method function for foreign/custom
joins that aren't.
This fix also adds to postgres_fdw an isolation test for an EPQ recheck
that caused issues stated above.
Oversight in commit 385f337c9.
Reported-by: Kristian Lejao <kristianlejao@gmail.com>
Author: Masahiko Sawada <sawada.mshk@gmail.com>
Co-authored-by: Etsuro Fujita <etsuro.fujita@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Etsuro Fujita <etsuro.fujita@gmail.com>
Discussion: https://postgr.es/m/CAD21AoBpo6Gx55FBOW+9s5X=nUw3Xpq64v35fpDEKsTERnc4TQ@mail.gmail.com
Backpatch-through: 13
The EvalPlanQual recheck for TID Range Scan wasn't rechecking the TID qual
still passed after following update chains. This could result in tuples
being updated or deleted by plans using TID Range Scans where the ctid of
the new (updated) tuple no longer matches the clause of the scan. This
isn't desired behavior, and isn't consistent with what would happen if the
chosen plan had used an Index or Seq Scan, and that could lead to hard to
predict behavior for scans that contain TID quals and other quals as the
planner has freedom to choose TID Range or some other non-TID scan method
for such queries, and the chosen plan could change at any moment.
Here we fix this by properly implementing the recheck function for TID
Range Scans.
Backpatch to 14, where TID Range Scans were added
Reported-by: Sophie Alpert <pg@sophiebits.com>
Author: Sophie Alpert <pg@sophiebits.com>
Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/4a6268ff-3340-453a-9bf5-c98d51a6f729@app.fastmail.com
Backpatch-through: 14
The EvalPlanQual recheck for TID Scan wasn't rechecking the TID qual
still passed after following update chains. This could result in tuples
being updated or deleted by plans using TID Scans where the ctid of the
new (updated) tuple no longer matches the clause of the scan. This isn't
desired behavior, and isn't consistent with what would happen if the
chosen plan had used an Index or Seq Scan, and that could lead to hard to
predict behavior for scans that contain TID quals and other quals as the
planner has freedom to choose TID or some other scan method for such
queries, and the chosen plan could change at any moment.
Here we fix this by properly implementing the recheck function for TID
Scans.
Backpatch to 13, oldest supported version
Reported-by: Sophie Alpert <pg@sophiebits.com>
Author: Sophie Alpert <pg@sophiebits.com>
Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/4a6268ff-3340-453a-9bf5-c98d51a6f729@app.fastmail.com
Backpatch-through: 13
If the hash functions used for hashing tuples leaked any memory,
we failed to clean that up, resulting in query-lifespan memory
leakage in queries using hashed subplans. One way that could
happen is if the values being hashed require de-toasting, since
most of our hash functions don't trouble to clean up de-toasted
inputs.
Prior to commit bf6c614a2, this leakage was largely masked
because TupleHashTableMatch would reset hashtable->tempcxt
(via execTuplesMatch). But it doesn't do that anymore, and
that's not really the right place for this anyway: doing it
there could reset the tempcxt many times per hash lookup,
or not at all. Instead put reset calls into ExecHashSubPlan
and buildSubPlanHash. Along the way to that, rearrange
ExecHashSubPlan so that there's just one place to call
MemoryContextReset instead of several.
This amounts to accepting the de-facto API spec that the caller
of the TupleHashTable routines is responsible for resetting the
tempcxt adequately often. Although the other callers seem to
get this right, it was not documented anywhere, so add a comment
about it.
Bug: #19040
Reported-by: Haiyang Li <mohen.lhy@alibaba-inc.com>
Author: Haiyang Li <mohen.lhy@alibaba-inc.com>
Reviewed-by: Fei Changhong <feichanghong@qq.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/19040-c9b6073ef814f48c@postgresql.org
Backpatch-through: 13
When executing a MERGE UPDATE action, if there is more than one
concurrent update of the target row, the lock-and-retry code would
sometimes incorrectly identify the latest version of the target tuple,
leading to incorrect results.
This was caused by using the ctid field from the TM_FailureData
returned by table_tuple_lock() in a case where the result was TM_Ok,
which is unsafe because the TM_FailureData struct is not guaranteed to
be fully populated in that case. Instead, it should use the tupleid
passed to (and updated by) table_tuple_lock().
To reduce the chances of similar errors in the future, improve the
commentary for table_tuple_lock() and TM_FailureData to make it
clearer that table_tuple_lock() updates the tid passed to it, and most
fields of TM_FailureData should not be relied on in non-failure cases.
An exception to this is the "traversed" field, which is set in both
success and failure cases.
Reported-by: Dmitry <dsy.075@yandex.ru>
Author: Yugo Nagata <nagata@sraoss.co.jp>
Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/1570d30e-2b95-4239-b9c3-f7bf2f2f8556@yandex.ru
Backpatch-through: 15
Per buildfarm member wrasse, void function cannot return a value.
This only affects v13-v17, where an ABI-compatible wrapper function
was added.
Backpatch-through: 13-17
If an INSERT has an ON CONFLICT DO UPDATE clause, the executor must
check that the target relation supports UPDATE as well as INSERT. In
particular, it must check that the target relation has a REPLICA
IDENTITY if it publishes updates. Formerly, it was not doing this
check, which could lead to silently breaking replication.
Fix by adding such a check to CheckValidResultRel(), which requires
adding a new onConflictAction argument. In back-branches, preserve ABI
compatibility by introducing a wrapper function with the original
signature.
Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Tested-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/OS3PR01MB57180C87E43A679A730482DF94B62@OS3PR01MB5718.jpnprd01.prod.outlook.com
Backpatch-through: 13
Commit e2d4ef8de8 (the fix for CVE-2017-7484) added security checks
to the selectivity estimation functions to prevent them from running
user-supplied operators on data obtained from pg_statistic if the user
lacks privileges to select from the underlying table. In cases
involving inheritance/partitioning, those checks were originally
performed against the child RTE (which for plain inheritance might
actually refer to the parent table). Commit 553d2ec271 then extended
that to also check the parent RTE, allowing access if the user had
permissions on either the parent or the child. It turns out, however,
that doing any checks using the child RTE is incorrect, since
securityQuals is set to NULL when creating an RTE for an inheritance
child (whether it refers to the parent table or the child table), and
therefore such checks do not correctly account for any RLS policies or
security barrier views. Therefore, do the security checks using only
the parent RTE. This is consistent with how RLS policies are applied,
and the executor's ACL checks, both of which use only the parent
table's permissions/policies. Similar checks are performed in the
extended stats code, so update that in the same way, centralizing all
the checks in a new function.
In addition, note that these checks by themselves are insufficient to
ensure that the user has access to the table's data because, in a
query that goes via a view, they only check that the view owner has
permissions on the underlying table, not that the current user has
permissions on the view itself. In the selectivity estimation
functions, there is no easy way to navigate from underlying tables to
views, so add permissions checks for all views mentioned in the query
to the planner startup code. If the user lacks permissions on a view,
a permissions error will now be reported at planner-startup, and the
selectivity estimation functions will not be run.
Checking view permissions at planner-startup in this way is a little
ugly, since the same checks will be repeated at executor-startup.
Longer-term, it might be better to move all the permissions checks
from the executor to the planner so that permissions errors can be
reported sooner, instead of creating a plan that won't ever be run.
However, such a change seems too far-reaching to be back-patched.
Back-patch to all supported versions. In v13, there is the added
complication that UPDATEs and DELETEs on inherited target tables are
planned using inheritance_planner(), which plans each inheritance
child table separately, so that the selectivity estimation functions
do not know that they are dealing with a child table accessed via its
parent. Handle that by checking access permissions on the top parent
table at planner-startup, in the same way as we do for views. Any
securityQuals on the top parent table are moved down to the child
tables by inheritance_planner(), so they continue to be checked by the
selectivity estimation functions.
Author: Dean Rasheed <dean.a.rasheed@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Noah Misch <noah@leadboat.com>
Backpatch-through: 13
Security: CVE-2025-8713
When a MERGE's target table is the parent of an inheritance tree, any
INSERT actions insert into the parent table using ModifyTableState's
rootResultRelInfo. However, there are two bugs in the way this is
initialized:
1. ExecInitMerge() incorrectly uses a different ResultRelInfo entry
from ModifyTableState's resultRelInfo array to build the insert
projection, which may not be compatible with rootResultRelInfo.
2. ExecInitModifyTable() does not fully initialize rootResultRelInfo.
Specifically, ri_WithCheckOptions, ri_WithCheckOptionExprs,
ri_returningList, and ri_projectReturning are not initialized.
This can lead to crashes, or incorrect query results due to failing to
check WCO's or process the RETURNING list for INSERT actions.
Fix both these bugs in ExecInitMerge(), noting that it is only
necessary to fully initialize rootResultRelInfo if the MERGE has
INSERT actions and the target table is a plain inheritance parent.
Backpatch to v15, where MERGE was introduced.
Reported-by: Andres Freund <andres@anarazel.de>
Author: Dean Rasheed <dean.a.rasheed@gmail.com>
Reviewed-by: Jian He <jian.universality@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/4rlmjfniiyffp6b3kv4pfy4jw3pciy6mq72rdgnedsnbsx7qe5@j5hlpiwdguvc
Backpatch-through: 15
If a GENERATED column is declared to have a domain data type where
the domain's constraints disallow null values, INSERT commands failed
because we built a targetlist that included coercing a null constant
to the domain's type. The failure occurred even when the generated
value would have been perfectly OK. This is adjacent to the issues
fixed in 0da39aa76, but we didn't notice for lack of testing a domain
with such a constraint.
We aren't going to use the result of the targetlist entry for the
generated column --- ExecComputeStoredGenerated will overwrite it.
So it's not really necessary that it have the exact datatype of
the generated column. This patch fixes the problem by changing
the targetlist entry to be a null Const of the domain's base type,
which should be sufficiently legal. (We do have to tweak
ExecCheckPlanOutput to accept the situation, though.)
This has been broken since we implemented generated columns.
However, this patch only applies easily as far back as v14, partly
because I (tgl) only carried 0da39aa76 back that far, but mostly
because v14 significantly refactored the handling of INSERT/UPDATE
targetlists. Given the lack of field complaints and the short
remaining support lifetime of v13, I judge the cost-benefit ratio
not good for devising a version that would work in v13.
Reported-by: jian he <jian.universality@gmail.com>
Author: jian he <jian.universality@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CACJufxG59tip2+9h=rEv-ykOFjt0cbsPVchhi0RTij8bABBA0Q@mail.gmail.com
Backpatch-through: 14
The optimization does not take the removal of TIDs by a concurrent vacuum into
account. The concurrent vacuum can remove dead TIDs and make pages ALL_VISIBLE
while those dead TIDs are referenced in the bitmap. This can lead to a
skip_fetch scan returning too many tuples.
It likely would be possible to implement this optimization safely, but we
don't have the necessary infrastructure in place. Nor is it clear that it's
worth building that infrastructure, given how limited the skip_fetch
optimization is.
In the backbranches we just disable the optimization by always passing
need_tuples=true to table_beginscan_bm(). We can't perform API/ABI changes in
the backbranches and we want to make the change as minimal as possible.
Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reported-By: Konstantin Knizhnik <knizhnik@garret.ru>
Discussion: https://postgr.es/m/CAEze2Wg3gXXZTr6_rwC+s4-o2ZVFB5F985uUSgJTsECx6AmGcQ@mail.gmail.com
Backpatch-through: 13
ExecInitPartitionInfo() duplicates much of the logic in
ExecInitMerge(), except that it failed to handle DO NOTHING
actions. This would cause an "unknown action in MERGE WHEN clause"
error if a MERGE with any DO NOTHING actions attempted to insert into
a partition not already initialised by ExecInitModifyTable().
Bug: #18871
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: Tender Wang <tndrwang@gmail.com>
Reviewed-by: Gurjeet Singh <gurjeet@singh.im>
Discussion: https://postgr.es/m/18871-b44e3c96de3bd2e8%40postgresql.org
Backpatch-through: 15
We did not wake up on interrupts while waiting on async events on an
async-capable append node. For example, if you tried to cancel the
query, nothing would happen until one of the async subplans becomes
readable. To fix, add WL_LATCH_SET to the WaitEventSet.
Backpatch down to v14 where async Append execution was introduced.
Discussion: https://www.postgresql.org/message-id/37a40570-f558-40d3-b5ea-5c2079b3b30b@iki.fi
The function calls GetLatestSnapshot() to acquire a fresh snapshot,
makes it active, and was meant to pass it to table_tuple_lock(), but
instead called GetLatestSnapshot() again to acquire yet another
snapshot. It was harmless because the heap AM and all other known
table AMs ignore the 'snapshot' argument anyway, but let's be tidy.
In the long run, this perhaps should be redesigned so that snapshot
was not needed in the first place. The table AM API uses TID +
snapshot as the unique identifier for the row version, which is
questionable when the row came from an index scan with a Dirty
snapshot. You might lock a different row version when you use a
different snapshot in the table_tuple_lock() call (a fresh MVCC
snapshot) than in the index scan (DirtySnapshot). However, in the heap
AM and other AMs where the TID alone identifies the row version, it
doesn't matter. So for now, just fix the obvious albeit harmless bug.
This has been wrong ever since the table AM API was introduced in
commit 5db6df0c01, so backpatch to all supported versions.
Discussion: https://www.postgresql.org/message-id/83d243d6-ad8d-4307-8b51-2ee5844f6230@iki.fi
Backpatch-through: 13
If the non-recursive part of a recursive CTE ended up using
TTSOpsBufferHeapTuple as the table slot type, then a duplicate value
could cause an Assert failure in CheckOpSlotCompatibility() when
checking the hash table for the duplicate value. The expected slot type
for the deform step was TTSOpsMinimalTuple so the Assert failed when the
TTSOpsBufferHeapTuple slot was used.
This is a long-standing bug which we likely didn't notice because it
seems much more likely that the non-recursive term would have required
projection and used a TTSOpsVirtual slot, which CheckOpSlotCompatibility
is ok with.
There doesn't seem to be any harm done here other than the Assert
failure. Both TTSOpsMinimalTuple and TTSOpsBufferHeapTuple slot types
require tuple deformation, so the EEOP_*_FETCHSOME ExprState step would
have properly existed in the ExprState.
The solution is to pass NULL for the ExecBuildGroupingEqual's 'lops'
parameter. This means the ExprState's EEOP_*_FETCHSOME step won't
expect a fixed slot type. This makes CheckOpSlotCompatibility() happy as
no checking is performed when the ExprEvalStep is not expecting a fixed
slot type.
Reported-by: Richard Guo
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/CAMbWs4-8U9q2LAtf8+ghV11zeUReA3AmrYkxzBEv0vKnDxwkKA@mail.gmail.com
Backpatch-through: 13, all supported versions
Our parallel-mode code only works when we are executing a query
in full, so ExecutePlan must disable parallel mode when it is
asked to do partial execution. The previous logic for this
involved passing down a flag (variously named execute_once or
run_once) from callers of ExecutorRun or PortalRun. This is
overcomplicated, and unsurprisingly some of the callers didn't
get it right, since it requires keeping state that not all of
them have handy; not to mention that the requirements for it were
undocumented. That led to assertion failures in some corner
cases. The only state we really need for this is the existing
QueryDesc.already_executed flag, so let's just put all the
responsibility in ExecutePlan. (It could have been done in
ExecutorRun too, leading to a slightly shorter patch -- but if
there's ever more than one caller of ExecutePlan, it seems better
to have this logic in the subroutine than the callers.)
This makes those ExecutorRun/PortalRun parameters unnecessary.
In master it seems okay to just remove them, returning the
API for those functions to what it was before parallelism.
Such an API break is clearly not okay in stable branches,
but for them we can just leave the parameters in place after
documenting that they do nothing.
Per report from Yugo Nagata, who also reviewed and tested
this patch. Back-patch to all supported branches.
Discussion: https://postgr.es/m/20241206062549.710dc01cf91224809dd6c0e1@sraoss.co.jp
When short-circuiting WindowAgg node evaluation on the top-level
WindowAgg node using quals on monotonic window functions, because the
WindowAgg run condition can mean there's no need to evaluate subsequent
window function results in the same partition once the run condition
becomes false, it was possible that the executor would use stale results
from the previous invocation of the window function in some cases.
A fix for this was partially done by a5832722, but that commit only
fixed the issue for non-top-level WindowAgg nodes. I mistakenly thought
that the top-level WindowAgg didn't have this issue, but Jayesh's example
case clearly shows that's incorrect. At the time, I also thought that
this only affected 32-bit systems as all window functions which then
supported run conditions returned BIGINT, however, that's wrong as
ExecProject is still called and that could cause evaluation of any other
window function belonging to the same WindowAgg node, one of which may
return a byref type.
The only queries affected by this are WindowAggs with a "Run Condition"
which contains at least one window function with a byref result type,
such as lead() or lag() on a byref column. The window clause must also
contain a PARTITION BY clause (without a PARTITION BY, execution of the
WindowAgg stops immediately when the run condition becomes false and
there's no risk of using the stale results).
Reported-by: Jayesh Dehankar
Discussion: https://postgr.es/m/193261e2c4d.3dd3cd7c1842.871636075166132237@zohocorp.com
Backpatch-through: 15, where WindowAgg run conditions were added
If passed a read-write expanded object pointer, the EEOP_NULLIF
code would hand that same pointer to the equality function
and then (unless equality was reported) also return the same
pointer as its value. This is no good, because a function that
receives a read-write expanded object pointer is fully entitled
to scribble on or even delete the object, thus corrupting the
NULLIF output. (This problem is likely unobservable with the
equality functions provided in core Postgres, but it's easy to
demonstrate with one coded in plpgsql.)
To fix, make sure the pointer passed to the equality function
is read-only. We can still return the original read-write
pointer as the NULLIF result, allowing optimization of later
operations.
Per bug #18722 from Alexander Lakhin. This has been wrong
since we invented expanded objects, so back-patch to all
supported branches.
Discussion: https://postgr.es/m/18722-fd9e645448cc78b4@postgresql.org
If a CTE, subquery, sublink, security invoker view, or coercion
projection references a table with row-level security policies, we
neglected to mark the plan as potentially dependent on which role
is executing it. This could lead to later executions in the same
session returning or hiding rows that should have been hidden or
returned instead.
Reported-by: Wolfgang Walther
Reviewed-by: Noah Misch
Security: CVE-2024-10976
Backpatch-through: 12
Commit ac04aa84a put the shutoff for this into the planner, which is
not ideal because it doesn't prevent us from re-using a previously
made parallel plan. Revert the planner change and instead put the
shutoff into InitializeParallelDSM, modeling it on the existing code
there for recovering from failure to allocate a DSM segment.
However, that code path is mostly untested, and testing a bit harder
showed there's at least one bug: ExecHashJoinReInitializeDSM is not
prepared for us to have skipped doing parallel DSM setup. I also
thought the Assert in ReinitializeParallelWorkers is pretty
ill-advised, and replaced it with a silent Min() operation.
The existing test case added by ac04aa84a serves fine to test this
version of the fix, so no change needed there.
Patch by me, but thanks to Noah Misch for the core idea that we
could shut off worker creation when !INTERRUPTS_CAN_BE_PROCESSED.
Back-patch to v12, as ac04aa84a was.
Discussion: https://postgr.es/m/CAC-SaSzHUKT=vZJ8MPxYdC_URPfax+yoA1hKTcF4ROz_Q6z0_Q@mail.gmail.com
After repartitioning the inner side of a hash join that would have
exceeded the allowed size, we check if all the tuples from a parent
partition moved to one child partition. That is evidence that it
contains duplicate keys and later attempts to repartition will also
fail, so we should give up trying to limit memory (for lack of a better
fallback strategy).
A thinko prevented the check from working correctly in partition 0 (the
one that is partially loaded into memory already). After
repartitioning, we should check for extreme skew if the *parent*
partition's space_exhausted flag was set, not the child partition's.
The consequence was repeated futile repartitioning until per-partition
data exceeded various limits including "ERROR: invalid DSA memory alloc
request size 1811939328", OS allocation failure, or temporary disk space
errors. (We could also do something about some of those symptoms, but
that's material for separate patches.)
This problem only became likely when PostgreSQL 16 introduced support
for Parallel Hash Right/Full Join, allowing NULL keys into the hash
table. Repartitioning always leaves NULL in partition 0, no matter how
many times you do it, because the hash value is all zero bits. That's
unlikely for other hashed values, but they might still have caused
wasted extra effort before giving up.
Back-patch to all supported releases.
Reported-by: Craig Milhiser <craig@milhiser.com>
Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Discussion: https://postgr.es/m/CA%2BwnhO1OfgXbmXgC4fv_uu%3DOxcDQuHvfoQ4k0DFeB0Qqd-X-rQ%40mail.gmail.com
Commit 2dc1deaea turns out to have been still a brick shy of a load,
because CALL statements executing within a plpgsql exception block
could still pass the wrong snapshot to stable functions within the
CALL's argument list. That happened because standard_ProcessUtility
forces isAtomicContext to true if IsTransactionBlock is true, which
it always will be inside a subtransaction. Then ExecuteCallStmt
would think it does not need to push a new snapshot --- but
_SPI_execute_plan didn't do so either, since it thought it was in
nonatomic mode.
The best fix for this seems to be for _SPI_execute_plan to operate
in atomic execution mode if IsSubTransaction() is true, even when the
SPI context as a whole is non-atomic. This makes _SPI_execute_plan
have the same rules about when non-atomic execution is allowed as
_SPI_commit/_SPI_rollback have about when COMMIT/ROLLBACK are allowed,
which seems appropriately symmetric. (If anyone ever tries to allow
COMMIT/ROLLBACK inside a subtransaction, this would all need to be
rethought ... but I'm unconvinced that such a thing could be logically
consistent at all.)
For further consistency, also check IsSubTransaction() in
SPI_inside_nonatomic_context. That does not matter for its
one present-day caller StartTransaction, which can't be reached
inside a subtransaction. But if any other callers ever arise,
they'd presumably want this definition.
Per bug #18656 from Alexander Alehin. Back-patch to all
supported branches, like previous fixes in this area.
Discussion: https://postgr.es/m/18656-cade1780866ef66c@postgresql.org
The previous commit fixed some ways of losing an inplace update. It
remained possible to lose one when a backend working toward a
heap_update() copied a tuple into memory just before inplace update of
that tuple. In catalogs eligible for inplace update, use LOCKTAG_TUPLE
to govern admission to the steps of copying an old tuple, modifying it,
and issuing heap_update(). This includes MERGE commands. To avoid
changing most of the pg_class DDL, don't require LOCKTAG_TUPLE when
holding a relation lock sufficient to exclude inplace updaters.
Back-patch to v12 (all supported versions). In v13 and v12, "UPDATE
pg_class" or "UPDATE pg_database" can still lose an inplace update. The
v14+ UPDATE fix needs commit 86dc90056d,
and it wasn't worth reimplementing that fix without such infrastructure.
Reviewed by Nitin Motiani and (in earlier versions) Heikki Linnakangas.
Discussion: https://postgr.es/m/20231027214946.79.nmisch@google.com
This commit adds query ID reports for two code paths when processing
extended query protocol messages:
- When receiving a bind message, setting it to the first Query retrieved
from a cached cache.
- When receiving an execute message, setting it to the first PlannedStmt
stored in a portal.
An advantage of this method is that this is able to cover all the types
of portals handled in the extended query protocol, particularly these
two when the report done in ExecutorStart() is not enough (neither is an
addition in ExecutorRun(), actually, for the second point):
- Multiple execute messages, with multiple ExecutorRun().
- Portal with execute/fetch messages, like a query with a RETURNING
clause and a fetch size that stores the tuples in a first execute
message going though ExecutorStart() and ExecuteRun(), followed by one
or more execute messages doing only fetches from the tuplestore created
in the first message. This corresponds to the case where
execute_is_fetch is set, for example.
Note that the query ID reporting done in ExecutorStart() is still
necessary, as an EXECUTE requires it. Query ID reporting is optimistic
and more calls to pgstat_report_query_id() don't matter as the first
report takes priority except if the report is forced. The comment in
ExecutorStart() is adjusted to reflect better the reality with the
extended query protocol.
The test added in pg_stat_statements is a courtesy of Robert Haas. This
uses psql's \bind metacommand, hence this part is backpatched down to
v16.
Reported-by: Kaido Vaikla, Erik Wienhold
Author: Sami Imseih
Reviewed-by: Jian He, Andrei Lepikhov, Michael Paquier
Discussion: https://postgr.es/m/CA+427g8DiW3aZ6pOpVgkPbqK97ouBdf18VLiHFesea2jUk3XoQ@mail.gmail.com
Discussion: https://postgr.es/m/CA+TgmoZxtnf_jZ=VqBSyaU8hfUkkwoJCJ6ufy4LGpXaunKrjrg@mail.gmail.com
Discussion: https://postgr.es/m/1391613709.939460.1684777418070@office.mailbox.org
Backpatch-through: 14
If the plancache entry for the CALL statement is already stale,
it's possible for us to fetch an old procedure OID out of it,
and then fail with "cache lookup failed for function NNN".
In ordinary usage this never happens because make_callstmt_target
is called just once immediately after building the plancache
entry. It can be forced however by setting up an erroneous CALL
(that causes make_callstmt_target itself to report an error),
then dropping/recreating the target procedure, then repeating
the erroneous CALL.
To fix, use SPI_plan_get_cached_plan() to fetch the plancache's
plan, rather than assuming we can use SPI_plan_get_plan_sources().
This shouldn't add any noticeable overhead in the normal case,
and in the stale-plan case we'd have had to replan anyway a little
further down.
The other callers of SPI_plan_get_plan_sources() seem OK, because
either they don't need up-to-date plans or they know that the
query was just (re) planned. But add some commentary in hopes
of not falling into this trap again.
Per bug #18574 from Song Hongyu. Back-patch to v14 where this coding
was introduced. (Older branches have comparable code, but it's run
after any required replanning, so there's no issue.)
Discussion: https://postgr.es/m/18574-2ce7ba3249221389@postgresql.org
Most comments concern RELKIND_VIEW. One addresses the ExecUpdate()
"tupleid" parameter. A later commit will rely on these facts, but they
hold already. Back-patch to v12 (all supported versions), the plan for
that commit.
Reviewed (in an earlier version) by Robert Haas.
Discussion: https://postgr.es/m/20240512232923.aa.nmisch@google.com
When detaching partition in concurrent mode, it's possible for partition
descriptors to not match the set that was recently seen when the plan
was made, causing an assertion failure or (in production builds) failure
to construct a working plan. The case that was reported involves
prepared statements, but I think it may be possible to hit this bug
without that too.
The problem is that CreatePartitionPruneState is constructing a
PartitionPruneState under the assumption that new partitions can be
added, but never removed, but it turns out that this isn't true: a
prepared statement gets replanned when the DETACH CONCURRENTLY session
sends out its invalidation message, but if the invalidation message
arrives after ExecInitAppend started, we would build a partition
descriptor without the partition, and then CreatePartitionPruneState
would refuse to work with it.
CreatePartitionPruneState already contains code to deal with the new
descriptor having more partitions than before (and behaving for the
extra partitions as if they had been pruned), but doesn't have code to
deal with less partitions than before, and it is naïve about the case
where the number of partitions is the same. We could simply add that a
new stanza for less partitions than before, and in simple testing it
works to do that; but it's possible to press the test scripts even
further and hit the case where one partition is added and a partition is
removed quickly enough that we see the same number of partitions, but
they don't actually match, causing hangs during execution.
To cope with both these problems, we now memcmp() the arrays of
partition OIDs, and do a more elaborate mapping (relying on the fact
that both OID arrays are in partition-bounds order) if they're not
identical.
Backpatch to 14, where DETACH CONCURRENTLY appeared.
Reported-by: yajun Hu <1026592243@qq.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/18377-e0324601cfebdfe5@postgresql.org
If the CALL is within an atomic context (e.g. there's an outer
transaction block), _SPI_execute_plan should acquire a fresh snapshot
to execute any such functions with. We failed to do that and instead
passed them the Portal snapshot, which had been acquired at the start
of the current SQL command. This'd lead to seeing stale values of
rows modified since the start of the command.
This is arguably a bug in 84f5c2908: I failed to see that "are we in
non-atomic mode" needs to be defined the same way as it is further
down in _SPI_execute_plan, i.e. check !_SPI_current->atomic not just
options->allow_nonatomic. Alternatively the blame could be laid on
plpgsql, which is unconditionally passing allow_nonatomic = true
for CALL/DO even when it knows it's in an atomic context. However,
fixing it in spi.c seems like a better idea since that will also fix
the problem for any extensions that may have copied plpgsql's coding
pattern.
While here, update an obsolete comment about _SPI_execute_plan's
snapshot management.
Per report from Victor Yegorov. Back-patch to all supported versions.
Discussion: https://postgr.es/m/CAGnEboiRe+fG2QxuBO2390F7P8e2MQ6UyBjZSL_w1Cej+E4=Vw@mail.gmail.com
As an optimization, we store "name" columns as cstrings in btree
indexes.
Here we modify it so that Index Only Scans convert these cstrings back
to names with NAMEDATALEN bytes rather than storing the cstring in the
tuple slot, as was happening previously.
Bug: #17855
Reported-by: Alexander Lakhin
Reviewed-by: Alexander Lakhin, Tom Lane
Discussion: https://postgr.es/m/17855-5f523e0f9769a566@postgresql.org
Backpatch-through: 12, all supported versions
No configured-by-FDW events would result in "return" directly out of a
PG_TRY block, making the exception stack dangling. Repair.
Oversight in commit 501cfd07d; back-patch to v14, like that commit, but
as we do not have this issue in HEAD (cf. commit 50c67c201), no need to
apply this patch to it.
In passing, improve a comment about the handling of in-process requests
in a postgres_fdw.c function called from this function.
Alexander Pyhalov, with comment adjustment/improvement by me.
Discussion: https://postgr.es/m/425fa29a429b21b0332737c42a4fdc70%40postgrespro.ru
When a plain aggregate is used as a window function, and the window
frame start is specified as UNBOUNDED PRECEDING, the frame's head
cannot move so we do not need to use moving-aggregate mode. The check
for that was put into initialize_peragg(), failing to notice that
ExecInitWindowAgg() calls that function before it's filled in
winstate->frameOptions. Since makeNode() would have zeroed the field,
this didn't provoke uninitialized-value complaints, nor would the
erroneous decision have resulted in more than a little inefficiency.
Still, it's wrong, so move the initialization of
winstate->frameOptions earlier to make it work properly.
While here, also fix a thinko in a comment. Both errors crept in in
commit a9d9acbf2 which introduced the moving-aggregate mode.
Spotted by Vallimaharajan G. Back-patch to all supported branches.
Discussion: https://postgr.es/m/18e7f2a5167.fe36253866818.977923893562469143@zohocorp.com
There is a very ancient hack in check_sql_fn_retval that allows a
single SELECT targetlist entry of composite type to be taken as
supplying all the output columns of a function returning composite.
(This is grotty and fundamentally ambiguous, but it's really hard
to do nested composite-returning functions without it.)
As far as I know, that doesn't cause any problems in ordinary
functions. It's disastrous for procedures however. All procedures
that have any output parameters are labeled with prorettype RECORD,
and the CALL code expects it will get back a record with one column
per output parameter, regardless of whether any of those parameters
is composite. Doing something else leads to an assertion failure
or core dump.
This is simple enough to fix: we just need to not apply that rule
when considering procedures. However, that requires adding another
argument to check_sql_fn_retval, which at least in principle might be
getting called by external callers. Therefore, in the back branches
convert check_sql_fn_retval into an ABI-preserving wrapper around a
new function check_sql_fn_retval_ext.
Per report from Yahor Yuzefovich. This has been broken since we
implemented procedures, so back-patch to all supported branches.
Discussion: https://postgr.es/m/CABz5gWHSjj2df6uG0NRiDhZ_Uz=Y8t0FJP-_SVSsRsnrQT76Gg@mail.gmail.com
For pass-by-reference types, the code added in 0b053e78b, which aimed to
resolve a memory leak, was overly aggressive in resetting the per-tuple
memory context which could result in pfree'd memory being accessed
resulting in failing to find previously cached results in the hash
table.
What was happening was prepare_probe_slot() was switching to the
per-tuple memory context and calling ExecEvalExpr(). ExecEvalExpr() may
have required a memory allocation. Both MemoizeHash_hash() and
MemoizeHash_equal() were aggressively resetting the per-tuple context
and after determining the hash value, the context would have gotten reset
before MemoizeHash_equal() was called. This could have resulted in
MemoizeHash_equal() looking at pfree'd memory.
This is less likely to have caused issues on a production build as some
other allocation would have had to have reused the pfree'd memory to
overwrite it. Otherwise, the original contents would have been intact.
However, this clearly caused issues on MEMORY_CONTEXT_CHECKING builds.
Author: Tender Wang, Andrei Lepikhov
Reported-by: Tender Wang (using SQLancer)
Reviewed-by: Andrei Lepikhov, Richard Guo, David Rowley
Discussion: https://postgr.es/m/CAHewXNnT6N6UJkya0z-jLFzVxcwGfeRQSfhiwA+NyLg-x8iGew@mail.gmail.com
Backpatch-through: 14, where Memoize was added
When an UPDATE or DELETE action in MERGE returns TM_SelfModified,
there are 2 possible causes:
1). The target tuple was already updated or deleted by the current
command. This can happen if the target row joins to more than one
source row, and the SQL standard explicitly says that this must be
an error.
2). The target tuple was already updated or deleted by a later command
in the current transaction. This can happen if the tuple is
modified by a BEFORE trigger or a volatile function used in the
query, and should be an error for the same reason that it is in a
plain UPDATE or DELETE command.
In MERGE's primary error handling block, it failed to check for (2),
causing it to return a misleading error message in such cases.
In the secondary error handling block, following a concurrent update
from another session, it failed to check for (1), causing it to
silently ignore target rows joined to more than one source row,
instead of reporting an error.
Fix this, and add tests for both of these cases.
Per report from Wenjiang Zhang. Back-patch to v15, where MERGE was
introduced.
Discussion: https://postgr.es/m/tencent_41DE0FF443FE14B94A5898D373792109E408%40qq.com
In the corner case where a function returning RECORD has been
simplified to a RECORD constant or an inlined ROW() expression,
ExecInitFunctionScan failed to cross-check the function's result
rowtype against the coldeflist provided by the calling query.
That happened because get_expr_result_type is able to extract a
tupdesc from such expressions, which led ExecInitFunctionScan to
ignore the coldeflist. (Instead, it used the extracted tupdesc
to check the function's output, which of course always succeeds.)
I have not been able to demonstrate any really serious consequences
from this, because if some column of the result is of the wrong
type and is directly referenced by a Var of the calling query,
CheckVarSlotCompatibility will catch it. However, we definitely do
fail to report the case where the function returns more columns than
the coldeflist expects, and in the converse case where it returns
fewer columns, we get an assert failure (but, seemingly, no worse
results in non-assert builds).
To fix, always build the expected tupdesc from the coldeflist if there
is one, and consult get_expr_result_type only when there isn't one.
Also remove the failing Assert, even though it is no longer reached
after this fix. It doesn't seem to be adding anything useful, since
later checking will deal with cases with the wrong number of columns.
The only other place I could find that is doing something similar
is inline_set_returning_function. There's no live bug there because
we cannot be looking at a Const or RowExpr, but for consistency
change that code to agree with ExecInitFunctionScan.
Per report from PetSerAl. After some debate I've concluded that
this should be back-patched. There is a small risk that somebody
has been relying on such a case not throwing an error, but I judge
this outweighed by the risk that I've missed some way in which the
failure to cross-check has worse consequences than sketched above.
Discussion: https://postgr.es/m/CAKygsHSerA1eXsJHR9wft3Gn3wfHQ5RfP8XHBzF70_qcrrRvEg@mail.gmail.com
During the calculations of the maximum for the number of buckets, take into
account that later we round that to the next power of 2.
Reported-by: Karen Talarico
Bug: #16925
Discussion: https://postgr.es/m/16925-ec96d83529d0d629%40postgresql.org
Author: Thomas Munro, Andrei Lepikhov, Alexander Korotkov
Reviewed-by: Alena Rybakina
Backpatch-through: 12
Fix a bug during MERGE if a cross-partition update is attempted on a
partitioned table with a BEFORE DELETE ROW trigger that returns NULL,
to prevent the update. This would cause an error to be thrown, or an
assert failure in an assert-enabled build.
This was an oversight in 9321c79c86, which failed to properly
distinguish a DELETE prevented by a trigger from one prevented by a
concurrent update. Fix by having ExecDelete() return the TM_Result
status to ExecCrossPartitionUpdate(), so that it can distinguish the
two cases, and make ExecCrossPartitionUpdate() return the TM_Result
status to ExecUpdateAct(), so that it can return the correct status
from a concurrent update.
In addition, ensure that the command tag is correctly updated by
having ExecMergeMatched() pass canSetTag to ExecUpdateAct(), rather
than passing false, so that it updates the command tag if it does a
cross-partition update, making this code path in ExecMergeMatched()
consistent with ExecUpdate().
Per bug #18238 from Alexander Lakhin. Back-patch to v15, where MERGE
was introduced.
Dean Rasheed, reviewed by Richard Guo and Jian He.
Discussion: https://postgr.es/m/18238-2f2bdc7f720180b9%40postgresql.org
If an error is thrown after calling CreateWaitEventSet(), the memory
of a WaitEventSet is free'd as it's allocated in the short-lived
memory context, but the file descriptor (on epoll- or kqueue-based
systems) or handles (on Windows) that it contains are leaked.
Use PG_TRY-FINALLY to ensure it gets freed. (On master, I will apply a
better fix, using ResourceOwners to track the WaitEventSet, but that's
not backpatchable.)
The added test doesn't check for leaking resources, so it passed even
before this commit. But at least it covers the code path.
In the passing, fix misleading comment on what the 'nevents' argument
to WaitEventSetWait means.
Report by Alexander Lakhin, analysis and suggestion for the fix by Tom
Lane. Fixes bug #17828. Backpatch to v14 where async execution was
introduced, but master gets a different fix.
Discussion: https://www.postgresql.org/message-id/17828-122da8cba23236be@postgresql.org
Discussion: https://www.postgresql.org/message-id/472235.1678387869@sss.pgh.pa.us
When executing a MERGE UPDATE action, if the UPDATE is turned into a
cross-partition DELETE then INSERT, do not attempt to invoke AFTER
UPDATE ROW triggers, or any of the other post-update actions in
ExecUpdateEpilogue().
For consistency with a plain UPDATE command, such triggers should not
be fired (and typically fail anyway), and similarly, other post-update
actions, such as WCO/RLS checks should not be executed, and might also
lead to unexpected failures.
Therefore, as with ExecUpdate(), make ExecMergeMatched() return
immediately if ExecUpdateAct() reports that a cross-partition update
was done, to be sure that no further processing is done for that
tuple.
Back-patch to v15, where MERGE was introduced.
Discussion: https://postgr.es/m/CAEZATCWjBgagyNZs02vgDF0DvASYj-iHTFtXG2-nP3orZhmtcw%40mail.gmail.com
When looping around after finding that the set-returning function
returned zero rows for the current input tuple, ExecProjectSet
neglected to reset either of the two memory contexts it's
responsible for cleaning out. Typically this wouldn't cause much
problem, because once the SRF does return at least one row, the
contexts would get reset on the next call. However, if the SRF
returns no rows for many input tuples in succession, quite a lot
of memory could be transiently consumed.
To fix, make sure we reset both contexts while looping around.
Per bug #18172 from Sergei Kornilov. Back-patch to all supported
branches.
Discussion: https://postgr.es/m/18172-9b8c5fc1d676ded3@postgresql.org