1
0
mirror of https://github.com/postgres/postgres.git synced 2025-08-19 23:22:23 +03:00

Don't lock partitions pruned by initial pruning

Before executing a cached generic plan, AcquireExecutorLocks() in
plancache.c locks all relations in a plan's range table to ensure the
plan is safe for execution. However, this locks runtime-prunable
relations that will later be pruned during "initial" runtime pruning,
introducing unnecessary overhead.

This commit defers locking for such relations to executor startup and
ensures that if the CachedPlan is invalidated due to concurrent DDL
during this window, replanning is triggered. Deferring these locks
avoids unnecessary locking overhead for pruned partitions, resulting
in significant speedup, particularly when many partitions are pruned
during initial runtime pruning.

* Changes to locking when executing generic plans:

AcquireExecutorLocks() now locks only unprunable relations, that is,
those found in PlannedStmt.unprunableRelids (introduced in commit
cbc127917e), to avoid locking runtime-prunable partitions
unnecessarily.  The remaining locks are taken by
ExecDoInitialPruning(), which acquires them only for partitions that
survive pruning.

This deferral does not affect the locks required for permission
checking in InitPlan(), which takes place before initial pruning.
ExecCheckPermissions() now includes an Assert to verify that all
relations undergoing permission checks, none of which can be in the
set of runtime-prunable relations, are properly locked.

* Plan invalidation handling:

Deferring locks introduces a window where prunable relations may be
altered by concurrent DDL, invalidating the plan. A new function,
ExecutorStartCachedPlan(), wraps ExecutorStart() to detect and handle
invalidation caused by deferred locking. If invalidation occurs,
ExecutorStartCachedPlan() updates CachedPlan using the new
UpdateCachedPlan() function and retries execution with the updated
plan. To ensure all code paths that may be affected by this handle
invalidation properly, all callers of ExecutorStart that may execute a
PlannedStmt from a CachedPlan have been updated to use
ExecutorStartCachedPlan() instead.

UpdateCachedPlan() replaces stale plans in CachedPlan.stmt_list. A new
CachedPlan.stmt_context, created as a child of CachedPlan.context,
allows freeing old PlannedStmts while preserving the CachedPlan
structure and its statement list. This ensures that loops over
statements in upstream callers of ExecutorStartCachedPlan() remain
intact.

ExecutorStart() and ExecutorStart_hook implementations now return a
boolean value indicating whether plan initialization succeeded with a
valid PlanState tree in QueryDesc.planstate, or false otherwise, in
which case QueryDesc.planstate is NULL. Hook implementations are
required to call standard_ExecutorStart() at the beginning, and if it
returns false, they should do the same without proceeding.

* Testing:

To verify these changes, the delay_execution module tests scenarios
where cached plans become invalid due to changes in prunable relations
after deferred locks.

* Note to extension authors:

ExecutorStart_hook implementations must verify plan validity after
calling standard_ExecutorStart(), as explained earlier. For example:

    if (prev_ExecutorStart)
        plan_valid = prev_ExecutorStart(queryDesc, eflags);
    else
        plan_valid = standard_ExecutorStart(queryDesc, eflags);

    if (!plan_valid)
        return false;

    <extension-code>

    return true;

Extensions accessing child relations, especially prunable partitions,
via ExecGetRangeTableRelation() must now ensure their RT indexes are
present in es_unpruned_relids (introduced in commit cbc127917e), or
they will encounter an error. This is a strict requirement after this
change, as only relations in that set are locked.

The idea of deferring some locks to executor startup, allowing locks
for prunable partitions to be skipped, was first proposed by Tom Lane.

Reviewed-by: Robert Haas <robertmhaas@gmail.com> (earlier versions)
Reviewed-by: David Rowley <dgrowleyml@gmail.com> (earlier versions)
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> (earlier versions)
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Discussion: https://postgr.es/m/CA+HiwqFGkMSge6TgC9KQzde0ohpAycLQuV7ooitEEpbKB0O_mg@mail.gmail.com
This commit is contained in:
Amit Langote
2025-02-20 17:09:48 +09:00
parent 4aa6fa3cd0
commit 525392d572
33 changed files with 1014 additions and 95 deletions

View File

@@ -280,6 +280,28 @@ are typically reset to empty once per tuple. Per-tuple contexts are usually
associated with ExprContexts, and commonly each PlanState node has its own
ExprContext to evaluate its qual and targetlist expressions in.
Relation Locking
----------------
When the executor initializes a plan tree for execution, it doesn't lock
non-index relations if the plan tree is freshly generated and not derived
from a CachedPlan. This is because such locks have already been established
during the query's parsing, rewriting, and planning phases. However, with a
cached plan tree, some relations may remain unlocked. The function
AcquireExecutorLocks() only locks unprunable relations in the plan, deferring
the locking of prunable ones to executor initialization. This avoids
unnecessary locking of relations that will be pruned during "initial" runtime
pruning in ExecDoInitialPruning().
This approach creates a window where a cached plan tree with child tables
could become outdated if another backend modifies these tables before
ExecDoInitialPruning() locks them. As a result, the executor has the added duty
to verify the plan tree's validity whenever it locks a child table after
doing initial pruning. This validation is done by checking the CachedPlan.is_valid
flag. If the plan tree is outdated (is_valid = false), the executor stops
further initialization, cleans up anything in EState that would have been
allocated up to that point, and retries execution after recreating the
invalid plan in the CachedPlan. See ExecutorStartCachedPlan().
Query Processing Control Flow
-----------------------------
@@ -288,11 +310,13 @@ This is a sketch of control flow for full query processing:
CreateQueryDesc
ExecutorStart
ExecutorStart or ExecutorStartCachedPlan
CreateExecutorState
creates per-query context
switch to per-query context to run ExecInitNode
switch to per-query context to run ExecDoInitialPruning and ExecInitNode
AfterTriggerBeginQuery
ExecDoInitialPruning
does initial pruning and locks surviving partitions if needed
ExecInitNode --- recursively scans plan tree
ExecInitNode
recurse into subsidiary nodes
@@ -316,7 +340,12 @@ This is a sketch of control flow for full query processing:
FreeQueryDesc
Per above comments, it's not really critical for ExecEndNode to free any
As mentioned in the "Relation Locking" section, if the plan tree is found to
be stale after locking partitions in ExecDoInitialPruning(), the control is
immediately returned to ExecutorStartCachedPlan(), which will create a new plan
tree and perform the steps starting from CreateExecutorState() again.
Per above comments, it's not really critical for ExecEndPlan to free any
memory; it'll all go away in FreeExecutorState anyway. However, we do need to
be careful to close relations, drop buffer pins, etc, so we do need to scan
the plan state tree to find these sorts of resources.

View File

@@ -55,11 +55,13 @@
#include "parser/parse_relation.h"
#include "pgstat.h"
#include "rewrite/rewriteHandler.h"
#include "storage/lmgr.h"
#include "tcop/utility.h"
#include "utils/acl.h"
#include "utils/backend_status.h"
#include "utils/lsyscache.h"
#include "utils/partcache.h"
#include "utils/plancache.h"
#include "utils/rls.h"
#include "utils/snapmgr.h"
@@ -114,11 +116,16 @@ static void EvalPlanQualStart(EPQState *epqstate, Plan *planTree);
* get control when ExecutorStart is called. Such a plugin would
* normally call standard_ExecutorStart().
*
* Return value indicates if the plan has been initialized successfully so
* that queryDesc->planstate contains a valid PlanState tree. It may not
* if the plan got invalidated during InitPlan().
* ----------------------------------------------------------------
*/
void
bool
ExecutorStart(QueryDesc *queryDesc, int eflags)
{
bool plan_valid;
/*
* In some cases (e.g. an EXECUTE statement or an execute message with the
* extended query protocol) the query_id won't be reported, so do it now.
@@ -130,12 +137,14 @@ ExecutorStart(QueryDesc *queryDesc, int eflags)
pgstat_report_query_id(queryDesc->plannedstmt->queryId, false);
if (ExecutorStart_hook)
(*ExecutorStart_hook) (queryDesc, eflags);
plan_valid = (*ExecutorStart_hook) (queryDesc, eflags);
else
standard_ExecutorStart(queryDesc, eflags);
plan_valid = standard_ExecutorStart(queryDesc, eflags);
return plan_valid;
}
void
bool
standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
{
EState *estate;
@@ -259,6 +268,64 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
InitPlan(queryDesc, eflags);
MemoryContextSwitchTo(oldcontext);
return ExecPlanStillValid(queryDesc->estate);
}
/*
* ExecutorStartCachedPlan
* Start execution for a given query in the CachedPlanSource, replanning
* if the plan is invalidated due to deferred locks taken during the
* plan's initialization
*
* This function handles cases where the CachedPlan given in queryDesc->cplan
* might become invalid during the initialization of the plan given in
* queryDesc->plannedstmt, particularly when prunable relations in it are
* locked after performing initial pruning. If the locks invalidate the plan,
* the function calls UpdateCachedPlan() to replan all queries in the
* CachedPlan, and then retries initialization.
*
* The function repeats the process until ExecutorStart() successfully
* initializes the plan, that is without the CachedPlan becoming invalid.
*/
void
ExecutorStartCachedPlan(QueryDesc *queryDesc, int eflags,
CachedPlanSource *plansource,
int query_index)
{
if (unlikely(queryDesc->cplan == NULL))
elog(ERROR, "ExecutorStartCachedPlan(): missing CachedPlan");
if (unlikely(plansource == NULL))
elog(ERROR, "ExecutorStartCachedPlan(): missing CachedPlanSource");
/*
* Loop and retry with an updated plan until no further invalidation
* occurs.
*/
while (1)
{
if (!ExecutorStart(queryDesc, eflags))
{
/*
* Clean up the current execution state before creating the new
* plan to retry ExecutorStart(). Mark execution as aborted to
* ensure that AFTER trigger state is properly reset.
*/
queryDesc->estate->es_aborted = true;
ExecutorEnd(queryDesc);
/* Retry ExecutorStart() with an updated plan tree. */
queryDesc->plannedstmt = UpdateCachedPlan(plansource, query_index,
queryDesc->queryEnv);
}
else
/*
* Exit the loop if the plan is initialized successfully and no
* sinval messages were received that invalidated the CachedPlan.
*/
break;
}
}
/* ----------------------------------------------------------------
@@ -317,6 +384,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
estate = queryDesc->estate;
Assert(estate != NULL);
Assert(!estate->es_aborted);
Assert(!(estate->es_top_eflags & EXEC_FLAG_EXPLAIN_ONLY));
/* caller must ensure the query's snapshot is active */
@@ -423,8 +491,11 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
Assert(estate != NULL);
Assert(!(estate->es_top_eflags & EXEC_FLAG_EXPLAIN_ONLY));
/* This should be run once and only once per Executor instance */
Assert(!estate->es_finished);
/*
* This should be run once and only once per Executor instance and never
* if the execution was aborted.
*/
Assert(!estate->es_finished && !estate->es_aborted);
/* Switch into per-query memory context */
oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
@@ -487,11 +558,10 @@ standard_ExecutorEnd(QueryDesc *queryDesc)
(PgStat_Counter) estate->es_parallel_workers_launched);
/*
* Check that ExecutorFinish was called, unless in EXPLAIN-only mode. This
* Assert is needed because ExecutorFinish is new as of 9.1, and callers
* might forget to call it.
* Check that ExecutorFinish was called, unless in EXPLAIN-only mode or if
* execution was aborted.
*/
Assert(estate->es_finished ||
Assert(estate->es_finished || estate->es_aborted ||
(estate->es_top_eflags & EXEC_FLAG_EXPLAIN_ONLY));
/*
@@ -505,6 +575,14 @@ standard_ExecutorEnd(QueryDesc *queryDesc)
UnregisterSnapshot(estate->es_snapshot);
UnregisterSnapshot(estate->es_crosscheck_snapshot);
/*
* Reset AFTER trigger module if the query execution was aborted.
*/
if (estate->es_aborted &&
!(estate->es_top_eflags &
(EXEC_FLAG_SKIP_TRIGGERS | EXEC_FLAG_EXPLAIN_ONLY)))
AfterTriggerAbortQuery();
/*
* Must switch out of context before destroying it
*/
@@ -603,6 +681,21 @@ ExecCheckPermissions(List *rangeTable, List *rteperminfos,
(rte->rtekind == RTE_SUBQUERY &&
rte->relkind == RELKIND_VIEW));
/*
* Ensure that we have at least an AccessShareLock on relations
* whose permissions need to be checked.
*
* Skip this check in a parallel worker because locks won't be
* taken until ExecInitNode() performs plan initialization.
*
* XXX: ExecCheckPermissions() in a parallel worker may be
* redundant with the checks done in the leader process, so this
* should be reviewed to ensure its necessary.
*/
Assert(IsParallelWorker() ||
CheckRelationOidLockedByMe(rte->relid, AccessShareLock,
true));
(void) getRTEPermissionInfo(rteperminfos, rte);
/* Many-to-one mapping not allowed */
Assert(!bms_is_member(rte->perminfoindex, indexset));
@@ -828,6 +921,12 @@ ExecCheckXactReadOnly(PlannedStmt *plannedstmt)
*
* Initializes the query plan: open files, allocate storage
* and start up the rule manager
*
* If the plan originates from a CachedPlan (given in queryDesc->cplan),
* it can become invalid during runtime "initial" pruning when the
* remaining set of locks is taken. The function returns early in that
* case without initializing the plan, and the caller is expected to
* retry with a new valid plan.
* ----------------------------------------------------------------
*/
static void
@@ -835,6 +934,7 @@ InitPlan(QueryDesc *queryDesc, int eflags)
{
CmdType operation = queryDesc->operation;
PlannedStmt *plannedstmt = queryDesc->plannedstmt;
CachedPlan *cachedplan = queryDesc->cplan;
Plan *plan = plannedstmt->planTree;
List *rangeTable = plannedstmt->rtable;
EState *estate = queryDesc->estate;
@@ -855,6 +955,7 @@ InitPlan(QueryDesc *queryDesc, int eflags)
bms_copy(plannedstmt->unprunableRelids));
estate->es_plannedstmt = plannedstmt;
estate->es_cachedplan = cachedplan;
estate->es_part_prune_infos = plannedstmt->partPruneInfos;
/*
@@ -868,6 +969,9 @@ InitPlan(QueryDesc *queryDesc, int eflags)
*/
ExecDoInitialPruning(estate);
if (!ExecPlanStillValid(estate))
return;
/*
* Next, build the ExecRowMark array from the PlanRowMark(s), if any.
*/
@@ -2873,6 +2977,9 @@ EvalPlanQualStart(EPQState *epqstate, Plan *planTree)
* the snapshot, rangetable, and external Param info. They need their own
* copies of local state, including a tuple table, es_param_exec_vals,
* result-rel info, etc.
*
* es_cachedplan is not copied because EPQ plan execution does not acquire
* any new locks that could invalidate the CachedPlan.
*/
rcestate->es_direction = ForwardScanDirection;
rcestate->es_snapshot = parentestate->es_snapshot;

View File

@@ -1258,8 +1258,15 @@ ExecParallelGetQueryDesc(shm_toc *toc, DestReceiver *receiver,
paramspace = shm_toc_lookup(toc, PARALLEL_KEY_PARAMLISTINFO, false);
paramLI = RestoreParamList(&paramspace);
/* Create a QueryDesc for the query. */
/*
* Create a QueryDesc for the query. We pass NULL for cachedplan, because
* we don't have a pointer to the CachedPlan in the leader's process. It's
* fine because the only reason the executor needs to see it is to decide
* if it should take locks on certain relations, but parallel workers
* always take locks anyway.
*/
return CreateQueryDesc(pstmt,
NULL,
queryString,
GetActiveSnapshot(), InvalidSnapshot,
receiver, paramLI, NULL, instrument_options);
@@ -1440,7 +1447,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
/* Start up the executor */
queryDesc->plannedstmt->jitFlags = fpes->jit_flags;
ExecutorStart(queryDesc, fpes->eflags);
if (!ExecutorStart(queryDesc, fpes->eflags))
elog(ERROR, "ExecutorStart() failed unexpectedly");
/* Special executor initialization steps for parallel workers */
queryDesc->planstate->state->es_query_dsa = area;

View File

@@ -26,6 +26,7 @@
#include "partitioning/partdesc.h"
#include "partitioning/partprune.h"
#include "rewrite/rewriteManip.h"
#include "storage/lmgr.h"
#include "utils/acl.h"
#include "utils/lsyscache.h"
#include "utils/partcache.h"
@@ -1768,7 +1769,8 @@ adjust_partition_colnos_using_map(List *colnos, AttrMap *attrMap)
* ExecDoInitialPruning:
* Perform runtime "initial" pruning, if necessary, to determine the set
* of child subnodes that need to be initialized during ExecInitNode() for
* all plan nodes that contain a PartitionPruneInfo.
* all plan nodes that contain a PartitionPruneInfo. This also locks the
* leaf partitions whose subnodes will be initialized if needed.
*
* ExecInitPartitionExecPruning:
* Updates the PartitionPruneState found at given part_prune_index in
@@ -1789,11 +1791,13 @@ adjust_partition_colnos_using_map(List *colnos, AttrMap *attrMap)
*-------------------------------------------------------------------------
*/
/*
* ExecDoInitialPruning
* Perform runtime "initial" pruning, if necessary, to determine the set
* of child subnodes that need to be initialized during ExecInitNode() for
* plan nodes that support partition pruning.
* plan nodes that support partition pruning. This also locks the leaf
* partitions whose subnodes will be initialized if needed.
*
* This function iterates over each PartitionPruneInfo entry in
* estate->es_part_prune_infos. For each entry, it creates a PartitionPruneState
@@ -1816,6 +1820,7 @@ void
ExecDoInitialPruning(EState *estate)
{
ListCell *lc;
List *locked_relids = NIL;
foreach(lc, estate->es_part_prune_infos)
{
@@ -1841,11 +1846,40 @@ ExecDoInitialPruning(EState *estate)
else
validsubplan_rtis = all_leafpart_rtis;
if (ExecShouldLockRelations(estate))
{
int rtindex = -1;
while ((rtindex = bms_next_member(validsubplan_rtis,
rtindex)) >= 0)
{
RangeTblEntry *rte = exec_rt_fetch(rtindex, estate);
Assert(rte->rtekind == RTE_RELATION &&
rte->rellockmode != NoLock);
LockRelationOid(rte->relid, rte->rellockmode);
locked_relids = lappend_int(locked_relids, rtindex);
}
}
estate->es_unpruned_relids = bms_add_members(estate->es_unpruned_relids,
validsubplan_rtis);
estate->es_part_prune_results = lappend(estate->es_part_prune_results,
validsubplans);
}
/*
* Release the useless locks if the plan won't be executed. This is the
* same as what CheckCachedPlan() in plancache.c does.
*/
if (!ExecPlanStillValid(estate))
{
foreach(lc, locked_relids)
{
RangeTblEntry *rte = exec_rt_fetch(lfirst_int(lc), estate);
UnlockRelationOid(rte->relid, rte->rellockmode);
}
}
}
/*

View File

@@ -147,6 +147,7 @@ CreateExecutorState(void)
estate->es_top_eflags = 0;
estate->es_instrument = 0;
estate->es_finished = false;
estate->es_aborted = false;
estate->es_exprcontexts = NIL;
@@ -813,6 +814,10 @@ ExecInitRangeTable(EState *estate, List *rangeTable, List *permInfos,
* Open the Relation for a range table entry, if not already done
*
* The Relations will be closed in ExecEndPlan().
*
* Note: The caller must ensure that 'rti' refers to an unpruned relation
* (i.e., it is a member of estate->es_unpruned_relids) before calling this
* function. Attempting to open a pruned relation will result in an error.
*/
Relation
ExecGetRangeTableRelation(EState *estate, Index rti)
@@ -821,6 +826,9 @@ ExecGetRangeTableRelation(EState *estate, Index rti)
Assert(rti > 0 && rti <= estate->es_range_table_size);
if (!bms_is_member(rti, estate->es_unpruned_relids))
elog(ERROR, "trying to open a pruned relation");
rel = estate->es_relations[rti - 1];
if (rel == NULL)
{

View File

@@ -840,6 +840,7 @@ postquel_start(execution_state *es, SQLFunctionCachePtr fcache)
dest = None_Receiver;
es->qd = CreateQueryDesc(es->stmt,
NULL,
fcache->src,
GetActiveSnapshot(),
InvalidSnapshot,
@@ -864,7 +865,8 @@ postquel_start(execution_state *es, SQLFunctionCachePtr fcache)
eflags = EXEC_FLAG_SKIP_TRIGGERS;
else
eflags = 0; /* default run-to-completion flags */
ExecutorStart(es->qd, eflags);
if (!ExecutorStart(es->qd, eflags))
elog(ERROR, "ExecutorStart() failed unexpectedly");
}
es->status = F_EXEC_RUN;

View File

@@ -70,7 +70,8 @@ static int _SPI_execute_plan(SPIPlanPtr plan, const SPIExecuteOptions *options,
static ParamListInfo _SPI_convert_params(int nargs, Oid *argtypes,
Datum *Values, const char *Nulls);
static int _SPI_pquery(QueryDesc *queryDesc, bool fire_triggers, uint64 tcount);
static int _SPI_pquery(QueryDesc *queryDesc, bool fire_triggers, uint64 tcount,
CachedPlanSource *plansource, int query_index);
static void _SPI_error_callback(void *arg);
@@ -1685,7 +1686,8 @@ SPI_cursor_open_internal(const char *name, SPIPlanPtr plan,
query_string,
plansource->commandTag,
stmt_list,
cplan);
cplan,
plansource);
/*
* Set up options for portal. Default SCROLL type is chosen the same way
@@ -2500,6 +2502,7 @@ _SPI_execute_plan(SPIPlanPtr plan, const SPIExecuteOptions *options,
CachedPlanSource *plansource = (CachedPlanSource *) lfirst(lc1);
List *stmt_list;
ListCell *lc2;
int query_index = 0;
spicallbackarg.query = plansource->query_string;
@@ -2690,14 +2693,16 @@ _SPI_execute_plan(SPIPlanPtr plan, const SPIExecuteOptions *options,
snap = InvalidSnapshot;
qdesc = CreateQueryDesc(stmt,
cplan,
plansource->query_string,
snap, crosscheck_snapshot,
dest,
options->params,
_SPI_current->queryEnv,
0);
res = _SPI_pquery(qdesc, fire_triggers,
canSetTag ? options->tcount : 0);
res = _SPI_pquery(qdesc, fire_triggers, canSetTag ? options->tcount : 0,
plansource, query_index);
FreeQueryDesc(qdesc);
}
else
@@ -2794,6 +2799,8 @@ _SPI_execute_plan(SPIPlanPtr plan, const SPIExecuteOptions *options,
my_res = res;
goto fail;
}
query_index++;
}
/* Done with this plan, so release refcount */
@@ -2871,7 +2878,8 @@ _SPI_convert_params(int nargs, Oid *argtypes,
}
static int
_SPI_pquery(QueryDesc *queryDesc, bool fire_triggers, uint64 tcount)
_SPI_pquery(QueryDesc *queryDesc, bool fire_triggers, uint64 tcount,
CachedPlanSource *plansource, int query_index)
{
int operation = queryDesc->operation;
int eflags;
@@ -2927,7 +2935,16 @@ _SPI_pquery(QueryDesc *queryDesc, bool fire_triggers, uint64 tcount)
else
eflags = EXEC_FLAG_SKIP_TRIGGERS;
ExecutorStart(queryDesc, eflags);
if (queryDesc->cplan)
{
ExecutorStartCachedPlan(queryDesc, eflags, plansource, query_index);
Assert(queryDesc->planstate);
}
else
{
if (!ExecutorStart(queryDesc, eflags))
elog(ERROR, "ExecutorStart() failed unexpectedly");
}
ExecutorRun(queryDesc, ForwardScanDirection, tcount);