mirror of
https://github.com/postgres/postgres.git
synced 2025-05-05 09:19:17 +03:00
Here we add a new executor node type named "Result Cache". The planner can include this node type in the plan to have the executor cache the results from the inner side of parameterized nested loop joins. This allows caching of tuples for sets of parameters so that in the event that the node sees the same parameter values again, it can just return the cached tuples instead of rescanning the inner side of the join all over again. Internally, result cache uses a hash table in order to quickly find tuples that have been previously cached. For certain data sets, this can significantly improve the performance of joins. The best cases for using this new node type are for join problems where a large portion of the tuples from the inner side of the join have no join partner on the outer side of the join. In such cases, hash join would have to hash values that are never looked up, thus bloating the hash table and possibly causing it to multi-batch. Merge joins would have to skip over all of the unmatched rows. If we use a nested loop join with a result cache, then we only cache tuples that have at least one join partner on the outer side of the join. The benefits of using a parameterized nested loop with a result cache increase when there are fewer distinct values being looked up and the number of lookups of each value is large. Also, hash probes to lookup the cache can be much faster than the hash probe in a hash join as it's common that the result cache's hash table is much smaller than the hash join's due to result cache only caching useful tuples rather than all tuples from the inner side of the join. This variation in hash probe performance is more significant when the hash join's hash table no longer fits into the CPU's L3 cache, but the result cache's hash table does. The apparent "random" access of hash buckets with each hash probe can cause a poor L3 cache hit ratio for large hash tables. Smaller hash tables generally perform better. The hash table used for the cache limits itself to not exceeding work_mem * hash_mem_multiplier in size. We maintain a dlist of keys for this cache and when we're adding new tuples and realize we've exceeded the memory budget, we evict cache entries starting with the least recently used ones until we have enough memory to add the new tuples to the cache. For parameterized nested loop joins, we now consider using one of these result cache nodes in between the nested loop node and its inner node. We determine when this might be useful based on cost, which is primarily driven off of what the expected cache hit ratio will be. Estimating the cache hit ratio relies on having good distinct estimates on the nested loop's parameters. For now, the planner will only consider using a result cache for parameterized nested loop joins. This works for both normal joins and also for LATERAL type joins to subqueries. It is possible to use this new node for other uses in the future. For example, to cache results from correlated subqueries. However, that's not done here due to some difficulties obtaining a distinct estimation on the outer plan to calculate the estimated cache hit ratio. Currently we plan the inner plan before planning the outer plan so there is no good way to know if a result cache would be useful or not since we can't estimate the number of times the subplan will be called until the outer plan is generated. The functionality being added here is newly introducing a dependency on the return value of estimate_num_groups() during the join search. Previously, during the join search, we only ever needed to perform selectivity estimations. With this commit, we need to use estimate_num_groups() in order to estimate what the hit ratio on the result cache will be. In simple terms, if we expect 10 distinct values and we expect 1000 outer rows, then we'll estimate the hit ratio to be 99%. Since cache hits are very cheap compared to scanning the underlying nodes on the inner side of the nested loop join, then this will significantly reduce the planner's cost for the join. However, it's fairly easy to see here that things will go bad when estimate_num_groups() incorrectly returns a value that's significantly lower than the actual number of distinct values. If this happens then that may cause us to make use of a nested loop join with a result cache instead of some other join type, such as a merge or hash join. Our distinct estimations have been known to be a source of trouble in the past, so the extra reliance on them here could cause the planner to choose slower plans than it did previous to having this feature. Distinct estimations are also fairly hard to estimate accurately when several tables have been joined already or when a WHERE clause filters out a set of values that are correlated to the expressions we're estimating the number of distinct value for. For now, the costing we perform during query planning for result caches does put quite a bit of faith in the distinct estimations being accurate. When these are accurate then we should generally see faster execution times for plans containing a result cache. However, in the real world, we may find that we need to either change the costings to put less trust in the distinct estimations being accurate or perhaps even disable this feature by default. There's always an element of risk when we teach the query planner to do new tricks that it decides to use that new trick at the wrong time and causes a regression. Users may opt to get the old behavior by turning the feature off using the enable_resultcache GUC. Currently, this is enabled by default. It remains to be seen if we'll maintain that setting for the release. Additionally, the name "Result Cache" is the best name I could think of for this new node at the time I started writing the patch. Nobody seems to strongly dislike the name. A few people did suggest other names but no other name seemed to dominate in the brief discussion that there was about names. Let's allow the beta period to see if the current name pleases enough people. If there's some consensus on a better name, then we can change it before the release. Please see the 2nd discussion link below for the discussion on the "Result Cache" name. Author: David Rowley Reviewed-by: Andy Fan, Justin Pryzby, Zhihong Yu Tested-By: Konstantin Knizhnik Discussion: https://postgr.es/m/CAApHDvrPcQyQdWERGYWx8J%2B2DLUNgXu%2BfOSbQ1UscxrunyXyrQ%40mail.gmail.com Discussion: https://postgr.es/m/CAApHDvq=yQXr5kqhRviT2RhNKwToaWr9JAN5t+5_PzhuRJ3wvg@mail.gmail.com
981 lines
26 KiB
C
981 lines
26 KiB
C
/*-------------------------------------------------------------------------
|
|
*
|
|
* execProcnode.c
|
|
* contains dispatch functions which call the appropriate "initialize",
|
|
* "get a tuple", and "cleanup" routines for the given node type.
|
|
* If the node has children, then it will presumably call ExecInitNode,
|
|
* ExecProcNode, or ExecEndNode on its subnodes and do the appropriate
|
|
* processing.
|
|
*
|
|
* Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
|
|
* Portions Copyright (c) 1994, Regents of the University of California
|
|
*
|
|
*
|
|
* IDENTIFICATION
|
|
* src/backend/executor/execProcnode.c
|
|
*
|
|
*-------------------------------------------------------------------------
|
|
*/
|
|
/*
|
|
* NOTES
|
|
* This used to be three files. It is now all combined into
|
|
* one file so that it is easier to keep the dispatch routines
|
|
* in sync when new nodes are added.
|
|
*
|
|
* EXAMPLE
|
|
* Suppose we want the age of the manager of the shoe department and
|
|
* the number of employees in that department. So we have the query:
|
|
*
|
|
* select DEPT.no_emps, EMP.age
|
|
* from DEPT, EMP
|
|
* where EMP.name = DEPT.mgr and
|
|
* DEPT.name = "shoe"
|
|
*
|
|
* Suppose the planner gives us the following plan:
|
|
*
|
|
* Nest Loop (DEPT.mgr = EMP.name)
|
|
* / \
|
|
* / \
|
|
* Seq Scan Seq Scan
|
|
* DEPT EMP
|
|
* (name = "shoe")
|
|
*
|
|
* ExecutorStart() is called first.
|
|
* It calls InitPlan() which calls ExecInitNode() on
|
|
* the root of the plan -- the nest loop node.
|
|
*
|
|
* * ExecInitNode() notices that it is looking at a nest loop and
|
|
* as the code below demonstrates, it calls ExecInitNestLoop().
|
|
* Eventually this calls ExecInitNode() on the right and left subplans
|
|
* and so forth until the entire plan is initialized. The result
|
|
* of ExecInitNode() is a plan state tree built with the same structure
|
|
* as the underlying plan tree.
|
|
*
|
|
* * Then when ExecutorRun() is called, it calls ExecutePlan() which calls
|
|
* ExecProcNode() repeatedly on the top node of the plan state tree.
|
|
* Each time this happens, ExecProcNode() will end up calling
|
|
* ExecNestLoop(), which calls ExecProcNode() on its subplans.
|
|
* Each of these subplans is a sequential scan so ExecSeqScan() is
|
|
* called. The slots returned by ExecSeqScan() may contain
|
|
* tuples which contain the attributes ExecNestLoop() uses to
|
|
* form the tuples it returns.
|
|
*
|
|
* * Eventually ExecSeqScan() stops returning tuples and the nest
|
|
* loop join ends. Lastly, ExecutorEnd() calls ExecEndNode() which
|
|
* calls ExecEndNestLoop() which in turn calls ExecEndNode() on
|
|
* its subplans which result in ExecEndSeqScan().
|
|
*
|
|
* This should show how the executor works by having
|
|
* ExecInitNode(), ExecProcNode() and ExecEndNode() dispatch
|
|
* their work to the appropriate node support routines which may
|
|
* in turn call these routines themselves on their subplans.
|
|
*/
|
|
#include "postgres.h"
|
|
|
|
#include "executor/executor.h"
|
|
#include "executor/nodeAgg.h"
|
|
#include "executor/nodeAppend.h"
|
|
#include "executor/nodeBitmapAnd.h"
|
|
#include "executor/nodeBitmapHeapscan.h"
|
|
#include "executor/nodeBitmapIndexscan.h"
|
|
#include "executor/nodeBitmapOr.h"
|
|
#include "executor/nodeCtescan.h"
|
|
#include "executor/nodeCustom.h"
|
|
#include "executor/nodeForeignscan.h"
|
|
#include "executor/nodeFunctionscan.h"
|
|
#include "executor/nodeGather.h"
|
|
#include "executor/nodeGatherMerge.h"
|
|
#include "executor/nodeGroup.h"
|
|
#include "executor/nodeHash.h"
|
|
#include "executor/nodeHashjoin.h"
|
|
#include "executor/nodeIncrementalSort.h"
|
|
#include "executor/nodeIndexonlyscan.h"
|
|
#include "executor/nodeIndexscan.h"
|
|
#include "executor/nodeLimit.h"
|
|
#include "executor/nodeLockRows.h"
|
|
#include "executor/nodeMaterial.h"
|
|
#include "executor/nodeMergeAppend.h"
|
|
#include "executor/nodeMergejoin.h"
|
|
#include "executor/nodeModifyTable.h"
|
|
#include "executor/nodeNamedtuplestorescan.h"
|
|
#include "executor/nodeNestloop.h"
|
|
#include "executor/nodeProjectSet.h"
|
|
#include "executor/nodeRecursiveunion.h"
|
|
#include "executor/nodeResult.h"
|
|
#include "executor/nodeResultCache.h"
|
|
#include "executor/nodeSamplescan.h"
|
|
#include "executor/nodeSeqscan.h"
|
|
#include "executor/nodeSetOp.h"
|
|
#include "executor/nodeSort.h"
|
|
#include "executor/nodeSubplan.h"
|
|
#include "executor/nodeSubqueryscan.h"
|
|
#include "executor/nodeTableFuncscan.h"
|
|
#include "executor/nodeTidrangescan.h"
|
|
#include "executor/nodeTidscan.h"
|
|
#include "executor/nodeUnique.h"
|
|
#include "executor/nodeValuesscan.h"
|
|
#include "executor/nodeWindowAgg.h"
|
|
#include "executor/nodeWorktablescan.h"
|
|
#include "miscadmin.h"
|
|
#include "nodes/nodeFuncs.h"
|
|
|
|
static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
|
|
static TupleTableSlot *ExecProcNodeInstr(PlanState *node);
|
|
|
|
|
|
/* ------------------------------------------------------------------------
|
|
* ExecInitNode
|
|
*
|
|
* Recursively initializes all the nodes in the plan tree rooted
|
|
* at 'node'.
|
|
*
|
|
* Inputs:
|
|
* 'node' is the current node of the plan produced by the query planner
|
|
* 'estate' is the shared execution state for the plan tree
|
|
* 'eflags' is a bitwise OR of flag bits described in executor.h
|
|
*
|
|
* Returns a PlanState node corresponding to the given Plan node.
|
|
* ------------------------------------------------------------------------
|
|
*/
|
|
PlanState *
|
|
ExecInitNode(Plan *node, EState *estate, int eflags)
|
|
{
|
|
PlanState *result;
|
|
List *subps;
|
|
ListCell *l;
|
|
|
|
/*
|
|
* do nothing when we get to the end of a leaf on tree.
|
|
*/
|
|
if (node == NULL)
|
|
return NULL;
|
|
|
|
/*
|
|
* Make sure there's enough stack available. Need to check here, in
|
|
* addition to ExecProcNode() (via ExecProcNodeFirst()), to ensure the
|
|
* stack isn't overrun while initializing the node tree.
|
|
*/
|
|
check_stack_depth();
|
|
|
|
switch (nodeTag(node))
|
|
{
|
|
/*
|
|
* control nodes
|
|
*/
|
|
case T_Result:
|
|
result = (PlanState *) ExecInitResult((Result *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_ProjectSet:
|
|
result = (PlanState *) ExecInitProjectSet((ProjectSet *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_ModifyTable:
|
|
result = (PlanState *) ExecInitModifyTable((ModifyTable *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_Append:
|
|
result = (PlanState *) ExecInitAppend((Append *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_MergeAppend:
|
|
result = (PlanState *) ExecInitMergeAppend((MergeAppend *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_RecursiveUnion:
|
|
result = (PlanState *) ExecInitRecursiveUnion((RecursiveUnion *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_BitmapAnd:
|
|
result = (PlanState *) ExecInitBitmapAnd((BitmapAnd *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_BitmapOr:
|
|
result = (PlanState *) ExecInitBitmapOr((BitmapOr *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
/*
|
|
* scan nodes
|
|
*/
|
|
case T_SeqScan:
|
|
result = (PlanState *) ExecInitSeqScan((SeqScan *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_SampleScan:
|
|
result = (PlanState *) ExecInitSampleScan((SampleScan *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_IndexScan:
|
|
result = (PlanState *) ExecInitIndexScan((IndexScan *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_IndexOnlyScan:
|
|
result = (PlanState *) ExecInitIndexOnlyScan((IndexOnlyScan *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_BitmapIndexScan:
|
|
result = (PlanState *) ExecInitBitmapIndexScan((BitmapIndexScan *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_BitmapHeapScan:
|
|
result = (PlanState *) ExecInitBitmapHeapScan((BitmapHeapScan *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_TidScan:
|
|
result = (PlanState *) ExecInitTidScan((TidScan *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_TidRangeScan:
|
|
result = (PlanState *) ExecInitTidRangeScan((TidRangeScan *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_SubqueryScan:
|
|
result = (PlanState *) ExecInitSubqueryScan((SubqueryScan *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_FunctionScan:
|
|
result = (PlanState *) ExecInitFunctionScan((FunctionScan *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_TableFuncScan:
|
|
result = (PlanState *) ExecInitTableFuncScan((TableFuncScan *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_ValuesScan:
|
|
result = (PlanState *) ExecInitValuesScan((ValuesScan *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_CteScan:
|
|
result = (PlanState *) ExecInitCteScan((CteScan *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_NamedTuplestoreScan:
|
|
result = (PlanState *) ExecInitNamedTuplestoreScan((NamedTuplestoreScan *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_WorkTableScan:
|
|
result = (PlanState *) ExecInitWorkTableScan((WorkTableScan *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_ForeignScan:
|
|
result = (PlanState *) ExecInitForeignScan((ForeignScan *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_CustomScan:
|
|
result = (PlanState *) ExecInitCustomScan((CustomScan *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
/*
|
|
* join nodes
|
|
*/
|
|
case T_NestLoop:
|
|
result = (PlanState *) ExecInitNestLoop((NestLoop *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_MergeJoin:
|
|
result = (PlanState *) ExecInitMergeJoin((MergeJoin *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_HashJoin:
|
|
result = (PlanState *) ExecInitHashJoin((HashJoin *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
/*
|
|
* materialization nodes
|
|
*/
|
|
case T_Material:
|
|
result = (PlanState *) ExecInitMaterial((Material *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_Sort:
|
|
result = (PlanState *) ExecInitSort((Sort *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_IncrementalSort:
|
|
result = (PlanState *) ExecInitIncrementalSort((IncrementalSort *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_ResultCache:
|
|
result = (PlanState *) ExecInitResultCache((ResultCache *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_Group:
|
|
result = (PlanState *) ExecInitGroup((Group *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_Agg:
|
|
result = (PlanState *) ExecInitAgg((Agg *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_WindowAgg:
|
|
result = (PlanState *) ExecInitWindowAgg((WindowAgg *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_Unique:
|
|
result = (PlanState *) ExecInitUnique((Unique *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_Gather:
|
|
result = (PlanState *) ExecInitGather((Gather *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_GatherMerge:
|
|
result = (PlanState *) ExecInitGatherMerge((GatherMerge *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_Hash:
|
|
result = (PlanState *) ExecInitHash((Hash *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_SetOp:
|
|
result = (PlanState *) ExecInitSetOp((SetOp *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_LockRows:
|
|
result = (PlanState *) ExecInitLockRows((LockRows *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
case T_Limit:
|
|
result = (PlanState *) ExecInitLimit((Limit *) node,
|
|
estate, eflags);
|
|
break;
|
|
|
|
default:
|
|
elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
|
|
result = NULL; /* keep compiler quiet */
|
|
break;
|
|
}
|
|
|
|
ExecSetExecProcNode(result, result->ExecProcNode);
|
|
|
|
/*
|
|
* Initialize any initPlans present in this node. The planner put them in
|
|
* a separate list for us.
|
|
*/
|
|
subps = NIL;
|
|
foreach(l, node->initPlan)
|
|
{
|
|
SubPlan *subplan = (SubPlan *) lfirst(l);
|
|
SubPlanState *sstate;
|
|
|
|
Assert(IsA(subplan, SubPlan));
|
|
sstate = ExecInitSubPlan(subplan, result);
|
|
subps = lappend(subps, sstate);
|
|
}
|
|
result->initPlan = subps;
|
|
|
|
/* Set up instrumentation for this node if requested */
|
|
if (estate->es_instrument)
|
|
result->instrument = InstrAlloc(1, estate->es_instrument);
|
|
|
|
return result;
|
|
}
|
|
|
|
|
|
/*
|
|
* If a node wants to change its ExecProcNode function after ExecInitNode()
|
|
* has finished, it should do so with this function. That way any wrapper
|
|
* functions can be reinstalled, without the node having to know how that
|
|
* works.
|
|
*/
|
|
void
|
|
ExecSetExecProcNode(PlanState *node, ExecProcNodeMtd function)
|
|
{
|
|
/*
|
|
* Add a wrapper around the ExecProcNode callback that checks stack depth
|
|
* during the first execution and maybe adds an instrumentation wrapper.
|
|
* When the callback is changed after execution has already begun that
|
|
* means we'll superfluously execute ExecProcNodeFirst, but that seems ok.
|
|
*/
|
|
node->ExecProcNodeReal = function;
|
|
node->ExecProcNode = ExecProcNodeFirst;
|
|
}
|
|
|
|
|
|
/*
|
|
* ExecProcNode wrapper that performs some one-time checks, before calling
|
|
* the relevant node method (possibly via an instrumentation wrapper).
|
|
*/
|
|
static TupleTableSlot *
|
|
ExecProcNodeFirst(PlanState *node)
|
|
{
|
|
/*
|
|
* Perform stack depth check during the first execution of the node. We
|
|
* only do so the first time round because it turns out to not be cheap on
|
|
* some common architectures (eg. x86). This relies on the assumption
|
|
* that ExecProcNode calls for a given plan node will always be made at
|
|
* roughly the same stack depth.
|
|
*/
|
|
check_stack_depth();
|
|
|
|
/*
|
|
* If instrumentation is required, change the wrapper to one that just
|
|
* does instrumentation. Otherwise we can dispense with all wrappers and
|
|
* have ExecProcNode() directly call the relevant function from now on.
|
|
*/
|
|
if (node->instrument)
|
|
node->ExecProcNode = ExecProcNodeInstr;
|
|
else
|
|
node->ExecProcNode = node->ExecProcNodeReal;
|
|
|
|
return node->ExecProcNode(node);
|
|
}
|
|
|
|
|
|
/*
|
|
* ExecProcNode wrapper that performs instrumentation calls. By keeping
|
|
* this a separate function, we avoid overhead in the normal case where
|
|
* no instrumentation is wanted.
|
|
*/
|
|
static TupleTableSlot *
|
|
ExecProcNodeInstr(PlanState *node)
|
|
{
|
|
TupleTableSlot *result;
|
|
|
|
InstrStartNode(node->instrument);
|
|
|
|
result = node->ExecProcNodeReal(node);
|
|
|
|
InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
|
|
|
|
return result;
|
|
}
|
|
|
|
|
|
/* ----------------------------------------------------------------
|
|
* MultiExecProcNode
|
|
*
|
|
* Execute a node that doesn't return individual tuples
|
|
* (it might return a hashtable, bitmap, etc). Caller should
|
|
* check it got back the expected kind of Node.
|
|
*
|
|
* This has essentially the same responsibilities as ExecProcNode,
|
|
* but it does not do InstrStartNode/InstrStopNode (mainly because
|
|
* it can't tell how many returned tuples to count). Each per-node
|
|
* function must provide its own instrumentation support.
|
|
* ----------------------------------------------------------------
|
|
*/
|
|
Node *
|
|
MultiExecProcNode(PlanState *node)
|
|
{
|
|
Node *result;
|
|
|
|
check_stack_depth();
|
|
|
|
CHECK_FOR_INTERRUPTS();
|
|
|
|
if (node->chgParam != NULL) /* something changed */
|
|
ExecReScan(node); /* let ReScan handle this */
|
|
|
|
switch (nodeTag(node))
|
|
{
|
|
/*
|
|
* Only node types that actually support multiexec will be listed
|
|
*/
|
|
|
|
case T_HashState:
|
|
result = MultiExecHash((HashState *) node);
|
|
break;
|
|
|
|
case T_BitmapIndexScanState:
|
|
result = MultiExecBitmapIndexScan((BitmapIndexScanState *) node);
|
|
break;
|
|
|
|
case T_BitmapAndState:
|
|
result = MultiExecBitmapAnd((BitmapAndState *) node);
|
|
break;
|
|
|
|
case T_BitmapOrState:
|
|
result = MultiExecBitmapOr((BitmapOrState *) node);
|
|
break;
|
|
|
|
default:
|
|
elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
|
|
result = NULL;
|
|
break;
|
|
}
|
|
|
|
return result;
|
|
}
|
|
|
|
|
|
/* ----------------------------------------------------------------
|
|
* ExecEndNode
|
|
*
|
|
* Recursively cleans up all the nodes in the plan rooted
|
|
* at 'node'.
|
|
*
|
|
* After this operation, the query plan will not be able to be
|
|
* processed any further. This should be called only after
|
|
* the query plan has been fully executed.
|
|
* ----------------------------------------------------------------
|
|
*/
|
|
void
|
|
ExecEndNode(PlanState *node)
|
|
{
|
|
/*
|
|
* do nothing when we get to the end of a leaf on tree.
|
|
*/
|
|
if (node == NULL)
|
|
return;
|
|
|
|
/*
|
|
* Make sure there's enough stack available. Need to check here, in
|
|
* addition to ExecProcNode() (via ExecProcNodeFirst()), because it's not
|
|
* guaranteed that ExecProcNode() is reached for all nodes.
|
|
*/
|
|
check_stack_depth();
|
|
|
|
if (node->chgParam != NULL)
|
|
{
|
|
bms_free(node->chgParam);
|
|
node->chgParam = NULL;
|
|
}
|
|
|
|
switch (nodeTag(node))
|
|
{
|
|
/*
|
|
* control nodes
|
|
*/
|
|
case T_ResultState:
|
|
ExecEndResult((ResultState *) node);
|
|
break;
|
|
|
|
case T_ProjectSetState:
|
|
ExecEndProjectSet((ProjectSetState *) node);
|
|
break;
|
|
|
|
case T_ModifyTableState:
|
|
ExecEndModifyTable((ModifyTableState *) node);
|
|
break;
|
|
|
|
case T_AppendState:
|
|
ExecEndAppend((AppendState *) node);
|
|
break;
|
|
|
|
case T_MergeAppendState:
|
|
ExecEndMergeAppend((MergeAppendState *) node);
|
|
break;
|
|
|
|
case T_RecursiveUnionState:
|
|
ExecEndRecursiveUnion((RecursiveUnionState *) node);
|
|
break;
|
|
|
|
case T_BitmapAndState:
|
|
ExecEndBitmapAnd((BitmapAndState *) node);
|
|
break;
|
|
|
|
case T_BitmapOrState:
|
|
ExecEndBitmapOr((BitmapOrState *) node);
|
|
break;
|
|
|
|
/*
|
|
* scan nodes
|
|
*/
|
|
case T_SeqScanState:
|
|
ExecEndSeqScan((SeqScanState *) node);
|
|
break;
|
|
|
|
case T_SampleScanState:
|
|
ExecEndSampleScan((SampleScanState *) node);
|
|
break;
|
|
|
|
case T_GatherState:
|
|
ExecEndGather((GatherState *) node);
|
|
break;
|
|
|
|
case T_GatherMergeState:
|
|
ExecEndGatherMerge((GatherMergeState *) node);
|
|
break;
|
|
|
|
case T_IndexScanState:
|
|
ExecEndIndexScan((IndexScanState *) node);
|
|
break;
|
|
|
|
case T_IndexOnlyScanState:
|
|
ExecEndIndexOnlyScan((IndexOnlyScanState *) node);
|
|
break;
|
|
|
|
case T_BitmapIndexScanState:
|
|
ExecEndBitmapIndexScan((BitmapIndexScanState *) node);
|
|
break;
|
|
|
|
case T_BitmapHeapScanState:
|
|
ExecEndBitmapHeapScan((BitmapHeapScanState *) node);
|
|
break;
|
|
|
|
case T_TidScanState:
|
|
ExecEndTidScan((TidScanState *) node);
|
|
break;
|
|
|
|
case T_TidRangeScanState:
|
|
ExecEndTidRangeScan((TidRangeScanState *) node);
|
|
break;
|
|
|
|
case T_SubqueryScanState:
|
|
ExecEndSubqueryScan((SubqueryScanState *) node);
|
|
break;
|
|
|
|
case T_FunctionScanState:
|
|
ExecEndFunctionScan((FunctionScanState *) node);
|
|
break;
|
|
|
|
case T_TableFuncScanState:
|
|
ExecEndTableFuncScan((TableFuncScanState *) node);
|
|
break;
|
|
|
|
case T_ValuesScanState:
|
|
ExecEndValuesScan((ValuesScanState *) node);
|
|
break;
|
|
|
|
case T_CteScanState:
|
|
ExecEndCteScan((CteScanState *) node);
|
|
break;
|
|
|
|
case T_NamedTuplestoreScanState:
|
|
ExecEndNamedTuplestoreScan((NamedTuplestoreScanState *) node);
|
|
break;
|
|
|
|
case T_WorkTableScanState:
|
|
ExecEndWorkTableScan((WorkTableScanState *) node);
|
|
break;
|
|
|
|
case T_ForeignScanState:
|
|
ExecEndForeignScan((ForeignScanState *) node);
|
|
break;
|
|
|
|
case T_CustomScanState:
|
|
ExecEndCustomScan((CustomScanState *) node);
|
|
break;
|
|
|
|
/*
|
|
* join nodes
|
|
*/
|
|
case T_NestLoopState:
|
|
ExecEndNestLoop((NestLoopState *) node);
|
|
break;
|
|
|
|
case T_MergeJoinState:
|
|
ExecEndMergeJoin((MergeJoinState *) node);
|
|
break;
|
|
|
|
case T_HashJoinState:
|
|
ExecEndHashJoin((HashJoinState *) node);
|
|
break;
|
|
|
|
/*
|
|
* materialization nodes
|
|
*/
|
|
case T_MaterialState:
|
|
ExecEndMaterial((MaterialState *) node);
|
|
break;
|
|
|
|
case T_SortState:
|
|
ExecEndSort((SortState *) node);
|
|
break;
|
|
|
|
case T_IncrementalSortState:
|
|
ExecEndIncrementalSort((IncrementalSortState *) node);
|
|
break;
|
|
|
|
case T_ResultCacheState:
|
|
ExecEndResultCache((ResultCacheState *) node);
|
|
break;
|
|
|
|
case T_GroupState:
|
|
ExecEndGroup((GroupState *) node);
|
|
break;
|
|
|
|
case T_AggState:
|
|
ExecEndAgg((AggState *) node);
|
|
break;
|
|
|
|
case T_WindowAggState:
|
|
ExecEndWindowAgg((WindowAggState *) node);
|
|
break;
|
|
|
|
case T_UniqueState:
|
|
ExecEndUnique((UniqueState *) node);
|
|
break;
|
|
|
|
case T_HashState:
|
|
ExecEndHash((HashState *) node);
|
|
break;
|
|
|
|
case T_SetOpState:
|
|
ExecEndSetOp((SetOpState *) node);
|
|
break;
|
|
|
|
case T_LockRowsState:
|
|
ExecEndLockRows((LockRowsState *) node);
|
|
break;
|
|
|
|
case T_LimitState:
|
|
ExecEndLimit((LimitState *) node);
|
|
break;
|
|
|
|
default:
|
|
elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
|
|
break;
|
|
}
|
|
}
|
|
|
|
/*
|
|
* ExecShutdownNode
|
|
*
|
|
* Give execution nodes a chance to stop asynchronous resource consumption
|
|
* and release any resources still held.
|
|
*/
|
|
bool
|
|
ExecShutdownNode(PlanState *node)
|
|
{
|
|
if (node == NULL)
|
|
return false;
|
|
|
|
check_stack_depth();
|
|
|
|
/*
|
|
* Treat the node as running while we shut it down, but only if it's run
|
|
* at least once already. We don't expect much CPU consumption during
|
|
* node shutdown, but in the case of Gather or Gather Merge, we may shut
|
|
* down workers at this stage. If so, their buffer usage will get
|
|
* propagated into pgBufferUsage at this point, and we want to make sure
|
|
* that it gets associated with the Gather node. We skip this if the node
|
|
* has never been executed, so as to avoid incorrectly making it appear
|
|
* that it has.
|
|
*/
|
|
if (node->instrument && node->instrument->running)
|
|
InstrStartNode(node->instrument);
|
|
|
|
planstate_tree_walker(node, ExecShutdownNode, NULL);
|
|
|
|
switch (nodeTag(node))
|
|
{
|
|
case T_GatherState:
|
|
ExecShutdownGather((GatherState *) node);
|
|
break;
|
|
case T_ForeignScanState:
|
|
ExecShutdownForeignScan((ForeignScanState *) node);
|
|
break;
|
|
case T_CustomScanState:
|
|
ExecShutdownCustomScan((CustomScanState *) node);
|
|
break;
|
|
case T_GatherMergeState:
|
|
ExecShutdownGatherMerge((GatherMergeState *) node);
|
|
break;
|
|
case T_HashState:
|
|
ExecShutdownHash((HashState *) node);
|
|
break;
|
|
case T_HashJoinState:
|
|
ExecShutdownHashJoin((HashJoinState *) node);
|
|
break;
|
|
default:
|
|
break;
|
|
}
|
|
|
|
/* Stop the node if we started it above, reporting 0 tuples. */
|
|
if (node->instrument && node->instrument->running)
|
|
InstrStopNode(node->instrument, 0);
|
|
|
|
return false;
|
|
}
|
|
|
|
/*
|
|
* ExecSetTupleBound
|
|
*
|
|
* Set a tuple bound for a planstate node. This lets child plan nodes
|
|
* optimize based on the knowledge that the maximum number of tuples that
|
|
* their parent will demand is limited. The tuple bound for a node may
|
|
* only be changed between scans (i.e., after node initialization or just
|
|
* before an ExecReScan call).
|
|
*
|
|
* Any negative tuples_needed value means "no limit", which should be the
|
|
* default assumption when this is not called at all for a particular node.
|
|
*
|
|
* Note: if this is called repeatedly on a plan tree, the exact same set
|
|
* of nodes must be updated with the new limit each time; be careful that
|
|
* only unchanging conditions are tested here.
|
|
*/
|
|
void
|
|
ExecSetTupleBound(int64 tuples_needed, PlanState *child_node)
|
|
{
|
|
/*
|
|
* Since this function recurses, in principle we should check stack depth
|
|
* here. In practice, it's probably pointless since the earlier node
|
|
* initialization tree traversal would surely have consumed more stack.
|
|
*/
|
|
|
|
if (IsA(child_node, SortState))
|
|
{
|
|
/*
|
|
* If it is a Sort node, notify it that it can use bounded sort.
|
|
*
|
|
* Note: it is the responsibility of nodeSort.c to react properly to
|
|
* changes of these parameters. If we ever redesign this, it'd be a
|
|
* good idea to integrate this signaling with the parameter-change
|
|
* mechanism.
|
|
*/
|
|
SortState *sortState = (SortState *) child_node;
|
|
|
|
if (tuples_needed < 0)
|
|
{
|
|
/* make sure flag gets reset if needed upon rescan */
|
|
sortState->bounded = false;
|
|
}
|
|
else
|
|
{
|
|
sortState->bounded = true;
|
|
sortState->bound = tuples_needed;
|
|
}
|
|
}
|
|
else if (IsA(child_node, IncrementalSortState))
|
|
{
|
|
/*
|
|
* If it is an IncrementalSort node, notify it that it can use bounded
|
|
* sort.
|
|
*
|
|
* Note: it is the responsibility of nodeIncrementalSort.c to react
|
|
* properly to changes of these parameters. If we ever redesign this,
|
|
* it'd be a good idea to integrate this signaling with the
|
|
* parameter-change mechanism.
|
|
*/
|
|
IncrementalSortState *sortState = (IncrementalSortState *) child_node;
|
|
|
|
if (tuples_needed < 0)
|
|
{
|
|
/* make sure flag gets reset if needed upon rescan */
|
|
sortState->bounded = false;
|
|
}
|
|
else
|
|
{
|
|
sortState->bounded = true;
|
|
sortState->bound = tuples_needed;
|
|
}
|
|
}
|
|
else if (IsA(child_node, AppendState))
|
|
{
|
|
/*
|
|
* If it is an Append, we can apply the bound to any nodes that are
|
|
* children of the Append, since the Append surely need read no more
|
|
* than that many tuples from any one input.
|
|
*/
|
|
AppendState *aState = (AppendState *) child_node;
|
|
int i;
|
|
|
|
for (i = 0; i < aState->as_nplans; i++)
|
|
ExecSetTupleBound(tuples_needed, aState->appendplans[i]);
|
|
}
|
|
else if (IsA(child_node, MergeAppendState))
|
|
{
|
|
/*
|
|
* If it is a MergeAppend, we can apply the bound to any nodes that
|
|
* are children of the MergeAppend, since the MergeAppend surely need
|
|
* read no more than that many tuples from any one input.
|
|
*/
|
|
MergeAppendState *maState = (MergeAppendState *) child_node;
|
|
int i;
|
|
|
|
for (i = 0; i < maState->ms_nplans; i++)
|
|
ExecSetTupleBound(tuples_needed, maState->mergeplans[i]);
|
|
}
|
|
else if (IsA(child_node, ResultState))
|
|
{
|
|
/*
|
|
* Similarly, for a projecting Result, we can apply the bound to its
|
|
* child node.
|
|
*
|
|
* If Result supported qual checking, we'd have to punt on seeing a
|
|
* qual. Note that having a resconstantqual is not a showstopper: if
|
|
* that condition succeeds it affects nothing, while if it fails, no
|
|
* rows will be demanded from the Result child anyway.
|
|
*/
|
|
if (outerPlanState(child_node))
|
|
ExecSetTupleBound(tuples_needed, outerPlanState(child_node));
|
|
}
|
|
else if (IsA(child_node, SubqueryScanState))
|
|
{
|
|
/*
|
|
* We can also descend through SubqueryScan, but only if it has no
|
|
* qual (otherwise it might discard rows).
|
|
*/
|
|
SubqueryScanState *subqueryState = (SubqueryScanState *) child_node;
|
|
|
|
if (subqueryState->ss.ps.qual == NULL)
|
|
ExecSetTupleBound(tuples_needed, subqueryState->subplan);
|
|
}
|
|
else if (IsA(child_node, GatherState))
|
|
{
|
|
/*
|
|
* A Gather node can propagate the bound to its workers. As with
|
|
* MergeAppend, no one worker could possibly need to return more
|
|
* tuples than the Gather itself needs to.
|
|
*
|
|
* Note: As with Sort, the Gather node is responsible for reacting
|
|
* properly to changes to this parameter.
|
|
*/
|
|
GatherState *gstate = (GatherState *) child_node;
|
|
|
|
gstate->tuples_needed = tuples_needed;
|
|
|
|
/* Also pass down the bound to our own copy of the child plan */
|
|
ExecSetTupleBound(tuples_needed, outerPlanState(child_node));
|
|
}
|
|
else if (IsA(child_node, GatherMergeState))
|
|
{
|
|
/* Same comments as for Gather */
|
|
GatherMergeState *gstate = (GatherMergeState *) child_node;
|
|
|
|
gstate->tuples_needed = tuples_needed;
|
|
|
|
ExecSetTupleBound(tuples_needed, outerPlanState(child_node));
|
|
}
|
|
|
|
/*
|
|
* In principle we could descend through any plan node type that is
|
|
* certain not to discard or combine input rows; but on seeing a node that
|
|
* can do that, we can't propagate the bound any further. For the moment
|
|
* it's unclear that any other cases are worth checking here.
|
|
*/
|
|
}
|