mirror of
https://github.com/postgres/postgres.git
synced 2025-10-18 04:29:09 +03:00
Improve the naming of Parallel Hash Join phases.
* Commit3048898e
dropped -ING from PHJ wait event names. Update the corresponding barrier phases names to match. * Rename the "DONE" phases to "FREE". That's symmetrical with "ALLOCATE", and names the activity that actually happens in that phase (as we do for the other phases) rather than a state. The bug fixed by commit8d578b9b
might have been more obvious with this name. * Rename the batch/bucket growth barriers' "ALLOCATE" phases to "REALLOCATE", a better description of what they do. * Update the high level comments about phases to highlight phases are executed by a single process with an asterisk (mostly memory management phases). No behavior change, as this is just improving internal identifiers. The only user-visible sign of this is that a couple of wait events' display names change from "...Allocate" to "...Reallocate" in pg_stat_activity, to stay in sync with the internal names. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKG%2BMDpwF2Eo2LAvzd%3DpOh81wUTsrwU1uAwR-v6OGBB6%2B7g%40mail.gmail.com
This commit is contained in:
@@ -39,27 +39,30 @@
|
||||
*
|
||||
* One barrier called build_barrier is used to coordinate the hashing phases.
|
||||
* The phase is represented by an integer which begins at zero and increments
|
||||
* one by one, but in the code it is referred to by symbolic names as follows:
|
||||
* one by one, but in the code it is referred to by symbolic names as follows.
|
||||
* An asterisk indicates a phase that is performed by a single arbitrarily
|
||||
* chosen process.
|
||||
*
|
||||
* PHJ_BUILD_ELECTING -- initial state
|
||||
* PHJ_BUILD_ALLOCATING -- one sets up the batches and table 0
|
||||
* PHJ_BUILD_HASHING_INNER -- all hash the inner rel
|
||||
* PHJ_BUILD_HASHING_OUTER -- (multi-batch only) all hash the outer
|
||||
* PHJ_BUILD_RUNNING -- building done, probing can begin
|
||||
* PHJ_BUILD_DONE -- all work complete, one frees batches
|
||||
* PHJ_BUILD_ELECT -- initial state
|
||||
* PHJ_BUILD_ALLOCATE* -- one sets up the batches and table 0
|
||||
* PHJ_BUILD_HASH_INNER -- all hash the inner rel
|
||||
* PHJ_BUILD_HASH_OUTER -- (multi-batch only) all hash the outer
|
||||
* PHJ_BUILD_RUN -- building done, probing can begin
|
||||
* PHJ_BUILD_FREE* -- all work complete, one frees batches
|
||||
*
|
||||
* While in the phase PHJ_BUILD_HASHING_INNER a separate pair of barriers may
|
||||
* While in the phase PHJ_BUILD_HASH_INNER a separate pair of barriers may
|
||||
* be used repeatedly as required to coordinate expansions in the number of
|
||||
* batches or buckets. Their phases are as follows:
|
||||
*
|
||||
* PHJ_GROW_BATCHES_ELECTING -- initial state
|
||||
* PHJ_GROW_BATCHES_ALLOCATING -- one allocates new batches
|
||||
* PHJ_GROW_BATCHES_REPARTITIONING -- all repartition
|
||||
* PHJ_GROW_BATCHES_FINISHING -- one cleans up, detects skew
|
||||
* PHJ_GROW_BATCHES_ELECT -- initial state
|
||||
* PHJ_GROW_BATCHES_REALLOCATE* -- one allocates new batches
|
||||
* PHJ_GROW_BATCHES_REPARTITION -- all repartition
|
||||
* PHJ_GROW_BATCHES_DECIDE* -- one detects skew and cleans up
|
||||
* PHJ_GROW_BATCHES_FINISH -- finished one growth cycle
|
||||
*
|
||||
* PHJ_GROW_BUCKETS_ELECTING -- initial state
|
||||
* PHJ_GROW_BUCKETS_ALLOCATING -- one allocates new buckets
|
||||
* PHJ_GROW_BUCKETS_REINSERTING -- all insert tuples
|
||||
* PHJ_GROW_BUCKETS_ELECT -- initial state
|
||||
* PHJ_GROW_BUCKETS_REALLOCATE* -- one allocates new buckets
|
||||
* PHJ_GROW_BUCKETS_REINSERT -- all insert tuples
|
||||
*
|
||||
* If the planner got the number of batches and buckets right, those won't be
|
||||
* necessary, but on the other hand we might finish up needing to expand the
|
||||
@@ -67,27 +70,27 @@
|
||||
* within our memory budget and load factor target. For that reason it's a
|
||||
* separate pair of barriers using circular phases.
|
||||
*
|
||||
* The PHJ_BUILD_HASHING_OUTER phase is required only for multi-batch joins,
|
||||
* The PHJ_BUILD_HASH_OUTER phase is required only for multi-batch joins,
|
||||
* because we need to divide the outer relation into batches up front in order
|
||||
* to be able to process batches entirely independently. In contrast, the
|
||||
* parallel-oblivious algorithm simply throws tuples 'forward' to 'later'
|
||||
* batches whenever it encounters them while scanning and probing, which it
|
||||
* can do because it processes batches in serial order.
|
||||
*
|
||||
* Once PHJ_BUILD_RUNNING is reached, backends then split up and process
|
||||
* Once PHJ_BUILD_RUN is reached, backends then split up and process
|
||||
* different batches, or gang up and work together on probing batches if there
|
||||
* aren't enough to go around. For each batch there is a separate barrier
|
||||
* with the following phases:
|
||||
*
|
||||
* PHJ_BATCH_ELECTING -- initial state
|
||||
* PHJ_BATCH_ALLOCATING -- one allocates buckets
|
||||
* PHJ_BATCH_LOADING -- all load the hash table from disk
|
||||
* PHJ_BATCH_PROBING -- all probe
|
||||
* PHJ_BATCH_DONE -- end
|
||||
* PHJ_BATCH_ELECT -- initial state
|
||||
* PHJ_BATCH_ALLOCATE* -- one allocates buckets
|
||||
* PHJ_BATCH_LOAD -- all load the hash table from disk
|
||||
* PHJ_BATCH_PROBE -- all probe
|
||||
* PHJ_BATCH_FREE* -- one frees memory
|
||||
*
|
||||
* Batch 0 is a special case, because it starts out in phase
|
||||
* PHJ_BATCH_PROBING; populating batch 0's hash table is done during
|
||||
* PHJ_BUILD_HASHING_INNER so we can skip loading.
|
||||
* PHJ_BATCH_PROBE; populating batch 0's hash table is done during
|
||||
* PHJ_BUILD_HASH_INNER so we can skip loading.
|
||||
*
|
||||
* Initially we try to plan for a single-batch hash join using the combined
|
||||
* hash_mem of all participants to create a large shared hash table. If that
|
||||
@@ -99,8 +102,8 @@
|
||||
* finished. Practically, that means that we never emit a tuple while attached
|
||||
* to a barrier, unless the barrier has reached a phase that means that no
|
||||
* process will wait on it again. We emit tuples while attached to the build
|
||||
* barrier in phase PHJ_BUILD_RUNNING, and to a per-batch barrier in phase
|
||||
* PHJ_BATCH_PROBING. These are advanced to PHJ_BUILD_DONE and PHJ_BATCH_DONE
|
||||
* barrier in phase PHJ_BUILD_RUN, and to a per-batch barrier in phase
|
||||
* PHJ_BATCH_PROBE. These are advanced to PHJ_BUILD_FREE and PHJ_BATCH_FREE
|
||||
* respectively without waiting, using BarrierArriveAndDetach(). The last to
|
||||
* detach receives a different return value so that it knows that it's safe to
|
||||
* clean up. Any straggler process that attaches after that phase is reached
|
||||
@@ -306,13 +309,12 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
|
||||
if (parallel)
|
||||
{
|
||||
/*
|
||||
* Advance the build barrier to PHJ_BUILD_RUNNING
|
||||
* before proceeding so we can negotiate resource
|
||||
* cleanup.
|
||||
* Advance the build barrier to PHJ_BUILD_RUN before
|
||||
* proceeding so we can negotiate resource cleanup.
|
||||
*/
|
||||
Barrier *build_barrier = ¶llel_state->build_barrier;
|
||||
|
||||
while (BarrierPhase(build_barrier) < PHJ_BUILD_RUNNING)
|
||||
while (BarrierPhase(build_barrier) < PHJ_BUILD_RUN)
|
||||
BarrierArriveAndWait(build_barrier, 0);
|
||||
}
|
||||
return NULL;
|
||||
@@ -336,10 +338,10 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
|
||||
Barrier *build_barrier;
|
||||
|
||||
build_barrier = ¶llel_state->build_barrier;
|
||||
Assert(BarrierPhase(build_barrier) == PHJ_BUILD_HASHING_OUTER ||
|
||||
BarrierPhase(build_barrier) == PHJ_BUILD_RUNNING ||
|
||||
BarrierPhase(build_barrier) == PHJ_BUILD_DONE);
|
||||
if (BarrierPhase(build_barrier) == PHJ_BUILD_HASHING_OUTER)
|
||||
Assert(BarrierPhase(build_barrier) == PHJ_BUILD_HASH_OUTER ||
|
||||
BarrierPhase(build_barrier) == PHJ_BUILD_RUN ||
|
||||
BarrierPhase(build_barrier) == PHJ_BUILD_FREE);
|
||||
if (BarrierPhase(build_barrier) == PHJ_BUILD_HASH_OUTER)
|
||||
{
|
||||
/*
|
||||
* If multi-batch, we need to hash the outer relation
|
||||
@@ -350,7 +352,7 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
|
||||
BarrierArriveAndWait(build_barrier,
|
||||
WAIT_EVENT_HASH_BUILD_HASH_OUTER);
|
||||
}
|
||||
else if (BarrierPhase(build_barrier) == PHJ_BUILD_DONE)
|
||||
else if (BarrierPhase(build_barrier) == PHJ_BUILD_FREE)
|
||||
{
|
||||
/*
|
||||
* If we attached so late that the job is finished and
|
||||
@@ -361,7 +363,7 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
|
||||
}
|
||||
|
||||
/* Each backend should now select a batch to work on. */
|
||||
Assert(BarrierPhase(build_barrier) == PHJ_BUILD_RUNNING);
|
||||
Assert(BarrierPhase(build_barrier) == PHJ_BUILD_RUN);
|
||||
hashtable->curbatch = -1;
|
||||
node->hj_JoinState = HJ_NEED_NEW_BATCH;
|
||||
|
||||
@@ -1153,7 +1155,7 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
|
||||
|
||||
switch (BarrierAttach(batch_barrier))
|
||||
{
|
||||
case PHJ_BATCH_ELECTING:
|
||||
case PHJ_BATCH_ELECT:
|
||||
|
||||
/* One backend allocates the hash table. */
|
||||
if (BarrierArriveAndWait(batch_barrier,
|
||||
@@ -1161,13 +1163,13 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
|
||||
ExecParallelHashTableAlloc(hashtable, batchno);
|
||||
/* Fall through. */
|
||||
|
||||
case PHJ_BATCH_ALLOCATING:
|
||||
case PHJ_BATCH_ALLOCATE:
|
||||
/* Wait for allocation to complete. */
|
||||
BarrierArriveAndWait(batch_barrier,
|
||||
WAIT_EVENT_HASH_BATCH_ALLOCATE);
|
||||
/* Fall through. */
|
||||
|
||||
case PHJ_BATCH_LOADING:
|
||||
case PHJ_BATCH_LOAD:
|
||||
/* Start (or join in) loading tuples. */
|
||||
ExecParallelHashTableSetCurrentBatch(hashtable, batchno);
|
||||
inner_tuples = hashtable->batches[batchno].inner_tuples;
|
||||
@@ -1187,7 +1189,7 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
|
||||
WAIT_EVENT_HASH_BATCH_LOAD);
|
||||
/* Fall through. */
|
||||
|
||||
case PHJ_BATCH_PROBING:
|
||||
case PHJ_BATCH_PROBE:
|
||||
|
||||
/*
|
||||
* This batch is ready to probe. Return control to
|
||||
@@ -1197,13 +1199,13 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
|
||||
* this barrier again (or else a deadlock could occur).
|
||||
* All attached participants must eventually call
|
||||
* BarrierArriveAndDetach() so that the final phase
|
||||
* PHJ_BATCH_DONE can be reached.
|
||||
* PHJ_BATCH_FREE can be reached.
|
||||
*/
|
||||
ExecParallelHashTableSetCurrentBatch(hashtable, batchno);
|
||||
sts_begin_parallel_scan(hashtable->batches[batchno].outer_tuples);
|
||||
return true;
|
||||
|
||||
case PHJ_BATCH_DONE:
|
||||
case PHJ_BATCH_FREE:
|
||||
|
||||
/*
|
||||
* Already done. Detach and go around again (if any
|
||||
@@ -1523,7 +1525,7 @@ ExecHashJoinReInitializeDSM(HashJoinState *state, ParallelContext *pcxt)
|
||||
/*
|
||||
* It would be possible to reuse the shared hash table in single-batch
|
||||
* cases by resetting and then fast-forwarding build_barrier to
|
||||
* PHJ_BUILD_DONE and batch 0's batch_barrier to PHJ_BATCH_PROBING, but
|
||||
* PHJ_BUILD_FREE and batch 0's batch_barrier to PHJ_BATCH_PROBE, but
|
||||
* currently shared hash tables are already freed by now (by the last
|
||||
* participant to detach from the batch). We could consider keeping it
|
||||
* around for single-batch joins. We'd also need to adjust
|
||||
@@ -1542,7 +1544,7 @@ ExecHashJoinReInitializeDSM(HashJoinState *state, ParallelContext *pcxt)
|
||||
/* Clear any shared batch files. */
|
||||
SharedFileSetDeleteAll(&pstate->fileset);
|
||||
|
||||
/* Reset build_barrier to PHJ_BUILD_ELECTING so we can go around again. */
|
||||
/* Reset build_barrier to PHJ_BUILD_ELECT so we can go around again. */
|
||||
BarrierInit(&pstate->build_barrier, 0);
|
||||
}
|
||||
|
||||
|
Reference in New Issue
Block a user