postgres/doc/src/sgml/parallel.sgml

<!-- doc/src/sgml/parallel.sgml -->

 <chapter id="parallel-query">
  <title>Parallel Query</title>

  <indexterm zone="parallel-query">
   <primary>parallel query</primary>
  </indexterm>

  <para>
   <productname>PostgreSQL</> can devise query plans which can leverage
   multiple CPUs in order to answer queries faster.  This feature is known
   as parallel query.  Many queries cannot benefit from parallel query, either
   due to limitations of the current implementation or because there is no
   imaginable query plan which is any faster than the serial query plan.
   However, for queries that can benefit, the speedup from parallel query
   is often very significant.  Many queries can run more than twice as fast
   when using parallel query, and some queries can run four times faster or
   even more.  Queries that touch a large amount of data but return only a
   few rows to the user will typically benefit most.  This chapter explains
   some details of how parallel query works and in which situations it can be
   used so that users who wish to make use of it can understand what to expect.
  </para>

 <sect1 id="how-parallel-query-works">
  <title>How Parallel Query Works</title>

   <para>
    When the optimizer determines that parallel query is the fastest execution
    strategy for a particular query, it will create a query plan which includes
    a <firstterm>Gather node</firstterm>.  Here is a simple example:

<screen>
EXPLAIN SELECT * FROM pgbench_accounts WHERE filler LIKE '%x%';
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Gather  (cost=1000.00..217018.43 rows=1 width=97)
   Workers Planned: 2
   ->  Parallel Seq Scan on pgbench_accounts  (cost=0.00..216018.33 rows=1 width=97)
         Filter: (filler ~~ '%x%'::text)
(4 rows)
</screen>
   </para>

   <para>
    In all cases, the <literal>Gather</literal> node will have exactly one
    child plan, which is the portion of the plan that will be executed in
    parallel.  If the <literal>Gather</> node is at the very top of the plan
    tree, then the entire query will execute in parallel.  If it is somewhere
    else in the plan tree, then only that portion of the query will run in
    parallel.  In the example above, the query accesses only one table, so
    there is only one plan node other than the <literal>Gather</> node itself;
    since that plan node is a child of the <literal>Gather</> node, it will
    run in parallel.
   </para>

   <para>
    <link linkend="using-explain">Using EXPLAIN</>, you can see the number of
    workers chosen by the planner.  When the <literal>Gather</> node is reached
    during query execution, the process which is implementing the user's
    session will request a number of <link linkend="bgworker">background
    worker processes</link> equal to the number
    of workers chosen by the planner.  The total number of background
    workers that can exist at any one time is limited by
    <xref linkend="guc-max-worker-processes">, so it is possible for a
    parallel query to run with fewer workers than planned, or even with
    no workers at all.  The optimal plan may depend on the number of workers
    that are available, so this can result in poor query performance.  If this
    occurrence is frequent, considering increasing
    <varname>max_worker_processes</> so that more workers can be run
    simultaneously or alternatively reducing
    <xref linkend="guc-max-parallel-workers-per-gather"> so that the planner
    requests fewer workers.
   </para>

   <para>
    Every background worker process which is successfully started for a given
    parallel query will execute the portion of the plan which is a descendent
    of the <literal>Gather</> node.  The leader will also execute that portion
    of the plan, but it has an additional responsibility: it must also read
    all of the tuples generated by the workers.  When the parallel portion of
    the plan generates only a small number of tuples, the leader will often
    behave very much like an additional worker, speeding up query execution.
    Conversely, when the parallel portion of the plan generates a large number
    of tuples, the leader may be almost entirely occupied with reading the
    tuples generated by the workers and performing any further processing
    steps which are required by plan nodes above the level of the
    <literal>Gather</literal> node.  In such cases, the leader will do very
    little of the work of executing the parallel portion of the plan.
   </para>
 </sect1>

 <sect1 id="when-can-parallel-query-be-used">
  <title>When Can Parallel Query Be Used?</title>

  <para>
    There are several settings which can cause the query planner not to
    generate a parallel query plan under any circumstances.  In order for
    any parallel query plans whatsoever to be generated, the following
    settings must be configured as indicated.
  </para>

  <itemizedlist>
    <listitem>
      <para>
        <xref linkend="guc-max-parallel-workers-per-gather"> must be set to a
        value which is greater than zero. This is a special case of the more
        general principle that no more workers should be used than the number
        configured via <varname>max_parallel_workers_per_gather</varname>.
      </para>
    </listitem>

    <listitem>
      <para>
        <xref linkend="guc-dynamic-shared-memory-type"> must be set to a
        value other than <literal>none</>.  Parallel query requires dynamic
        shared memory in order to pass data between cooperating processes.
      </para>
    </listitem>
  </itemizedlist>

  <para>
    In addition, the system must not be running in single-user mode.  Since
    the entire database system is running in single process in this situation,
    no background workers will be available.
  </para>

  <para>
    Even when it is in general possible for parallel query plans to be
    generated, the planner will not generate them for a given query
    if any of the following are true:
  </para>

  <itemizedlist>
    <listitem>
      <para>
        The query writes any data or locks any database rows. If a query
        contains a data-modifying operation either at the top level or within
        a CTE, no parallel plans for that query will be generated. This is a
        limitation of the current implementation which could be lifted in a
        future release.
      </para>
    </listitem>

    <listitem>
      <para>
        The query might be suspended during execution. In any situation in
        which the system thinks that partial or incremental execution might
        occur, no parallel plan is generated. For example, a cursor created
        using <link linkend="sql-declare">DECLARE CURSOR</link> will never use
        a parallel plan. Similarly, a PL/pgsql loop of the form
        <literal>FOR x IN query LOOP .. END LOOP</literal> will never use a
        parallel plan, because the parallel query system is unable to verify
        that the code in the loop is safe to execute while parallel query is
        active.
      </para>
    </listitem>

    <listitem>
      <para>
        The query uses any function marked <literal>PARALLEL UNSAFE</literal>.
        Most system-defined functions are <literal>PARALLEL SAFE</literal>,
        but user-defined functions are marked <literal>PARALLEL
        UNSAFE</literal> by default. See the discussion of
        <xref linkend="parallel-safety">.
      </para>
    </listitem>

    <listitem>
      <para>
        The query is running inside of another query that is already parallel.
        For example, if a function called by a parallel query issues an SQL
        query itself, that query will never use a parallel plan. This is a
        limitation of the current implementation, but it may not be desirable
        to remove this limitation, since it could result in a single query
        using a very large number of processes.
      </para>
    </listitem>

    <listitem>
      <para>
        The transaction isolation level is serializable.  This is
        a limitation of the current implementation.
      </para>
    </listitem>
  </itemizedlist>

  <para>
    Even when parallel query plan is generated for a particular query, there
    are several circumstances under which it will be impossible to execute
    that plan in parallel at execution time.  If this occurs, the leader
    will execute the portion of the plan below the <literal>Gather</>
    node entirely by itself, almost as if the <literal>Gather</> node were
    not present.  This will happen if any of the following conditions are met:
  </para>

  <itemizedlist>
    <listitem>
      <para>
        No background workers can be obtained because of the limitation that
        the total number of background workers cannot exceed
        <xref linkend="guc-max-worker-processes">.
      </para>
    </listitem>

    <listitem>
      <para>
        The client sends an Execute message with a non-zero fetch count.
        See the discussion of the
        <link linkend="protocol-flow-ext-query">extended query protocol</link>.
        Since <link linkend="libpq">libpq</link> currently provides no way to
        send such a message, this can only occur when using a client that
        does not rely on libpq.  If this is a frequent
        occurrence, it may be a good idea to set
        <xref linkend="guc-max-parallel-workers-per-gather"> in sessions
        where it is likely, so as to avoid generating query plans that may
        be suboptimal when run serially.
      </para>
    </listitem>

    <listitem>
      <para>
        A prepared statement is executed using a <literal>CREATE TABLE .. AS
        EXECUTE ..</literal> statement.  This construct converts what otherwise
        would have been a read-only operation into a read-write operation,
        making it ineligible for parallel query.
      </para>
    </listitem>

    <listitem>
      <para>
        The transaction isolation level is serializable.  This situation
        does not normally arise, because parallel query plans are not
        generated when the transaction isolation level is serializable.
        However, it can happen if the transaction isolation level is changed to
        serializable after the plan is generated and before it is executed.
      </para>
    </listitem>
  </itemizedlist>
 </sect1>

 <sect1 id="parallel-plans">
  <title>Parallel Plans</title>

  <para>
    Because each worker executes the parallel portion of the plan to
    completion, it is not possible to simply take an ordinary query plan
    and run it using multiple workers.  Each worker would produce a full
    copy of the output result set, so the query would not run any faster
    than normal but would produce incorrect results.  Instead, the parallel
    portion of the plan must be what is known internally to the query
    optimizer as a <firstterm>partial plan</>; that is, it must be constructed
    so that each process which executes the plan will generate only a
    subset of the output rows in such a way that each required output row
    is guaranteed to be generated by exactly one of the cooperating processes.
  </para>

 <sect2 id="parallel-scans">
  <title>Parallel Scans</title>

  <para>
    Currently, the only type of scan which has been modified to work with
    parallel query is a sequential scan.  Therefore, the driving table in
    a parallel plan will always be scanned using a
    <literal>Parallel Seq Scan</>.  The relation's blocks will be divided
    among the cooperating processes.  Blocks are handed out one at a
    time, so that access to the relation remains sequential.  Each process
    will visit every tuple on the page assigned to it before requesting a new
    page.
  </para>
 </sect2>

 <sect2 id="parallel-joins">
  <title>Parallel Joins</title>

  <para>
    The driving table may be joined to one or more other tables using nested
    loops or hash joins.  The inner side of the join may be any kind of
    non-parallel plan that is otherwise supported by the planner provided that
    it is safe to run within a parallel worker.  For example, it may be an
    index scan which looks up a value taken from the outer side of the join.
    Each worker will execute the inner side of the join in full,  which for
    hash join means that an identical hash table is built in each worker
    process.
  </para>
 </sect2>

 <sect2 id="parallel-aggregation">
  <title>Parallel Aggregation</title>
  <para>
    <productname>PostgreSQL</> supports parallel aggregation by aggregating in
    two stages.  First, each process participating in the parallel portion of
    the query performs an aggregation step, producing a partial result for
    each group of which that process is aware.  This is reflected in the plan
    as a <literal>Partial Aggregate</> node.  Second, the partial results are
    transferred to the leader via the <literal>Gather</> node.  Finally, the
    leader re-aggregates the results across all workers in order to produce
    the final result.  This is reflected in the plan as a
    <literal>Finalize Aggregate</> node.
  </para>

  <para>
    Because the <literal>Finalize Aggregate</> node runs on the leader
    process, queries which produce a relatively large number of groups in
    comparison to the number of input rows will appear less favorable to the
    query planner. For example, in the worst-case scenario the number of
    groups seen by the <literal>Finalize Aggregate</> node could be as many as
    the number of input rows which were seen by all worker processes in the
    <literal>Partial Aggregate</> stage. For such cases, there is clearly
    going to be no performance benefit to using parallel aggregation. The
    query planner takes this into account during the planning process and is
    unlikely to choose parallel aggregate in this scenario.
  </para>

  <para>
    Parallel aggregation is not supported in all situations.  Each aggregate
    must be <link linkend="parallel-safety">safe</> for parallelism and must
    have a combine function.  If the aggregate has a transition state of type
    <literal>internal</>, it must have serialization and deserialization
    functions.  See <xref linkend="sql-createaggregate"> for more details.
    Parallel aggregation is not supported if any aggregate function call
    contains <literal>DISTINCT</> or <literal>ORDER BY</> clause and is also
    not supported for ordered set aggregates or when  the query involves
    <literal>GROUPING SETS</>.  It can only be used when all joins involved in
    the query are also part of the parallel portion of the plan.
  </para>

 </sect2>

 <sect2 id="parallel-plan-tips">
  <title>Parallel Plan Tips</title>

  <para>
    If a query that is expected to do so does not produce a parallel plan,
    you can try reducing <xref linkend="guc-parallel-setup-cost"> or
    <xref linkend="guc-parallel-tuple-cost">.  Of course, this plan may turn
    out to be slower than the serial plan which the planner preferred, but
    this will not always be the case.  If you don't get a parallel
    plan even with very small values of these settings (e.g., after setting
    them both to zero), there may be some reason why the query planner is
    unable to generate a parallel plan for your query.  See
    <xref linkend="when-can-parallel-query-be-used"> and
    <xref linkend="parallel-safety"> for information on why this may be
    the case.
  </para>

  <para>
    When executing a parallel plan, you can use <literal>EXPLAIN (ANALYZE,
    VERBOSE)</literal> to display per-worker statistics for each plan node.
    This may be useful in determining whether the work is being evenly
    distributed between all plan nodes and more generally in understanding the
    performance characteristics of the plan.
  </para>

 </sect2>
 </sect1>

 <sect1 id="parallel-safety">
  <title>Parallel Safety</title>

  <para>
    The planner classifies operations involved in a query as either
    <firstterm>parallel safe</>, <firstterm>parallel restricted</>,
    or <firstterm>parallel unsafe</>.  A parallel safe operation is one which
    does not conflict with the use of parallel query.  A parallel restricted
    operation is one which cannot be performed in a parallel worker, but which
    can be performed in the leader while parallel query is in use.  Therefore,
    parallel restricted operations can never occur below a <literal>Gather</>
    node, but can occur elsewhere in a plan which contains a
    <literal>Gather</> node.  A parallel unsafe operation is one which cannot
    be performed while parallel query is in use, not even in the leader.
    When a query contains anything which is parallel unsafe, parallel query
    is completely disabled for that query.
  </para>

  <para>
    The following operations are always parallel restricted.
  </para>

  <itemizedlist>
    <listitem>
      <para>
        Scans of common table expressions (CTEs).
      </para>
    </listitem>

    <listitem>
      <para>
        Scans of temporary tables.
      </para>
    </listitem>

    <listitem>
      <para>
        Scans of foreign tables, unless the foreign data wrapper has
        an <literal>IsForeignScanParallelSafe</> API which indicates otherwise.
      </para>
    </listitem>

    <listitem>
      <para>
        Access to an <literal>InitPlan</> or <literal>SubPlan</>.
      </para>
    </listitem>
  </itemizedlist>

 <sect2 id="parallel-labeling">
  <title>Parallel Labeling for Functions and Aggregates</title>

  <para>
    The planner cannot automatically determine whether a user-defined
    function or aggregate is parallel safe, parallel restricted, or parallel
    unsafe, because this would require predicting every operation which the
    function could possibly perform.  In general, this is equivalent to the
    Halting Problem and therefore impossible.  Even for simple functions
    where it could conceivably be done, we do not try, since this would be expensive
    and error-prone.  Instead, all user-defined functions are assumed to
    be parallel unsafe unless otherwise marked.  When using
    <xref linkend="sql-createfunction"> or
    <xref linkend="sql-alterfunction">, markings can be set by specifying
    <literal>PARALLEL SAFE</>, <literal>PARALLEL RESTRICTED</>, or
    <literal>PARALLEL UNSAFE</> as appropriate.  When using
    <xref linkend="sql-createaggregate">, the
    <literal>PARALLEL</> option can be specified with <literal>SAFE</>,
    <literal>RESTRICTED</>, or <literal>UNSAFE</> as the corresponding value.
  </para>

  <para>
    Functions and aggregates must be marked <literal>PARALLEL UNSAFE</> if
    they write to the database, access sequences, change the transaction state
    even temporarily (e.g., a PL/pgsql function which establishes an
    <literal>EXCEPTION</> block to catch errors), or make persistent changes to
    settings.  Similarly, functions must be marked <literal>PARALLEL
    RESTRICTED</> if they access temporary tables, client connection state,
    cursors, prepared statements, or miscellaneous backend-local state which
    the system cannot synchronize across workers. For example,
    <literal>setseed</> and <literal>random</> are parallel restricted for
    this last reason.
  </para>

  <para>
    In general, if a function is labeled as being safe when it is restricted or
    unsafe, or if it is labeled as being restricted when it is in fact unsafe,
    it may throw errors or produce wrong answers when used in a parallel query.
    C-language functions could in theory exhibit totally undefined behavior if
    mislabeled, since there is no way for the system to protect itself against
    arbitrary C code, but in most likely cases the result will be no worse than
    for any other function. If in doubt, it is probably best to label functions
    as <literal>UNSAFE</>.
  </para>

  <para>
    If a function executed within a parallel worker acquires locks which are
    not held by the leader, for example by querying a table not referenced in
    the query, those locks will be released at worker exit, not end of
    transaction. If you write a function which does this, and this behavior
    difference is important to you, mark such functions as
    <literal>PARALLEL RESTRICTED</literal>
    to ensure that they execute only in the leader.
  </para>

  <para>
    Note that the query planner does not consider deferring the evaluation of
    parallel-restricted functions or aggregates involved in the query in
    order to obtain a superior plan.  So, for example, if a <literal>WHERE</>
    clause applied to a particular table is parallel restricted, the query
    planner will not consider placing the scan of that table below a
    <literal>Gather</> node.  In some cases, it would be
    possible (and perhaps even efficient) to include the scan of that table in
    the parallel portion of the query and defer the evaluation of the
    <literal>WHERE</> clause so that it happens above the <literal>Gather</>
    node.  However, the planner does not do this.
  </para>

 </sect2>

 </sect1>

 </chapter>