postgres/doc/src/sgml/arch-dev.sgml

<!--
$Header: /cvsroot/pgsql/doc/src/sgml/arch-dev.sgml,v 2.21 2003/06/22 16:16:44 tgl Exp $
-->

 <chapter id="overview">
  <title>Overview of PostgreSQL Internals</title>

  <note>
   <title>Author</title>
   <para>
    This chapter originated as part of
    <xref linkend="SIM98">, Stefan Simkovics'
    Master's Thesis prepared at Vienna University of Technology under the direction
    of O.Univ.Prof.Dr. Georg Gottlob and Univ.Ass. Mag. Katrin Seyr.
   </para>
  </note>

  <para>
   This chapter gives an overview of the internal structure of the
   backend of <productname>PostgreSQL</productname>.  After having
   read the following sections you should have an idea of how a query
   is processed. This chapter does not aim to provide a detailed
   description of the internal operation of
   <productname>PostgreSQL</productname>, as such a document would be
   very extensive. Rather, this chapter is intended to help the reader
   understand the general sequence of operations that occur within the
   backend from the point at which a query is received, to the point
   when the results are returned to the client.
  </para>

  <sect1 id="query-path">
   <title>The Path of a Query</title>

   <para>
    Here we give a short overview of the stages a query has to pass in
    order to obtain a result.
   </para>

   <procedure>
    <step>
     <para>
      A connection from an application program to the <productname>PostgreSQL</productname>
      server has to be established. The application program transmits a
      query to the server and waits to receive the results sent back by the
      server.
     </para>
    </step>

    <step>
     <para>
      The <firstterm>parser stage</firstterm> checks the query
      transmitted by the application
      program for correct syntax and creates
      a <firstterm>query tree</firstterm>.
     </para>
    </step>

    <step>
     <para>
      The <firstterm>rewrite system</firstterm> takes
      the query tree created by the parser stage and looks for
      any <firstterm>rules</firstterm> (stored in the
      <firstterm>system catalogs</firstterm>) to apply to
      the query tree.  It performs the
      transformations given in the <firstterm>rule bodies</firstterm>.
      One application of the rewrite system is in the realization of
      <firstterm>views</firstterm>.
     </para>

     <para>
      Whenever a query against a view
      (i.e. a <firstterm>virtual table</firstterm>) is made,
      the rewrite system rewrites the user's query to
      a query that accesses the <firstterm>base tables</firstterm> given in
      the <firstterm>view definition</firstterm> instead.
     </para>
    </step>

    <step>
     <para>
      The <firstterm>planner/optimizer</firstterm> takes
      the (rewritten) querytree and creates a
      <firstterm>query plan</firstterm> that will be the input to the
      <firstterm>executor</firstterm>.
     </para>

     <para>
      It does so by first creating all possible <firstterm>paths</firstterm>
      leading to the same result. For example if there is an index on a
      relation to be scanned, there are two paths for the
      scan. One possibility is a simple sequential scan and the other
      possibility is to use the index. Next the cost for the execution of
      each plan is estimated and the
      cheapest plan is chosen and handed back.
     </para>
    </step>

    <step>
     <para>
      The executor recursively steps through
      the <firstterm>plan tree</firstterm> and
      retrieves tuples in the way represented by the plan.
      The executor makes use of the
      <firstterm>storage system</firstterm> while scanning
      relations, performs <firstterm>sorts</firstterm> and <firstterm>joins</firstterm>,
      evaluates <firstterm>qualifications</firstterm> and finally hands back the tuples derived.
     </para>
    </step>
   </procedure>

   <para>
    In the following sections we will cover each of the above listed items
    in more detail to give a better understanding of <productname>PostgreSQL</productname>'s internal
    control and data structures.
   </para>
  </sect1>

  <sect1 id="connect-estab">
   <title>How Connections are Established</title>

   <para>
    <productname>PostgreSQL</productname> is implemented using a
    simple <quote>process per user</> client/server model.  In this model
    there is one <firstterm>client process</firstterm> connected to
    exactly one <firstterm>server process</firstterm>.  As we do not
    know ahead of time how many connections will be made, we have to
    use a <firstterm>master process</firstterm> that spawns a new
    server process every time a connection is requested. This master
    process is called <literal>postmaster</literal> and listens at a
    specified TCP/IP port for incoming connections. Whenever a request
    for a connection is detected the <literal>postmaster</literal>
    process spawns a new server process called
    <literal>postgres</literal>. The server tasks
    (<literal>postgres</literal> processes) communicate with each
    other using <firstterm>semaphores</firstterm> and
    <firstterm>shared memory</firstterm> to ensure data integrity
    throughout concurrent data access.
   </para>

   <para>
    The client process can be any program that understands the
    <productname>PostgreSQL</productname> protocol described in
    <xref linkend="protocol">.  Many clients are based on the
    C-language library <application>libpq</>, but several independent
    implementations exist, such as the Java <application>JDBC</> driver.
   </para>

   <para>
    Once a connection is established the client process can send a query
    to the <firstterm>backend</firstterm> (server). The query is transmitted using plain text,
    i.e. there is no parsing done in the <firstterm>frontend</firstterm> (client). The
    server parses the query, creates an <firstterm>execution plan</firstterm>,
    executes the plan and returns the retrieved tuples to the client
    by transmitting them over the established connection.
   </para>
  </sect1>

  <sect1 id="parser-stage">
   <title>The Parser Stage</title>

   <para>
    The <firstterm>parser stage</firstterm> consists of two parts:

    <itemizedlist>
     <listitem>
      <para>
       The <firstterm>parser</firstterm> defined in
       <filename>gram.y</filename> and <filename>scan.l</filename> is
       built using the Unix tools <application>yacc</application>
       and <application>lex</application>.
      </para>
     </listitem>
     <listitem>
      <para>
       The <firstterm>transformation process</firstterm> does
       modifications and augmentations to the data structures returned by the parser.
      </para>
     </listitem>
    </itemizedlist>
   </para>

   <sect2>
    <title>Parser</title>

    <para>
     The parser has to check the query string (which arrives as
     plain ASCII text) for valid syntax. If the syntax is correct a
     <firstterm>parse tree</firstterm> is built up and handed back otherwise an error is
     returned. For the implementation the well known Unix
     tools <application>lex</application> and <application>yacc</application>
     are used.
    </para>

    <para>
     The <firstterm>lexer</firstterm> is defined in the file
     <filename>scan.l</filename> and is responsible
     for recognizing <firstterm>identifiers</firstterm>,
     the <firstterm>SQL keywords</firstterm> etc. For
     every keyword or identifier that is found, a <firstterm>token</firstterm>
     is generated and handed to the parser.
    </para>

    <para>
     The parser is defined in the file <filename>gram.y</filename> and consists of a
     set of <firstterm>grammar rules</firstterm> and <firstterm>actions</firstterm>
     that are executed
     whenever a rule is fired. The code of the actions (which
     is actually C-code) is used to build up the parse tree.
    </para>

    <para>
     The file <filename>scan.l</filename> is transformed to
     the C-source file <filename>scan.c</filename>
     using the program <application>lex</application>
     and <filename>gram.y</filename> is transformed to
     <filename>gram.c</filename> using <application>yacc</application>.
     After these transformations have taken
     place a normal C-compiler can be used to create the
     parser. Never make any changes to the generated C-files as they will
     be overwritten the next time <application>lex</application>
     or <application>yacc</application> is called.

     <note>
      <para>
       The mentioned transformations and compilations are normally done
       automatically using the <firstterm>makefiles</firstterm>
       shipped with the <productname>PostgreSQL</productname>
       source distribution.
      </para>
     </note>
    </para>

    <para>
     A detailed description of <application>yacc</application> or
     the grammar rules given in <filename>gram.y</filename> would be
     beyond the scope of this paper. There are many books and
     documents dealing with <application>lex</application> and
     <application>yacc</application>. You should be familiar with
     <application>yacc</application> before you start to study the
     grammar given in <filename>gram.y</filename> otherwise you won't
     understand what happens there.
    </para>

   </sect2>

   <sect2>
     <title>Transformation Process</title>

    <para>
     The parser stage creates a parse tree using only fixed rules about
     the syntactic structure of SQL.  It does not make any lookups in the
     system catalogs, so there is no possibility to understand the detailed
     semantics of the requested operations.  After the parser completes,
     the <firstterm>transformation process</firstterm> takes the tree handed
     back by the parser as input and does the semantic interpretation needed
     to understand which tables, functions, and operators are referenced by
     the query.  The data structure that is built to represent this
     information is called the <firstterm>query tree</>.
    </para>

    <para>
     The reason for separating raw parsing from semantic analysis is that
     system catalog lookups can only be done within a transaction, and we
     do not wish to start a transaction immediately upon receiving a query
     string.  The raw parsing stage is sufficient to identify the transaction
     control commands (<command>BEGIN</>, <command>ROLLBACK</>, etc), and
     these can then be correctly executed without any further analysis.
     Once we know that we are dealing with an actual query (such as
     <command>SELECT</> or <command>UPDATE</>), it is okay to
     start a transaction if we're not already in one.  Only then can the
     transformation process be invoked.
    </para>

    <para>
     The query tree created by the transformation process is structurally
     similar to the raw parse tree in most places, but it has many differences
     in detail.  For example, a <structname>FuncCall</> node in the
     parse tree represents something that looks syntactically like a function
     call.  This may be transformed to either a <structname>FuncExpr</>
     or <structname>Aggref</> node depending on whether the referenced
     name turns out to be an ordinary function or an aggregate function.
     Also, information about the actual datatypes of columns and expression
     results is added to the query tree.
    </para>
   </sect2>
  </sect1>

  <sect1 id="rule-system">
   <title>The <productname>PostgreSQL</productname> Rule System</title>

   <para>
    <productname>PostgreSQL</productname> supports a powerful
    <firstterm>rule system</firstterm> for the specification
    of <firstterm>views</firstterm> and ambiguous <firstterm>view updates</firstterm>.
    Originally the <productname>PostgreSQL</productname>
    rule system consisted of two implementations:

    <itemizedlist>
     <listitem>
      <para>
       The first one worked using <firstterm>tuple level</firstterm> processing and was
       implemented deep in the <firstterm>executor</firstterm>. The rule system was
       called whenever an individual tuple had been accessed. This
       implementation was removed in 1995 when the last official release
       of the <productname>Berkeley Postgres</productname> project was
       transformed into <productname>Postgres95</productname>.
      </para>
     </listitem>

     <listitem>
      <para>
       The second implementation of the rule system is a technique
       called <firstterm>query rewriting</firstterm>.
       The <firstterm>rewrite system</firstterm> is a module
       that exists between the <firstterm>parser stage</firstterm> and the
       <firstterm>planner/optimizer</firstterm>. This technique is still implemented.
      </para>
     </listitem>
    </itemizedlist>
   </para>

   <para>
    The query rewriter is discussed in some detail in
    <xref linkend="rules">, so there is no need to cover it here.
    We will only point out that both the input and the output of the
    rewriter are query trees, that is, there is no change in the
    representation or level of semantic detail in the trees.  Rewriting
    can be thought of as a form of macro expansion.
   </para>

  </sect1>

  <sect1 id="planner-optimizer">
   <title>Planner/Optimizer</title>

   <para>
    The task of the <firstterm>planner/optimizer</firstterm> is to create an optimal
    execution plan. It first considers all possible ways of
    <firstterm>scanning</firstterm> and <firstterm>joining</firstterm>
    the relations that appear in a
    query. All the created paths lead to the same result and it's the
    task of the optimizer to estimate the cost of executing each path and
    find out which one is the cheapest.
   </para>

   <para>
    After the cheapest path is determined, a <firstterm>plan tree</>
    is built to pass to the executor.  This represents the desired
    execution plan in sufficient detail for the executor to run it.
   </para>

   <sect2>
    <title>Generating Possible Plans</title>

    <para>
     The planner/optimizer decides which plans should be generated
     based upon the types of indexes defined on the relations appearing in
     a query. There is always the possibility of performing a
     sequential scan on a relation, so a plan using only
     sequential scans is always created. Assume an index is defined on a
     relation (for example a B-tree index) and a query contains the
     restriction
     <literal>relation.attribute OPR constant</literal>. If
     <literal>relation.attribute</literal> happens to match the key of the B-tree
     index and <literal>OPR</literal> is one of the operators listed in
     the index's <firstterm>operator class</>, another plan is created using
     the B-tree index to scan the relation. If there are further indexes
     present and the restrictions in the query happen to match a key of an
     index further plans will be considered.
    </para>

    <para>
     After all feasible plans have been found for scanning single relations,
     plans for joining relations are created. The planner/optimizer
     preferentially considers joins between any two relations for which there
     exist a corresponding join clause in the WHERE qualification (i.e. for
     which a restriction like <literal>where rel1.attr1=rel2.attr2</literal>
     exists). Join pairs with no join clause are considered only when there
     is no other choice, that is, a particular relation has no available
     join clauses to any other relation. All possible plans are generated for
     every join pair considered
     by the planner/optimizer. The three possible join strategies are:

     <itemizedlist>
      <listitem>
       <para>
	<firstterm>nested loop join</firstterm>: The right relation is scanned
	once for every tuple found in the left relation. This strategy
	is easy to implement but can be very time consuming.  (However,
	if the right relation can be scanned with an indexscan, this can
	be a good strategy.  It is possible to use values from the current
	row of the left relation as keys for the indexscan of the right.)
       </para>
      </listitem>

      <listitem>
       <para>
	<firstterm>merge sort join</firstterm>: Each relation is sorted on the join
	attributes before the join starts. Then the two relations are
	merged together taking into account that both relations are
	ordered on the join attributes. This kind of join is more
	attractive because each relation has to be scanned only once.
       </para>
      </listitem>

      <listitem>
       <para>
	<firstterm>hash join</firstterm>: the right relation is first scanned
	and loaded into a hash table, using its join attributes as hash keys.
	Next the left relation is scanned and the
	appropriate values of every tuple found are used as hash keys to
	locate the matching tuples in the table.
       </para>
      </listitem>
     </itemizedlist>
    </para>

    <para>
     The finished plan tree consists of sequential or index scans of the
     base relations, plus nestloop, merge, or hash join nodes as needed,
     plus any auxiliary steps needed, such as sort nodes or aggregate-function
     calculation nodes.  Most of these plan node types have the additional
     ability to do <firstterm>selection</> (discarding rows that do
     not meet a specified boolean condition) and <firstterm>projection</>
     (computation of a derived column set based on given column values,
     that is, evaluation of scalar expressions where needed).  One of
     the responsibilities of the planner is to attach selection conditions
     from the WHERE clause and computation of required output expressions
     to the most appropriate nodes of the plan tree.
    </para>
   </sect2>
  </sect1>

  <sect1 id="executor">
   <title>Executor</title>

   <para>
    The <firstterm>executor</firstterm> takes the plan handed back by the
    planner/optimizer and recursively processes it to extract the required set
    of rows.  This is essentially a demand-pull pipeline mechanism.
    Each time a plan node is called, it must deliver one more tuple, or
    report that it is done delivering tuples.
   </para>

   <para>
    To provide a concrete example, assume that the top
    node is a <literal>MergeJoin</literal> node.
    Before any merge can be done two tuples have to be fetched (one from
    each subplan). So the executor recursively calls itself to
    process the subplans (it starts with the subplan attached to
    <literal>lefttree</literal>). The new top node (the top node of the left
    subplan) is, let's say, a
    <literal>Sort</literal> node and again recursion is needed to obtain
    an input tuple.  The child node of the <literal>Sort</literal> might
    be a <literal>SeqScan</> node, representing actual reading of a table.
    Execution of this node causes the executor to fetch a row from the
    table and return it up to the calling node.  The <literal>Sort</literal>
    node will repeatedly call its child to obtain all the rows to be sorted.
    When the input is exhausted (as indicated by the child node returning
    a NULL instead of a tuple), the <literal>Sort</literal> code performs
    the sort, and finally is able to return its first output row, namely
    the first one in sorted order.  It keeps the remaining rows stored so
    that it can deliver them in sorted order in response to later demands.
   </para>

   <para>
    The <literal>MergeJoin</literal> node similarly demands the first row
    from its right subplan.  Then it compares the two rows to see if they
    can be joined; if so, it returns a join row to its caller.  On the next
    call, or immediately if it cannot join the current pair of inputs,
    it advances to the next row of one table
    or the other (depending on how the comparison came out), and again
    checks for a match.  Eventually, one subplan or the other is exhausted,
    and the <literal>MergeJoin</literal> node returns NULL to indicate that
    no more join rows can be formed.
   </para>

   <para>
    Complex queries may involve many levels of plan nodes, but the general
    approach is the same: each node computes and returns its next output
    row each time it is called.  Each node is also responsible for applying
    any selection or projection expressions that were assigned to it by
    the planner.
   </para>

   <para>
    The executor mechanism is used to evaluate all four basic SQL query types:
    <command>SELECT</>, <command>INSERT</>, <command>UPDATE</>, and
    <command>DELETE</>.  For <command>SELECT</>, the top-level executor
    code only needs to send each row returned by the query plan tree off
    to the client.  For <command>INSERT</>, each returned row is inserted
    into the target table specified for the <command>INSERT</>.  (A simple
    <command>INSERT ... VALUES</> command creates a trivial plan tree
    consisting of a single <literal>Result</> node, which computes just one
    result row.  But <command>INSERT ... SELECT</> may demand the full power
    of the executor mechanism.)  For <command>UPDATE</>, the planner arranges
    that each computed row includes all the updated column values, plus
    the <firstterm>TID</> (tuple ID, or location) of the original target row;
    the executor top level uses this information to create a new updated row
    and mark the old row deleted.  For <command>DELETE</>, the only column
    that is actually returned by the plan is the TID, and the executor top
    level simply uses the TID to visit the target rows and mark them deleted.
   </para>

  </sect1>

 </chapter>

<!-- Keep this comment at the end of the file
Local variables:
mode:sgml
sgml-omittag:nil
sgml-shorttag:t
sgml-minimize-attributes:nil
sgml-always-quote-attributes:t
sgml-indent-step:1
sgml-indent-data:t
sgml-parent-document:nil
sgml-default-dtd-file:"./reference.ced"
sgml-exposed-tags:nil
sgml-local-catalogs:("/usr/lib/sgml/catalog")
sgml-local-ecat-files:nil
End:
-->