mirror of
https://github.com/postgres/postgres.git
synced 2025-04-25 21:42:33 +03:00
552 lines
23 KiB
Plaintext
552 lines
23 KiB
Plaintext
<!-- $PostgreSQL: pgsql/doc/src/sgml/arch-dev.sgml,v 2.30 2007/07/21 04:02:41 tgl Exp $ -->
|
|
|
|
<chapter id="overview">
|
|
<title>Overview of PostgreSQL Internals</title>
|
|
|
|
<note>
|
|
<title>Author</title>
|
|
<para>
|
|
This chapter originated as part of
|
|
<xref linkend="SIM98">, Stefan Simkovics'
|
|
Master's Thesis prepared at Vienna University of Technology under the direction
|
|
of O.Univ.Prof.Dr. Georg Gottlob and Univ.Ass. Mag. Katrin Seyr.
|
|
</para>
|
|
</note>
|
|
|
|
<para>
|
|
This chapter gives an overview of the internal structure of the
|
|
backend of <productname>PostgreSQL</productname>. After having
|
|
read the following sections you should have an idea of how a query
|
|
is processed. This chapter does not aim to provide a detailed
|
|
description of the internal operation of
|
|
<productname>PostgreSQL</productname>, as such a document would be
|
|
very extensive. Rather, this chapter is intended to help the reader
|
|
understand the general sequence of operations that occur within the
|
|
backend from the point at which a query is received, to the point
|
|
at which the results are returned to the client.
|
|
</para>
|
|
|
|
<sect1 id="query-path">
|
|
<title>The Path of a Query</title>
|
|
|
|
<para>
|
|
Here we give a short overview of the stages a query has to pass in
|
|
order to obtain a result.
|
|
</para>
|
|
|
|
<procedure>
|
|
<step>
|
|
<para>
|
|
A connection from an application program to the <productname>PostgreSQL</productname>
|
|
server has to be established. The application program transmits a
|
|
query to the server and waits to receive the results sent back by the
|
|
server.
|
|
</para>
|
|
</step>
|
|
|
|
<step>
|
|
<para>
|
|
The <firstterm>parser stage</firstterm> checks the query
|
|
transmitted by the application
|
|
program for correct syntax and creates
|
|
a <firstterm>query tree</firstterm>.
|
|
</para>
|
|
</step>
|
|
|
|
<step>
|
|
<para>
|
|
The <firstterm>rewrite system</firstterm> takes
|
|
the query tree created by the parser stage and looks for
|
|
any <firstterm>rules</firstterm> (stored in the
|
|
<firstterm>system catalogs</firstterm>) to apply to
|
|
the query tree. It performs the
|
|
transformations given in the <firstterm>rule bodies</firstterm>.
|
|
</para>
|
|
|
|
<para>
|
|
One application of the rewrite system is in the realization of
|
|
<firstterm>views</firstterm>.
|
|
Whenever a query against a view
|
|
(i.e. a <firstterm>virtual table</firstterm>) is made,
|
|
the rewrite system rewrites the user's query to
|
|
a query that accesses the <firstterm>base tables</firstterm> given in
|
|
the <firstterm>view definition</firstterm> instead.
|
|
</para>
|
|
</step>
|
|
|
|
<step>
|
|
<para>
|
|
The <firstterm>planner/optimizer</firstterm> takes
|
|
the (rewritten) query tree and creates a
|
|
<firstterm>query plan</firstterm> that will be the input to the
|
|
<firstterm>executor</firstterm>.
|
|
</para>
|
|
|
|
<para>
|
|
It does so by first creating all possible <firstterm>paths</firstterm>
|
|
leading to the same result. For example if there is an index on a
|
|
relation to be scanned, there are two paths for the
|
|
scan. One possibility is a simple sequential scan and the other
|
|
possibility is to use the index. Next the cost for the execution of
|
|
each path is estimated and the cheapest path is chosen. The cheapest
|
|
path is expanded into a complete plan that the executor can use.
|
|
</para>
|
|
</step>
|
|
|
|
<step>
|
|
<para>
|
|
The executor recursively steps through
|
|
the <firstterm>plan tree</firstterm> and
|
|
retrieves rows in the way represented by the plan.
|
|
The executor makes use of the
|
|
<firstterm>storage system</firstterm> while scanning
|
|
relations, performs <firstterm>sorts</firstterm> and <firstterm>joins</firstterm>,
|
|
evaluates <firstterm>qualifications</firstterm> and finally hands back the rows derived.
|
|
</para>
|
|
</step>
|
|
</procedure>
|
|
|
|
<para>
|
|
In the following sections we will cover each of the above listed items
|
|
in more detail to give a better understanding of <productname>PostgreSQL</productname>'s internal
|
|
control and data structures.
|
|
</para>
|
|
</sect1>
|
|
|
|
<sect1 id="connect-estab">
|
|
<title>How Connections are Established</title>
|
|
|
|
<para>
|
|
<productname>PostgreSQL</productname> is implemented using a
|
|
simple <quote>process per user</> client/server model. In this model
|
|
there is one <firstterm>client process</firstterm> connected to
|
|
exactly one <firstterm>server process</firstterm>. As we do not
|
|
know ahead of time how many connections will be made, we have to
|
|
use a <firstterm>master process</firstterm> that spawns a new
|
|
server process every time a connection is requested. This master
|
|
process is called <literal>postgres</literal> and listens at a
|
|
specified TCP/IP port for incoming connections. Whenever a request
|
|
for a connection is detected the <literal>postgres</literal>
|
|
process spawns a new server process. The server tasks
|
|
communicate with each other using <firstterm>semaphores</firstterm> and
|
|
<firstterm>shared memory</firstterm> to ensure data integrity
|
|
throughout concurrent data access.
|
|
</para>
|
|
|
|
<para>
|
|
The client process can be any program that understands the
|
|
<productname>PostgreSQL</productname> protocol described in
|
|
<xref linkend="protocol">. Many clients are based on the
|
|
C-language library <application>libpq</>, but several independent
|
|
implementations of the protocol exist, such as the Java
|
|
<application>JDBC</> driver.
|
|
</para>
|
|
|
|
<para>
|
|
Once a connection is established the client process can send a query
|
|
to the <firstterm>backend</firstterm> (server). The query is transmitted using plain text,
|
|
i.e. there is no parsing done in the <firstterm>frontend</firstterm> (client). The
|
|
server parses the query, creates an <firstterm>execution plan</firstterm>,
|
|
executes the plan and returns the retrieved rows to the client
|
|
by transmitting them over the established connection.
|
|
</para>
|
|
</sect1>
|
|
|
|
<sect1 id="parser-stage">
|
|
<title>The Parser Stage</title>
|
|
|
|
<para>
|
|
The <firstterm>parser stage</firstterm> consists of two parts:
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
The <firstterm>parser</firstterm> defined in
|
|
<filename>gram.y</filename> and <filename>scan.l</filename> is
|
|
built using the Unix tools <application>yacc</application>
|
|
and <application>lex</application>.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
The <firstterm>transformation process</firstterm> does
|
|
modifications and augmentations to the data structures returned by the parser.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<sect2>
|
|
<title>Parser</title>
|
|
|
|
<para>
|
|
The parser has to check the query string (which arrives as plain
|
|
ASCII text) for valid syntax. If the syntax is correct a
|
|
<firstterm>parse tree</firstterm> is built up and handed back;
|
|
otherwise an error is returned. The parser and lexer are
|
|
implemented using the well-known Unix tools <application>yacc</>
|
|
and <application>lex</>.
|
|
</para>
|
|
|
|
<para>
|
|
The <firstterm>lexer</firstterm> is defined in the file
|
|
<filename>scan.l</filename> and is responsible
|
|
for recognizing <firstterm>identifiers</firstterm>,
|
|
the <firstterm>SQL key words</firstterm> etc. For
|
|
every key word or identifier that is found, a <firstterm>token</firstterm>
|
|
is generated and handed to the parser.
|
|
</para>
|
|
|
|
<para>
|
|
The parser is defined in the file <filename>gram.y</filename> and
|
|
consists of a set of <firstterm>grammar rules</firstterm> and
|
|
<firstterm>actions</firstterm> that are executed whenever a rule
|
|
is fired. The code of the actions (which is actually C code) is
|
|
used to build up the parse tree.
|
|
</para>
|
|
|
|
<para>
|
|
The file <filename>scan.l</filename> is transformed to the C
|
|
source file <filename>scan.c</filename> using the program
|
|
<application>lex</application> and <filename>gram.y</filename> is
|
|
transformed to <filename>gram.c</filename> using
|
|
<application>yacc</application>. After these transformations
|
|
have taken place a normal C compiler can be used to create the
|
|
parser. Never make any changes to the generated C files as they
|
|
will be overwritten the next time <application>lex</application>
|
|
or <application>yacc</application> is called.
|
|
|
|
<note>
|
|
<para>
|
|
The mentioned transformations and compilations are normally done
|
|
automatically using the <firstterm>makefiles</firstterm>
|
|
shipped with the <productname>PostgreSQL</productname>
|
|
source distribution.
|
|
</para>
|
|
</note>
|
|
</para>
|
|
|
|
<para>
|
|
A detailed description of <application>yacc</application> or
|
|
the grammar rules given in <filename>gram.y</filename> would be
|
|
beyond the scope of this paper. There are many books and
|
|
documents dealing with <application>lex</application> and
|
|
<application>yacc</application>. You should be familiar with
|
|
<application>yacc</application> before you start to study the
|
|
grammar given in <filename>gram.y</filename> otherwise you won't
|
|
understand what happens there.
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2>
|
|
<title>Transformation Process</title>
|
|
|
|
<para>
|
|
The parser stage creates a parse tree using only fixed rules about
|
|
the syntactic structure of SQL. It does not make any lookups in the
|
|
system catalogs, so there is no possibility to understand the detailed
|
|
semantics of the requested operations. After the parser completes,
|
|
the <firstterm>transformation process</firstterm> takes the tree handed
|
|
back by the parser as input and does the semantic interpretation needed
|
|
to understand which tables, functions, and operators are referenced by
|
|
the query. The data structure that is built to represent this
|
|
information is called the <firstterm>query tree</>.
|
|
</para>
|
|
|
|
<para>
|
|
The reason for separating raw parsing from semantic analysis is that
|
|
system catalog lookups can only be done within a transaction, and we
|
|
do not wish to start a transaction immediately upon receiving a query
|
|
string. The raw parsing stage is sufficient to identify the transaction
|
|
control commands (<command>BEGIN</>, <command>ROLLBACK</>, etc), and
|
|
these can then be correctly executed without any further analysis.
|
|
Once we know that we are dealing with an actual query (such as
|
|
<command>SELECT</> or <command>UPDATE</>), it is okay to
|
|
start a transaction if we're not already in one. Only then can the
|
|
transformation process be invoked.
|
|
</para>
|
|
|
|
<para>
|
|
The query tree created by the transformation process is structurally
|
|
similar to the raw parse tree in most places, but it has many differences
|
|
in detail. For example, a <structname>FuncCall</> node in the
|
|
parse tree represents something that looks syntactically like a function
|
|
call. This might be transformed to either a <structname>FuncExpr</>
|
|
or <structname>Aggref</> node depending on whether the referenced
|
|
name turns out to be an ordinary function or an aggregate function.
|
|
Also, information about the actual data types of columns and expression
|
|
results is added to the query tree.
|
|
</para>
|
|
</sect2>
|
|
</sect1>
|
|
|
|
<sect1 id="rule-system">
|
|
<title>The <productname>PostgreSQL</productname> Rule System</title>
|
|
|
|
<para>
|
|
<productname>PostgreSQL</productname> supports a powerful
|
|
<firstterm>rule system</firstterm> for the specification
|
|
of <firstterm>views</firstterm> and ambiguous <firstterm>view updates</firstterm>.
|
|
Originally the <productname>PostgreSQL</productname>
|
|
rule system consisted of two implementations:
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
The first one worked using <firstterm>row level</firstterm> processing and was
|
|
implemented deep in the <firstterm>executor</firstterm>. The rule system was
|
|
called whenever an individual row had been accessed. This
|
|
implementation was removed in 1995 when the last official release
|
|
of the <productname>Berkeley Postgres</productname> project was
|
|
transformed into <productname>Postgres95</productname>.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
The second implementation of the rule system is a technique
|
|
called <firstterm>query rewriting</firstterm>.
|
|
The <firstterm>rewrite system</firstterm> is a module
|
|
that exists between the <firstterm>parser stage</firstterm> and the
|
|
<firstterm>planner/optimizer</firstterm>. This technique is still implemented.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>
|
|
The query rewriter is discussed in some detail in
|
|
<xref linkend="rules">, so there is no need to cover it here.
|
|
We will only point out that both the input and the output of the
|
|
rewriter are query trees, that is, there is no change in the
|
|
representation or level of semantic detail in the trees. Rewriting
|
|
can be thought of as a form of macro expansion.
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="planner-optimizer">
|
|
<title>Planner/Optimizer</title>
|
|
|
|
<para>
|
|
The task of the <firstterm>planner/optimizer</firstterm> is to
|
|
create an optimal execution plan. A given SQL query (and hence, a
|
|
query tree) can be actually executed in a wide variety of
|
|
different ways, each of which will produce the same set of
|
|
results. If it is computationally feasible, the query optimizer
|
|
will examine each of these possible execution plans, ultimately
|
|
selecting the execution plan that is expected to run the fastest.
|
|
</para>
|
|
|
|
<note>
|
|
<para>
|
|
In some situations, examining each possible way in which a query
|
|
can be executed would take an excessive amount of time and memory
|
|
space. In particular, this occurs when executing queries
|
|
involving large numbers of join operations. In order to determine
|
|
a reasonable (not necessarily optimal) query plan in a reasonable amount
|
|
of time, <productname>PostgreSQL</productname> uses a <xref
|
|
linkend="geqo" endterm="geqo-title"> when the number of joins
|
|
exceeds a threshold (see <xref linkend="guc-geqo-threshold">).
|
|
</para>
|
|
</note>
|
|
|
|
<para>
|
|
The planner's search procedure actually works with data structures
|
|
called <firstterm>paths</>, which are simply cut-down representations of
|
|
plans containing only as much information as the planner needs to make
|
|
its decisions. After the cheapest path is determined, a full-fledged
|
|
<firstterm>plan tree</> is built to pass to the executor. This represents
|
|
the desired execution plan in sufficient detail for the executor to run it.
|
|
In the rest of this section we'll ignore the distinction between paths
|
|
and plans.
|
|
</para>
|
|
|
|
<sect2>
|
|
<title>Generating Possible Plans</title>
|
|
|
|
<para>
|
|
The planner/optimizer starts by generating plans for scanning each
|
|
individual relation (table) used in the query. The possible plans
|
|
are determined by the available indexes on each relation.
|
|
There is always the possibility of performing a
|
|
sequential scan on a relation, so a sequential scan plan is always
|
|
created. Assume an index is defined on a
|
|
relation (for example a B-tree index) and a query contains the
|
|
restriction
|
|
<literal>relation.attribute OPR constant</literal>. If
|
|
<literal>relation.attribute</literal> happens to match the key of the B-tree
|
|
index and <literal>OPR</literal> is one of the operators listed in
|
|
the index's <firstterm>operator class</>, another plan is created using
|
|
the B-tree index to scan the relation. If there are further indexes
|
|
present and the restrictions in the query happen to match a key of an
|
|
index, further plans will be considered. Index scan plans are also
|
|
generated for indexes that have a sort ordering that can match the
|
|
query's <literal>ORDER BY</> clause (if any), or a sort ordering that
|
|
might be useful for merge joining (see below).
|
|
</para>
|
|
|
|
<para>
|
|
If the query requires joining two or more relations,
|
|
plans for joining relations are considered
|
|
after all feasible plans have been found for scanning single relations.
|
|
The three available join strategies are:
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
<firstterm>nested loop join</firstterm>: The right relation is scanned
|
|
once for every row found in the left relation. This strategy
|
|
is easy to implement but can be very time consuming. (However,
|
|
if the right relation can be scanned with an index scan, this can
|
|
be a good strategy. It is possible to use values from the current
|
|
row of the left relation as keys for the index scan of the right.)
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
<firstterm>merge sort join</firstterm>: Each relation is sorted on the join
|
|
attributes before the join starts. Then the two relations are
|
|
scanned in parallel, and matching rows are combined to form
|
|
join rows. This kind of join is more
|
|
attractive because each relation has to be scanned only once.
|
|
The required sorting might be achieved either by an explicit sort
|
|
step, or by scanning the relation in the proper order using an
|
|
index on the join key.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
<firstterm>hash join</firstterm>: the right relation is first scanned
|
|
and loaded into a hash table, using its join attributes as hash keys.
|
|
Next the left relation is scanned and the
|
|
appropriate values of every row found are used as hash keys to
|
|
locate the matching rows in the table.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>
|
|
When the query involves more than two relations, the final result
|
|
must be built up by a tree of join steps, each with two inputs.
|
|
The planner examines different possible join sequences to find the
|
|
cheapest one.
|
|
</para>
|
|
|
|
<para>
|
|
If the query uses fewer than <xref linkend="guc-geqo-threshold">
|
|
relations, a near-exhaustive search is conducted to find the best
|
|
join sequence. The planner preferentially considers joins between any
|
|
two relations for which there exist a corresponding join clause in the
|
|
<literal>WHERE</literal> qualification (i.e. for
|
|
which a restriction like <literal>where rel1.attr1=rel2.attr2</literal>
|
|
exists). Join pairs with no join clause are considered only when there
|
|
is no other choice, that is, a particular relation has no available
|
|
join clauses to any other relation. All possible plans are generated for
|
|
every join pair considered by the planner, and the one that is
|
|
(estimated to be) the cheapest is chosen.
|
|
</para>
|
|
|
|
<para>
|
|
When <varname>geqo_threshold</varname> is exceeded, the join
|
|
sequences considered are determined by heuristics, as described
|
|
in <xref linkend="geqo">. Otherwise the process is the same.
|
|
</para>
|
|
|
|
<para>
|
|
The finished plan tree consists of sequential or index scans of
|
|
the base relations, plus nested-loop, merge, or hash join nodes as
|
|
needed, plus any auxiliary steps needed, such as sort nodes or
|
|
aggregate-function calculation nodes. Most of these plan node
|
|
types have the additional ability to do <firstterm>selection</>
|
|
(discarding rows that do not meet a specified boolean condition)
|
|
and <firstterm>projection</> (computation of a derived column set
|
|
based on given column values, that is, evaluation of scalar
|
|
expressions where needed). One of the responsibilities of the
|
|
planner is to attach selection conditions from the
|
|
<literal>WHERE</literal> clause and computation of required
|
|
output expressions to the most appropriate nodes of the plan
|
|
tree.
|
|
</para>
|
|
</sect2>
|
|
</sect1>
|
|
|
|
<sect1 id="executor">
|
|
<title>Executor</title>
|
|
|
|
<para>
|
|
The <firstterm>executor</firstterm> takes the plan handed back by the
|
|
planner/optimizer and recursively processes it to extract the required set
|
|
of rows. This is essentially a demand-pull pipeline mechanism.
|
|
Each time a plan node is called, it must deliver one more row, or
|
|
report that it is done delivering rows.
|
|
</para>
|
|
|
|
<para>
|
|
To provide a concrete example, assume that the top
|
|
node is a <literal>MergeJoin</literal> node.
|
|
Before any merge can be done two rows have to be fetched (one from
|
|
each subplan). So the executor recursively calls itself to
|
|
process the subplans (it starts with the subplan attached to
|
|
<literal>lefttree</literal>). The new top node (the top node of the left
|
|
subplan) is, let's say, a
|
|
<literal>Sort</literal> node and again recursion is needed to obtain
|
|
an input row. The child node of the <literal>Sort</literal> might
|
|
be a <literal>SeqScan</> node, representing actual reading of a table.
|
|
Execution of this node causes the executor to fetch a row from the
|
|
table and return it up to the calling node. The <literal>Sort</literal>
|
|
node will repeatedly call its child to obtain all the rows to be sorted.
|
|
When the input is exhausted (as indicated by the child node returning
|
|
a NULL instead of a row), the <literal>Sort</literal> code performs
|
|
the sort, and finally is able to return its first output row, namely
|
|
the first one in sorted order. It keeps the remaining rows stored so
|
|
that it can deliver them in sorted order in response to later demands.
|
|
</para>
|
|
|
|
<para>
|
|
The <literal>MergeJoin</literal> node similarly demands the first row
|
|
from its right subplan. Then it compares the two rows to see if they
|
|
can be joined; if so, it returns a join row to its caller. On the next
|
|
call, or immediately if it cannot join the current pair of inputs,
|
|
it advances to the next row of one table
|
|
or the other (depending on how the comparison came out), and again
|
|
checks for a match. Eventually, one subplan or the other is exhausted,
|
|
and the <literal>MergeJoin</literal> node returns NULL to indicate that
|
|
no more join rows can be formed.
|
|
</para>
|
|
|
|
<para>
|
|
Complex queries can involve many levels of plan nodes, but the general
|
|
approach is the same: each node computes and returns its next output
|
|
row each time it is called. Each node is also responsible for applying
|
|
any selection or projection expressions that were assigned to it by
|
|
the planner.
|
|
</para>
|
|
|
|
<para>
|
|
The executor mechanism is used to evaluate all four basic SQL query types:
|
|
<command>SELECT</>, <command>INSERT</>, <command>UPDATE</>, and
|
|
<command>DELETE</>. For <command>SELECT</>, the top-level executor
|
|
code only needs to send each row returned by the query plan tree off
|
|
to the client. For <command>INSERT</>, each returned row is inserted
|
|
into the target table specified for the <command>INSERT</>. (A simple
|
|
<command>INSERT ... VALUES</> command creates a trivial plan tree
|
|
consisting of a single <literal>Result</> node, which computes just one
|
|
result row. But <command>INSERT ... SELECT</> can demand the full power
|
|
of the executor mechanism.) For <command>UPDATE</>, the planner arranges
|
|
that each computed row includes all the updated column values, plus
|
|
the <firstterm>TID</> (tuple ID, or row ID) of the original target row;
|
|
the executor top level uses this information to create a new updated row
|
|
and mark the old row deleted. For <command>DELETE</>, the only column
|
|
that is actually returned by the plan is the TID, and the executor top
|
|
level simply uses the TID to visit each target row and mark it deleted.
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
</chapter>
|