mirror of
https://github.com/postgres/postgres.git
synced 2025-05-28 05:21:27 +03:00
525 lines
22 KiB
Plaintext
525 lines
22 KiB
Plaintext
<!--
|
|
$Header: /cvsroot/pgsql/doc/src/sgml/arch-dev.sgml,v 2.21 2003/06/22 16:16:44 tgl Exp $
|
|
-->
|
|
|
|
<chapter id="overview">
|
|
<title>Overview of PostgreSQL Internals</title>
|
|
|
|
<note>
|
|
<title>Author</title>
|
|
<para>
|
|
This chapter originated as part of
|
|
<xref linkend="SIM98">, Stefan Simkovics'
|
|
Master's Thesis prepared at Vienna University of Technology under the direction
|
|
of O.Univ.Prof.Dr. Georg Gottlob and Univ.Ass. Mag. Katrin Seyr.
|
|
</para>
|
|
</note>
|
|
|
|
<para>
|
|
This chapter gives an overview of the internal structure of the
|
|
backend of <productname>PostgreSQL</productname>. After having
|
|
read the following sections you should have an idea of how a query
|
|
is processed. This chapter does not aim to provide a detailed
|
|
description of the internal operation of
|
|
<productname>PostgreSQL</productname>, as such a document would be
|
|
very extensive. Rather, this chapter is intended to help the reader
|
|
understand the general sequence of operations that occur within the
|
|
backend from the point at which a query is received, to the point
|
|
when the results are returned to the client.
|
|
</para>
|
|
|
|
<sect1 id="query-path">
|
|
<title>The Path of a Query</title>
|
|
|
|
<para>
|
|
Here we give a short overview of the stages a query has to pass in
|
|
order to obtain a result.
|
|
</para>
|
|
|
|
<procedure>
|
|
<step>
|
|
<para>
|
|
A connection from an application program to the <productname>PostgreSQL</productname>
|
|
server has to be established. The application program transmits a
|
|
query to the server and waits to receive the results sent back by the
|
|
server.
|
|
</para>
|
|
</step>
|
|
|
|
<step>
|
|
<para>
|
|
The <firstterm>parser stage</firstterm> checks the query
|
|
transmitted by the application
|
|
program for correct syntax and creates
|
|
a <firstterm>query tree</firstterm>.
|
|
</para>
|
|
</step>
|
|
|
|
<step>
|
|
<para>
|
|
The <firstterm>rewrite system</firstterm> takes
|
|
the query tree created by the parser stage and looks for
|
|
any <firstterm>rules</firstterm> (stored in the
|
|
<firstterm>system catalogs</firstterm>) to apply to
|
|
the query tree. It performs the
|
|
transformations given in the <firstterm>rule bodies</firstterm>.
|
|
One application of the rewrite system is in the realization of
|
|
<firstterm>views</firstterm>.
|
|
</para>
|
|
|
|
<para>
|
|
Whenever a query against a view
|
|
(i.e. a <firstterm>virtual table</firstterm>) is made,
|
|
the rewrite system rewrites the user's query to
|
|
a query that accesses the <firstterm>base tables</firstterm> given in
|
|
the <firstterm>view definition</firstterm> instead.
|
|
</para>
|
|
</step>
|
|
|
|
<step>
|
|
<para>
|
|
The <firstterm>planner/optimizer</firstterm> takes
|
|
the (rewritten) querytree and creates a
|
|
<firstterm>query plan</firstterm> that will be the input to the
|
|
<firstterm>executor</firstterm>.
|
|
</para>
|
|
|
|
<para>
|
|
It does so by first creating all possible <firstterm>paths</firstterm>
|
|
leading to the same result. For example if there is an index on a
|
|
relation to be scanned, there are two paths for the
|
|
scan. One possibility is a simple sequential scan and the other
|
|
possibility is to use the index. Next the cost for the execution of
|
|
each plan is estimated and the
|
|
cheapest plan is chosen and handed back.
|
|
</para>
|
|
</step>
|
|
|
|
<step>
|
|
<para>
|
|
The executor recursively steps through
|
|
the <firstterm>plan tree</firstterm> and
|
|
retrieves tuples in the way represented by the plan.
|
|
The executor makes use of the
|
|
<firstterm>storage system</firstterm> while scanning
|
|
relations, performs <firstterm>sorts</firstterm> and <firstterm>joins</firstterm>,
|
|
evaluates <firstterm>qualifications</firstterm> and finally hands back the tuples derived.
|
|
</para>
|
|
</step>
|
|
</procedure>
|
|
|
|
<para>
|
|
In the following sections we will cover each of the above listed items
|
|
in more detail to give a better understanding of <productname>PostgreSQL</productname>'s internal
|
|
control and data structures.
|
|
</para>
|
|
</sect1>
|
|
|
|
<sect1 id="connect-estab">
|
|
<title>How Connections are Established</title>
|
|
|
|
<para>
|
|
<productname>PostgreSQL</productname> is implemented using a
|
|
simple <quote>process per user</> client/server model. In this model
|
|
there is one <firstterm>client process</firstterm> connected to
|
|
exactly one <firstterm>server process</firstterm>. As we do not
|
|
know ahead of time how many connections will be made, we have to
|
|
use a <firstterm>master process</firstterm> that spawns a new
|
|
server process every time a connection is requested. This master
|
|
process is called <literal>postmaster</literal> and listens at a
|
|
specified TCP/IP port for incoming connections. Whenever a request
|
|
for a connection is detected the <literal>postmaster</literal>
|
|
process spawns a new server process called
|
|
<literal>postgres</literal>. The server tasks
|
|
(<literal>postgres</literal> processes) communicate with each
|
|
other using <firstterm>semaphores</firstterm> and
|
|
<firstterm>shared memory</firstterm> to ensure data integrity
|
|
throughout concurrent data access.
|
|
</para>
|
|
|
|
<para>
|
|
The client process can be any program that understands the
|
|
<productname>PostgreSQL</productname> protocol described in
|
|
<xref linkend="protocol">. Many clients are based on the
|
|
C-language library <application>libpq</>, but several independent
|
|
implementations exist, such as the Java <application>JDBC</> driver.
|
|
</para>
|
|
|
|
<para>
|
|
Once a connection is established the client process can send a query
|
|
to the <firstterm>backend</firstterm> (server). The query is transmitted using plain text,
|
|
i.e. there is no parsing done in the <firstterm>frontend</firstterm> (client). The
|
|
server parses the query, creates an <firstterm>execution plan</firstterm>,
|
|
executes the plan and returns the retrieved tuples to the client
|
|
by transmitting them over the established connection.
|
|
</para>
|
|
</sect1>
|
|
|
|
<sect1 id="parser-stage">
|
|
<title>The Parser Stage</title>
|
|
|
|
<para>
|
|
The <firstterm>parser stage</firstterm> consists of two parts:
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
The <firstterm>parser</firstterm> defined in
|
|
<filename>gram.y</filename> and <filename>scan.l</filename> is
|
|
built using the Unix tools <application>yacc</application>
|
|
and <application>lex</application>.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
The <firstterm>transformation process</firstterm> does
|
|
modifications and augmentations to the data structures returned by the parser.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<sect2>
|
|
<title>Parser</title>
|
|
|
|
<para>
|
|
The parser has to check the query string (which arrives as
|
|
plain ASCII text) for valid syntax. If the syntax is correct a
|
|
<firstterm>parse tree</firstterm> is built up and handed back otherwise an error is
|
|
returned. For the implementation the well known Unix
|
|
tools <application>lex</application> and <application>yacc</application>
|
|
are used.
|
|
</para>
|
|
|
|
<para>
|
|
The <firstterm>lexer</firstterm> is defined in the file
|
|
<filename>scan.l</filename> and is responsible
|
|
for recognizing <firstterm>identifiers</firstterm>,
|
|
the <firstterm>SQL keywords</firstterm> etc. For
|
|
every keyword or identifier that is found, a <firstterm>token</firstterm>
|
|
is generated and handed to the parser.
|
|
</para>
|
|
|
|
<para>
|
|
The parser is defined in the file <filename>gram.y</filename> and consists of a
|
|
set of <firstterm>grammar rules</firstterm> and <firstterm>actions</firstterm>
|
|
that are executed
|
|
whenever a rule is fired. The code of the actions (which
|
|
is actually C-code) is used to build up the parse tree.
|
|
</para>
|
|
|
|
<para>
|
|
The file <filename>scan.l</filename> is transformed to
|
|
the C-source file <filename>scan.c</filename>
|
|
using the program <application>lex</application>
|
|
and <filename>gram.y</filename> is transformed to
|
|
<filename>gram.c</filename> using <application>yacc</application>.
|
|
After these transformations have taken
|
|
place a normal C-compiler can be used to create the
|
|
parser. Never make any changes to the generated C-files as they will
|
|
be overwritten the next time <application>lex</application>
|
|
or <application>yacc</application> is called.
|
|
|
|
<note>
|
|
<para>
|
|
The mentioned transformations and compilations are normally done
|
|
automatically using the <firstterm>makefiles</firstterm>
|
|
shipped with the <productname>PostgreSQL</productname>
|
|
source distribution.
|
|
</para>
|
|
</note>
|
|
</para>
|
|
|
|
<para>
|
|
A detailed description of <application>yacc</application> or
|
|
the grammar rules given in <filename>gram.y</filename> would be
|
|
beyond the scope of this paper. There are many books and
|
|
documents dealing with <application>lex</application> and
|
|
<application>yacc</application>. You should be familiar with
|
|
<application>yacc</application> before you start to study the
|
|
grammar given in <filename>gram.y</filename> otherwise you won't
|
|
understand what happens there.
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2>
|
|
<title>Transformation Process</title>
|
|
|
|
<para>
|
|
The parser stage creates a parse tree using only fixed rules about
|
|
the syntactic structure of SQL. It does not make any lookups in the
|
|
system catalogs, so there is no possibility to understand the detailed
|
|
semantics of the requested operations. After the parser completes,
|
|
the <firstterm>transformation process</firstterm> takes the tree handed
|
|
back by the parser as input and does the semantic interpretation needed
|
|
to understand which tables, functions, and operators are referenced by
|
|
the query. The data structure that is built to represent this
|
|
information is called the <firstterm>query tree</>.
|
|
</para>
|
|
|
|
<para>
|
|
The reason for separating raw parsing from semantic analysis is that
|
|
system catalog lookups can only be done within a transaction, and we
|
|
do not wish to start a transaction immediately upon receiving a query
|
|
string. The raw parsing stage is sufficient to identify the transaction
|
|
control commands (<command>BEGIN</>, <command>ROLLBACK</>, etc), and
|
|
these can then be correctly executed without any further analysis.
|
|
Once we know that we are dealing with an actual query (such as
|
|
<command>SELECT</> or <command>UPDATE</>), it is okay to
|
|
start a transaction if we're not already in one. Only then can the
|
|
transformation process be invoked.
|
|
</para>
|
|
|
|
<para>
|
|
The query tree created by the transformation process is structurally
|
|
similar to the raw parse tree in most places, but it has many differences
|
|
in detail. For example, a <structname>FuncCall</> node in the
|
|
parse tree represents something that looks syntactically like a function
|
|
call. This may be transformed to either a <structname>FuncExpr</>
|
|
or <structname>Aggref</> node depending on whether the referenced
|
|
name turns out to be an ordinary function or an aggregate function.
|
|
Also, information about the actual datatypes of columns and expression
|
|
results is added to the query tree.
|
|
</para>
|
|
</sect2>
|
|
</sect1>
|
|
|
|
<sect1 id="rule-system">
|
|
<title>The <productname>PostgreSQL</productname> Rule System</title>
|
|
|
|
<para>
|
|
<productname>PostgreSQL</productname> supports a powerful
|
|
<firstterm>rule system</firstterm> for the specification
|
|
of <firstterm>views</firstterm> and ambiguous <firstterm>view updates</firstterm>.
|
|
Originally the <productname>PostgreSQL</productname>
|
|
rule system consisted of two implementations:
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
The first one worked using <firstterm>tuple level</firstterm> processing and was
|
|
implemented deep in the <firstterm>executor</firstterm>. The rule system was
|
|
called whenever an individual tuple had been accessed. This
|
|
implementation was removed in 1995 when the last official release
|
|
of the <productname>Berkeley Postgres</productname> project was
|
|
transformed into <productname>Postgres95</productname>.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
The second implementation of the rule system is a technique
|
|
called <firstterm>query rewriting</firstterm>.
|
|
The <firstterm>rewrite system</firstterm> is a module
|
|
that exists between the <firstterm>parser stage</firstterm> and the
|
|
<firstterm>planner/optimizer</firstterm>. This technique is still implemented.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>
|
|
The query rewriter is discussed in some detail in
|
|
<xref linkend="rules">, so there is no need to cover it here.
|
|
We will only point out that both the input and the output of the
|
|
rewriter are query trees, that is, there is no change in the
|
|
representation or level of semantic detail in the trees. Rewriting
|
|
can be thought of as a form of macro expansion.
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="planner-optimizer">
|
|
<title>Planner/Optimizer</title>
|
|
|
|
<para>
|
|
The task of the <firstterm>planner/optimizer</firstterm> is to create an optimal
|
|
execution plan. It first considers all possible ways of
|
|
<firstterm>scanning</firstterm> and <firstterm>joining</firstterm>
|
|
the relations that appear in a
|
|
query. All the created paths lead to the same result and it's the
|
|
task of the optimizer to estimate the cost of executing each path and
|
|
find out which one is the cheapest.
|
|
</para>
|
|
|
|
<para>
|
|
After the cheapest path is determined, a <firstterm>plan tree</>
|
|
is built to pass to the executor. This represents the desired
|
|
execution plan in sufficient detail for the executor to run it.
|
|
</para>
|
|
|
|
<sect2>
|
|
<title>Generating Possible Plans</title>
|
|
|
|
<para>
|
|
The planner/optimizer decides which plans should be generated
|
|
based upon the types of indexes defined on the relations appearing in
|
|
a query. There is always the possibility of performing a
|
|
sequential scan on a relation, so a plan using only
|
|
sequential scans is always created. Assume an index is defined on a
|
|
relation (for example a B-tree index) and a query contains the
|
|
restriction
|
|
<literal>relation.attribute OPR constant</literal>. If
|
|
<literal>relation.attribute</literal> happens to match the key of the B-tree
|
|
index and <literal>OPR</literal> is one of the operators listed in
|
|
the index's <firstterm>operator class</>, another plan is created using
|
|
the B-tree index to scan the relation. If there are further indexes
|
|
present and the restrictions in the query happen to match a key of an
|
|
index further plans will be considered.
|
|
</para>
|
|
|
|
<para>
|
|
After all feasible plans have been found for scanning single relations,
|
|
plans for joining relations are created. The planner/optimizer
|
|
preferentially considers joins between any two relations for which there
|
|
exist a corresponding join clause in the WHERE qualification (i.e. for
|
|
which a restriction like <literal>where rel1.attr1=rel2.attr2</literal>
|
|
exists). Join pairs with no join clause are considered only when there
|
|
is no other choice, that is, a particular relation has no available
|
|
join clauses to any other relation. All possible plans are generated for
|
|
every join pair considered
|
|
by the planner/optimizer. The three possible join strategies are:
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
<firstterm>nested loop join</firstterm>: The right relation is scanned
|
|
once for every tuple found in the left relation. This strategy
|
|
is easy to implement but can be very time consuming. (However,
|
|
if the right relation can be scanned with an indexscan, this can
|
|
be a good strategy. It is possible to use values from the current
|
|
row of the left relation as keys for the indexscan of the right.)
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
<firstterm>merge sort join</firstterm>: Each relation is sorted on the join
|
|
attributes before the join starts. Then the two relations are
|
|
merged together taking into account that both relations are
|
|
ordered on the join attributes. This kind of join is more
|
|
attractive because each relation has to be scanned only once.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
<firstterm>hash join</firstterm>: the right relation is first scanned
|
|
and loaded into a hash table, using its join attributes as hash keys.
|
|
Next the left relation is scanned and the
|
|
appropriate values of every tuple found are used as hash keys to
|
|
locate the matching tuples in the table.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>
|
|
The finished plan tree consists of sequential or index scans of the
|
|
base relations, plus nestloop, merge, or hash join nodes as needed,
|
|
plus any auxiliary steps needed, such as sort nodes or aggregate-function
|
|
calculation nodes. Most of these plan node types have the additional
|
|
ability to do <firstterm>selection</> (discarding rows that do
|
|
not meet a specified boolean condition) and <firstterm>projection</>
|
|
(computation of a derived column set based on given column values,
|
|
that is, evaluation of scalar expressions where needed). One of
|
|
the responsibilities of the planner is to attach selection conditions
|
|
from the WHERE clause and computation of required output expressions
|
|
to the most appropriate nodes of the plan tree.
|
|
</para>
|
|
</sect2>
|
|
</sect1>
|
|
|
|
<sect1 id="executor">
|
|
<title>Executor</title>
|
|
|
|
<para>
|
|
The <firstterm>executor</firstterm> takes the plan handed back by the
|
|
planner/optimizer and recursively processes it to extract the required set
|
|
of rows. This is essentially a demand-pull pipeline mechanism.
|
|
Each time a plan node is called, it must deliver one more tuple, or
|
|
report that it is done delivering tuples.
|
|
</para>
|
|
|
|
<para>
|
|
To provide a concrete example, assume that the top
|
|
node is a <literal>MergeJoin</literal> node.
|
|
Before any merge can be done two tuples have to be fetched (one from
|
|
each subplan). So the executor recursively calls itself to
|
|
process the subplans (it starts with the subplan attached to
|
|
<literal>lefttree</literal>). The new top node (the top node of the left
|
|
subplan) is, let's say, a
|
|
<literal>Sort</literal> node and again recursion is needed to obtain
|
|
an input tuple. The child node of the <literal>Sort</literal> might
|
|
be a <literal>SeqScan</> node, representing actual reading of a table.
|
|
Execution of this node causes the executor to fetch a row from the
|
|
table and return it up to the calling node. The <literal>Sort</literal>
|
|
node will repeatedly call its child to obtain all the rows to be sorted.
|
|
When the input is exhausted (as indicated by the child node returning
|
|
a NULL instead of a tuple), the <literal>Sort</literal> code performs
|
|
the sort, and finally is able to return its first output row, namely
|
|
the first one in sorted order. It keeps the remaining rows stored so
|
|
that it can deliver them in sorted order in response to later demands.
|
|
</para>
|
|
|
|
<para>
|
|
The <literal>MergeJoin</literal> node similarly demands the first row
|
|
from its right subplan. Then it compares the two rows to see if they
|
|
can be joined; if so, it returns a join row to its caller. On the next
|
|
call, or immediately if it cannot join the current pair of inputs,
|
|
it advances to the next row of one table
|
|
or the other (depending on how the comparison came out), and again
|
|
checks for a match. Eventually, one subplan or the other is exhausted,
|
|
and the <literal>MergeJoin</literal> node returns NULL to indicate that
|
|
no more join rows can be formed.
|
|
</para>
|
|
|
|
<para>
|
|
Complex queries may involve many levels of plan nodes, but the general
|
|
approach is the same: each node computes and returns its next output
|
|
row each time it is called. Each node is also responsible for applying
|
|
any selection or projection expressions that were assigned to it by
|
|
the planner.
|
|
</para>
|
|
|
|
<para>
|
|
The executor mechanism is used to evaluate all four basic SQL query types:
|
|
<command>SELECT</>, <command>INSERT</>, <command>UPDATE</>, and
|
|
<command>DELETE</>. For <command>SELECT</>, the top-level executor
|
|
code only needs to send each row returned by the query plan tree off
|
|
to the client. For <command>INSERT</>, each returned row is inserted
|
|
into the target table specified for the <command>INSERT</>. (A simple
|
|
<command>INSERT ... VALUES</> command creates a trivial plan tree
|
|
consisting of a single <literal>Result</> node, which computes just one
|
|
result row. But <command>INSERT ... SELECT</> may demand the full power
|
|
of the executor mechanism.) For <command>UPDATE</>, the planner arranges
|
|
that each computed row includes all the updated column values, plus
|
|
the <firstterm>TID</> (tuple ID, or location) of the original target row;
|
|
the executor top level uses this information to create a new updated row
|
|
and mark the old row deleted. For <command>DELETE</>, the only column
|
|
that is actually returned by the plan is the TID, and the executor top
|
|
level simply uses the TID to visit the target rows and mark them deleted.
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
</chapter>
|
|
|
|
<!-- Keep this comment at the end of the file
|
|
Local variables:
|
|
mode:sgml
|
|
sgml-omittag:nil
|
|
sgml-shorttag:t
|
|
sgml-minimize-attributes:nil
|
|
sgml-always-quote-attributes:t
|
|
sgml-indent-step:1
|
|
sgml-indent-data:t
|
|
sgml-parent-document:nil
|
|
sgml-default-dtd-file:"./reference.ced"
|
|
sgml-exposed-tags:nil
|
|
sgml-local-catalogs:("/usr/lib/sgml/catalog")
|
|
sgml-local-ecat-files:nil
|
|
End:
|
|
-->
|