1
0
mirror of https://github.com/postgres/postgres.git synced 2025-07-30 11:03:19 +03:00

Add parallel pg_dump option.

New infrastructure is added which creates a set number of workers
(threads on Windows, forked processes on Unix). Jobs are then
handed out to these workers by the master process as needed.
pg_restore is adjusted to use this new infrastructure in place of the
old setup which created a new worker for each step on the fly. Parallel
dumps acquire a snapshot clone in order to stay consistent, if
available.

The parallel option is selected by the -j / --jobs command line
parameter of pg_dump.

Joachim Wieland, lightly editorialized by Andrew Dunstan.
This commit is contained in:
Andrew Dunstan
2013-03-24 11:27:20 -04:00
parent 3b91fe185a
commit 9e257a181c
22 changed files with 2776 additions and 830 deletions

View File

@ -310,6 +310,24 @@ pg_restore -d <replaceable class="parameter">dbname</replaceable> <replaceable c
with one of the other two approaches.
</para>
<formalpara>
<title>Use <application>pg_dump</>'s parallel dump feature.</title>
<para>
To speed up the dump of a large database, you can use
<application>pg_dump</application>'s parallel mode. This will dump
multiple tables at the same time. You can control the degree of
parallelism with the <command>-j</command> parameter. Parallel dumps
are only supported for the "directory" archive format.
<programlisting>
pg_dump -j <replaceable class="parameter">num</replaceable> -F d -f <replaceable class="parameter">out.dir</replaceable> <replaceable class="parameter">dbname</replaceable>
</programlisting>
You can use <command>pg_restore -j</command> to restore a dump in parallel.
This will work for any archive of either the "custom" or the "directory"
archive mode, whether or not it has been created with <command>pg_dump -j</command>.
</para>
</formalpara>
</sect2>
</sect1>

View File

@ -1433,6 +1433,15 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
base backup.
</para>
</listitem>
<listitem>
<para>
Experiment with the parallel dump and restore modes of both
<application>pg_dump</> and <application>pg_restore</> and find the
optimal number of concurrent jobs to use. Dumping and restoring in
parallel by means of the <option>-j</> option should give you a
significantly higher performance over the serial mode.
</para>
</listitem>
<listitem>
<para>
Consider whether the whole dump should be restored as a single

View File

@ -73,10 +73,12 @@ PostgreSQL documentation
transfer mechanism. <application>pg_dump</application> can be used to
backup an entire database, then <application>pg_restore</application>
can be used to examine the archive and/or select which parts of the
database are to be restored. The most flexible output file format is
the <quote>custom</quote> format (<option>-Fc</option>). It allows
for selection and reordering of all archived items, and is compressed
by default.
database are to be restored. The most flexible output file formats are
the <quote>custom</quote> format (<option>-Fc</option>) and the
<quote>directory</quote> format(<option>-Fd</option>). They allow
for selection and reordering of all archived items, support parallel
restoration, and are compressed by default. The <quote>directory</quote>
format is the only format that supports parallel dumps.
</para>
<para>
@ -251,7 +253,8 @@ PostgreSQL documentation
can read. A directory format archive can be manipulated with
standard Unix tools; for example, files in an uncompressed archive
can be compressed with the <application>gzip</application> tool.
This format is compressed by default.
This format is compressed by default and also supports parallel
dumps.
</para>
</listitem>
</varlistentry>
@ -285,6 +288,62 @@ PostgreSQL documentation
</listitem>
</varlistentry>
<varlistentry>
<term><option>-j <replaceable class="parameter">njobs</replaceable></></term>
<term><option>--jobs=<replaceable class="parameter">njobs</replaceable></></term>
<listitem>
<para>
Run the dump in parallel by dumping <replaceable class="parameter">njobs</replaceable>
tables simultaneously. This option reduces the time of the dump but it also
increases the load on the database server. You can only use this option with the
directory output format because this is the only output format where multiple processes
can write their data at the same time.
</para>
<para>
<application>pg_dump</> will open <replaceable class="parameter">njobs</replaceable>
+ 1 connections to the database, so make sure your <xref linkend="guc-max-connections">
setting is high enough to accommodate all connections.
</para>
<para>
Requesting exclusive locks on database objects while running a parallel dump could
cause the dump to fail. The reason is that the <application>pg_dump</> master process
requests shared locks on the objects that the worker processes are going to dump later
in order to
make sure that nobody deletes them and makes them go away while the dump is running.
If another client then requests an exclusive lock on a table, that lock will not be
granted but will be queued waiting for the shared lock of the master process to be
released.. Consequently any other access to the table will not be granted either and
will queue after the exclusive lock request. This includes the worker process trying
to dump the table. Without any precautions this would be a classic deadlock situation.
To detect this conflict, the <application>pg_dump</> worker process requests another
shared lock using the <literal>NOWAIT</> option. If the worker process is not granted
this shared lock, somebody else must have requested an exclusive lock in the meantime
and there is no way to continue with the dump, so <application>pg_dump</> has no choice
but to abort the dump.
</para>
<para>
For a consistent backup, the database server needs to support synchronized snapshots,
a feature that was introduced in <productname>PostgreSQL</productname> 9.2. With this
feature, database clients can ensure they see the same dataset even though they use
different connections. <command>pg_dump -j</command> uses multiple database
connections; it connects to the database once with the master process and
once again for each worker job. Without the sychronized snapshot feature, the
different worker jobs wouldn't be guaranteed to see the same data in each connection,
which could lead to an inconsistent backup.
</para>
<para>
If you want to run a parallel dump of a pre-9.2 server, you need to make sure that the
database content doesn't change from between the time the master connects to the
database until the last worker job has connected to the database. The easiest way to
do this is to halt any data modifying processes (DDL and DML) accessing the database
before starting the backup. You also need to specify the
<option>--no-synchronized-snapshots</option> parameter when running
<command>pg_dump -j</command> against a pre-9.2 <productname>PostgreSQL</productname>
server.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term><option>-n <replaceable class="parameter">schema</replaceable></option></term>
<term><option>--schema=<replaceable class="parameter">schema</replaceable></option></term>
@ -690,6 +749,17 @@ PostgreSQL documentation
</listitem>
</varlistentry>
<varlistentry>
<term><option>--no-synchronized-snapshots</></term>
<listitem>
<para>
This option allows running <command>pg_dump -j</> against a pre-9.2
server, see the documentation of the <option>-j</option> parameter
for more details.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term><option>--no-tablespaces</option></term>
<listitem>
@ -1082,6 +1152,15 @@ CREATE DATABASE foo WITH TEMPLATE template0;
</screen>
</para>
<para>
To dump a database into a directory-format archive in parallel with
5 worker jobs:
<screen>
<prompt>$</prompt> <userinput>pg_dump -Fd mydb -j 5 -f dumpdir</userinput>
</screen>
</para>
<para>
To reload an archive file into a (freshly created) database named
<literal>newdb</>: