mirror of
https://github.com/postgres/postgres.git
synced 2025-07-30 11:03:19 +03:00
Add parallel pg_dump option.
New infrastructure is added which creates a set number of workers (threads on Windows, forked processes on Unix). Jobs are then handed out to these workers by the master process as needed. pg_restore is adjusted to use this new infrastructure in place of the old setup which created a new worker for each step on the fly. Parallel dumps acquire a snapshot clone in order to stay consistent, if available. The parallel option is selected by the -j / --jobs command line parameter of pg_dump. Joachim Wieland, lightly editorialized by Andrew Dunstan.
This commit is contained in:
@ -310,6 +310,24 @@ pg_restore -d <replaceable class="parameter">dbname</replaceable> <replaceable c
|
||||
with one of the other two approaches.
|
||||
</para>
|
||||
|
||||
<formalpara>
|
||||
<title>Use <application>pg_dump</>'s parallel dump feature.</title>
|
||||
<para>
|
||||
To speed up the dump of a large database, you can use
|
||||
<application>pg_dump</application>'s parallel mode. This will dump
|
||||
multiple tables at the same time. You can control the degree of
|
||||
parallelism with the <command>-j</command> parameter. Parallel dumps
|
||||
are only supported for the "directory" archive format.
|
||||
|
||||
<programlisting>
|
||||
pg_dump -j <replaceable class="parameter">num</replaceable> -F d -f <replaceable class="parameter">out.dir</replaceable> <replaceable class="parameter">dbname</replaceable>
|
||||
</programlisting>
|
||||
|
||||
You can use <command>pg_restore -j</command> to restore a dump in parallel.
|
||||
This will work for any archive of either the "custom" or the "directory"
|
||||
archive mode, whether or not it has been created with <command>pg_dump -j</command>.
|
||||
</para>
|
||||
</formalpara>
|
||||
</sect2>
|
||||
</sect1>
|
||||
|
||||
|
@ -1433,6 +1433,15 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
|
||||
base backup.
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
Experiment with the parallel dump and restore modes of both
|
||||
<application>pg_dump</> and <application>pg_restore</> and find the
|
||||
optimal number of concurrent jobs to use. Dumping and restoring in
|
||||
parallel by means of the <option>-j</> option should give you a
|
||||
significantly higher performance over the serial mode.
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
Consider whether the whole dump should be restored as a single
|
||||
|
@ -73,10 +73,12 @@ PostgreSQL documentation
|
||||
transfer mechanism. <application>pg_dump</application> can be used to
|
||||
backup an entire database, then <application>pg_restore</application>
|
||||
can be used to examine the archive and/or select which parts of the
|
||||
database are to be restored. The most flexible output file format is
|
||||
the <quote>custom</quote> format (<option>-Fc</option>). It allows
|
||||
for selection and reordering of all archived items, and is compressed
|
||||
by default.
|
||||
database are to be restored. The most flexible output file formats are
|
||||
the <quote>custom</quote> format (<option>-Fc</option>) and the
|
||||
<quote>directory</quote> format(<option>-Fd</option>). They allow
|
||||
for selection and reordering of all archived items, support parallel
|
||||
restoration, and are compressed by default. The <quote>directory</quote>
|
||||
format is the only format that supports parallel dumps.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
@ -251,7 +253,8 @@ PostgreSQL documentation
|
||||
can read. A directory format archive can be manipulated with
|
||||
standard Unix tools; for example, files in an uncompressed archive
|
||||
can be compressed with the <application>gzip</application> tool.
|
||||
This format is compressed by default.
|
||||
This format is compressed by default and also supports parallel
|
||||
dumps.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
@ -285,6 +288,62 @@ PostgreSQL documentation
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term><option>-j <replaceable class="parameter">njobs</replaceable></></term>
|
||||
<term><option>--jobs=<replaceable class="parameter">njobs</replaceable></></term>
|
||||
<listitem>
|
||||
<para>
|
||||
Run the dump in parallel by dumping <replaceable class="parameter">njobs</replaceable>
|
||||
tables simultaneously. This option reduces the time of the dump but it also
|
||||
increases the load on the database server. You can only use this option with the
|
||||
directory output format because this is the only output format where multiple processes
|
||||
can write their data at the same time.
|
||||
</para>
|
||||
<para>
|
||||
<application>pg_dump</> will open <replaceable class="parameter">njobs</replaceable>
|
||||
+ 1 connections to the database, so make sure your <xref linkend="guc-max-connections">
|
||||
setting is high enough to accommodate all connections.
|
||||
</para>
|
||||
<para>
|
||||
Requesting exclusive locks on database objects while running a parallel dump could
|
||||
cause the dump to fail. The reason is that the <application>pg_dump</> master process
|
||||
requests shared locks on the objects that the worker processes are going to dump later
|
||||
in order to
|
||||
make sure that nobody deletes them and makes them go away while the dump is running.
|
||||
If another client then requests an exclusive lock on a table, that lock will not be
|
||||
granted but will be queued waiting for the shared lock of the master process to be
|
||||
released.. Consequently any other access to the table will not be granted either and
|
||||
will queue after the exclusive lock request. This includes the worker process trying
|
||||
to dump the table. Without any precautions this would be a classic deadlock situation.
|
||||
To detect this conflict, the <application>pg_dump</> worker process requests another
|
||||
shared lock using the <literal>NOWAIT</> option. If the worker process is not granted
|
||||
this shared lock, somebody else must have requested an exclusive lock in the meantime
|
||||
and there is no way to continue with the dump, so <application>pg_dump</> has no choice
|
||||
but to abort the dump.
|
||||
</para>
|
||||
<para>
|
||||
For a consistent backup, the database server needs to support synchronized snapshots,
|
||||
a feature that was introduced in <productname>PostgreSQL</productname> 9.2. With this
|
||||
feature, database clients can ensure they see the same dataset even though they use
|
||||
different connections. <command>pg_dump -j</command> uses multiple database
|
||||
connections; it connects to the database once with the master process and
|
||||
once again for each worker job. Without the sychronized snapshot feature, the
|
||||
different worker jobs wouldn't be guaranteed to see the same data in each connection,
|
||||
which could lead to an inconsistent backup.
|
||||
</para>
|
||||
<para>
|
||||
If you want to run a parallel dump of a pre-9.2 server, you need to make sure that the
|
||||
database content doesn't change from between the time the master connects to the
|
||||
database until the last worker job has connected to the database. The easiest way to
|
||||
do this is to halt any data modifying processes (DDL and DML) accessing the database
|
||||
before starting the backup. You also need to specify the
|
||||
<option>--no-synchronized-snapshots</option> parameter when running
|
||||
<command>pg_dump -j</command> against a pre-9.2 <productname>PostgreSQL</productname>
|
||||
server.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term><option>-n <replaceable class="parameter">schema</replaceable></option></term>
|
||||
<term><option>--schema=<replaceable class="parameter">schema</replaceable></option></term>
|
||||
@ -690,6 +749,17 @@ PostgreSQL documentation
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term><option>--no-synchronized-snapshots</></term>
|
||||
<listitem>
|
||||
<para>
|
||||
This option allows running <command>pg_dump -j</> against a pre-9.2
|
||||
server, see the documentation of the <option>-j</option> parameter
|
||||
for more details.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term><option>--no-tablespaces</option></term>
|
||||
<listitem>
|
||||
@ -1082,6 +1152,15 @@ CREATE DATABASE foo WITH TEMPLATE template0;
|
||||
</screen>
|
||||
</para>
|
||||
|
||||
<para>
|
||||
To dump a database into a directory-format archive in parallel with
|
||||
5 worker jobs:
|
||||
|
||||
<screen>
|
||||
<prompt>$</prompt> <userinput>pg_dump -Fd mydb -j 5 -f dumpdir</userinput>
|
||||
</screen>
|
||||
</para>
|
||||
|
||||
<para>
|
||||
To reload an archive file into a (freshly created) database named
|
||||
<literal>newdb</>:
|
||||
|
Reference in New Issue
Block a user