1
0
mirror of https://github.com/postgres/postgres.git synced 2025-07-27 12:41:57 +03:00

Support an optional asynchronous commit mode, in which we don't flush WAL

before reporting a transaction committed.  Data consistency is still
guaranteed (unlike setting fsync = off), but a crash may lose the effects
of the last few transactions.  Patch by Simon, some editorialization by Tom.
This commit is contained in:
Tom Lane
2007-08-01 22:45:09 +00:00
parent c722628a43
commit 4a78cdeb6b
25 changed files with 998 additions and 303 deletions

View File

@ -1,4 +1,4 @@
<!-- $PostgreSQL: pgsql/doc/src/sgml/config.sgml,v 1.133 2007/07/24 04:54:08 tgl Exp $ -->
<!-- $PostgreSQL: pgsql/doc/src/sgml/config.sgml,v 1.134 2007/08/01 22:45:07 tgl Exp $ -->
<chapter Id="runtime-config">
<title>Server Configuration</title>
@ -1307,6 +1307,13 @@ SET ENABLE_SEQSCAN TO OFF;
disabling <varname>fsync</varname>.
</para>
<para>
In many situations, turning off <xref linkend="guc-synchronous-commit">
for noncritical transactions can provide much of the potential
performance benefit of turning off <varname>fsync</varname>, without
the attendant risks of data corruption.
</para>
<para>
This parameter can only be set in the <filename>postgresql.conf</>
file or on the server command line.
@ -1315,6 +1322,38 @@ SET ENABLE_SEQSCAN TO OFF;
</para>
</listitem>
</varlistentry>
<varlistentry id="guc-synchronous-commit" xreflabel="synchronous_commit">
<term><varname>synchronous_commit</varname> (<type>boolean</type>)</term>
<indexterm>
<primary><varname>synchronous_commit</> configuration parameter</primary>
</indexterm>
<listitem>
<para>
Specifies whether transaction commit will wait for WAL records
to be written to disk before the command returns a <quote>success</>
indication to the client. The default, and safe, setting is
<literal>on</>. When <literal>off</>, there can be a delay between
when success is reported to the client and when the transaction is
really guaranteed to be safe against a server crash. (The maximum
delay is three times <xref linkend="guc-wal-writer-delay">.) Unlike
<xref linkend="guc-fsync">, setting this parameter to <literal>off</>
does not create any risk of database inconsistency: a crash might
result in some recent allegedly-committed transactions being lost, but
the database state will be just the same as if those transactions had
been aborted cleanly. So, turning <varname>synchronous_commit</> off
can be a useful alternative when performance is more important than
exact certainty about the durability of a transaction. For more
discussion see <xref linkend="wal-async-commit">.
</para>
<para>
This parameter can be changed at any time; the behavior for any
one transaction is determined by the setting in effect when it
commits. It is therefore possible, and useful, to have some
transactions commit synchronously and others asynchronously.
</para>
</listitem>
</varlistentry>
<varlistentry id="guc-wal-sync-method" xreflabel="wal_sync_method">
<term><varname>wal_sync_method</varname> (<type>string</type>)</term>

View File

@ -1,10 +1,10 @@
<!-- $PostgreSQL: pgsql/doc/src/sgml/wal.sgml,v 1.44 2007/06/28 00:02:37 tgl Exp $ -->
<!-- $PostgreSQL: pgsql/doc/src/sgml/wal.sgml,v 1.45 2007/08/01 22:45:07 tgl Exp $ -->
<chapter id="wal">
<title>Reliability and the Write-Ahead Log</title>
<para>
This chapter explain how the Write-Ahead Log is used to obtain
This chapter explains how the Write-Ahead Log is used to obtain
efficient, reliable operation.
</para>
@ -71,7 +71,7 @@
write caches. At the drive level, disable write-back caching if the
drive cannot guarantee the data will be written before shutdown.
</para>
<para>
Another risk of data loss is posed by the disk platter write
operations themselves. Disk platters are divided into sectors,
@ -86,11 +86,11 @@
disk. By doing this, during crash recovery <productname>PostgreSQL</> can
restore partially-written pages. If you have a battery-backed disk
controller or file-system software that prevents partial page writes
(e.g., ReiserFS 4), you can turn off this page imaging by using the
(e.g., ReiserFS 4), you can turn off this page imaging by using the
<xref linkend="guc-full-page-writes"> parameter.
</para>
</sect1>
<sect1 id="wal-intro">
<title>Write-Ahead Logging (<acronym>WAL</acronym>)</title>
@ -105,12 +105,12 @@
<para>
<firstterm>Write-Ahead Logging</firstterm> (<acronym>WAL</acronym>)
is a standard approach to transaction logging. Its detailed
is a standard method for ensuring data integrity. A detailed
description can be found in most (if not all) books about
transaction processing. Briefly, <acronym>WAL</acronym>'s central
concept is that changes to data files (where tables and indexes
reside) must be written only after those changes have been logged,
that is, when log records describing the changes have been flushed
that is, after log records describing the changes have been flushed
to permanent storage. If we follow this procedure, we do not need
to flush data pages to disk on every transaction commit, because we
know that in the event of a crash we will be able to recover the
@ -120,17 +120,17 @@
</para>
<para>
A major benefit of using <acronym>WAL</acronym> is a
Using <acronym>WAL</acronym> results in a
significantly reduced number of disk writes, because only the log
file needs to be flushed to disk at the time of transaction
commit, rather than every data file changed by the transaction.
In multiuser environments, commits of many transactions
can be accomplished with a single <function>fsync</function> of
the log file. Furthermore, the log file is written sequentially,
file needs to be flushed to disk to guarantee that a transaction is
committed, rather than every data file changed by the transaction.
The log file is written sequentially,
and so the cost of syncing the log is much less than the cost of
flushing the data pages. This is especially true for servers
flushing the data pages. This is especially true for servers
handling many small transactions touching different parts of the data
store.
store. Furthermore, when the server is processing many small concurrent
transactions, one <function>fsync</function> of the log file may
suffice to commit many transactions.
</para>
<para>
@ -147,6 +147,139 @@
</para>
</sect1>
<sect1 id="wal-async-commit">
<title>Asynchronous Commit</title>
<indexterm>
<primary>synchronous commit</primary>
</indexterm>
<indexterm>
<primary>asynchronous commit</primary>
</indexterm>
<para>
<firstterm>Asynchronous commit</> is an option that allows transactions
to complete more quickly, at the cost that the most recent transactions may
be lost if the database should crash. In many applications this is an
acceptable tradeoff.
</para>
<para>
As described in the previous section, transaction commit is normally
<firstterm>synchronous</>: the server waits for the transaction's
<acronym>WAL</acronym> records to be flushed to permanent storage
before returning a success indication to the client. The client is
therefore guaranteed that a transaction reported to be committed will
be preserved, even in the event of a server crash immediately after.
However, for short transactions this delay is a major component of the
total transaction time. Selecting asynchronous commit mode means that
the server returns success as soon as the transaction is logically
completed, before the <acronym>WAL</acronym> records it generated have
actually made their way to disk. This can provide a significant boost
in throughput for small transactions.
</para>
<para>
Asynchronous commit introduces the risk of data loss. There is a short
time window between the report of transaction completion to the client
and the time that the transaction is truly committed (that is, it is
guaranteed not to be lost if the server crashes). Thus asynchronous
commit should not be used if the client will take external actions
relying on the assumption that the transaction will be remembered.
As an example, a bank would certainly not use asynchronous commit for
a transaction recording an ATM's dispensing of cash. But in many
scenarios, such as event logging, there is no need for a strong
guarantee of this kind.
</para>
<para>
The risk that is taken by using asynchronous commit is of data loss,
not data corruption. If the database should crash, it will recover
by replaying <acronym>WAL</acronym> up to the last record that was
flushed. The database will therefore be restored to a self-consistent
state, but any transactions that were not yet flushed to disk will
not be reflected in that state. The net effect is therefore loss of
the last few transactions. Because the transactions are replayed in
commit order, no inconsistency can be introduced &mdash; for example,
if transaction B made changes relying on the effects of a previous
transaction A, it is not possible for A's effects to be lost while B's
effects are preserved.
</para>
<para>
The user can select the commit mode of each transaction, so that
it is possible to have both synchronous and asynchronous commit
transactions running concurrently. This allows flexible tradeoffs
between performance and certainty of transaction durability.
The commit mode is controlled by the user-settable parameter
<xref linkend="guc-synchronous-commit">, which can be changed in any of
the ways that a configuration parameter can be set. The mode used for
any one transaction depends on the value of
<varname>synchronous_commit</varname> when transaction commit begins.
</para>
<para>
Certain utility commands, for instance <command>DROP TABLE</>, are
forced to commit synchronously regardless of the setting of
<varname>synchronous_commit</varname>. This is to ensure consistency
between the server's filesystem and the logical state of the database.
The commands supporting two-phase commit, such as <command>PREPARE
TRANSACTION</>, are also always synchronous.
</para>
<para>
If the database crashes during the risk window between an
asynchronous commit and the writing of the transaction's
<acronym>WAL</acronym> records,
then changes made during that transaction <emphasis>will</> be lost.
The duration of the
risk window is limited because a background process (the <quote>wal
writer</>) flushes unwritten <acronym>WAL</acronym> records to disk
every <xref linkend="guc-wal-writer-delay"> milliseconds.
The actual maximum duration of the risk window is three times
<varname>wal_writer_delay</varname> because the wal writer is
designed to favor writing whole pages at a time during busy periods.
</para>
<caution>
<para>
An immediate-mode shutdown is equivalent to a server crash, and will
therefore cause loss of any unflushed asynchronous commits.
</para>
</caution>
<para>
Asynchronous commit provides behavior different from setting
<xref linkend="guc-fsync"> = off.
<varname>fsync</varname> is a server-wide
setting that will alter the behavior of all transactions. It disables
all logic within <productname>PostgreSQL</> that attempts to synchronize
writes to different portions of the database, and therefore a system
crash (that is, a hardware or operating system crash, not a failure of
<productname>PostgreSQL</> itself) could result in arbitrarily bad
corruption of the database state. In many scenarios, asynchronous
commit provides most of the performance improvement that could be
obtained by turning off <varname>fsync</varname>, but without the risk
of data corruption.
</para>
<para>
<xref linkend="guc-commit-delay"> also sounds very similar to
asynchronous commit, but it is actually a synchronous commit method
(in fact, <varname>commit_delay</varname> is ignored during an
asynchronous commit). <varname>commit_delay</varname> causes a delay
just before a synchronous commit attempts to flush
<acronym>WAL</acronym> to disk, in the hope that a single flush
executed by one such transaction can also serve other transactions
committing at about the same time. Setting <varname>commit_delay</varname>
can only help when there are many concurrently committing transactions,
and it is difficult to tune it to a value that actually helps rather
than hurting throughput.
</para>
</sect1>
<sect1 id="wal-configuration">
<title><acronym>WAL</acronym> Configuration</title>
@ -188,13 +321,13 @@
<varname>checkpoint_timeout</varname> causes checkpoints to be done
more often. This allows faster after-crash recovery (since less work
will need to be redone). However, one must balance this against the
increased cost of flushing dirty data pages more often. If
<xref linkend="guc-full-page-writes"> is set (as is the default), there is
another factor to consider. To ensure data page consistency,
the first modification of a data page after each checkpoint results in
increased cost of flushing dirty data pages more often. If
<xref linkend="guc-full-page-writes"> is set (as is the default), there is
another factor to consider. To ensure data page consistency,
the first modification of a data page after each checkpoint results in
logging the entire page content. In that case,
a smaller checkpoint interval increases the volume of output to the WAL log,
partially negating the goal of using a smaller interval,
partially negating the goal of using a smaller interval,
and in any case causing more disk I/O.
</para>
@ -206,8 +339,8 @@
don't happen too often. As a simple sanity check on your checkpointing
parameters, you can set the <xref linkend="guc-checkpoint-warning">
parameter. If checkpoints happen closer together than
<varname>checkpoint_warning</> seconds,
a message will be output to the server log recommending increasing
<varname>checkpoint_warning</> seconds,
a message will be output to the server log recommending increasing
<varname>checkpoint_segments</varname>. Occasional appearance of such
a message is not cause for alarm, but if it appears often then the
checkpoint control parameters should be increased. Bulk operations such
@ -280,9 +413,9 @@
modifying the configuration parameter <xref
linkend="guc-wal-buffers">. The default number of <acronym>WAL</acronym>
buffers is 8. Increasing this value will
correspondingly increase shared memory usage. When
<xref linkend="guc-full-page-writes"> is set and the system is very busy,
setting this value higher will help smooth response times during the
correspondingly increase shared memory usage. When
<xref linkend="guc-full-page-writes"> is set and the system is very busy,
setting this value higher will help smooth response times during the
period immediately following each checkpoint.
</para>
@ -307,7 +440,7 @@
<para>
The <xref linkend="guc-wal-sync-method"> parameter determines how
<productname>PostgreSQL</productname> will ask the kernel to force
<acronym>WAL</acronym> updates out to disk.
<acronym>WAL</acronym> updates out to disk.
All the options should be the same as far as reliability goes,
but it's quite platform-specific which one will be the fastest.
Note that this parameter is irrelevant if <varname>fsync</varname>
@ -360,7 +493,7 @@
The aim of <acronym>WAL</acronym>, to ensure that the log is
written before database records are altered, can be subverted by
disk drives<indexterm><primary>disk drive</></> that falsely report a
successful write to the kernel,
successful write to the kernel,
when in fact they have only cached the data and not yet stored it
on the disk. A power failure in such a situation might still lead to
irrecoverable data corruption. Administrators should try to ensure