Efficient transaction-controlled synchronous replication.

If a standby is broadcasting reply messages and we have named one or more standbys in synchronous_standby_names then allow users who set synchronous_replication to wait for commit, which then provides strict data integrity guarantees. Design avoids sending and receiving transaction state information so minimises bookkeeping overheads. We synchronize with the highest priority standby that is connected and ready to synchronize. Other standbys can be defined to takeover in case of standby failure. This version has very strict behaviour; more relaxed options may be added at a later date. Simon Riggs and Fujii Masao, with reviews by Yeb Havinga, Jaime Casanova, Heikki Linnakangas and Robert Haas, plus the assistance of many other design reviewers.
2025-07-28 23:42:10 +03:00 · 2011-03-06 22:49:16 +00:00
parent 149b2673c2
commit a8a8a3e096
21 changed files with 507 additions and 22 deletions
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@ -2018,6 +2018,92 @@ SET ENABLE_SEQSCAN TO OFF;
     </variablelist>
    </sect2>

+    <sect2 id="runtime-config-sync-rep">
+     <title>Synchronous Replication</title>
+
+     <para>
+      These settings control the behavior of the built-in
+      <firstterm>synchronous replication</> feature.
+      These parameters would be set on the primary server that is
+      to send replication data to one or more standby servers.
+     </para>
+
+     <variablelist>
+     <varlistentry id="guc-synchronous-replication" xreflabel="synchronous_replication">
+      <term><varname>synchronous_replication</varname> (<type>boolean</type>)</term>
+      <indexterm>
+       <primary><varname>synchronous_replication</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+        Specifies whether transaction commit will wait for WAL records
+        to be replicated before the command returns a <quote>success</>
+        indication to the client.  The default setting is <literal>off</>.
+        When <literal>on</>, there will be a delay while the client waits
+        for confirmation of successful replication. That delay will
+        increase depending upon the physical distance and network activity
+        between primary and standby. The commit wait will last until a
+        reply from the current synchronous standby indicates it has received
+        the commit record of the transaction. Synchronous standbys must
+        already have been defined (see <xref linkend="guc-sync-standby-names">).
+       </para>
+       <para>
+        This parameter can be changed at any time; the
+        behavior for any one transaction is determined by the setting in
+        effect when it commits.  It is therefore possible, and useful, to have
+        some transactions replicate synchronously and others asynchronously.
+        For example, to make a single multistatement transaction commit
+        asynchronously when the default is synchronous replication, issue
+        <command>SET LOCAL synchronous_replication TO OFF</> within the
+        transaction.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-sync-standby-names" xreflabel="synchronous_standby_names">
+      <term><varname>synchronous_standby_names</varname> (<type>integer</type>)</term>
+      <indexterm>
+       <primary><varname>synchronous_standby_names</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+        Specifies a priority ordered list of standby names that can offer
+        synchronous replication.  At any one time there will be just one
+        synchronous standby that will wake sleeping users following commit.
+        The synchronous standby will be the first named standby that is
+        both currently connected and streaming in real-time to the standby
+        (as shown by a state of "STREAMING").  Other standby servers
+        with listed later will become potential synchronous standbys.
+        If the current synchronous standby disconnects for whatever reason
+        it will be replaced immediately with the next highest priority standby.
+        Specifying more than one standby name can allow very high availability.
+       </para>
+       <para>
+        The standby name is currently taken as the application_name of the
+        standby, as set in the primary_conninfo on the standby. Names are
+        not enforced for uniqueness. In case of duplicates one of the standbys
+        will be chosen to be the synchronous standby, though exactly which
+        one is indeterminate.
+       </para>
+       <para>
+        No value is set by default.
+        The special entry <literal>*</> matches any application_name, including
+        the default application name of <literal>walreceiver</>.
+       </para>
+       <para>
+        If a standby is removed from the list of servers then it will stop
+        being the synchronous standby, allowing another to take it's place.
+        If the list is empty, synchronous replication will not be
+        possible, whatever the setting of <varname>synchronous_replication</>,
+        however, already waiting commits will continue to wait.
+        Standbys may also be added to the list without restarting the server.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     </variablelist>
+    </sect2>
+
    <sect2 id="runtime-config-standby">
    <title>Standby Servers</title>

--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@ -875,6 +875,209 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
   </sect3>

  </sect2>
+  <sect2 id="synchronous-replication">
+   <title>Synchronous Replication</title>
+
+   <indexterm zone="high-availability">
+    <primary>Synchronous Replication</primary>
+   </indexterm>
+
+   <para>
+    <productname>PostgreSQL</> streaming replication is asynchronous by
+    default. If the primary server
+    crashes then some transactions that were committed may not have been
+    replicated to the standby server, causing data loss. The amount
+    of data loss is proportional to the replication delay at the time of
+    failover.
+   </para>
+
+   <para>
+    Synchronous replication offers the ability to confirm that all changes
+    made by a transaction have been transferred to one synchronous standby
+    server. This extends the standard level of durability
+    offered by a transaction commit. This level of protection is referred
+    to as 2-safe replication in computer science theory.
+   </para>
+
+   <para>
+    When requesting synchronous replication, each commit of a
+    write transaction will wait until confirmation is
+    received that the commit has been written to the transaction log on disk
+    of both the primary and standby server. The only possibility that data
+    can be lost is if both the primary and the standby suffer crashes at the
+    same time. This can provide a much higher level of durability, though only
+    if the sysadmin is cautious about the placement and management of the two
+    servers.  Waiting for confirmation increases the user's confidence that the
+    changes will not be lost in the event of server crashes but it also
+    necessarily increases the response time for the requesting transaction.
+    The minimum wait time is the roundtrip time between primary to standby.
+   </para>
+
+   <para>
+    Read only transactions and transaction rollbacks need not wait for
+    replies from standby servers. Subtransaction commits do not wait for
+    responses from standby servers, only top-level commits. Long
+    running actions such as data loading or index building do not wait
+    until the very final commit message. All two-phase commit actions
+    require commit waits, including both prepare and commit.
+   </para>
+
+   <sect3 id="synchronous-replication-config">
+    <title>Basic Configuration</title>
+
+   <para>
+    All parameters have useful default values, so we can enable
+    synchronous replication easily just by setting this on the primary
+
+<programlisting>
+synchronous_replication = on
+</programlisting>
+
+    When <varname>synchronous_replication</> is set, a commit will wait
+    for confirmation that the standby has received the commit record,
+    even if that takes a very long time.
+    <varname>synchronous_replication</> can be set by individual
+    users, so can be configured in the configuration file, for particular
+    users or databases, or dynamically by applications programs.
+   </para>
+
+   <para>
+    After a commit record has been written to disk on the primary the
+    WAL record is then sent to the standby. The standby sends reply
+    messages each time a new batch of WAL data is received, unless
+    <varname>wal_receiver_status_interval</> is set to zero on the standby.
+    If the standby is the first matching standby, as specified in
+    <varname>synchronous_standby_names</> on the primary, the reply
+    messages from that standby will be used to wake users waiting for
+    confirmation the commit record has been received. These parameters
+    allow the administrator to specify which standby servers should be
+    synchronous standbys. Note that the configuration of synchronous
+    replication is mainly on the master.
+   </para>
+
+   <para>
+    Users will stop waiting if a fast shutdown is requested, though the
+    server does not fully shutdown until all outstanding WAL records are
+    transferred to standby servers.
+   </para>
+
+   <para>
+    Note also that <varname>synchronous_commit</> is used when the user
+    specifies <varname>synchronous_replication</>, overriding even an
+    explicit setting of <varname>synchronous_commit</> to <literal>off</>.
+    This is because we must write WAL to disk on primary before we replicate
+    to ensure the standby never gets ahead of the primary.
+   </para>
+
+   </sect3>
+
+   <sect3 id="synchronous-replication-performance">
+    <title>Planning for Performance</title>
+
+   <para>
+    Synchronous replication usually requires carefully planned and placed
+    standby servers to ensure applications perform acceptably. Waiting
+    doesn't utilise system resources, but transaction locks continue to be
+    held until the transfer is confirmed. As a result, incautious use of
+    synchronous replication will reduce performance for database
+    applications because of increased response times and higher contention.
+   </para>
+
+   <para>
+    <productname>PostgreSQL</> allows the application developer
+    to specify the durability level required via replication. This can be
+    specified for the system overall, though it can also be specified for
+    specific users or connections, or even individual transactions.
+   </para>
+
+   <para>
+    For example, an application workload might consist of:
+    10% of changes are important customer details, while
+    90% of changes are less important data that the business can more
+    easily survive if it is lost, such as chat messages between users.
+   </para>
+
+   <para>
+    With synchronous replication options specified at the application level
+    (on the primary) we can offer sync rep for the most important changes,
+    without slowing down the bulk of the total workload. Application level
+    options are an important and practical tool for allowing the benefits of
+    synchronous replication for high performance applications.
+   </para>
+
+   <para>
+    You should consider that the network bandwidth must be higher than
+    the rate of generation of WAL data.
+    10% of changes are important customer details, while
+    90% of changes are less important data that the business can more
+    easily survive if it is lost, such as chat messages between users.
+   </para>
+
+   </sect3>
+
+   <sect3 id="synchronous-replication-ha">
+    <title>Planning for High Availability</title>
+
+   <para>
+    Commits made when synchronous_replication is set will wait until at
+    the sync standby responds. The response may never occur if the last,
+    or only, standby should crash.
+   </para>
+
+   <para>
+    The best solution for avoiding data loss is to ensure you don't lose
+    your last remaining sync standby. This can be achieved by naming multiple
+    potential synchronous standbys using <varname>synchronous_standby_names</>.
+    The first named standby will be used as the synchronous standby. Standbys
+    listed after this will takeover the role of synchronous standby if the
+    first one should fail.
+   </para>
+
+   <para>
+    When a standby first attaches to the primary, it will not yet be properly
+    synchronized. This is described as <literal>CATCHUP</> mode. Once
+    the lag between standby and primary reaches zero for the first time
+    we move to real-time <literal>STREAMING</> state.
+    The catch-up duration may be long immediately after the standby has
+    been created. If the standby is shutdown, then the catch-up period
+    will increase according to the length of time the standby has been down.
+    The standby is only able to become a synchronous standby
+    once it has reached <literal>STREAMING</> state.
+   </para>
+
+   <para>
+    If primary restarts while commits are waiting for acknowledgement, those
+    waiting transactions will be marked fully committed once the primary
+    database recovers.
+    There is no way to be certain that all standbys have received all
+    outstanding WAL data at time of the crash of the primary. Some
+    transactions may not show as committed on the standby, even though
+    they show as committed on the primary. The guarantee we offer is that
+    the application will not receive explicit acknowledgement of the
+    successful commit of a transaction until the WAL data is known to be
+    safely received by the standby.
+   </para>
+
+   <para>
+    If you really do lose your last standby server then you should disable
+    <varname>synchronous_standby_names</> and restart the primary server.
+   </para>
+
+   <para>
+    If the primary is isolated from remaining standby severs you should
+    failover to the best candidate of those other remaining standby servers.
+   </para>
+
+   <para>
+    If you need to re-create a standby server while transactions are
+    waiting, make sure that the commands to run pg_start_backup() and
+    pg_stop_backup() are run in a session with
+    synchronous_replication = off, otherwise those requests will wait
+    forever for the standby to appear.
+   </para>
+
+   </sect3>
+  </sect2>
  </sect1>

  <sect1 id="warm-standby-failover">
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@ -306,8 +306,11 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
      location.  In addition, the standby reports the last transaction log
      position it received and wrote, the last position it flushed to disk,
      and the last position it replayed, and this information is also
-      displayed here.  The columns detailing what exactly the connection is
-      doing are only visible if the user examining the view is a superuser.
+      displayed here. If the standby's application names matches one of the
+      settings in <varname>synchronous_standby_names</> then the sync_priority
+      is shown here also, that is the order in which standbys will become
+      the synchronous standby. The columns detailing what exactly the connection
+      is doing are only visible if the user examining the view is a superuser.
      The client's hostname will be available only if
      <xref linkend="guc-log-hostname"> is set or if the user's hostname
      needed to be looked up during <filename>pg_hba.conf</filename>