1
0
mirror of https://github.com/postgres/postgres.git synced 2025-07-27 12:41:57 +03:00

Allow to trigger kernel writeback after a configurable number of writes.

Currently writes to the main data files of postgres all go through the
OS page cache. This means that some operating systems can end up
collecting a large number of dirty buffers in their respective page
caches.  When these dirty buffers are flushed to storage rapidly, be it
because of fsync(), timeouts, or dirty ratios, latency for other reads
and writes can increase massively.  This is the primary reason for
regular massive stalls observed in real world scenarios and artificial
benchmarks; on rotating disks stalls on the order of hundreds of seconds
have been observed.

On linux it is possible to control this by reducing the global dirty
limits significantly, reducing the above problem. But global
configuration is rather problematic because it'll affect other
applications; also PostgreSQL itself doesn't always generally want this
behavior, e.g. for temporary files it's undesirable.

Several operating systems allow some control over the kernel page
cache. Linux has sync_file_range(2), several posix systems have msync(2)
and posix_fadvise(2). sync_file_range(2) is preferable because it
requires no special setup, whereas msync() requires the to-be-flushed
range to be mmap'ed. For the purpose of flushing dirty data
posix_fadvise(2) is the worst alternative, as flushing dirty data is
just a side-effect of POSIX_FADV_DONTNEED, which also removes the pages
from the page cache.  Thus the feature is enabled by default only on
linux, but can be enabled on all systems that have any of the above
APIs.

While desirable and likely possible this patch does not contain an
implementation for windows.

With the infrastructure added, writes made via checkpointer, bgwriter
and normal user backends can be flushed after a configurable number of
writes. Each of these sources of writes controlled by a separate GUC,
checkpointer_flush_after, bgwriter_flush_after and backend_flush_after
respectively; they're separate because the number of flushes that are
good are separate, and because the performance considerations of
controlled flushing for each of these are different.

A later patch will add checkpoint sorting - after that flushes from the
ckeckpoint will almost always be desirable. Bgwriter flushes are most of
the time going to be random, which are slow on lots of storage hardware.
Flushing in backends works well if the storage and bgwriter can keep up,
but if not it can have negative consequences.  This patch is likely to
have negative performance consequences without checkpoint sorting, but
unfortunately so has sorting without flush control.

Discussion: alpine.DEB.2.10.1506011320000.28433@sto
Author: Fabien Coelho and Andres Freund
This commit is contained in:
Andres Freund
2016-02-19 12:13:05 -08:00
parent c82c92b111
commit 428b1d6b29
15 changed files with 602 additions and 32 deletions

View File

@ -1843,6 +1843,35 @@ include_dir 'conf.d'
</para>
</listitem>
</varlistentry>
<varlistentry id="guc-bgwriter-flush-after" xreflabel="bgwriter_flush_after">
<term><varname>bgwriter_flush_after</varname> (<type>int</type>)
<indexterm>
<primary><varname>bgwriter_flush_after</> configuration parameter</primary>
</indexterm>
</term>
<listitem>
<para>
Whenever more than <varname>bgwriter_flush_after</varname> bytes have
been written by the bgwriter, attempt to force the OS to issue these
writes to the underlying storage. Doing so will limit the amount of
dirty data in the kernel's page cache, reducing the likelihood of
stalls when an fsync is issued at the end of a checkpoint, or when
the OS writes data back in larger batches in the background. Often
that will result in greatly reduced transaction latency, but there
also are some cases, especially with workloads that are bigger than
<xref linkend="guc-shared-buffers">, but smaller than the OS's page
cache, where performance might degrade. This setting may have no
effect on some platforms. The valid range is between
<literal>0</literal>, which disables controlled writeback, and
<literal>2MB</literal>. The default is <literal>512Kb</> on Linux,
<literal>0</> elsewhere. (Non-default values of
<symbol>BLCKSZ</symbol> change the default and maximum.)
This parameter can only be set in the <filename>postgresql.conf</>
file or on the server command line.
</para>
</listitem>
</varlistentry>
</variablelist>
<para>
@ -1944,6 +1973,35 @@ include_dir 'conf.d'
</para>
</listitem>
</varlistentry>
<varlistentry id="guc-backend-flush-after" xreflabel="backend_flush_after">
<term><varname>backend_flush_after</varname> (<type>int</type>)
<indexterm>
<primary><varname>backend_flush_after</> configuration parameter</primary>
</indexterm>
</term>
<listitem>
<para>
Whenever more than <varname>backend_flush_after</varname> bytes have
been written by a single backend, attempt to force the OS to issue
these writes to the underlying storage. Doing so will limit the
amount of dirty data in the kernel's page cache, reducing the
likelihood of stalls when an fsync is issued at the end of a
checkpoint, or when the OS writes data back in larger batches in the
background. Often that will result in greatly reduced transaction
latency, but there also are some cases, especially with workloads
that are bigger than <xref linkend="guc-shared-buffers">, but smaller
than the OS's page cache, where performance might degrade. This
setting may have no effect on some platforms. The valid range is
between <literal>0</literal>, which disables controlled writeback,
and <literal>2MB</literal>. The default is <literal>128Kb</> on
Linux, <literal>0</> elsewhere. (Non-default values of
<symbol>BLCKSZ</symbol> change the default and maximum.)
This parameter can only be set in the <filename>postgresql.conf</>
file or on the server command line.
</para>
</listitem>
</varlistentry>
</variablelist>
</sect2>
</sect1>
@ -2475,6 +2533,35 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
<varlistentry id="guc-checkpoint-flush-after" xreflabel="checkpoint_flush_after">
<term><varname>checkpoint_flush_after</varname> (<type>int</type>)
<indexterm>
<primary><varname>checkpoint_flush_after</> configuration parameter</primary>
</indexterm>
</term>
<listitem>
<para>
Whenever more than <varname>checkpoint_flush_after</varname> bytes
have been written while performing a checkpoint, attempt to force the
OS to issue these writes to the underlying storage. Doing so will
limit the amount of dirty data in the kernel's page cache, reducing
the likelihood of stalls when an fsync is issued at the end of the
checkpoint, or when the OS writes data back in larger batches in the
background. Often that will result in greatly reduced transaction
latency, but there also are some cases, especially with workloads
that are bigger than <xref linkend="guc-shared-buffers">, but smaller
than the OS's page cache, where performance might degrade. This
setting may have no effect on some platforms. The valid range is
between <literal>0</literal>, which disables controlled writeback,
and <literal>2MB</literal>. The default is <literal>128Kb</> on
Linux, <literal>0</> elsewhere. (Non-default values of
<symbol>BLCKSZ</symbol> change the default and maximum.)
This parameter can only be set in the <filename>postgresql.conf</>
file or on the server command line.
</para>
</listitem>
</varlistentry>
<varlistentry id="guc-checkpoint-warning" xreflabel="checkpoint_warning">
<term><varname>checkpoint_warning</varname> (<type>integer</type>)
<indexterm>

View File

@ -545,6 +545,17 @@
unexpected variation in the number of WAL segments needed.
</para>
<para>
On Linux and POSIX platforms <xref linkend="guc-checkpoint-flush-after">
allows to force the OS that pages written by the checkpoint should be
flushed to disk after a configurable number of bytes. Otherwise, these
pages may be kept in the OS's page cache, inducing a stall when
<literal>fsync</> is issued at the end of a checkpoint. This setting will
often help to reduce transaction latency, but it also can an adverse effect
on performance; particularly for workloads that are bigger than
<xref linkend="guc-shared-buffers">, but smaller than the OS's page cache.
</para>
<para>
The number of WAL segment files in <filename>pg_xlog</> directory depends on
<varname>min_wal_size</>, <varname>max_wal_size</> and