mirror of
https://github.com/postgres/postgres.git
synced 2025-04-24 10:47:04 +03:00
its usual buffer cleaning duties during archive recovery, and it's responsible for performing restartpoints. This requires some changes in postmaster. When the startup process has done all the initialization and is ready to start WAL redo, it signals the postmaster to launch the background writer. The postmaster is signaled again when the point in recovery is reached where we know that the database is in consistent state. Postmaster isn't interested in that at the moment, but that's the point where we could let other backends in to perform read-only queries. The postmaster is signaled third time when the recovery has ended, so that postmaster knows that it's safe to start accepting connections. The startup process now traps SIGTERM, and performs a "clean" shutdown. If you do a fast shutdown during recovery, a shutdown restartpoint is performed, like a shutdown checkpoint, and postmaster kills the processes cleanly. You still have to continue the recovery at next startup, though. Currently, the background writer is only launched during archive recovery. We could launch it during crash recovery as well, but it seems better to keep that codepath as simple as possible, for the sake of robustness. And it couldn't do any restartpoints during crash recovery anyway, so it wouldn't be that useful. log_restartpoints is gone. Use log_checkpoints instead. This is yet to be documented. This whole operation is a pre-requisite for Hot Standby, but has some value of its own whether the hot standby patch makes 8.4 or not. Simon Riggs, with lots of modifications by me.
276 lines
15 KiB
Plaintext
276 lines
15 KiB
Plaintext
$PostgreSQL: pgsql/src/backend/storage/buffer/README,v 1.16 2009/02/18 15:58:41 heikki Exp $
|
|
|
|
Notes About Shared Buffer Access Rules
|
|
======================================
|
|
|
|
There are two separate access control mechanisms for shared disk buffers:
|
|
reference counts (a/k/a pin counts) and buffer content locks. (Actually,
|
|
there's a third level of access control: one must hold the appropriate kind
|
|
of lock on a relation before one can legally access any page belonging to
|
|
the relation. Relation-level locks are not discussed here.)
|
|
|
|
Pins: one must "hold a pin on" a buffer (increment its reference count)
|
|
before being allowed to do anything at all with it. An unpinned buffer is
|
|
subject to being reclaimed and reused for a different page at any instant,
|
|
so touching it is unsafe. Normally a pin is acquired via ReadBuffer and
|
|
released via ReleaseBuffer. It is OK and indeed common for a single
|
|
backend to pin a page more than once concurrently; the buffer manager
|
|
handles this efficiently. It is considered OK to hold a pin for long
|
|
intervals --- for example, sequential scans hold a pin on the current page
|
|
until done processing all the tuples on the page, which could be quite a
|
|
while if the scan is the outer scan of a join. Similarly, btree index
|
|
scans hold a pin on the current index page. This is OK because normal
|
|
operations never wait for a page's pin count to drop to zero. (Anything
|
|
that might need to do such a wait is instead handled by waiting to obtain
|
|
the relation-level lock, which is why you'd better hold one first.) Pins
|
|
may not be held across transaction boundaries, however.
|
|
|
|
Buffer content locks: there are two kinds of buffer lock, shared and exclusive,
|
|
which act just as you'd expect: multiple backends can hold shared locks on
|
|
the same buffer, but an exclusive lock prevents anyone else from holding
|
|
either shared or exclusive lock. (These can alternatively be called READ
|
|
and WRITE locks.) These locks are intended to be short-term: they should not
|
|
be held for long. Buffer locks are acquired and released by LockBuffer().
|
|
It will *not* work for a single backend to try to acquire multiple locks on
|
|
the same buffer. One must pin a buffer before trying to lock it.
|
|
|
|
Buffer access rules:
|
|
|
|
1. To scan a page for tuples, one must hold a pin and either shared or
|
|
exclusive content lock. To examine the commit status (XIDs and status bits)
|
|
of a tuple in a shared buffer, one must likewise hold a pin and either shared
|
|
or exclusive lock.
|
|
|
|
2. Once one has determined that a tuple is interesting (visible to the
|
|
current transaction) one may drop the content lock, yet continue to access
|
|
the tuple's data for as long as one holds the buffer pin. This is what is
|
|
typically done by heap scans, since the tuple returned by heap_fetch
|
|
contains a pointer to tuple data in the shared buffer. Therefore the
|
|
tuple cannot go away while the pin is held (see rule #5). Its state could
|
|
change, but that is assumed not to matter after the initial determination
|
|
of visibility is made.
|
|
|
|
3. To add a tuple or change the xmin/xmax fields of an existing tuple,
|
|
one must hold a pin and an exclusive content lock on the containing buffer.
|
|
This ensures that no one else might see a partially-updated state of the
|
|
tuple while they are doing visibility checks.
|
|
|
|
4. It is considered OK to update tuple commit status bits (ie, OR the
|
|
values HEAP_XMIN_COMMITTED, HEAP_XMIN_INVALID, HEAP_XMAX_COMMITTED, or
|
|
HEAP_XMAX_INVALID into t_infomask) while holding only a shared lock and
|
|
pin on a buffer. This is OK because another backend looking at the tuple
|
|
at about the same time would OR the same bits into the field, so there
|
|
is little or no risk of conflicting update; what's more, if there did
|
|
manage to be a conflict it would merely mean that one bit-update would
|
|
be lost and need to be done again later. These four bits are only hints
|
|
(they cache the results of transaction status lookups in pg_clog), so no
|
|
great harm is done if they get reset to zero by conflicting updates.
|
|
|
|
5. To physically remove a tuple or compact free space on a page, one
|
|
must hold a pin and an exclusive lock, *and* observe while holding the
|
|
exclusive lock that the buffer's shared reference count is one (ie,
|
|
no other backend holds a pin). If these conditions are met then no other
|
|
backend can perform a page scan until the exclusive lock is dropped, and
|
|
no other backend can be holding a reference to an existing tuple that it
|
|
might expect to examine again. Note that another backend might pin the
|
|
buffer (increment the refcount) while one is performing the cleanup, but
|
|
it won't be able to actually examine the page until it acquires shared
|
|
or exclusive content lock.
|
|
|
|
|
|
Rule #5 only affects VACUUM operations. Obtaining the
|
|
necessary lock is done by the bufmgr routine LockBufferForCleanup().
|
|
It first gets an exclusive lock and then checks to see if the shared pin
|
|
count is currently 1. If not, it releases the exclusive lock (but not the
|
|
caller's pin) and waits until signaled by another backend, whereupon it
|
|
tries again. The signal will occur when UnpinBuffer decrements the shared
|
|
pin count to 1. As indicated above, this operation might have to wait a
|
|
good while before it acquires lock, but that shouldn't matter much for
|
|
concurrent VACUUM. The current implementation only supports a single
|
|
waiter for pin-count-1 on any particular shared buffer. This is enough
|
|
for VACUUM's use, since we don't allow multiple VACUUMs concurrently on a
|
|
single relation anyway.
|
|
|
|
|
|
Buffer Manager's Internal Locking
|
|
---------------------------------
|
|
|
|
Before PostgreSQL 8.1, all operations of the shared buffer manager itself
|
|
were protected by a single system-wide lock, the BufMgrLock, which
|
|
unsurprisingly proved to be a source of contention. The new locking scheme
|
|
avoids grabbing system-wide exclusive locks in common code paths. It works
|
|
like this:
|
|
|
|
* There is a system-wide LWLock, the BufMappingLock, that notionally
|
|
protects the mapping from buffer tags (page identifiers) to buffers.
|
|
(Physically, it can be thought of as protecting the hash table maintained
|
|
by buf_table.c.) To look up whether a buffer exists for a tag, it is
|
|
sufficient to obtain share lock on the BufMappingLock. Note that one
|
|
must pin the found buffer, if any, before releasing the BufMappingLock.
|
|
To alter the page assignment of any buffer, one must hold exclusive lock
|
|
on the BufMappingLock. This lock must be held across adjusting the buffer's
|
|
header fields and changing the buf_table hash table. The only common
|
|
operation that needs exclusive lock is reading in a page that was not
|
|
in shared buffers already, which will require at least a kernel call
|
|
and usually a wait for I/O, so it will be slow anyway.
|
|
|
|
* As of PG 8.2, the BufMappingLock has been split into NUM_BUFFER_PARTITIONS
|
|
separate locks, each guarding a portion of the buffer tag space. This allows
|
|
further reduction of contention in the normal code paths. The partition
|
|
that a particular buffer tag belongs to is determined from the low-order
|
|
bits of the tag's hash value. The rules stated above apply to each partition
|
|
independently. If it is necessary to lock more than one partition at a time,
|
|
they must be locked in partition-number order to avoid risk of deadlock.
|
|
|
|
* A separate system-wide LWLock, the BufFreelistLock, provides mutual
|
|
exclusion for operations that access the buffer free list or select
|
|
buffers for replacement. This is always taken in exclusive mode since
|
|
there are no read-only operations on those data structures. The buffer
|
|
management policy is designed so that BufFreelistLock need not be taken
|
|
except in paths that will require I/O, and thus will be slow anyway.
|
|
(Details appear below.) It is never necessary to hold the BufMappingLock
|
|
and the BufFreelistLock at the same time.
|
|
|
|
* Each buffer header contains a spinlock that must be taken when examining
|
|
or changing fields of that buffer header. This allows operations such as
|
|
ReleaseBuffer to make local state changes without taking any system-wide
|
|
lock. We use a spinlock, not an LWLock, since there are no cases where
|
|
the lock needs to be held for more than a few instructions.
|
|
|
|
Note that a buffer header's spinlock does not control access to the data
|
|
held within the buffer. Each buffer header also contains an LWLock, the
|
|
"buffer content lock", that *does* represent the right to access the data
|
|
in the buffer. It is used per the rules above.
|
|
|
|
There is yet another set of per-buffer LWLocks, the io_in_progress locks,
|
|
that are used to wait for I/O on a buffer to complete. The process doing
|
|
a read or write takes exclusive lock for the duration, and processes that
|
|
need to wait for completion try to take shared locks (which they release
|
|
immediately upon obtaining). XXX on systems where an LWLock represents
|
|
nontrivial resources, it's fairly annoying to need so many locks. Possibly
|
|
we could use per-backend LWLocks instead (a buffer header would then contain
|
|
a field to show which backend is doing its I/O).
|
|
|
|
|
|
Normal Buffer Replacement Strategy
|
|
----------------------------------
|
|
|
|
There is a "free list" of buffers that are prime candidates for replacement.
|
|
In particular, buffers that are completely free (contain no valid page) are
|
|
always in this list. We could also throw buffers into this list if we
|
|
consider their pages unlikely to be needed soon; however, the current
|
|
algorithm never does that. The list is singly-linked using fields in the
|
|
buffer headers; we maintain head and tail pointers in global variables.
|
|
(Note: although the list links are in the buffer headers, they are
|
|
considered to be protected by the BufFreelistLock, not the buffer-header
|
|
spinlocks.) To choose a victim buffer to recycle when there are no free
|
|
buffers available, we use a simple clock-sweep algorithm, which avoids the
|
|
need to take system-wide locks during common operations. It works like
|
|
this:
|
|
|
|
Each buffer header contains a usage counter, which is incremented (up to a
|
|
small limit value) whenever the buffer is unpinned. (This requires only the
|
|
buffer header spinlock, which would have to be taken anyway to decrement the
|
|
buffer reference count, so it's nearly free.)
|
|
|
|
The "clock hand" is a buffer index, NextVictimBuffer, that moves circularly
|
|
through all the available buffers. NextVictimBuffer is protected by the
|
|
BufFreelistLock.
|
|
|
|
The algorithm for a process that needs to obtain a victim buffer is:
|
|
|
|
1. Obtain BufFreelistLock.
|
|
|
|
2. If buffer free list is nonempty, remove its head buffer. If the buffer
|
|
is pinned or has a nonzero usage count, it cannot be used; ignore it and
|
|
return to the start of step 2. Otherwise, pin the buffer, release
|
|
BufFreelistLock, and return the buffer.
|
|
|
|
3. Otherwise, select the buffer pointed to by NextVictimBuffer, and
|
|
circularly advance NextVictimBuffer for next time.
|
|
|
|
4. If the selected buffer is pinned or has a nonzero usage count, it cannot
|
|
be used. Decrement its usage count (if nonzero) and return to step 3 to
|
|
examine the next buffer.
|
|
|
|
5. Pin the selected buffer, release BufFreelistLock, and return the buffer.
|
|
|
|
(Note that if the selected buffer is dirty, we will have to write it out
|
|
before we can recycle it; if someone else pins the buffer meanwhile we will
|
|
have to give up and try another buffer. This however is not a concern
|
|
of the basic select-a-victim-buffer algorithm.)
|
|
|
|
|
|
Buffer Ring Replacement Strategy
|
|
---------------------------------
|
|
|
|
When running a query that needs to access a large number of pages just once,
|
|
such as VACUUM or a large sequential scan, a different strategy is used.
|
|
A page that has been touched only by such a scan is unlikely to be needed
|
|
again soon, so instead of running the normal clock sweep algorithm and
|
|
blowing out the entire buffer cache, a small ring of buffers is allocated
|
|
using the normal clock sweep algorithm and those buffers are reused for the
|
|
whole scan. This also implies that much of the write traffic caused by such
|
|
a statement will be done by the backend itself and not pushed off onto other
|
|
processes.
|
|
|
|
For sequential scans, a 256KB ring is used. That's small enough to fit in L2
|
|
cache, which makes transferring pages from OS cache to shared buffer cache
|
|
efficient. Even less would often be enough, but the ring must be big enough
|
|
to accommodate all pages in the scan that are pinned concurrently. 256KB
|
|
should also be enough to leave a small cache trail for other backends to
|
|
join in a synchronized seq scan. If a ring buffer is dirtied and its LSN
|
|
updated, we would normally have to write and flush WAL before we could
|
|
re-use the buffer; in this case we instead discard the buffer from the ring
|
|
and (later) choose a replacement using the normal clock-sweep algorithm.
|
|
Hence this strategy works best for scans that are read-only (or at worst
|
|
update hint bits). In a scan that modifies every page in the scan, like a
|
|
bulk UPDATE or DELETE, the buffers in the ring will always be dirtied and
|
|
the ring strategy effectively degrades to the normal strategy.
|
|
|
|
VACUUM uses a 256KB ring like sequential scans, but dirty pages are not
|
|
removed from the ring. Instead, WAL is flushed if needed to allow reuse of
|
|
the buffers. Before introducing the buffer ring strategy in 8.3, VACUUM's
|
|
buffers were sent to the freelist, which was effectively a buffer ring of 1
|
|
buffer, resulting in excessive WAL flushing. Allowing VACUUM to update
|
|
256KB between WAL flushes should be more efficient.
|
|
|
|
Bulk writes work similarly to VACUUM. Currently this applies only to
|
|
COPY IN and CREATE TABLE AS SELECT. (Might it be interesting to make
|
|
seqscan UPDATE and DELETE use the bulkwrite strategy?)
|
|
|
|
|
|
Background Writer's Processing
|
|
------------------------------
|
|
|
|
The background writer is designed to write out pages that are likely to be
|
|
recycled soon, thereby offloading the writing work from active backends.
|
|
To do this, it scans forward circularly from the current position of
|
|
NextVictimBuffer (which it does not change!), looking for buffers that are
|
|
dirty and not pinned nor marked with a positive usage count. It pins,
|
|
writes, and releases any such buffer.
|
|
|
|
If we can assume that reading NextVictimBuffer is an atomic action, then
|
|
the writer doesn't even need to take the BufFreelistLock in order to look
|
|
for buffers to write; it needs only to spinlock each buffer header for long
|
|
enough to check the dirtybit. Even without that assumption, the writer
|
|
only needs to take the lock long enough to read the variable value, not
|
|
while scanning the buffers. (This is a very substantial improvement in
|
|
the contention cost of the writer compared to PG 8.0.)
|
|
|
|
During a checkpoint, the writer's strategy must be to write every dirty
|
|
buffer (pinned or not!). We may as well make it start this scan from
|
|
NextVictimBuffer, however, so that the first-to-be-written pages are the
|
|
ones that backends might otherwise have to write for themselves soon.
|
|
|
|
The background writer takes shared content lock on a buffer while writing it
|
|
out (and anyone else who flushes buffer contents to disk must do so too).
|
|
This ensures that the page image transferred to disk is reasonably consistent.
|
|
We might miss a hint-bit update or two but that isn't a problem, for the same
|
|
reasons mentioned under buffer access rules.
|
|
|
|
As of 8.4, background writer starts during recovery mode when there is
|
|
some form of potentially extended recovery to perform. It performs an
|
|
identical service to normal processing, except that checkpoints it
|
|
writes are technically restartpoints.
|