mirror of
https://github.com/postgres/postgres.git
synced 2025-11-12 05:01:15 +03:00
Make large sequential scans and VACUUMs work in a limited-size "ring" of
buffers, rather than blowing out the whole shared-buffer arena. Aside from avoiding cache spoliation, this fixes the problem that VACUUM formerly tended to cause a WAL flush for every page it modified, because we had it hacked to use only a single buffer. Those flushes will now occur only once per ring-ful. The exact ring size, and the threshold for seqscans to switch into the ring usage pattern, remain under debate; but the infrastructure seems done. The key bit of infrastructure is a new optional BufferAccessStrategy object that can be passed to ReadBuffer operations; this replaces the former StrategyHintVacuum API. This patch also changes the buffer usage-count methodology a bit: we now advance usage_count when first pinning a buffer, rather than when last unpinning it. To preserve the behavior that a buffer's lifetime starts to decrease when it's released, the clock sweep code is modified to not decrement usage_count of pinned buffers. Work not done in this commit: teach GiST and GIN indexes to use the vacuum BufferAccessStrategy for vacuum-driven fetches. Original patch by Simon, reworked by Heikki and again by Tom.
This commit is contained in:
@@ -1,4 +1,4 @@
|
||||
$PostgreSQL: pgsql/src/backend/storage/buffer/README,v 1.11 2006/07/23 03:07:58 tgl Exp $
|
||||
$PostgreSQL: pgsql/src/backend/storage/buffer/README,v 1.12 2007/05/30 20:11:58 tgl Exp $
|
||||
|
||||
Notes about shared buffer access rules
|
||||
--------------------------------------
|
||||
@@ -152,20 +152,21 @@ we could use per-backend LWLocks instead (a buffer header would then contain
|
||||
a field to show which backend is doing its I/O).
|
||||
|
||||
|
||||
Buffer replacement strategy
|
||||
---------------------------
|
||||
Normal buffer replacement strategy
|
||||
----------------------------------
|
||||
|
||||
There is a "free list" of buffers that are prime candidates for replacement.
|
||||
In particular, buffers that are completely free (contain no valid page) are
|
||||
always in this list. We may also throw buffers into this list if we
|
||||
consider their pages unlikely to be needed soon. The list is singly-linked
|
||||
using fields in the buffer headers; we maintain head and tail pointers in
|
||||
global variables. (Note: although the list links are in the buffer headers,
|
||||
they are considered to be protected by the BufFreelistLock, not the
|
||||
buffer-header spinlocks.) To choose a victim buffer to recycle when there
|
||||
are no free buffers available, we use a simple clock-sweep algorithm, which
|
||||
avoids the need to take system-wide locks during common operations. It
|
||||
works like this:
|
||||
always in this list. We could also throw buffers into this list if we
|
||||
consider their pages unlikely to be needed soon; however, the current
|
||||
algorithm never does that. The list is singly-linked using fields in the
|
||||
buffer headers; we maintain head and tail pointers in global variables.
|
||||
(Note: although the list links are in the buffer headers, they are
|
||||
considered to be protected by the BufFreelistLock, not the buffer-header
|
||||
spinlocks.) To choose a victim buffer to recycle when there are no free
|
||||
buffers available, we use a simple clock-sweep algorithm, which avoids the
|
||||
need to take system-wide locks during common operations. It works like
|
||||
this:
|
||||
|
||||
Each buffer header contains a usage counter, which is incremented (up to a
|
||||
small limit value) whenever the buffer is unpinned. (This requires only the
|
||||
@@ -199,22 +200,40 @@ before we can recycle it; if someone else pins the buffer meanwhile we will
|
||||
have to give up and try another buffer. This however is not a concern
|
||||
of the basic select-a-victim-buffer algorithm.)
|
||||
|
||||
A special provision is that while running VACUUM, a backend does not
|
||||
increment the usage count on buffers it accesses. In fact, if ReleaseBuffer
|
||||
sees that it is dropping the pin count to zero and the usage count is zero,
|
||||
then it appends the buffer to the tail of the free list. (This implies that
|
||||
VACUUM, but only VACUUM, must take the BufFreelistLock during ReleaseBuffer;
|
||||
this shouldn't create much of a contention problem.) This provision
|
||||
encourages VACUUM to work in a relatively small number of buffers rather
|
||||
than blowing out the entire buffer cache. It is reasonable since a page
|
||||
that has been touched only by VACUUM is unlikely to be needed again soon.
|
||||
|
||||
Since VACUUM usually requests many pages very fast, the effect of this is that
|
||||
it will get back the very buffers it filled and possibly modified on the next
|
||||
call and will therefore do its work in a few shared memory buffers, while
|
||||
being able to use whatever it finds in the cache already. This also implies
|
||||
that most of the write traffic caused by a VACUUM will be done by the VACUUM
|
||||
itself and not pushed off onto other processes.
|
||||
Buffer ring replacement strategy
|
||||
---------------------------------
|
||||
|
||||
When running a query that needs to access a large number of pages just once,
|
||||
such as VACUUM or a large sequential scan, a different strategy is used.
|
||||
A page that has been touched only by such a scan is unlikely to be needed
|
||||
again soon, so instead of running the normal clock sweep algorithm and
|
||||
blowing out the entire buffer cache, a small ring of buffers is allocated
|
||||
using the normal clock sweep algorithm and those buffers are reused for the
|
||||
whole scan. This also implies that much of the write traffic caused by such
|
||||
a statement will be done by the backend itself and not pushed off onto other
|
||||
processes.
|
||||
|
||||
For sequential scans, a 256KB ring is used. That's small enough to fit in L2
|
||||
cache, which makes transferring pages from OS cache to shared buffer cache
|
||||
efficient. Even less would often be enough, but the ring must be big enough
|
||||
to accommodate all pages in the scan that are pinned concurrently. 256KB
|
||||
should also be enough to leave a small cache trail for other backends to
|
||||
join in a synchronized seq scan. If a ring buffer is dirtied and its LSN
|
||||
updated, we would normally have to write and flush WAL before we could
|
||||
re-use the buffer; in this case we instead discard the buffer from the ring
|
||||
and (later) choose a replacement using the normal clock-sweep algorithm.
|
||||
Hence this strategy works best for scans that are read-only (or at worst
|
||||
update hint bits). In a scan that modifies every page in the scan, like a
|
||||
bulk UPDATE or DELETE, the buffers in the ring will always be dirtied and
|
||||
the ring strategy effectively degrades to the normal strategy.
|
||||
|
||||
VACUUM uses a 256KB ring like sequential scans, but dirty pages are not
|
||||
removed from the ring. Instead, WAL is flushed if needed to allow reuse of
|
||||
the buffers. Before introducing the buffer ring strategy in 8.3, VACUUM's
|
||||
buffers were sent to the freelist, which was effectively a buffer ring of 1
|
||||
buffer, resulting in excessive WAL flushing. Allowing VACUUM to update
|
||||
256KB between WAL flushes should be more efficient.
|
||||
|
||||
|
||||
Background writer's processing
|
||||
|
||||
@@ -8,7 +8,7 @@
|
||||
*
|
||||
*
|
||||
* IDENTIFICATION
|
||||
* $PostgreSQL: pgsql/src/backend/storage/buffer/bufmgr.c,v 1.219 2007/05/27 03:50:39 tgl Exp $
|
||||
* $PostgreSQL: pgsql/src/backend/storage/buffer/bufmgr.c,v 1.220 2007/05/30 20:11:58 tgl Exp $
|
||||
*
|
||||
*-------------------------------------------------------------------------
|
||||
*/
|
||||
@@ -90,11 +90,11 @@ static volatile BufferDesc *PinCountWaitBuf = NULL;
|
||||
|
||||
|
||||
static Buffer ReadBuffer_common(Relation reln, BlockNumber blockNum,
|
||||
bool zeroPage);
|
||||
static bool PinBuffer(volatile BufferDesc *buf);
|
||||
bool zeroPage,
|
||||
BufferAccessStrategy strategy);
|
||||
static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
|
||||
static void PinBuffer_Locked(volatile BufferDesc *buf);
|
||||
static void UnpinBuffer(volatile BufferDesc *buf,
|
||||
bool fixOwner, bool normalAccess);
|
||||
static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
|
||||
static bool SyncOneBuffer(int buf_id, bool skip_pinned);
|
||||
static void WaitIO(volatile BufferDesc *buf);
|
||||
static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
|
||||
@@ -102,7 +102,8 @@ static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
|
||||
int set_flag_bits);
|
||||
static void buffer_write_error_callback(void *arg);
|
||||
static volatile BufferDesc *BufferAlloc(Relation reln, BlockNumber blockNum,
|
||||
bool *foundPtr);
|
||||
BufferAccessStrategy strategy,
|
||||
bool *foundPtr);
|
||||
static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
|
||||
static void AtProcExit_Buffers(int code, Datum arg);
|
||||
|
||||
@@ -125,7 +126,18 @@ static void AtProcExit_Buffers(int code, Datum arg);
|
||||
Buffer
|
||||
ReadBuffer(Relation reln, BlockNumber blockNum)
|
||||
{
|
||||
return ReadBuffer_common(reln, blockNum, false);
|
||||
return ReadBuffer_common(reln, blockNum, false, NULL);
|
||||
}
|
||||
|
||||
/*
|
||||
* ReadBufferWithStrategy -- same as ReadBuffer, except caller can specify
|
||||
* a nondefault buffer access strategy. See buffer/README for details.
|
||||
*/
|
||||
Buffer
|
||||
ReadBufferWithStrategy(Relation reln, BlockNumber blockNum,
|
||||
BufferAccessStrategy strategy)
|
||||
{
|
||||
return ReadBuffer_common(reln, blockNum, false, strategy);
|
||||
}
|
||||
|
||||
/*
|
||||
@@ -140,14 +152,15 @@ ReadBuffer(Relation reln, BlockNumber blockNum)
|
||||
Buffer
|
||||
ReadOrZeroBuffer(Relation reln, BlockNumber blockNum)
|
||||
{
|
||||
return ReadBuffer_common(reln, blockNum, true);
|
||||
return ReadBuffer_common(reln, blockNum, true, NULL);
|
||||
}
|
||||
|
||||
/*
|
||||
* ReadBuffer_common -- common logic for ReadBuffer and ReadOrZeroBuffer
|
||||
* ReadBuffer_common -- common logic for ReadBuffer variants
|
||||
*/
|
||||
static Buffer
|
||||
ReadBuffer_common(Relation reln, BlockNumber blockNum, bool zeroPage)
|
||||
ReadBuffer_common(Relation reln, BlockNumber blockNum, bool zeroPage,
|
||||
BufferAccessStrategy strategy)
|
||||
{
|
||||
volatile BufferDesc *bufHdr;
|
||||
Block bufBlock;
|
||||
@@ -185,7 +198,7 @@ ReadBuffer_common(Relation reln, BlockNumber blockNum, bool zeroPage)
|
||||
* lookup the buffer. IO_IN_PROGRESS is set if the requested block is
|
||||
* not currently in memory.
|
||||
*/
|
||||
bufHdr = BufferAlloc(reln, blockNum, &found);
|
||||
bufHdr = BufferAlloc(reln, blockNum, strategy, &found);
|
||||
if (found)
|
||||
BufferHitCount++;
|
||||
}
|
||||
@@ -330,6 +343,10 @@ ReadBuffer_common(Relation reln, BlockNumber blockNum, bool zeroPage)
|
||||
* buffer. If no buffer exists already, selects a replacement
|
||||
* victim and evicts the old page, but does NOT read in new page.
|
||||
*
|
||||
* "strategy" can be a buffer replacement strategy object, or NULL for
|
||||
* the default strategy. The selected buffer's usage_count is advanced when
|
||||
* using the default strategy, but otherwise possibly not (see PinBuffer).
|
||||
*
|
||||
* The returned buffer is pinned and is already marked as holding the
|
||||
* desired page. If it already did have the desired page, *foundPtr is
|
||||
* set TRUE. Otherwise, *foundPtr is set FALSE and the buffer is marked
|
||||
@@ -343,6 +360,7 @@ ReadBuffer_common(Relation reln, BlockNumber blockNum, bool zeroPage)
|
||||
static volatile BufferDesc *
|
||||
BufferAlloc(Relation reln,
|
||||
BlockNumber blockNum,
|
||||
BufferAccessStrategy strategy,
|
||||
bool *foundPtr)
|
||||
{
|
||||
BufferTag newTag; /* identity of requested block */
|
||||
@@ -375,7 +393,7 @@ BufferAlloc(Relation reln,
|
||||
*/
|
||||
buf = &BufferDescriptors[buf_id];
|
||||
|
||||
valid = PinBuffer(buf);
|
||||
valid = PinBuffer(buf, strategy);
|
||||
|
||||
/* Can release the mapping lock as soon as we've pinned it */
|
||||
LWLockRelease(newPartitionLock);
|
||||
@@ -413,13 +431,15 @@ BufferAlloc(Relation reln,
|
||||
/* Loop here in case we have to try another victim buffer */
|
||||
for (;;)
|
||||
{
|
||||
bool lock_held;
|
||||
|
||||
/*
|
||||
* Select a victim buffer. The buffer is returned with its header
|
||||
* spinlock still held! Also the BufFreelistLock is still held, since
|
||||
* it would be bad to hold the spinlock while possibly waking up other
|
||||
* processes.
|
||||
* spinlock still held! Also (in most cases) the BufFreelistLock is
|
||||
* still held, since it would be bad to hold the spinlock while
|
||||
* possibly waking up other processes.
|
||||
*/
|
||||
buf = StrategyGetBuffer();
|
||||
buf = StrategyGetBuffer(strategy, &lock_held);
|
||||
|
||||
Assert(buf->refcount == 0);
|
||||
|
||||
@@ -430,7 +450,8 @@ BufferAlloc(Relation reln,
|
||||
PinBuffer_Locked(buf);
|
||||
|
||||
/* Now it's safe to release the freelist lock */
|
||||
LWLockRelease(BufFreelistLock);
|
||||
if (lock_held)
|
||||
LWLockRelease(BufFreelistLock);
|
||||
|
||||
/*
|
||||
* If the buffer was dirty, try to write it out. There is a race
|
||||
@@ -458,16 +479,34 @@ BufferAlloc(Relation reln,
|
||||
*/
|
||||
if (LWLockConditionalAcquire(buf->content_lock, LW_SHARED))
|
||||
{
|
||||
/*
|
||||
* If using a nondefault strategy, and writing the buffer
|
||||
* would require a WAL flush, let the strategy decide whether
|
||||
* to go ahead and write/reuse the buffer or to choose another
|
||||
* victim. We need lock to inspect the page LSN, so this
|
||||
* can't be done inside StrategyGetBuffer.
|
||||
*/
|
||||
if (strategy != NULL &&
|
||||
XLogNeedsFlush(BufferGetLSN(buf)) &&
|
||||
StrategyRejectBuffer(strategy, buf))
|
||||
{
|
||||
/* Drop lock/pin and loop around for another buffer */
|
||||
LWLockRelease(buf->content_lock);
|
||||
UnpinBuffer(buf, true);
|
||||
continue;
|
||||
}
|
||||
|
||||
/* OK, do the I/O */
|
||||
FlushBuffer(buf, NULL);
|
||||
LWLockRelease(buf->content_lock);
|
||||
}
|
||||
else
|
||||
{
|
||||
/*
|
||||
* Someone else has pinned the buffer, so give it up and loop
|
||||
* Someone else has locked the buffer, so give it up and loop
|
||||
* back to get another one.
|
||||
*/
|
||||
UnpinBuffer(buf, true, false /* evidently recently used */ );
|
||||
UnpinBuffer(buf, true);
|
||||
continue;
|
||||
}
|
||||
}
|
||||
@@ -531,10 +570,9 @@ BufferAlloc(Relation reln,
|
||||
* Got a collision. Someone has already done what we were about to
|
||||
* do. We'll just handle this as if it were found in the buffer
|
||||
* pool in the first place. First, give up the buffer we were
|
||||
* planning to use. Don't allow it to be thrown in the free list
|
||||
* (we don't want to hold freelist and mapping locks at once).
|
||||
* planning to use.
|
||||
*/
|
||||
UnpinBuffer(buf, true, false);
|
||||
UnpinBuffer(buf, true);
|
||||
|
||||
/* Can give up that buffer's mapping partition lock now */
|
||||
if ((oldFlags & BM_TAG_VALID) &&
|
||||
@@ -545,7 +583,7 @@ BufferAlloc(Relation reln,
|
||||
|
||||
buf = &BufferDescriptors[buf_id];
|
||||
|
||||
valid = PinBuffer(buf);
|
||||
valid = PinBuffer(buf, strategy);
|
||||
|
||||
/* Can release the mapping lock as soon as we've pinned it */
|
||||
LWLockRelease(newPartitionLock);
|
||||
@@ -595,20 +633,21 @@ BufferAlloc(Relation reln,
|
||||
oldPartitionLock != newPartitionLock)
|
||||
LWLockRelease(oldPartitionLock);
|
||||
LWLockRelease(newPartitionLock);
|
||||
UnpinBuffer(buf, true, false /* evidently recently used */ );
|
||||
UnpinBuffer(buf, true);
|
||||
}
|
||||
|
||||
/*
|
||||
* Okay, it's finally safe to rename the buffer.
|
||||
*
|
||||
* Clearing BM_VALID here is necessary, clearing the dirtybits is just
|
||||
* paranoia. We also clear the usage_count since any recency of use of
|
||||
* the old content is no longer relevant.
|
||||
* paranoia. We also reset the usage_count since any recency of use of
|
||||
* the old content is no longer relevant. (The usage_count starts out
|
||||
* at 1 so that the buffer can survive one clock-sweep pass.)
|
||||
*/
|
||||
buf->tag = newTag;
|
||||
buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_IO_ERROR);
|
||||
buf->flags |= BM_TAG_VALID;
|
||||
buf->usage_count = 0;
|
||||
buf->usage_count = 1;
|
||||
|
||||
UnlockBufHdr(buf);
|
||||
|
||||
@@ -736,7 +775,7 @@ retry:
|
||||
/*
|
||||
* Insert the buffer at the head of the list of free buffers.
|
||||
*/
|
||||
StrategyFreeBuffer(buf, true);
|
||||
StrategyFreeBuffer(buf);
|
||||
}
|
||||
|
||||
/*
|
||||
@@ -814,9 +853,6 @@ ReleaseAndReadBuffer(Buffer buffer,
|
||||
return buffer;
|
||||
ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
|
||||
LocalRefCount[-buffer - 1]--;
|
||||
if (LocalRefCount[-buffer - 1] == 0 &&
|
||||
bufHdr->usage_count < BM_MAX_USAGE_COUNT)
|
||||
bufHdr->usage_count++;
|
||||
}
|
||||
else
|
||||
{
|
||||
@@ -826,7 +862,7 @@ ReleaseAndReadBuffer(Buffer buffer,
|
||||
if (bufHdr->tag.blockNum == blockNum &&
|
||||
RelFileNodeEquals(bufHdr->tag.rnode, relation->rd_node))
|
||||
return buffer;
|
||||
UnpinBuffer(bufHdr, true, true);
|
||||
UnpinBuffer(bufHdr, true);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -836,6 +872,14 @@ ReleaseAndReadBuffer(Buffer buffer,
|
||||
/*
|
||||
* PinBuffer -- make buffer unavailable for replacement.
|
||||
*
|
||||
* For the default access strategy, the buffer's usage_count is incremented
|
||||
* when we first pin it; for other strategies we just make sure the usage_count
|
||||
* isn't zero. (The idea of the latter is that we don't want synchronized
|
||||
* heap scans to inflate the count, but we need it to not be zero to discourage
|
||||
* other backends from stealing buffers from our ring. As long as we cycle
|
||||
* through the ring faster than the global clock-sweep cycles, buffers in
|
||||
* our ring won't be chosen as victims for replacement by other backends.)
|
||||
*
|
||||
* This should be applied only to shared buffers, never local ones.
|
||||
*
|
||||
* Note that ResourceOwnerEnlargeBuffers must have been done already.
|
||||
@@ -844,7 +888,7 @@ ReleaseAndReadBuffer(Buffer buffer,
|
||||
* some callers to avoid an extra spinlock cycle.
|
||||
*/
|
||||
static bool
|
||||
PinBuffer(volatile BufferDesc *buf)
|
||||
PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy)
|
||||
{
|
||||
int b = buf->buf_id;
|
||||
bool result;
|
||||
@@ -853,6 +897,16 @@ PinBuffer(volatile BufferDesc *buf)
|
||||
{
|
||||
LockBufHdr(buf);
|
||||
buf->refcount++;
|
||||
if (strategy == NULL)
|
||||
{
|
||||
if (buf->usage_count < BM_MAX_USAGE_COUNT)
|
||||
buf->usage_count++;
|
||||
}
|
||||
else
|
||||
{
|
||||
if (buf->usage_count == 0)
|
||||
buf->usage_count = 1;
|
||||
}
|
||||
result = (buf->flags & BM_VALID) != 0;
|
||||
UnlockBufHdr(buf);
|
||||
}
|
||||
@@ -872,6 +926,11 @@ PinBuffer(volatile BufferDesc *buf)
|
||||
* PinBuffer_Locked -- as above, but caller already locked the buffer header.
|
||||
* The spinlock is released before return.
|
||||
*
|
||||
* Currently, no callers of this function want to modify the buffer's
|
||||
* usage_count at all, so there's no need for a strategy parameter.
|
||||
* Also we don't bother with a BM_VALID test (the caller could check that for
|
||||
* itself).
|
||||
*
|
||||
* Note: use of this routine is frequently mandatory, not just an optimization
|
||||
* to save a spin lock/unlock cycle, because we need to pin a buffer before
|
||||
* its state can change under us.
|
||||
@@ -897,17 +956,9 @@ PinBuffer_Locked(volatile BufferDesc *buf)
|
||||
*
|
||||
* Most but not all callers want CurrentResourceOwner to be adjusted.
|
||||
* Those that don't should pass fixOwner = FALSE.
|
||||
*
|
||||
* normalAccess indicates that we are finishing a "normal" page access,
|
||||
* that is, one requested by something outside the buffer subsystem.
|
||||
* Passing FALSE means it's an internal access that should not update the
|
||||
* buffer's usage count nor cause a change in the freelist.
|
||||
*
|
||||
* If we are releasing a buffer during VACUUM, and it's not been otherwise
|
||||
* used recently, and normalAccess is true, we send the buffer to the freelist.
|
||||
*/
|
||||
static void
|
||||
UnpinBuffer(volatile BufferDesc *buf, bool fixOwner, bool normalAccess)
|
||||
UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
|
||||
{
|
||||
int b = buf->buf_id;
|
||||
|
||||
@@ -919,8 +970,6 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner, bool normalAccess)
|
||||
PrivateRefCount[b]--;
|
||||
if (PrivateRefCount[b] == 0)
|
||||
{
|
||||
bool immed_free_buffer = false;
|
||||
|
||||
/* I'd better not still hold any locks on the buffer */
|
||||
Assert(!LWLockHeldByMe(buf->content_lock));
|
||||
Assert(!LWLockHeldByMe(buf->io_in_progress_lock));
|
||||
@@ -931,22 +980,7 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner, bool normalAccess)
|
||||
Assert(buf->refcount > 0);
|
||||
buf->refcount--;
|
||||
|
||||
/* Update buffer usage info, unless this is an internal access */
|
||||
if (normalAccess)
|
||||
{
|
||||
if (!strategy_hint_vacuum)
|
||||
{
|
||||
if (buf->usage_count < BM_MAX_USAGE_COUNT)
|
||||
buf->usage_count++;
|
||||
}
|
||||
else
|
||||
{
|
||||
/* VACUUM accesses don't bump usage count, instead... */
|
||||
if (buf->refcount == 0 && buf->usage_count == 0)
|
||||
immed_free_buffer = true;
|
||||
}
|
||||
}
|
||||
|
||||
/* Support LockBufferForCleanup() */
|
||||
if ((buf->flags & BM_PIN_COUNT_WAITER) &&
|
||||
buf->refcount == 1)
|
||||
{
|
||||
@@ -959,14 +993,6 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner, bool normalAccess)
|
||||
}
|
||||
else
|
||||
UnlockBufHdr(buf);
|
||||
|
||||
/*
|
||||
* If VACUUM is releasing an otherwise-unused buffer, send it to the
|
||||
* freelist for near-term reuse. We put it at the tail so that it
|
||||
* won't be used before any invalid buffers that may exist.
|
||||
*/
|
||||
if (immed_free_buffer)
|
||||
StrategyFreeBuffer(buf, false);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1150,7 +1176,7 @@ SyncOneBuffer(int buf_id, bool skip_pinned)
|
||||
FlushBuffer(bufHdr, NULL);
|
||||
|
||||
LWLockRelease(bufHdr->content_lock);
|
||||
UnpinBuffer(bufHdr, true, false /* don't change freelist */ );
|
||||
UnpinBuffer(bufHdr, true);
|
||||
|
||||
return true;
|
||||
}
|
||||
@@ -1266,7 +1292,7 @@ AtProcExit_Buffers(int code, Datum arg)
|
||||
* here, it suggests that ResourceOwners are messed up.
|
||||
*/
|
||||
PrivateRefCount[i] = 1; /* make sure we release shared pin */
|
||||
UnpinBuffer(buf, false, false /* don't change freelist */ );
|
||||
UnpinBuffer(buf, false);
|
||||
Assert(PrivateRefCount[i] == 0);
|
||||
}
|
||||
}
|
||||
@@ -1700,7 +1726,7 @@ FlushRelationBuffers(Relation rel)
|
||||
LWLockAcquire(bufHdr->content_lock, LW_SHARED);
|
||||
FlushBuffer(bufHdr, rel->rd_smgr);
|
||||
LWLockRelease(bufHdr->content_lock);
|
||||
UnpinBuffer(bufHdr, true, false /* no freelist change */ );
|
||||
UnpinBuffer(bufHdr, true);
|
||||
}
|
||||
else
|
||||
UnlockBufHdr(bufHdr);
|
||||
@@ -1723,11 +1749,7 @@ ReleaseBuffer(Buffer buffer)
|
||||
if (BufferIsLocal(buffer))
|
||||
{
|
||||
Assert(LocalRefCount[-buffer - 1] > 0);
|
||||
bufHdr = &LocalBufferDescriptors[-buffer - 1];
|
||||
LocalRefCount[-buffer - 1]--;
|
||||
if (LocalRefCount[-buffer - 1] == 0 &&
|
||||
bufHdr->usage_count < BM_MAX_USAGE_COUNT)
|
||||
bufHdr->usage_count++;
|
||||
return;
|
||||
}
|
||||
|
||||
@@ -1738,7 +1760,7 @@ ReleaseBuffer(Buffer buffer)
|
||||
if (PrivateRefCount[buffer - 1] > 1)
|
||||
PrivateRefCount[buffer - 1]--;
|
||||
else
|
||||
UnpinBuffer(bufHdr, false, true);
|
||||
UnpinBuffer(bufHdr, false);
|
||||
}
|
||||
|
||||
/*
|
||||
|
||||
@@ -9,7 +9,7 @@
|
||||
*
|
||||
*
|
||||
* IDENTIFICATION
|
||||
* $PostgreSQL: pgsql/src/backend/storage/buffer/freelist.c,v 1.58 2007/01/05 22:19:37 momjian Exp $
|
||||
* $PostgreSQL: pgsql/src/backend/storage/buffer/freelist.c,v 1.59 2007/05/30 20:11:59 tgl Exp $
|
||||
*
|
||||
*-------------------------------------------------------------------------
|
||||
*/
|
||||
@@ -39,8 +39,42 @@ typedef struct
|
||||
/* Pointers to shared state */
|
||||
static BufferStrategyControl *StrategyControl = NULL;
|
||||
|
||||
/* Backend-local state about whether currently vacuuming */
|
||||
bool strategy_hint_vacuum = false;
|
||||
/*
|
||||
* Private (non-shared) state for managing a ring of shared buffers to re-use.
|
||||
* This is currently the only kind of BufferAccessStrategy object, but someday
|
||||
* we might have more kinds.
|
||||
*/
|
||||
typedef struct BufferAccessStrategyData
|
||||
{
|
||||
/* Overall strategy type */
|
||||
BufferAccessStrategyType btype;
|
||||
/* Number of elements in buffers[] array */
|
||||
int ring_size;
|
||||
/*
|
||||
* Index of the "current" slot in the ring, ie, the one most recently
|
||||
* returned by GetBufferFromRing.
|
||||
*/
|
||||
int current;
|
||||
/*
|
||||
* True if the buffer just returned by StrategyGetBuffer had been in
|
||||
* the ring already.
|
||||
*/
|
||||
bool current_was_in_ring;
|
||||
|
||||
/*
|
||||
* Array of buffer numbers. InvalidBuffer (that is, zero) indicates
|
||||
* we have not yet selected a buffer for this ring slot. For allocation
|
||||
* simplicity this is palloc'd together with the fixed fields of the
|
||||
* struct.
|
||||
*/
|
||||
Buffer buffers[1]; /* VARIABLE SIZE ARRAY */
|
||||
} BufferAccessStrategyData;
|
||||
|
||||
|
||||
/* Prototypes for internal functions */
|
||||
static volatile BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy);
|
||||
static void AddBufferToRing(BufferAccessStrategy strategy,
|
||||
volatile BufferDesc *buf);
|
||||
|
||||
|
||||
/*
|
||||
@@ -50,17 +84,38 @@ bool strategy_hint_vacuum = false;
|
||||
* BufferAlloc(). The only hard requirement BufferAlloc() has is that
|
||||
* the selected buffer must not currently be pinned by anyone.
|
||||
*
|
||||
* strategy is a BufferAccessStrategy object, or NULL for default strategy.
|
||||
*
|
||||
* To ensure that no one else can pin the buffer before we do, we must
|
||||
* return the buffer with the buffer header spinlock still held. That
|
||||
* means that we return with the BufFreelistLock still held, as well;
|
||||
* the caller must release that lock once the spinlock is dropped.
|
||||
* return the buffer with the buffer header spinlock still held. If
|
||||
* *lock_held is set on exit, we have returned with the BufFreelistLock
|
||||
* still held, as well; the caller must release that lock once the spinlock
|
||||
* is dropped. We do it that way because releasing the BufFreelistLock
|
||||
* might awaken other processes, and it would be bad to do the associated
|
||||
* kernel calls while holding the buffer header spinlock.
|
||||
*/
|
||||
volatile BufferDesc *
|
||||
StrategyGetBuffer(void)
|
||||
StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
|
||||
{
|
||||
volatile BufferDesc *buf;
|
||||
int trycounter;
|
||||
|
||||
/*
|
||||
* If given a strategy object, see whether it can select a buffer.
|
||||
* We assume strategy objects don't need the BufFreelistLock.
|
||||
*/
|
||||
if (strategy != NULL)
|
||||
{
|
||||
buf = GetBufferFromRing(strategy);
|
||||
if (buf != NULL)
|
||||
{
|
||||
*lock_held = false;
|
||||
return buf;
|
||||
}
|
||||
}
|
||||
|
||||
/* Nope, so lock the freelist */
|
||||
*lock_held = true;
|
||||
LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
|
||||
|
||||
/*
|
||||
@@ -82,11 +137,16 @@ StrategyGetBuffer(void)
|
||||
* If the buffer is pinned or has a nonzero usage_count, we cannot use
|
||||
* it; discard it and retry. (This can only happen if VACUUM put a
|
||||
* valid buffer in the freelist and then someone else used it before
|
||||
* we got to it.)
|
||||
* we got to it. It's probably impossible altogether as of 8.3,
|
||||
* but we'd better check anyway.)
|
||||
*/
|
||||
LockBufHdr(buf);
|
||||
if (buf->refcount == 0 && buf->usage_count == 0)
|
||||
{
|
||||
if (strategy != NULL)
|
||||
AddBufferToRing(strategy, buf);
|
||||
return buf;
|
||||
}
|
||||
UnlockBufHdr(buf);
|
||||
}
|
||||
|
||||
@@ -101,15 +161,23 @@ StrategyGetBuffer(void)
|
||||
|
||||
/*
|
||||
* If the buffer is pinned or has a nonzero usage_count, we cannot use
|
||||
* it; decrement the usage_count and keep scanning.
|
||||
* it; decrement the usage_count (unless pinned) and keep scanning.
|
||||
*/
|
||||
LockBufHdr(buf);
|
||||
if (buf->refcount == 0 && buf->usage_count == 0)
|
||||
return buf;
|
||||
if (buf->usage_count > 0)
|
||||
if (buf->refcount == 0)
|
||||
{
|
||||
buf->usage_count--;
|
||||
trycounter = NBuffers;
|
||||
if (buf->usage_count > 0)
|
||||
{
|
||||
buf->usage_count--;
|
||||
trycounter = NBuffers;
|
||||
}
|
||||
else
|
||||
{
|
||||
/* Found a usable buffer */
|
||||
if (strategy != NULL)
|
||||
AddBufferToRing(strategy, buf);
|
||||
return buf;
|
||||
}
|
||||
}
|
||||
else if (--trycounter == 0)
|
||||
{
|
||||
@@ -132,13 +200,9 @@ StrategyGetBuffer(void)
|
||||
|
||||
/*
|
||||
* StrategyFreeBuffer: put a buffer on the freelist
|
||||
*
|
||||
* The buffer is added either at the head or the tail, according to the
|
||||
* at_head parameter. This allows a small amount of control over how
|
||||
* quickly the buffer is reused.
|
||||
*/
|
||||
void
|
||||
StrategyFreeBuffer(volatile BufferDesc *buf, bool at_head)
|
||||
StrategyFreeBuffer(volatile BufferDesc *buf)
|
||||
{
|
||||
LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
|
||||
|
||||
@@ -148,22 +212,10 @@ StrategyFreeBuffer(volatile BufferDesc *buf, bool at_head)
|
||||
*/
|
||||
if (buf->freeNext == FREENEXT_NOT_IN_LIST)
|
||||
{
|
||||
if (at_head)
|
||||
{
|
||||
buf->freeNext = StrategyControl->firstFreeBuffer;
|
||||
if (buf->freeNext < 0)
|
||||
StrategyControl->lastFreeBuffer = buf->buf_id;
|
||||
StrategyControl->firstFreeBuffer = buf->buf_id;
|
||||
}
|
||||
else
|
||||
{
|
||||
buf->freeNext = FREENEXT_END_OF_LIST;
|
||||
if (StrategyControl->firstFreeBuffer < 0)
|
||||
StrategyControl->firstFreeBuffer = buf->buf_id;
|
||||
else
|
||||
BufferDescriptors[StrategyControl->lastFreeBuffer].freeNext = buf->buf_id;
|
||||
buf->freeNext = StrategyControl->firstFreeBuffer;
|
||||
if (buf->freeNext < 0)
|
||||
StrategyControl->lastFreeBuffer = buf->buf_id;
|
||||
}
|
||||
StrategyControl->firstFreeBuffer = buf->buf_id;
|
||||
}
|
||||
|
||||
LWLockRelease(BufFreelistLock);
|
||||
@@ -190,15 +242,6 @@ StrategySyncStart(void)
|
||||
return result;
|
||||
}
|
||||
|
||||
/*
|
||||
* StrategyHintVacuum -- tell us whether VACUUM is active
|
||||
*/
|
||||
void
|
||||
StrategyHintVacuum(bool vacuum_active)
|
||||
{
|
||||
strategy_hint_vacuum = vacuum_active;
|
||||
}
|
||||
|
||||
|
||||
/*
|
||||
* StrategyShmemSize
|
||||
@@ -274,3 +317,172 @@ StrategyInitialize(bool init)
|
||||
else
|
||||
Assert(!init);
|
||||
}
|
||||
|
||||
|
||||
/* ----------------------------------------------------------------
|
||||
* Backend-private buffer ring management
|
||||
* ----------------------------------------------------------------
|
||||
*/
|
||||
|
||||
|
||||
/*
|
||||
* GetAccessStrategy -- create a BufferAccessStrategy object
|
||||
*
|
||||
* The object is allocated in the current memory context.
|
||||
*/
|
||||
BufferAccessStrategy
|
||||
GetAccessStrategy(BufferAccessStrategyType btype)
|
||||
{
|
||||
BufferAccessStrategy strategy;
|
||||
int ring_size;
|
||||
|
||||
/*
|
||||
* Select ring size to use. See buffer/README for rationales.
|
||||
* (Currently all cases are the same size, but keep this code
|
||||
* structure for flexibility.)
|
||||
*/
|
||||
switch (btype)
|
||||
{
|
||||
case BAS_NORMAL:
|
||||
/* if someone asks for NORMAL, just give 'em a "default" object */
|
||||
return NULL;
|
||||
|
||||
case BAS_BULKREAD:
|
||||
ring_size = 256 * 1024 / BLCKSZ;
|
||||
break;
|
||||
case BAS_VACUUM:
|
||||
ring_size = 256 * 1024 / BLCKSZ;
|
||||
break;
|
||||
|
||||
default:
|
||||
elog(ERROR, "unrecognized buffer access strategy: %d",
|
||||
(int) btype);
|
||||
return NULL; /* keep compiler quiet */
|
||||
}
|
||||
|
||||
/* Make sure ring isn't an undue fraction of shared buffers */
|
||||
ring_size = Min(NBuffers / 8, ring_size);
|
||||
|
||||
/* Allocate the object and initialize all elements to zeroes */
|
||||
strategy = (BufferAccessStrategy)
|
||||
palloc0(offsetof(BufferAccessStrategyData, buffers) +
|
||||
ring_size * sizeof(Buffer));
|
||||
|
||||
/* Set fields that don't start out zero */
|
||||
strategy->btype = btype;
|
||||
strategy->ring_size = ring_size;
|
||||
|
||||
return strategy;
|
||||
}
|
||||
|
||||
/*
|
||||
* FreeAccessStrategy -- release a BufferAccessStrategy object
|
||||
*
|
||||
* A simple pfree would do at the moment, but we would prefer that callers
|
||||
* don't assume that much about the representation of BufferAccessStrategy.
|
||||
*/
|
||||
void
|
||||
FreeAccessStrategy(BufferAccessStrategy strategy)
|
||||
{
|
||||
/* don't crash if called on a "default" strategy */
|
||||
if (strategy != NULL)
|
||||
pfree(strategy);
|
||||
}
|
||||
|
||||
/*
|
||||
* GetBufferFromRing -- returns a buffer from the ring, or NULL if the
|
||||
* ring is empty.
|
||||
*
|
||||
* The bufhdr spin lock is held on the returned buffer.
|
||||
*/
|
||||
static volatile BufferDesc *
|
||||
GetBufferFromRing(BufferAccessStrategy strategy)
|
||||
{
|
||||
volatile BufferDesc *buf;
|
||||
Buffer bufnum;
|
||||
|
||||
/* Advance to next ring slot */
|
||||
if (++strategy->current >= strategy->ring_size)
|
||||
strategy->current = 0;
|
||||
|
||||
/*
|
||||
* If the slot hasn't been filled yet, tell the caller to allocate
|
||||
* a new buffer with the normal allocation strategy. He will then
|
||||
* fill this slot by calling AddBufferToRing with the new buffer.
|
||||
*/
|
||||
bufnum = strategy->buffers[strategy->current];
|
||||
if (bufnum == InvalidBuffer)
|
||||
{
|
||||
strategy->current_was_in_ring = false;
|
||||
return NULL;
|
||||
}
|
||||
|
||||
/*
|
||||
* If the buffer is pinned we cannot use it under any circumstances.
|
||||
*
|
||||
* If usage_count is 0 or 1 then the buffer is fair game (we expect 1,
|
||||
* since our own previous usage of the ring element would have left it
|
||||
* there, but it might've been decremented by clock sweep since then).
|
||||
* A higher usage_count indicates someone else has touched the buffer,
|
||||
* so we shouldn't re-use it.
|
||||
*/
|
||||
buf = &BufferDescriptors[bufnum - 1];
|
||||
LockBufHdr(buf);
|
||||
if (buf->refcount == 0 && buf->usage_count <= 1)
|
||||
{
|
||||
strategy->current_was_in_ring = true;
|
||||
return buf;
|
||||
}
|
||||
UnlockBufHdr(buf);
|
||||
|
||||
/*
|
||||
* Tell caller to allocate a new buffer with the normal allocation
|
||||
* strategy. He'll then replace this ring element via AddBufferToRing.
|
||||
*/
|
||||
strategy->current_was_in_ring = false;
|
||||
return NULL;
|
||||
}
|
||||
|
||||
/*
|
||||
* AddBufferToRing -- add a buffer to the buffer ring
|
||||
*
|
||||
* Caller must hold the buffer header spinlock on the buffer. Since this
|
||||
* is called with the spinlock held, it had better be quite cheap.
|
||||
*/
|
||||
static void
|
||||
AddBufferToRing(BufferAccessStrategy strategy, volatile BufferDesc *buf)
|
||||
{
|
||||
strategy->buffers[strategy->current] = BufferDescriptorGetBuffer(buf);
|
||||
}
|
||||
|
||||
/*
|
||||
* StrategyRejectBuffer -- consider rejecting a dirty buffer
|
||||
*
|
||||
* When a nondefault strategy is used, the buffer manager calls this function
|
||||
* when it turns out that the buffer selected by StrategyGetBuffer needs to
|
||||
* be written out and doing so would require flushing WAL too. This gives us
|
||||
* a chance to choose a different victim.
|
||||
*
|
||||
* Returns true if buffer manager should ask for a new victim, and false
|
||||
* if this buffer should be written and re-used.
|
||||
*/
|
||||
bool
|
||||
StrategyRejectBuffer(BufferAccessStrategy strategy, volatile BufferDesc *buf)
|
||||
{
|
||||
/* We only do this in bulkread mode */
|
||||
if (strategy->btype != BAS_BULKREAD)
|
||||
return false;
|
||||
|
||||
/* Don't muck with behavior of normal buffer-replacement strategy */
|
||||
if (!strategy->current_was_in_ring ||
|
||||
strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
|
||||
return false;
|
||||
|
||||
/*
|
||||
* Remove the dirty buffer from the ring; necessary to prevent infinite
|
||||
* loop if all ring members are dirty.
|
||||
*/
|
||||
strategy->buffers[strategy->current] = InvalidBuffer;
|
||||
|
||||
return true;
|
||||
}
|
||||
|
||||
@@ -9,7 +9,7 @@
|
||||
*
|
||||
*
|
||||
* IDENTIFICATION
|
||||
* $PostgreSQL: pgsql/src/backend/storage/buffer/localbuf.c,v 1.76 2007/01/05 22:19:37 momjian Exp $
|
||||
* $PostgreSQL: pgsql/src/backend/storage/buffer/localbuf.c,v 1.77 2007/05/30 20:11:59 tgl Exp $
|
||||
*
|
||||
*-------------------------------------------------------------------------
|
||||
*/
|
||||
@@ -57,7 +57,8 @@ static Block GetLocalBufferStorage(void);
|
||||
*
|
||||
* API is similar to bufmgr.c's BufferAlloc, except that we do not need
|
||||
* to do any locking since this is all local. Also, IO_IN_PROGRESS
|
||||
* does not get set.
|
||||
* does not get set. Lastly, we support only default access strategy
|
||||
* (hence, usage_count is always advanced).
|
||||
*/
|
||||
BufferDesc *
|
||||
LocalBufferAlloc(Relation reln, BlockNumber blockNum, bool *foundPtr)
|
||||
@@ -88,7 +89,12 @@ LocalBufferAlloc(Relation reln, BlockNumber blockNum, bool *foundPtr)
|
||||
fprintf(stderr, "LB ALLOC (%u,%d) %d\n",
|
||||
RelationGetRelid(reln), blockNum, -b - 1);
|
||||
#endif
|
||||
|
||||
/* this part is equivalent to PinBuffer for a shared buffer */
|
||||
if (LocalRefCount[b] == 0)
|
||||
{
|
||||
if (bufHdr->usage_count < BM_MAX_USAGE_COUNT)
|
||||
bufHdr->usage_count++;
|
||||
}
|
||||
LocalRefCount[b]++;
|
||||
ResourceOwnerRememberBuffer(CurrentResourceOwner,
|
||||
BufferDescriptorGetBuffer(bufHdr));
|
||||
@@ -121,18 +127,21 @@ LocalBufferAlloc(Relation reln, BlockNumber blockNum, bool *foundPtr)
|
||||
|
||||
bufHdr = &LocalBufferDescriptors[b];
|
||||
|
||||
if (LocalRefCount[b] == 0 && bufHdr->usage_count == 0)
|
||||
if (LocalRefCount[b] == 0)
|
||||
{
|
||||
LocalRefCount[b]++;
|
||||
ResourceOwnerRememberBuffer(CurrentResourceOwner,
|
||||
BufferDescriptorGetBuffer(bufHdr));
|
||||
break;
|
||||
}
|
||||
|
||||
if (bufHdr->usage_count > 0)
|
||||
{
|
||||
bufHdr->usage_count--;
|
||||
trycounter = NLocBuffer;
|
||||
if (bufHdr->usage_count > 0)
|
||||
{
|
||||
bufHdr->usage_count--;
|
||||
trycounter = NLocBuffer;
|
||||
}
|
||||
else
|
||||
{
|
||||
/* Found a usable buffer */
|
||||
LocalRefCount[b]++;
|
||||
ResourceOwnerRememberBuffer(CurrentResourceOwner,
|
||||
BufferDescriptorGetBuffer(bufHdr));
|
||||
break;
|
||||
}
|
||||
}
|
||||
else if (--trycounter == 0)
|
||||
ereport(ERROR,
|
||||
@@ -199,7 +208,7 @@ LocalBufferAlloc(Relation reln, BlockNumber blockNum, bool *foundPtr)
|
||||
bufHdr->tag = newTag;
|
||||
bufHdr->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_IO_ERROR);
|
||||
bufHdr->flags |= BM_TAG_VALID;
|
||||
bufHdr->usage_count = 0;
|
||||
bufHdr->usage_count = 1;
|
||||
|
||||
*foundPtr = FALSE;
|
||||
return bufHdr;
|
||||
|
||||
Reference in New Issue
Block a user