Make large sequential scans and VACUUMs work in a limited-size "ring" of

buffers, rather than blowing out the whole shared-buffer arena. Aside from avoiding cache spoliation, this fixes the problem that VACUUM formerly tended to cause a WAL flush for every page it modified, because we had it hacked to use only a single buffer. Those flushes will now occur only once per ring-ful. The exact ring size, and the threshold for seqscans to switch into the ring usage pattern, remain under debate; but the infrastructure seems done. The key bit of infrastructure is a new optional BufferAccessStrategy object that can be passed to ReadBuffer operations; this replaces the former StrategyHintVacuum API. This patch also changes the buffer usage-count methodology a bit: we now advance usage_count when first pinning a buffer, rather than when last unpinning it. To preserve the behavior that a buffer's lifetime starts to decrease when it's released, the clock sweep code is modified to not decrement usage_count of pinned buffers. Work not done in this commit: teach GiST and GIN indexes to use the vacuum BufferAccessStrategy for vacuum-driven fetches. Original patch by Simon, reworked by Heikki and again by Tom.
2025-11-12 05:01:15 +03:00 · 2007-05-30 20:12:03 +00:00
parent 0a6f2ee84d
commit d526575f89
24 changed files with 722 additions and 262 deletions
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -1,4 +1,4 @@
-$PostgreSQL: pgsql/src/backend/storage/buffer/README,v 1.11 2006/07/23 03:07:58 tgl Exp $
+$PostgreSQL: pgsql/src/backend/storage/buffer/README,v 1.12 2007/05/30 20:11:58 tgl Exp $

 Notes about shared buffer access rules
 --------------------------------------
@@ -152,20 +152,21 @@ we could use per-backend LWLocks instead (a buffer header would then contain
 a field to show which backend is doing its I/O).


-Buffer replacement strategy
---------------------------
+Normal buffer replacement strategy
+----------------------------------

 There is a "free list" of buffers that are prime candidates for replacement.
 In particular, buffers that are completely free (contain no valid page) are
-always in this list.  We may also throw buffers into this list if we
-consider their pages unlikely to be needed soon.  The list is singly-linked
-using fields in the buffer headers; we maintain head and tail pointers in
-global variables.  (Note: although the list links are in the buffer headers,
-they are considered to be protected by the BufFreelistLock, not the
-buffer-header spinlocks.)  To choose a victim buffer to recycle when there
-are no free buffers available, we use a simple clock-sweep algorithm, which
-avoids the need to take system-wide locks during common operations.  It
-works like this:
+always in this list.  We could also throw buffers into this list if we
+consider their pages unlikely to be needed soon; however, the current
+algorithm never does that.  The list is singly-linked using fields in the
+buffer headers; we maintain head and tail pointers in global variables.
+(Note: although the list links are in the buffer headers, they are
+considered to be protected by the BufFreelistLock, not the buffer-header
+spinlocks.)  To choose a victim buffer to recycle when there are no free
+buffers available, we use a simple clock-sweep algorithm, which avoids the
+need to take system-wide locks during common operations.  It works like
+this:

 Each buffer header contains a usage counter, which is incremented (up to a
 small limit value) whenever the buffer is unpinned.  (This requires only the
@@ -199,22 +200,40 @@ before we can recycle it; if someone else pins the buffer meanwhile we will
 have to give up and try another buffer.  This however is not a concern
 of the basic select-a-victim-buffer algorithm.)

-A special provision is that while running VACUUM, a backend does not
-increment the usage count on buffers it accesses.  In fact, if ReleaseBuffer
-sees that it is dropping the pin count to zero and the usage count is zero,
-then it appends the buffer to the tail of the free list.  (This implies that
-VACUUM, but only VACUUM, must take the BufFreelistLock during ReleaseBuffer;
-this shouldn't create much of a contention problem.)  This provision
-encourages VACUUM to work in a relatively small number of buffers rather
-than blowing out the entire buffer cache.  It is reasonable since a page
-that has been touched only by VACUUM is unlikely to be needed again soon.

-Since VACUUM usually requests many pages very fast, the effect of this is that
-it will get back the very buffers it filled and possibly modified on the next
-call and will therefore do its work in a few shared memory buffers, while
-being able to use whatever it finds in the cache already.  This also implies
-that most of the write traffic caused by a VACUUM will be done by the VACUUM
-itself and not pushed off onto other processes.
+Buffer ring replacement strategy
+---------------------------------
+
+When running a query that needs to access a large number of pages just once,
+such as VACUUM or a large sequential scan, a different strategy is used.
+A page that has been touched only by such a scan is unlikely to be needed
+again soon, so instead of running the normal clock sweep algorithm and
+blowing out the entire buffer cache, a small ring of buffers is allocated
+using the normal clock sweep algorithm and those buffers are reused for the
+whole scan.  This also implies that much of the write traffic caused by such
+a statement will be done by the backend itself and not pushed off onto other
+processes.
+
+For sequential scans, a 256KB ring is used. That's small enough to fit in L2
+cache, which makes transferring pages from OS cache to shared buffer cache
+efficient.  Even less would often be enough, but the ring must be big enough
+to accommodate all pages in the scan that are pinned concurrently.  256KB
+should also be enough to leave a small cache trail for other backends to
+join in a synchronized seq scan.  If a ring buffer is dirtied and its LSN
+updated, we would normally have to write and flush WAL before we could
+re-use the buffer; in this case we instead discard the buffer from the ring
+and (later) choose a replacement using the normal clock-sweep algorithm.
+Hence this strategy works best for scans that are read-only (or at worst
+update hint bits).  In a scan that modifies every page in the scan, like a
+bulk UPDATE or DELETE, the buffers in the ring will always be dirtied and
+the ring strategy effectively degrades to the normal strategy.
+
+VACUUM uses a 256KB ring like sequential scans, but dirty pages are not
+removed from the ring.  Instead, WAL is flushed if needed to allow reuse of
+the buffers.  Before introducing the buffer ring strategy in 8.3, VACUUM's
+buffers were sent to the freelist, which was effectively a buffer ring of 1
+buffer, resulting in excessive WAL flushing.  Allowing VACUUM to update
+256KB between WAL flushes should be more efficient.


 Background writer's processing
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -8,7 +8,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/storage/buffer/bufmgr.c,v 1.219 2007/05/27 03:50:39 tgl Exp $
+ *	  $PostgreSQL: pgsql/src/backend/storage/buffer/bufmgr.c,v 1.220 2007/05/30 20:11:58 tgl Exp $
 *
 *-------------------------------------------------------------------------
 */
@@ -90,11 +90,11 @@ static volatile BufferDesc *PinCountWaitBuf = NULL;


 static Buffer ReadBuffer_common(Relation reln, BlockNumber blockNum,
-								bool zeroPage);
-static bool PinBuffer(volatile BufferDesc *buf);
+								bool zeroPage,
+								BufferAccessStrategy strategy);
+static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(volatile BufferDesc *buf);
-static void UnpinBuffer(volatile BufferDesc *buf,
-			bool fixOwner, bool normalAccess);
+static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
 static bool SyncOneBuffer(int buf_id, bool skip_pinned);
 static void WaitIO(volatile BufferDesc *buf);
 static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
@@ -102,7 +102,8 @@ static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
 				  int set_flag_bits);
 static void buffer_write_error_callback(void *arg);
 static volatile BufferDesc *BufferAlloc(Relation reln, BlockNumber blockNum,
-			bool *foundPtr);
+										BufferAccessStrategy strategy,
+										bool *foundPtr);
 static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
 static void AtProcExit_Buffers(int code, Datum arg);

@@ -125,7 +126,18 @@ static void AtProcExit_Buffers(int code, Datum arg);
 Buffer
 ReadBuffer(Relation reln, BlockNumber blockNum)
 {
-	return ReadBuffer_common(reln, blockNum, false);
+	return ReadBuffer_common(reln, blockNum, false, NULL);
+}
+
+/*
+ * ReadBufferWithStrategy -- same as ReadBuffer, except caller can specify
+ *		a nondefault buffer access strategy.  See buffer/README for details.
+ */
+Buffer
+ReadBufferWithStrategy(Relation reln, BlockNumber blockNum,
+					   BufferAccessStrategy strategy)
+{
+	return ReadBuffer_common(reln, blockNum, false, strategy);
 }

 /*
@@ -140,14 +152,15 @@ ReadBuffer(Relation reln, BlockNumber blockNum)
 Buffer
 ReadOrZeroBuffer(Relation reln, BlockNumber blockNum)
 {
-	return ReadBuffer_common(reln, blockNum, true);
+	return ReadBuffer_common(reln, blockNum, true, NULL);
 }

 /*
- * ReadBuffer_common -- common logic for ReadBuffer and ReadOrZeroBuffer
+ * ReadBuffer_common -- common logic for ReadBuffer variants
 */
 static Buffer
-ReadBuffer_common(Relation reln, BlockNumber blockNum, bool zeroPage)
+ReadBuffer_common(Relation reln, BlockNumber blockNum, bool zeroPage,
+				  BufferAccessStrategy strategy)
 {
 	volatile BufferDesc *bufHdr;
 	Block		bufBlock;
@@ -185,7 +198,7 @@ ReadBuffer_common(Relation reln, BlockNumber blockNum, bool zeroPage)
 		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
 		 * not currently in memory.
 		 */
-		bufHdr = BufferAlloc(reln, blockNum, &found);
+		bufHdr = BufferAlloc(reln, blockNum, strategy, &found);
 		if (found)
 			BufferHitCount++;
 	}
@@ -330,6 +343,10 @@ ReadBuffer_common(Relation reln, BlockNumber blockNum, bool zeroPage)
 *		buffer.  If no buffer exists already, selects a replacement
 *		victim and evicts the old page, but does NOT read in new page.
 *
+ * "strategy" can be a buffer replacement strategy object, or NULL for
+ * the default strategy.  The selected buffer's usage_count is advanced when
+ * using the default strategy, but otherwise possibly not (see PinBuffer).
+ *
 * The returned buffer is pinned and is already marked as holding the
 * desired page.  If it already did have the desired page, *foundPtr is
 * set TRUE.  Otherwise, *foundPtr is set FALSE and the buffer is marked
@@ -343,6 +360,7 @@ ReadBuffer_common(Relation reln, BlockNumber blockNum, bool zeroPage)
 static volatile BufferDesc *
 BufferAlloc(Relation reln,
 			BlockNumber blockNum,
+			BufferAccessStrategy strategy,
 			bool *foundPtr)
 {
 	BufferTag	newTag;			/* identity of requested block */
@@ -375,7 +393,7 @@ BufferAlloc(Relation reln,
 		 */
 		buf = &BufferDescriptors[buf_id];

-		valid = PinBuffer(buf);
+		valid = PinBuffer(buf, strategy);

 		/* Can release the mapping lock as soon as we've pinned it */
 		LWLockRelease(newPartitionLock);
@@ -413,13 +431,15 @@ BufferAlloc(Relation reln,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+		bool lock_held;
+
 		/*
 		 * Select a victim buffer.	The buffer is returned with its header
-		 * spinlock still held!  Also the BufFreelistLock is still held, since
-		 * it would be bad to hold the spinlock while possibly waking up other
-		 * processes.
+		 * spinlock still held!  Also (in most cases) the BufFreelistLock is
+		 * still held, since it would be bad to hold the spinlock while
+		 * possibly waking up other processes.
 		 */
-		buf = StrategyGetBuffer();
+		buf = StrategyGetBuffer(strategy, &lock_held);

 		Assert(buf->refcount == 0);

@@ -430,7 +450,8 @@ BufferAlloc(Relation reln,
 		PinBuffer_Locked(buf);

 		/* Now it's safe to release the freelist lock */
-		LWLockRelease(BufFreelistLock);
+		if (lock_held)
+			LWLockRelease(BufFreelistLock);

 		/*
 		 * If the buffer was dirty, try to write it out.  There is a race
@@ -458,16 +479,34 @@ BufferAlloc(Relation reln,
 			 */
 			if (LWLockConditionalAcquire(buf->content_lock, LW_SHARED))
 			{
+				/*
+				 * If using a nondefault strategy, and writing the buffer
+				 * would require a WAL flush, let the strategy decide whether
+				 * to go ahead and write/reuse the buffer or to choose another
+				 * victim.  We need lock to inspect the page LSN, so this
+				 * can't be done inside StrategyGetBuffer.
+				 */
+				if (strategy != NULL &&
+					XLogNeedsFlush(BufferGetLSN(buf)) &&
+					StrategyRejectBuffer(strategy, buf))
+				{
+					/* Drop lock/pin and loop around for another buffer */
+					LWLockRelease(buf->content_lock);
+					UnpinBuffer(buf, true);
+					continue;
+				}
+
+				/* OK, do the I/O */
 				FlushBuffer(buf, NULL);
 				LWLockRelease(buf->content_lock);
 			}
 			else
 			{
 				/*
-				 * Someone else has pinned the buffer, so give it up and loop
+				 * Someone else has locked the buffer, so give it up and loop
 				 * back to get another one.
 				 */
-				UnpinBuffer(buf, true, false /* evidently recently used */ );
+				UnpinBuffer(buf, true);
 				continue;
 			}
 		}
@@ -531,10 +570,9 @@ BufferAlloc(Relation reln,
 			 * Got a collision. Someone has already done what we were about to
 			 * do. We'll just handle this as if it were found in the buffer
 			 * pool in the first place.  First, give up the buffer we were
-			 * planning to use.  Don't allow it to be thrown in the free list
-			 * (we don't want to hold freelist and mapping locks at once).
+			 * planning to use.
 			 */
-			UnpinBuffer(buf, true, false);
+			UnpinBuffer(buf, true);

 			/* Can give up that buffer's mapping partition lock now */
 			if ((oldFlags & BM_TAG_VALID) &&
@@ -545,7 +583,7 @@ BufferAlloc(Relation reln,

 			buf = &BufferDescriptors[buf_id];

-			valid = PinBuffer(buf);
+			valid = PinBuffer(buf, strategy);

 			/* Can release the mapping lock as soon as we've pinned it */
 			LWLockRelease(newPartitionLock);
@@ -595,20 +633,21 @@ BufferAlloc(Relation reln,
 			oldPartitionLock != newPartitionLock)
 			LWLockRelease(oldPartitionLock);
 		LWLockRelease(newPartitionLock);
-		UnpinBuffer(buf, true, false /* evidently recently used */ );
+		UnpinBuffer(buf, true);
 	}

 	/*
 	 * Okay, it's finally safe to rename the buffer.
 	 *
 	 * Clearing BM_VALID here is necessary, clearing the dirtybits is just
-	 * paranoia.  We also clear the usage_count since any recency of use of
-	 * the old content is no longer relevant.
+	 * paranoia.  We also reset the usage_count since any recency of use of
+	 * the old content is no longer relevant.  (The usage_count starts out
+	 * at 1 so that the buffer can survive one clock-sweep pass.)
 	 */
 	buf->tag = newTag;
 	buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_IO_ERROR);
 	buf->flags |= BM_TAG_VALID;
-	buf->usage_count = 0;
+	buf->usage_count = 1;

 	UnlockBufHdr(buf);

@@ -736,7 +775,7 @@ retry:
 	/*
 	 * Insert the buffer at the head of the list of free buffers.
 	 */
-	StrategyFreeBuffer(buf, true);
+	StrategyFreeBuffer(buf);
 }

 /*
@@ -814,9 +853,6 @@ ReleaseAndReadBuffer(Buffer buffer,
 				return buffer;
 			ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
 			LocalRefCount[-buffer - 1]--;
-			if (LocalRefCount[-buffer - 1] == 0 &&
-				bufHdr->usage_count < BM_MAX_USAGE_COUNT)
-				bufHdr->usage_count++;
 		}
 		else
 		{
@@ -826,7 +862,7 @@ ReleaseAndReadBuffer(Buffer buffer,
 			if (bufHdr->tag.blockNum == blockNum &&
 				RelFileNodeEquals(bufHdr->tag.rnode, relation->rd_node))
 				return buffer;
-			UnpinBuffer(bufHdr, true, true);
+			UnpinBuffer(bufHdr, true);
 		}
 	}

@@ -836,6 +872,14 @@ ReleaseAndReadBuffer(Buffer buffer,
 /*
 * PinBuffer -- make buffer unavailable for replacement.
 *
+ * For the default access strategy, the buffer's usage_count is incremented
+ * when we first pin it; for other strategies we just make sure the usage_count
+ * isn't zero.  (The idea of the latter is that we don't want synchronized
+ * heap scans to inflate the count, but we need it to not be zero to discourage
+ * other backends from stealing buffers from our ring.  As long as we cycle
+ * through the ring faster than the global clock-sweep cycles, buffers in
+ * our ring won't be chosen as victims for replacement by other backends.)
+ *
 * This should be applied only to shared buffers, never local ones.
 *
 * Note that ResourceOwnerEnlargeBuffers must have been done already.
@@ -844,7 +888,7 @@ ReleaseAndReadBuffer(Buffer buffer,
 * some callers to avoid an extra spinlock cycle.
 */
 static bool
-PinBuffer(volatile BufferDesc *buf)
+PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy)
 {
 	int			b = buf->buf_id;
 	bool		result;
@@ -853,6 +897,16 @@ PinBuffer(volatile BufferDesc *buf)
 	{
 		LockBufHdr(buf);
 		buf->refcount++;
+		if (strategy == NULL)
+		{
+			if (buf->usage_count < BM_MAX_USAGE_COUNT)
+				buf->usage_count++;
+		}
+		else
+		{
+			if (buf->usage_count == 0)
+				buf->usage_count = 1;
+		}
 		result = (buf->flags & BM_VALID) != 0;
 		UnlockBufHdr(buf);
 	}
@@ -872,6 +926,11 @@ PinBuffer(volatile BufferDesc *buf)
 * PinBuffer_Locked -- as above, but caller already locked the buffer header.
 * The spinlock is released before return.
 *
+ * Currently, no callers of this function want to modify the buffer's
+ * usage_count at all, so there's no need for a strategy parameter.
+ * Also we don't bother with a BM_VALID test (the caller could check that for
+ * itself).
+ *
 * Note: use of this routine is frequently mandatory, not just an optimization
 * to save a spin lock/unlock cycle, because we need to pin a buffer before
 * its state can change under us.
@@ -897,17 +956,9 @@ PinBuffer_Locked(volatile BufferDesc *buf)
 *
 * Most but not all callers want CurrentResourceOwner to be adjusted.
 * Those that don't should pass fixOwner = FALSE.
- *
- * normalAccess indicates that we are finishing a "normal" page access,
- * that is, one requested by something outside the buffer subsystem.
- * Passing FALSE means it's an internal access that should not update the
- * buffer's usage count nor cause a change in the freelist.
- *
- * If we are releasing a buffer during VACUUM, and it's not been otherwise
- * used recently, and normalAccess is true, we send the buffer to the freelist.
 */
 static void
-UnpinBuffer(volatile BufferDesc *buf, bool fixOwner, bool normalAccess)
+UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
 {
 	int			b = buf->buf_id;

@@ -919,8 +970,6 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner, bool normalAccess)
 	PrivateRefCount[b]--;
 	if (PrivateRefCount[b] == 0)
 	{
-		bool		immed_free_buffer = false;
-
 		/* I'd better not still hold any locks on the buffer */
 		Assert(!LWLockHeldByMe(buf->content_lock));
 		Assert(!LWLockHeldByMe(buf->io_in_progress_lock));
@@ -931,22 +980,7 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner, bool normalAccess)
 		Assert(buf->refcount > 0);
 		buf->refcount--;

-		/* Update buffer usage info, unless this is an internal access */
-		if (normalAccess)
-		{
-			if (!strategy_hint_vacuum)
-			{
-				if (buf->usage_count < BM_MAX_USAGE_COUNT)
-					buf->usage_count++;
-			}
-			else
-			{
-				/* VACUUM accesses don't bump usage count, instead... */
-				if (buf->refcount == 0 && buf->usage_count == 0)
-					immed_free_buffer = true;
-			}
-		}
-
+		/* Support LockBufferForCleanup() */
 		if ((buf->flags & BM_PIN_COUNT_WAITER) &&
 			buf->refcount == 1)
 		{
@@ -959,14 +993,6 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner, bool normalAccess)
 		}
 		else
 			UnlockBufHdr(buf);
-
-		/*
-		 * If VACUUM is releasing an otherwise-unused buffer, send it to the
-		 * freelist for near-term reuse.  We put it at the tail so that it
-		 * won't be used before any invalid buffers that may exist.
-		 */
-		if (immed_free_buffer)
-			StrategyFreeBuffer(buf, false);
 	}
 }

@@ -1150,7 +1176,7 @@ SyncOneBuffer(int buf_id, bool skip_pinned)
 	FlushBuffer(bufHdr, NULL);

 	LWLockRelease(bufHdr->content_lock);
-	UnpinBuffer(bufHdr, true, false /* don't change freelist */ );
+	UnpinBuffer(bufHdr, true);

 	return true;
 }
@@ -1266,7 +1292,7 @@ AtProcExit_Buffers(int code, Datum arg)
 			 * here, it suggests that ResourceOwners are messed up.
 			 */
 			PrivateRefCount[i] = 1;		/* make sure we release shared pin */
-			UnpinBuffer(buf, false, false /* don't change freelist */ );
+			UnpinBuffer(buf, false);
 			Assert(PrivateRefCount[i] == 0);
 		}
 	}
@@ -1700,7 +1726,7 @@ FlushRelationBuffers(Relation rel)
 			LWLockAcquire(bufHdr->content_lock, LW_SHARED);
 			FlushBuffer(bufHdr, rel->rd_smgr);
 			LWLockRelease(bufHdr->content_lock);
-			UnpinBuffer(bufHdr, true, false /* no freelist change */ );
+			UnpinBuffer(bufHdr, true);
 		}
 		else
 			UnlockBufHdr(bufHdr);
@@ -1723,11 +1749,7 @@ ReleaseBuffer(Buffer buffer)
 	if (BufferIsLocal(buffer))
 	{
 		Assert(LocalRefCount[-buffer - 1] > 0);
-		bufHdr = &LocalBufferDescriptors[-buffer - 1];
 		LocalRefCount[-buffer - 1]--;
-		if (LocalRefCount[-buffer - 1] == 0 &&
-			bufHdr->usage_count < BM_MAX_USAGE_COUNT)
-			bufHdr->usage_count++;
 		return;
 	}

@@ -1738,7 +1760,7 @@ ReleaseBuffer(Buffer buffer)
 	if (PrivateRefCount[buffer - 1] > 1)
 		PrivateRefCount[buffer - 1]--;
 	else
-		UnpinBuffer(bufHdr, false, true);
+		UnpinBuffer(bufHdr, false);
 }

 /*
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -9,7 +9,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/storage/buffer/freelist.c,v 1.58 2007/01/05 22:19:37 momjian Exp $
+ *	  $PostgreSQL: pgsql/src/backend/storage/buffer/freelist.c,v 1.59 2007/05/30 20:11:59 tgl Exp $
 *
 *-------------------------------------------------------------------------
 */
@@ -39,8 +39,42 @@ typedef struct
 /* Pointers to shared state */
 static BufferStrategyControl *StrategyControl = NULL;

-/* Backend-local state about whether currently vacuuming */
-bool		strategy_hint_vacuum = false;
+/*
+ * Private (non-shared) state for managing a ring of shared buffers to re-use.
+ * This is currently the only kind of BufferAccessStrategy object, but someday
+ * we might have more kinds.
+ */
+typedef struct BufferAccessStrategyData
+{
+	/* Overall strategy type */
+	BufferAccessStrategyType btype;
+	/* Number of elements in buffers[] array */
+	int			ring_size;
+	/*
+	 * Index of the "current" slot in the ring, ie, the one most recently
+	 * returned by GetBufferFromRing.
+	 */
+	int			current;
+	/*
+	 * True if the buffer just returned by StrategyGetBuffer had been in
+	 * the ring already.
+	 */
+	bool		current_was_in_ring;
+
+	/*
+	 * Array of buffer numbers.  InvalidBuffer (that is, zero) indicates
+	 * we have not yet selected a buffer for this ring slot.  For allocation
+	 * simplicity this is palloc'd together with the fixed fields of the
+	 * struct.
+	 */
+	Buffer		buffers[1];				/* VARIABLE SIZE ARRAY */
+} BufferAccessStrategyData;
+
+
+/* Prototypes for internal functions */
+static volatile BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy);
+static void AddBufferToRing(BufferAccessStrategy strategy,
+							volatile BufferDesc *buf);


 /*
@@ -50,17 +84,38 @@ bool		strategy_hint_vacuum = false;
 *	BufferAlloc(). The only hard requirement BufferAlloc() has is that
 *	the selected buffer must not currently be pinned by anyone.
 *
+ *	strategy is a BufferAccessStrategy object, or NULL for default strategy.
+ *
 *	To ensure that no one else can pin the buffer before we do, we must
- *	return the buffer with the buffer header spinlock still held.  That
- *	means that we return with the BufFreelistLock still held, as well;
- *	the caller must release that lock once the spinlock is dropped.
+ *	return the buffer with the buffer header spinlock still held.  If
+ *	*lock_held is set on exit, we have returned with the BufFreelistLock
+ *	still held, as well; the caller must release that lock once the spinlock
+ *	is dropped.  We do it that way because releasing the BufFreelistLock
+ *	might awaken other processes, and it would be bad to do the associated
+ *	kernel calls while holding the buffer header spinlock.
 */
 volatile BufferDesc *
-StrategyGetBuffer(void)
+StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 {
 	volatile BufferDesc *buf;
 	int			trycounter;

+	/*
+	 * If given a strategy object, see whether it can select a buffer.
+	 * We assume strategy objects don't need the BufFreelistLock.
+	 */
+	if (strategy != NULL)
+	{
+		buf = GetBufferFromRing(strategy);
+		if (buf != NULL)
+		{
+			*lock_held = false;
+			return buf;
+		}
+	}
+
+	/* Nope, so lock the freelist */
+	*lock_held = true;
 	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);

 	/*
@@ -82,11 +137,16 @@ StrategyGetBuffer(void)
 		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
 		 * it; discard it and retry.  (This can only happen if VACUUM put a
 		 * valid buffer in the freelist and then someone else used it before
-		 * we got to it.)
+		 * we got to it.  It's probably impossible altogether as of 8.3,
+		 * but we'd better check anyway.)
 		 */
 		LockBufHdr(buf);
 		if (buf->refcount == 0 && buf->usage_count == 0)
+		{
+			if (strategy != NULL)
+				AddBufferToRing(strategy, buf);
 			return buf;
+		}
 		UnlockBufHdr(buf);
 	}

@@ -101,15 +161,23 @@ StrategyGetBuffer(void)

 		/*
 		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
-		 * it; decrement the usage_count and keep scanning.
+		 * it; decrement the usage_count (unless pinned) and keep scanning.
 		 */
 		LockBufHdr(buf);
-		if (buf->refcount == 0 && buf->usage_count == 0)
-			return buf;
-		if (buf->usage_count > 0)
+		if (buf->refcount == 0)
 		{
-			buf->usage_count--;
-			trycounter = NBuffers;
+			if (buf->usage_count > 0)
+			{
+				buf->usage_count--;
+				trycounter = NBuffers;
+			}
+			else
+			{
+				/* Found a usable buffer */
+				if (strategy != NULL)
+					AddBufferToRing(strategy, buf);
+				return buf;
+			}
 		}
 		else if (--trycounter == 0)
 		{
@@ -132,13 +200,9 @@ StrategyGetBuffer(void)

 /*
 * StrategyFreeBuffer: put a buffer on the freelist
- *
- * The buffer is added either at the head or the tail, according to the
- * at_head parameter.  This allows a small amount of control over how
- * quickly the buffer is reused.
 */
 void
-StrategyFreeBuffer(volatile BufferDesc *buf, bool at_head)
+StrategyFreeBuffer(volatile BufferDesc *buf)
 {
 	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);

@@ -148,22 +212,10 @@ StrategyFreeBuffer(volatile BufferDesc *buf, bool at_head)
 	 */
 	if (buf->freeNext == FREENEXT_NOT_IN_LIST)
 	{
-		if (at_head)
-		{
-			buf->freeNext = StrategyControl->firstFreeBuffer;
-			if (buf->freeNext < 0)
-				StrategyControl->lastFreeBuffer = buf->buf_id;
-			StrategyControl->firstFreeBuffer = buf->buf_id;
-		}
-		else
-		{
-			buf->freeNext = FREENEXT_END_OF_LIST;
-			if (StrategyControl->firstFreeBuffer < 0)
-				StrategyControl->firstFreeBuffer = buf->buf_id;
-			else
-				BufferDescriptors[StrategyControl->lastFreeBuffer].freeNext = buf->buf_id;
+		buf->freeNext = StrategyControl->firstFreeBuffer;
+		if (buf->freeNext < 0)
 			StrategyControl->lastFreeBuffer = buf->buf_id;
-		}
+		StrategyControl->firstFreeBuffer = buf->buf_id;
 	}

 	LWLockRelease(BufFreelistLock);
@@ -190,15 +242,6 @@ StrategySyncStart(void)
 	return result;
 }

-/*
- * StrategyHintVacuum -- tell us whether VACUUM is active
- */
-void
-StrategyHintVacuum(bool vacuum_active)
-{
-	strategy_hint_vacuum = vacuum_active;
-}
-

 /*
 * StrategyShmemSize
@@ -274,3 +317,172 @@ StrategyInitialize(bool init)
 	else
 		Assert(!init);
 }
+
+
+/* ----------------------------------------------------------------
+ *				Backend-private buffer ring management
+ * ----------------------------------------------------------------
+ */
+
+
+/*
+ * GetAccessStrategy -- create a BufferAccessStrategy object
+ *
+ * The object is allocated in the current memory context.
+ */
+BufferAccessStrategy
+GetAccessStrategy(BufferAccessStrategyType btype)
+{
+	BufferAccessStrategy strategy;
+	int		ring_size;
+
+	/*
+	 * Select ring size to use.  See buffer/README for rationales.
+	 * (Currently all cases are the same size, but keep this code
+	 * structure for flexibility.)
+	 */
+	switch (btype)
+	{
+		case BAS_NORMAL:
+			/* if someone asks for NORMAL, just give 'em a "default" object */
+			return NULL;
+
+		case BAS_BULKREAD:
+			ring_size = 256 * 1024 / BLCKSZ;
+			break;
+		case BAS_VACUUM:
+			ring_size = 256 * 1024 / BLCKSZ;
+			break;
+
+		default:
+			elog(ERROR, "unrecognized buffer access strategy: %d",
+				 (int) btype);
+			return NULL;		/* keep compiler quiet */
+	}
+
+	/* Make sure ring isn't an undue fraction of shared buffers */
+	ring_size = Min(NBuffers / 8, ring_size);
+
+	/* Allocate the object and initialize all elements to zeroes */
+	strategy = (BufferAccessStrategy)
+		palloc0(offsetof(BufferAccessStrategyData, buffers) +
+				ring_size * sizeof(Buffer));
+
+	/* Set fields that don't start out zero */
+	strategy->btype = btype;
+	strategy->ring_size = ring_size;
+
+	return strategy;
+}
+
+/*
+ * FreeAccessStrategy -- release a BufferAccessStrategy object
+ *
+ * A simple pfree would do at the moment, but we would prefer that callers
+ * don't assume that much about the representation of BufferAccessStrategy.
+ */
+void
+FreeAccessStrategy(BufferAccessStrategy strategy)
+{
+	/* don't crash if called on a "default" strategy */
+	if (strategy != NULL)
+		pfree(strategy);
+}
+
+/*
+ * GetBufferFromRing -- returns a buffer from the ring, or NULL if the
+ *		ring is empty.
+ *
+ * The bufhdr spin lock is held on the returned buffer.
+ */
+static volatile BufferDesc *
+GetBufferFromRing(BufferAccessStrategy strategy)
+{
+	volatile BufferDesc *buf;
+	Buffer		bufnum;
+
+	/* Advance to next ring slot */
+	if (++strategy->current >= strategy->ring_size)
+		strategy->current = 0;
+
+	/*
+	 * If the slot hasn't been filled yet, tell the caller to allocate
+	 * a new buffer with the normal allocation strategy.  He will then
+	 * fill this slot by calling AddBufferToRing with the new buffer.
+	 */
+	bufnum = strategy->buffers[strategy->current];
+	if (bufnum == InvalidBuffer)
+	{
+		strategy->current_was_in_ring = false;
+		return NULL;
+	}
+
+	/*
+	 * If the buffer is pinned we cannot use it under any circumstances.
+	 *
+	 * If usage_count is 0 or 1 then the buffer is fair game (we expect 1,
+	 * since our own previous usage of the ring element would have left it
+	 * there, but it might've been decremented by clock sweep since then).
+	 * A higher usage_count indicates someone else has touched the buffer,
+	 * so we shouldn't re-use it.
+	 */
+	buf = &BufferDescriptors[bufnum - 1];
+	LockBufHdr(buf);
+	if (buf->refcount == 0 && buf->usage_count <= 1)
+	{
+		strategy->current_was_in_ring = true;
+		return buf;
+	}
+	UnlockBufHdr(buf);
+
+	/*
+	 * Tell caller to allocate a new buffer with the normal allocation
+	 * strategy.  He'll then replace this ring element via AddBufferToRing.
+	 */
+	strategy->current_was_in_ring = false;
+	return NULL;
+}
+
+/*
+ * AddBufferToRing -- add a buffer to the buffer ring
+ *
+ * Caller must hold the buffer header spinlock on the buffer.  Since this
+ * is called with the spinlock held, it had better be quite cheap.
+ */
+static void
+AddBufferToRing(BufferAccessStrategy strategy, volatile BufferDesc *buf)
+{
+	strategy->buffers[strategy->current] = BufferDescriptorGetBuffer(buf);
+}
+
+/*
+ * StrategyRejectBuffer -- consider rejecting a dirty buffer
+ *
+ * When a nondefault strategy is used, the buffer manager calls this function
+ * when it turns out that the buffer selected by StrategyGetBuffer needs to
+ * be written out and doing so would require flushing WAL too.  This gives us
+ * a chance to choose a different victim.
+ *
+ * Returns true if buffer manager should ask for a new victim, and false
+ * if this buffer should be written and re-used.
+ */
+bool
+StrategyRejectBuffer(BufferAccessStrategy strategy, volatile BufferDesc *buf)
+{
+	/* We only do this in bulkread mode */
+	if (strategy->btype != BAS_BULKREAD)
+		return false;
+
+	/* Don't muck with behavior of normal buffer-replacement strategy */
+	if (!strategy->current_was_in_ring ||
+		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
+		return false;
+
+	/*
+	 * Remove the dirty buffer from the ring; necessary to prevent infinite
+	 * loop if all ring members are dirty.
+	 */
+	strategy->buffers[strategy->current] = InvalidBuffer;
+
+	return true;
+}
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -9,7 +9,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/storage/buffer/localbuf.c,v 1.76 2007/01/05 22:19:37 momjian Exp $
+ *	  $PostgreSQL: pgsql/src/backend/storage/buffer/localbuf.c,v 1.77 2007/05/30 20:11:59 tgl Exp $
 *
 *-------------------------------------------------------------------------
 */
@@ -57,7 +57,8 @@ static Block GetLocalBufferStorage(void);
 *
 * API is similar to bufmgr.c's BufferAlloc, except that we do not need
 * to do any locking since this is all local.	Also, IO_IN_PROGRESS
- * does not get set.
+ * does not get set.  Lastly, we support only default access strategy
+ * (hence, usage_count is always advanced).
 */
 BufferDesc *
 LocalBufferAlloc(Relation reln, BlockNumber blockNum, bool *foundPtr)
@@ -88,7 +89,12 @@ LocalBufferAlloc(Relation reln, BlockNumber blockNum, bool *foundPtr)
 		fprintf(stderr, "LB ALLOC (%u,%d) %d\n",
 				RelationGetRelid(reln), blockNum, -b - 1);
 #endif
-
+		/* this part is equivalent to PinBuffer for a shared buffer */
+		if (LocalRefCount[b] == 0)
+		{
+			if (bufHdr->usage_count < BM_MAX_USAGE_COUNT)
+				bufHdr->usage_count++;
+		}
 		LocalRefCount[b]++;
 		ResourceOwnerRememberBuffer(CurrentResourceOwner,
 									BufferDescriptorGetBuffer(bufHdr));
@@ -121,18 +127,21 @@ LocalBufferAlloc(Relation reln, BlockNumber blockNum, bool *foundPtr)

 		bufHdr = &LocalBufferDescriptors[b];

-		if (LocalRefCount[b] == 0 && bufHdr->usage_count == 0)
+		if (LocalRefCount[b] == 0)
 		{
-			LocalRefCount[b]++;
-			ResourceOwnerRememberBuffer(CurrentResourceOwner,
-										BufferDescriptorGetBuffer(bufHdr));
-			break;
-		}
-
-		if (bufHdr->usage_count > 0)
-		{
-			bufHdr->usage_count--;
-			trycounter = NLocBuffer;
+			if (bufHdr->usage_count > 0)
+			{
+				bufHdr->usage_count--;
+				trycounter = NLocBuffer;
+			}
+			else
+			{
+				/* Found a usable buffer */
+				LocalRefCount[b]++;
+				ResourceOwnerRememberBuffer(CurrentResourceOwner,
+											BufferDescriptorGetBuffer(bufHdr));
+				break;
+			}
 		}
 		else if (--trycounter == 0)
 			ereport(ERROR,
@@ -199,7 +208,7 @@ LocalBufferAlloc(Relation reln, BlockNumber blockNum, bool *foundPtr)
 	bufHdr->tag = newTag;
 	bufHdr->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_IO_ERROR);
 	bufHdr->flags |= BM_TAG_VALID;
-	bufHdr->usage_count = 0;
+	bufHdr->usage_count = 1;

 	*foundPtr = FALSE;
 	return bufHdr;