Code review for ARC patch. Eliminate static variables, improve handling

of VACUUM cases so that VACUUM requests don't affect the ARC state at all, avoid corner case where BufferSync would uselessly rewrite a buffer that no longer contains the page that was to be flushed. Make some minor other cleanups in and around the bufmgr as well, such as moving PinBuffer and UnpinBuffer into bufmgr.c where they really belong.
2025-11-10 17:42:29 +03:00 · 2004-04-19 23:27:17 +00:00
parent 8f73bbae85
commit 011c3e62e7
7 changed files with 715 additions and 730 deletions
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -1,4 +1,4 @@
-$PostgreSQL: pgsql/src/backend/storage/buffer/README,v 1.6 2003/11/29 19:51:56 pgsql Exp $
+$PostgreSQL: pgsql/src/backend/storage/buffer/README,v 1.7 2004/04/19 23:27:17 tgl Exp $

 Notes about shared buffer access rules
 --------------------------------------
@@ -97,153 +97,149 @@ for VACUUM's use, since we don't allow multiple VACUUMs concurrently on a
 single relation anyway.


-Buffer replacement strategy interface:
+Buffer replacement strategy interface
+-------------------------------------

-The two files freelist.c and buf_table.c contain the buffer cache
-replacement strategy. The interface to the strategy is:
+The file freelist.c contains the buffer cache replacement strategy.
+The interface to the strategy is:

-    BufferDesc *
-	StrategyBufferLookup(BufferTag *tagPtr, bool recheck)
+	BufferDesc *StrategyBufferLookup(BufferTag *tagPtr, bool recheck,
+	                                 int *cdb_found_index)

-		This is allways the first call made by the buffer manager
-		to check if a disk page is in memory. If so, the function
-		returns the buffer descriptor and no further action is
-		required.
+This is always the first call made by the buffer manager to check if a disk
+page is in memory. If so, the function returns the buffer descriptor and no
+further action is required. If the page is not in memory,
+StrategyBufferLookup() returns NULL.

-		If the page is not in memory, StrategyBufferLookup()
-		returns NULL.
+The flag recheck tells the strategy that this is a second lookup after
+flushing a dirty block. If the buffer manager has to evict another buffer,
+it will release the bufmgr lock while doing the write IO. During this time,
+another backend could possibly fault in the same page this backend is after,
+so we have to check again after the IO is done if the page is in memory now.

-		The flag recheck tells the strategy that this is a second
-		lookup after flushing a dirty block. If the buffer manager
-		has to evict another buffer, he will release the bufmgr lock
-		while doing the write IO. During this time, another backend
-		could possibly fault in the same page this backend is after,
-		so we have to check again after the IO is done if the page
-		is in memory now.
+*cdb_found_index is set to the index of the found CDB, or -1 if none.
+This is not intended to be used by the caller, except to pass to
+StrategyReplaceBuffer().

-	BufferDesc *
-	StrategyGetBuffer(void)
+	BufferDesc *StrategyGetBuffer(int *cdb_replace_index)

-		The buffer manager calls this function to get an unpinned
-		cache buffer who's content can be evicted. The returned
-		buffer might be empty, clean or dirty.
+The buffer manager calls this function to get an unpinned cache buffer whose
+content can be evicted. The returned buffer might be empty, clean or dirty.

-		The returned buffer is only a cadidate for replacement.
-		It is possible that while the buffer is written, another
-		backend finds and modifies it, so that it is dirty again.
-		The buffer manager will then call StrategyGetBuffer()
-		again to ask for another candidate.
+The returned buffer is only a candidate for replacement.  It is possible that
+while the buffer is being written, another backend finds and modifies it, so
+that it is dirty again.  The buffer manager will then have to call
+StrategyGetBuffer() again to ask for another candidate.

-	void
-	StrategyReplaceBuffer(BufferDesc *buf, Relation rnode, 
-			BlockNumber blockNum)
-		
-		Called by the buffer manager at the time it is about to
-		change the association of a buffer with a disk page.
+*cdb_replace_index is set to the index of the candidate CDB, or -1 if none
+(meaning we are using a previously free buffer).  This is not intended to be
+used by the caller, except to pass to StrategyReplaceBuffer().

-		Before this call, StrategyBufferLookup() still has to find
-		the buffer even if it was returned by StrategyGetBuffer()
-		as a candidate for replacement.
+	void StrategyReplaceBuffer(BufferDesc *buf, BufferTag *newTag,
+	                           int cdb_found_index, int cdb_replace_index)

-		After this call, this buffer must be returned for a
-		lookup of the new page identified by rnode and blockNum.
+Called by the buffer manager at the time it is about to change the association
+of a buffer with a disk page.

-	void
-	StrategyInvalidateBuffer(BufferDesc *buf)
+Before this call, StrategyBufferLookup() still has to find the buffer under
+its old tag, even if it was returned by StrategyGetBuffer() as a candidate
+for replacement.

-		Called from various parts to inform that the content of
-		this buffer has been thrown away. This happens for example
-		in the case of dropping a relation.
+After this call, this buffer must be returned for a lookup of the new page
+identified by *newTag.

-		The buffer must be clean and unpinned on call.
+cdb_found_index and cdb_replace_index must be the auxiliary values
+returned by previous calls to StrategyBufferLookup and StrategyGetBuffer.

-		If the buffer associated with a disk page, StrategyBufferLookup()
-		must not return it for this page after the call.
+	void StrategyInvalidateBuffer(BufferDesc *buf)

-	void
-	StrategyHintVacuum(bool vacuum_active)
+Called by the buffer manager to inform the strategy that the content of this
+buffer is being thrown away. This happens for example in the case of dropping
+a relation.  The buffer must be clean and unpinned on call.

-		Because vacuum reads all relations of the entire database
-		through the buffer manager, it can greatly disturb the
-		buffer replacement strategy. This function is used by vacuum
-		to inform that all subsequent buffer lookups are caused
-		by vacuum scanning relations.
+If the buffer was associated with a disk page, StrategyBufferLookup()
+must not return it for this page after the call.

-		
-Buffer replacement strategy:
+	void StrategyHintVacuum(bool vacuum_active)

-The buffer replacement strategy actually used in freelist.c is a
-version of the Adaptive Replacement Cache (ARC) special tailored for
-PostgreSQL.
+Because VACUUM reads all relations of the entire database through the buffer
+manager, it can greatly disturb the buffer replacement strategy. This function
+is used by VACUUM to inform the strategy that subsequent buffer lookups are
+(or are not) caused by VACUUM scanning relations.
+
+
+Buffer replacement strategy
+---------------------------
+
+The buffer replacement strategy actually used in freelist.c is a version of
+the Adaptive Replacement Cache (ARC) specially tailored for PostgreSQL.

 The algorithm works as follows:

-    C is the size of the cache in number of pages (conf: shared_buffers)
-	ARC uses 2*C Cache Directory Blocks (CDB). A cache directory block
-	is allwayt associated with one unique file page and "can" point to
-	one shared buffer.
+C is the size of the cache in number of pages (a/k/a shared_buffers or
+NBuffers).  ARC uses 2*C Cache Directory Blocks (CDB). A cache directory block
+is always associated with one unique file page.  It may point to one shared
+buffer, or may indicate that the file page is not in a buffer but has been
+accessed recently.

-	All file pages known in by the directory are managed in 4 LRU lists
-	named B1, T1, T2 and B2. The T1 and T2 lists are the "real" cache
-	entries, linking a file page to a memory buffer where the page is
-	currently cached. Consequently T1len+T2len <= C. B1 and B2 are
-	ghost cache directories that extend T1 and T2 so that the strategy
-	remembers pages longer. The strategy tries to keep B1len+T1len and
-	B2len+T2len both at C. T1len and T2 len vary over the runtime
-	depending on the lookup pattern and its resulting cache hits. The
-	desired size of T1len is called T1target.
+All CDB entries are managed in 4 LRU lists named T1, T2, B1 and B2. The T1 and
+T2 lists are the "real" cache entries, linking a file page to a memory buffer
+where the page is currently cached. Consequently T1len+T2len <= C. B1 and B2
+are ghost cache directories that extend T1 and T2 so that the strategy
+remembers pages longer. The strategy tries to keep B1len+T1len and B2len+T2len
+both at C. T1len and T2len vary over the runtime depending on the lookup
+pattern and its resulting cache hits. The desired size of T1len is called
+T1target.

-	Assuming we have a full cache, one of 5 cases happens on a lookup:
+Assuming we have a full cache, one of 5 cases happens on a lookup:

-	MISS	On a cache miss, depending on T1target and the actual T1len
-			the LRU buffer of T1 or T2 is evicted. Its CDB is removed
-			from the T list and added as MRU of the corresponding B list.
-			The now free buffer is replaced with the requested page
-			and added as MRU of T1.
+MISS	On a cache miss, depending on T1target and the actual T1len
+	the LRU buffer of either T1 or T2 is evicted. Its CDB is removed
+	from the T list and added as MRU of the corresponding B list.
+	The now free buffer is replaced with the requested page
+	and added as MRU of T1.

-	T1 hit	The T1 CDB is moved to the MRU position of the T2 list.
+T1 hit	The T1 CDB is moved to the MRU position of the T2 list.

-	T2 hit	The T2 CDB is moved to the MRU position of the T2 list.
+T2 hit	The T2 CDB is moved to the MRU position of the T2 list.

-	B1 hit	This means that a buffer that was evicted from the T1
-			list is now requested again, indicating that T1target is
-			too small (otherwise it would still be in T1 and thus in
-			memory). The strategy raises T1target, evicts a buffer
-			depending on T1target and T1len and places the CDB at
-			MRU of T2.
+B1 hit	This means that a buffer that was evicted from the T1
+	list is now requested again, indicating that T1target is
+	too small (otherwise it would still be in T1 and thus in
+	memory). The strategy raises T1target, evicts a buffer
+	depending on T1target and T1len and places the CDB at
+	MRU of T2.

-	B2 hit	This means the opposite of B1, the T2 list is probably too
-			small. So the strategy lowers T1target, evicts a buffer
-			and places the CDB at MRU of T2.
+B2 hit	This means the opposite of B1, the T2 list is probably too
+	small. So the strategy lowers T1target, evicts a buffer
+	and places the CDB at MRU of T2.

-	Thus, every page that is found on lookup in any of the four lists
-	ends up as the MRU of the T2 list. The T2 list therefore is the
-	"frequency" cache, holding frequently requested pages.
+Thus, every page that is found on lookup in any of the four lists
+ends up as the MRU of the T2 list. The T2 list therefore is the
+"frequency" cache, holding frequently requested pages.

-	Every page that is seen for the first time ends up as the MRU of
-	the T1 list. The T1 list is the "recency" cache, holding recent
-	newcomers.
+Every page that is seen for the first time ends up as the MRU of the T1
+list. The T1 list is the "recency" cache, holding recent newcomers.

-	The tailoring done for PostgreSQL has to do with the way, the
-	query executor works. A typical UPDATE or DELETE first scans the 
-	relation, searching for the tuples and then calls heap_update() or
-	heap_delete(). This causes at least 2 lookups for the block in the
-	same statement. In the case of multiple matches in one block even
-	more often. As a result, every block touched in an UPDATE or DELETE
-	would directly jump into the T2 cache, which is wrong. To prevent
-	this the strategy remembers which transaction added a buffer to the
-	T1 list and will not promote it from there into the T2 cache during
-	the same transaction.
-	
-	Another specialty is the change of the strategy during VACUUM.
-	Lookups during VACUUM do not represent application needs, so it
-	would be wrong to change the cache balance T1target due to that
-	or to cause massive cache evictions. Therefore, a page read in to
-	satisfy vacuum (not those that actually cause a hit on any list)
-	is placed at the LRU position of the T1 list, for immediate
-	reuse. Since Vacuum usually requests many pages very fast, the
-	natural side effect of this is that it will get back the very
-	buffers it filled and possibly modified on the next call and will
-	therefore do it's work in a few shared memory buffers, while using
-	whatever it finds in the cache already.
+The tailoring done for PostgreSQL has to do with the way the query executor
+works. A typical UPDATE or DELETE first scans the relation, searching for the
+tuples and then calls heap_update() or heap_delete(). This causes at least 2
+lookups for the block in the same statement. In the case of multiple matches
+in one block even more often. As a result, every block touched in an UPDATE or
+DELETE would directly jump into the T2 cache, which is wrong. To prevent this
+the strategy remembers which transaction added a buffer to the T1 list and
+will not promote it from there into the T2 cache during the same transaction.

+Another specialty is the change of the strategy during VACUUM.  Lookups during
+VACUUM do not represent application needs, and do not suggest that the page
+will be hit again soon, so it would be wrong to change the cache balance
+T1target due to that or to cause massive cache evictions. Therefore, a page
+read in to satisfy vacuum is placed at the LRU position of the T1 list, for
+immediate reuse.  Also, if we happen to get a hit on a CDB entry during
+VACUUM, we do not promote the page above its current position in the list.
+Since VACUUM usually requests many pages very fast, the effect of this is that
+it will get back the very buffers it filled and possibly modified on the next
+call and will therefore do its work in a few shared memory buffers, while
+being able to use whatever it finds in the cache already.  This also implies
+that most of the write traffic caused by a VACUUM will be done by the VACUUM
+itself and not pushed off onto other processes.
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -8,35 +8,15 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/storage/buffer/buf_init.c,v 1.62 2004/02/12 15:06:56 wieck Exp $
+ *	  $PostgreSQL: pgsql/src/backend/storage/buffer/buf_init.c,v 1.63 2004/04/19 23:27:17 tgl Exp $
 *
 *-------------------------------------------------------------------------
 */
 #include "postgres.h"

-#include <sys/file.h>
-#include <math.h>
-#include <signal.h>
-
-#include "catalog/catalog.h"
-#include "executor/execdebug.h"
-#include "miscadmin.h"
-#include "storage/buf.h"
-#include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
-#include "storage/fd.h"
-#include "storage/ipc.h"
-#include "storage/lmgr.h"
-#include "storage/shmem.h"
-#include "storage/smgr.h"
-#include "storage/lwlock.h"
-#include "utils/builtins.h"
-#include "utils/hsearch.h"
-#include "utils/memutils.h"
+#include "storage/buf_internals.h"

-int			ShowPinTrace = 0;
-
-int			Data_Descriptors;

 BufferDesc *BufferDescriptors;
 Block	   *BufferBlockPointers;
@@ -44,6 +24,14 @@ Block	   *BufferBlockPointers;
 long	   *PrivateRefCount;	/* also used in freelist.c */
 bits8	   *BufferLocks;		/* flag bits showing locks I have set */

+/* statistics counters */
+long int	ReadBufferCount;
+long int	ReadLocalBufferCount;
+long int	BufferHitCount;
+long int	LocalBufferHitCount;
+long int	BufferFlushCount;
+long int	LocalBufferFlushCount;
+

 /*
 * Data Structures:
@@ -61,48 +49,35 @@ bits8	   *BufferLocks;		/* flag bits showing locks I have set */
 *		see freelist.c.  A buffer cannot be replaced while in
 *		use either by data manager or during IO.
 *
- * WriteBufferBack:
- *		currently, a buffer is only written back at the time
- *		it is selected for replacement.  It should
- *		be done sooner if possible to reduce latency of
- *		BufferAlloc().	Maybe there should be a daemon process.
 *
 * Synchronization/Locking:
 *
 * BufMgrLock lock -- must be acquired before manipulating the
- *		buffer queues (lookup/freelist).  Must be released
+ *		buffer search datastructures (lookup/freelist, as well as the
+ *		flag bits of any buffer).  Must be released
 *		before exit and before doing any IO.
 *
 * IO_IN_PROGRESS -- this is a flag in the buffer descriptor.
 *		It must be set when an IO is initiated and cleared at
- *		the end of	the IO.  It is there to make sure that one
+ *		the end of the IO.  It is there to make sure that one
 *		process doesn't start to use a buffer while another is
 *		faulting it in.  see IOWait/IOSignal.
 *
- * refcount --	A buffer is pinned during IO and immediately
- *		after a BufferAlloc().	A buffer is always either pinned
- *		or on the freelist but never both.	The buffer must be
- *		released, written, or flushed before the end of
- *		transaction.
+ * refcount --	Counts the number of processes holding pins on a buffer.
+ *		A buffer is pinned during IO and immediately after a BufferAlloc().
+ *		Pins must be released before end of transaction.
 *
- * PrivateRefCount -- Each buffer also has a private refcount the keeps
+ * PrivateRefCount -- Each buffer also has a private refcount that keeps
 *		track of the number of times the buffer is pinned in the current
- *		processes.	This is used for two purposes, first, if we pin a
+ *		process.	This is used for two purposes: first, if we pin a
 *		a buffer more than once, we only need to change the shared refcount
- *		once, thus only lock the buffer pool once, second, when a transaction
+ *		once, thus only lock the shared state once; second, when a transaction
 *		aborts, it should only unpin the buffers exactly the number of times it
 *		has pinned them, so that it will not blow away buffers of another
 *		backend.
 *
 */

-long int	ReadBufferCount;
-long int	ReadLocalBufferCount;
-long int	BufferHitCount;
-long int	LocalBufferHitCount;
-long int	BufferFlushCount;
-long int	LocalBufferFlushCount;
-

 /*
 * Initialize shared buffer pool
@@ -118,8 +93,6 @@ InitBufferPool(void)
 				foundDescs;
 	int			i;

-	Data_Descriptors = NBuffers;
-
 	/*
 	 * It's probably not really necessary to grab the lock --- if there's
 	 * anyone else attached to the shmem at this point, we've got
@@ -131,7 +104,7 @@ InitBufferPool(void)

 	BufferDescriptors = (BufferDesc *)
 		ShmemInitStruct("Buffer Descriptors",
-					  Data_Descriptors * sizeof(BufferDesc), &foundDescs);
+						NBuffers * sizeof(BufferDesc), &foundDescs);

 	BufferBlocks = (char *)
 		ShmemInitStruct("Buffer Blocks",
@@ -152,9 +125,9 @@ InitBufferPool(void)

 		/*
 		 * link the buffers into a single linked list. This will become the
-		 * LiFo list of unused buffers returned by StragegyGetBuffer().
+		 * LIFO list of unused buffers returned by StrategyGetBuffer().
 		 */
-		for (i = 0; i < Data_Descriptors; block += BLCKSZ, buf++, i++)
+		for (i = 0; i < NBuffers; block += BLCKSZ, buf++, i++)
 		{
 			Assert(ShmemIsValid((unsigned long) block));

@@ -173,7 +146,7 @@ InitBufferPool(void)
 		}

 		/* Correct last entry */
-		BufferDescriptors[Data_Descriptors - 1].bufNext = -1;
+		BufferDescriptors[NBuffers - 1].bufNext = -1;
 	}

 	/* Init other shared buffer-management stuff */
@@ -215,35 +188,31 @@ InitBufferPoolAccess(void)
 		BufferBlockPointers[i] = (Block) MAKE_PTR(BufferDescriptors[i].data);
 }

-/* -----------------------------------------------------
+/*
 * BufferShmemSize
 *
 * compute the size of shared memory for the buffer pool including
 * data pages, buffer descriptors, hash tables, etc.
- * ----------------------------------------------------
 */
 int
 BufferShmemSize(void)
 {
 	int			size = 0;

-	/* size of shmem index hash table */
-	size += hash_estimate_size(SHMEM_INDEX_SIZE, sizeof(ShmemIndexEnt));
-
 	/* size of buffer descriptors */
 	size += MAXALIGN(NBuffers * sizeof(BufferDesc));

-	/* size of the shared replacement strategy control block */
-	size += MAXALIGN(sizeof(BufferStrategyControl));
-
-	/* size of the ARC directory blocks */
-	size += MAXALIGN(NBuffers * 2 * sizeof(BufferStrategyCDB));
-
 	/* size of data pages */
 	size += NBuffers * MAXALIGN(BLCKSZ);

 	/* size of buffer hash table */
 	size += hash_estimate_size(NBuffers * 2, sizeof(BufferLookupEnt));

+	/* size of the shared replacement strategy control block */
+	size += MAXALIGN(sizeof(BufferStrategyControl));
+
+	/* size of the ARC directory blocks */
+	size += MAXALIGN(NBuffers * 2 * sizeof(BufferStrategyCDB));
+
 	return size;
 }
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -3,46 +3,42 @@
 * buf_table.c
 *	  routines for finding buffers in the buffer pool.
 *
+ * NOTE: these days, what this table actually provides is a mapping from
+ * BufferTags to CDB indexes, not directly to buffers.  The function names
+ * are thus slight misnomers.
+ *
+ * Note: all routines in this file assume that the BufMgrLock is held
+ * by the caller, so no synchronization is needed.
+ *
+ *
 * Portions Copyright (c) 1996-2003, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/storage/buffer/buf_table.c,v 1.34 2003/12/14 00:34:47 neilc Exp $
+ *	  $PostgreSQL: pgsql/src/backend/storage/buffer/buf_table.c,v 1.35 2004/04/19 23:27:17 tgl Exp $
 *
 *-------------------------------------------------------------------------
 */
-/*
- * OLD COMMENTS
- *
- * Data Structures:
- *
- *		Buffers are identified by their BufferTag (buf.h).	This
- * file contains routines for allocating a shmem hash table to
- * map buffer tags to buffer descriptors.
- *
- * Synchronization:
- *
- *	All routines in this file assume BufMgrLock is held by their caller.
- */
-
 #include "postgres.h"

 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"

+
 static HTAB *SharedBufHash;


 /*
 * Initialize shmem hash table for mapping buffers
+ *		size is the desired hash table size (2*NBuffers for ARC algorithm)
 */
 void
 InitBufTable(int size)
 {
 	HASHCTL		info;

-	/* assume lock is held */
+	/* assume no locking is needed yet */

 	/* BufferTag maps to Buffer */
 	info.keysize = sizeof(BufferTag);
@@ -60,6 +56,7 @@ InitBufTable(int size)

 /*
 * BufTableLookup
+ *		Lookup the given BufferTag; return CDB index, or -1 if not found
 */
 int
 BufTableLookup(BufferTag *tagPtr)
@@ -78,10 +75,11 @@ BufTableLookup(BufferTag *tagPtr)
 }

 /*
- * BufTableDelete
+ * BufTableInsert
+ *		Insert a hashtable entry for given tag and CDB index
 */
-bool
-BufTableInsert(BufferTag *tagPtr, Buffer buf_id)
+void
+BufTableInsert(BufferTag *tagPtr, int cdb_id)
 {
 	BufferLookupEnt *result;
 	bool		found;
@@ -97,14 +95,14 @@ BufTableInsert(BufferTag *tagPtr, Buffer buf_id)
 	if (found)					/* found something else in the table? */
 		elog(ERROR, "shared buffer hash table corrupted");

-	result->id = buf_id;
-	return TRUE;
+	result->id = cdb_id;
 }

 /*
 * BufTableDelete
+ *		Delete the hashtable entry for given tag
 */
-bool
+void
 BufTableDelete(BufferTag *tagPtr)
 {
 	BufferLookupEnt *result;
@@ -114,6 +112,4 @@ BufTableDelete(BufferTag *tagPtr)

 	if (!result)				/* shouldn't happen */
 		elog(ERROR, "shared buffer hash table corrupted");
-
-	return TRUE;
 }
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -8,7 +8,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/storage/buffer/bufmgr.c,v 1.160 2004/02/12 20:07:26 tgl Exp $
+ *	  $PostgreSQL: pgsql/src/backend/storage/buffer/bufmgr.c,v 1.161 2004/04/19 23:27:17 tgl Exp $
 *
 *-------------------------------------------------------------------------
 */
@@ -54,9 +54,9 @@
 #include "storage/proc.h"
 #include "storage/smgr.h"
 #include "utils/relcache.h"
-
 #include "pgstat.h"

+
 #define BufferGetLSN(bufHdr)	\
 	(*((XLogRecPtr*) MAKE_PTR((bufHdr)->data)))

@@ -64,15 +64,17 @@
 /* GUC variable */
 bool		zero_damaged_pages = false;

+#ifdef NOT_USED
+int			ShowPinTrace = 0;
+#endif
+
 int			BgWriterDelay = 200;
 int			BgWriterPercent = 1;
 int			BgWriterMaxpages = 100;

-static void WaitIO(BufferDesc *buf);
-static void StartBufferIO(BufferDesc *buf, bool forInput);
-static void TerminateBufferIO(BufferDesc *buf);
-static void ContinueBufferIO(BufferDesc *buf, bool forInput);
-static void buffer_write_error_callback(void *arg);
+long		NDirectFileRead;	/* some I/O's are direct file access.
+								 * bypass bufmgr */
+long		NDirectFileWrite;	/* e.g., I/O in psort and hashjoin. */

 /*
 * Macro : BUFFER_IS_BROKEN
@@ -80,18 +82,22 @@ static void buffer_write_error_callback(void *arg);
 */
 #define BUFFER_IS_BROKEN(buf) ((buf->flags & BM_IO_ERROR) && !(buf->flags & BM_DIRTY))

+
+static void PinBuffer(BufferDesc *buf);
+static void UnpinBuffer(BufferDesc *buf);
+static void WaitIO(BufferDesc *buf);
+static void StartBufferIO(BufferDesc *buf, bool forInput);
+static void TerminateBufferIO(BufferDesc *buf);
+static void ContinueBufferIO(BufferDesc *buf, bool forInput);
+static void buffer_write_error_callback(void *arg);
 static Buffer ReadBufferInternal(Relation reln, BlockNumber blockNum,
 				   bool bufferLockHeld);
 static BufferDesc *BufferAlloc(Relation reln, BlockNumber blockNum,
 			bool *foundPtr);
 static void BufferReplace(BufferDesc *bufHdr);
-
-#ifdef NOT_USED
-void		PrintBufferDescs(void);
-#endif
-
 static void write_buffer(Buffer buffer, bool unpin);

+
 /*
 * ReadBuffer -- returns a buffer containing the requested
 *		block of the requested relation.  If the blknum
@@ -282,14 +288,15 @@ BufferAlloc(Relation reln,
 	BufferDesc *buf,
 			   *buf2;
 	BufferTag	newTag;			/* identity of requested block */
+	int			cdb_found_index,
+				cdb_replace_index;
 	bool		inProgress;		/* buffer undergoing IO */

-	/* create a new tag so we can lookup the buffer */
-	/* assume that the relation is already open */
+	/* create a tag so we can lookup the buffer */
 	INIT_BUFFERTAG(&newTag, reln, blockNum);

 	/* see if the block is in the buffer pool already */
-	buf = StrategyBufferLookup(&newTag, false);
+	buf = StrategyBufferLookup(&newTag, false, &cdb_found_index);
 	if (buf != NULL)
 	{
 		/*
@@ -332,6 +339,13 @@ BufferAlloc(Relation reln,
 		}

 		LWLockRelease(BufMgrLock);
+
+		/*
+		 * Do the cost accounting for vacuum
+		 */
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageHit;
+
 		return buf;
 	}

@@ -345,16 +359,16 @@ BufferAlloc(Relation reln,
 	inProgress = FALSE;
 	for (buf = NULL; buf == NULL;)
 	{
-		buf = StrategyGetBuffer();
+		buf = StrategyGetBuffer(&cdb_replace_index);

-		/* GetFreeBuffer will abort if it can't find a free buffer */
+		/* StrategyGetBuffer will elog if it can't find a free buffer */
 		Assert(buf);

 		/*
 		 * There should be exactly one pin on the buffer after it is
 		 * allocated -- ours.  If it had a pin it wouldn't have been on
 		 * the free list.  No one else could have pinned it between
-		 * GetFreeBuffer and here because we have the BufMgrLock.
+		 * StrategyGetBuffer and here because we have the BufMgrLock.
 		 */
 		Assert(buf->refcount == 0);
 		buf->refcount = 1;
@@ -438,7 +452,7 @@ BufferAlloc(Relation reln,
 			 * we haven't gotten around to insert the new tag into the
 			 * buffer table. So we need to check here.		-ay 3/95
 			 */
-			buf2 = StrategyBufferLookup(&newTag, true);
+			buf2 = StrategyBufferLookup(&newTag, true, &cdb_found_index);
 			if (buf2 != NULL)
 			{
 				/*
@@ -471,6 +485,15 @@ BufferAlloc(Relation reln,
 				}

 				LWLockRelease(BufMgrLock);
+
+				/*
+				 * Do the cost accounting for vacuum.  (XXX perhaps better
+				 * to consider this a miss?  We didn't have to do the read,
+				 * but we did have to write ...)
+				 */
+				if (VacuumCostActive)
+					VacuumCostBalance += VacuumCostPageHit;
+
 				return buf2;
 			}
 		}
@@ -485,8 +508,8 @@ BufferAlloc(Relation reln,
 	 * Tell the buffer replacement strategy that we are replacing the
 	 * buffer content. Then rename the buffer.
 	 */
-	StrategyReplaceBuffer(buf, reln, blockNum);
-	INIT_BUFFERTAG(&(buf->tag), reln, blockNum);
+	StrategyReplaceBuffer(buf, &newTag, cdb_found_index, cdb_replace_index);
+	buf->tag = newTag;

 	/*
 	 * Buffer contents are currently invalid.  Have to mark IO IN PROGRESS
@@ -501,6 +524,12 @@ BufferAlloc(Relation reln,

 	LWLockRelease(BufMgrLock);

+	/*
+	 * Do the cost accounting for vacuum
+	 */
+	if (VacuumCostActive)
+		VacuumCostBalance += VacuumCostPageMiss;
+
 	return buf;
 }

@@ -624,20 +653,93 @@ ReleaseAndReadBuffer(Buffer buffer,
 }

 /*
- * BufferSync -- Write all dirty buffers in the pool.
+ * PinBuffer -- make buffer unavailable for replacement.
 *
- * This is called at checkpoint time and writes out all dirty shared buffers,
+ * This should be applied only to shared buffers, never local ones.
+ * Bufmgr lock must be held by caller.
+ */
+static void
+PinBuffer(BufferDesc *buf)
+{
+	int			b = BufferDescriptorGetBuffer(buf) - 1;
+
+	if (PrivateRefCount[b] == 0)
+		buf->refcount++;
+	PrivateRefCount[b]++;
+	Assert(PrivateRefCount[b] > 0);
+}
+
+/*
+ * UnpinBuffer -- make buffer available for replacement.
+ *
+ * This should be applied only to shared buffers, never local ones.
+ * Bufmgr lock must be held by caller.
+ */
+static void
+UnpinBuffer(BufferDesc *buf)
+{
+	int			b = BufferDescriptorGetBuffer(buf) - 1;
+
+	Assert(buf->refcount > 0);
+	Assert(PrivateRefCount[b] > 0);
+	PrivateRefCount[b]--;
+	if (PrivateRefCount[b] == 0)
+		buf->refcount--;
+
+	if ((buf->flags & BM_PIN_COUNT_WAITER) != 0 &&
+		buf->refcount == 1)
+	{
+		/* we just released the last pin other than the waiter's */
+		buf->flags &= ~BM_PIN_COUNT_WAITER;
+		ProcSendSignal(buf->wait_backend_id);
+	}
+	else
+	{
+		/* do nothing */
+	}
+}
+
+/*
+ * BufferSync -- Write out dirty buffers in the pool.
+ *
+ * This is called at checkpoint time to write out all dirty shared buffers,
 * and by the background writer process to write out some of the dirty blocks.
+ * percent/maxpages should be zero in the former case, and nonzero limit
+ * values in the latter.
 */
 int
 BufferSync(int percent, int maxpages)
 {
+	BufferDesc **dirty_buffers;
+	BufferTag  *buftags;
+	int			num_buffer_dirty;
 	int			i;
-	BufferDesc *bufHdr;
 	ErrorContextCallback errcontext;

-	int			num_buffer_dirty;
-	int		   *buffer_dirty;
+	/*
+	 * Get a list of all currently dirty buffers and how many there are.
+	 * We do not flush buffers that get dirtied after we started. They
+	 * have to wait until the next checkpoint.
+	 */
+	dirty_buffers = (BufferDesc **) palloc(NBuffers * sizeof(BufferDesc *));
+	buftags = (BufferTag *) palloc(NBuffers * sizeof(BufferTag));
+	
+	LWLockAcquire(BufMgrLock, LW_EXCLUSIVE);
+	num_buffer_dirty = StrategyDirtyBufferList(dirty_buffers, buftags,
+											   NBuffers);
+
+	/*
+	 * If called by the background writer, we are usually asked to
+	 * only write out some portion of dirty buffers now, to prevent
+	 * the IO storm at checkpoint time.
+	 */
+	if (percent > 0)
+	{
+		Assert(percent <= 100);
+		num_buffer_dirty = (num_buffer_dirty * percent + 99) / 100;
+	}
+	if (maxpages > 0 && num_buffer_dirty > maxpages)
+		num_buffer_dirty = maxpages;

 	/* Setup error traceback support for ereport() */
 	errcontext.callback = buffer_write_error_callback;
@@ -646,47 +748,22 @@ BufferSync(int percent, int maxpages)
 	error_context_stack = &errcontext;

 	/*
-	 * Get a list of all currently dirty buffers and how many there are.
-	 * We do not flush buffers that get dirtied after we started. They
-	 * have to wait until the next checkpoint.
+	 * Loop over buffers to be written.  Note the BufMgrLock is held at
+	 * loop top, but is released and reacquired intraloop, so we aren't
+	 * holding it long.
 	 */
-	buffer_dirty = (int *)palloc(NBuffers * sizeof(int));
-	
-	LWLockAcquire(BufMgrLock, LW_EXCLUSIVE);
-	num_buffer_dirty = StrategyDirtyBufferList(buffer_dirty, NBuffers);
-	LWLockRelease(BufMgrLock);
-
-	/*
-	 * If called by the background writer, we are usually asked to
-	 * only write out some percentage of dirty buffers now, to prevent
-	 * the IO storm at checkpoint time.
-	 */
-	if (percent > 0 && num_buffer_dirty > 10)
-	{
-		Assert(percent <= 100);
-		num_buffer_dirty = (num_buffer_dirty * percent) / 100;
-		if (maxpages > 0 && num_buffer_dirty > maxpages)
-			num_buffer_dirty = maxpages;
-	}
-
 	for (i = 0; i < num_buffer_dirty; i++)
 	{
+		BufferDesc *bufHdr = dirty_buffers[i];
 		Buffer		buffer;
 		XLogRecPtr	recptr;
 		SMgrRelation reln;

-		LWLockAcquire(BufMgrLock, LW_EXCLUSIVE);
-
-		bufHdr = &BufferDescriptors[buffer_dirty[i]];
 		errcontext.arg = bufHdr;

-		if (!(bufHdr->flags & BM_VALID))
-		{
-			LWLockRelease(BufMgrLock);
-			continue;
-		}
-
 		/*
+		 * Check it is still the same page and still needs writing.
+		 *
 		 * We can check bufHdr->cntxDirty here *without* holding any lock
 		 * on buffer context as long as we set this flag in access methods
 		 * *before* logging changes with XLogInsert(): if someone will set
@@ -694,11 +771,12 @@ BufferSync(int percent, int maxpages)
 		 * checkpoint.redo points before log record for upcoming changes
 		 * and so we are not required to write such dirty buffer.
 		 */
-		if (!(bufHdr->flags & BM_DIRTY) && !(bufHdr->cntxDirty))
-		{
-			LWLockRelease(BufMgrLock);
+		if (!(bufHdr->flags & BM_VALID))
+			continue;
+		if (!BUFFERTAGS_EQUAL(&bufHdr->tag, &buftags[i]))
+			continue;
+		if (!(bufHdr->flags & BM_DIRTY) && !(bufHdr->cntxDirty))
 			continue;
-		}

 		/*
 		 * IO synchronization. Note that we do it with unpinned buffer to
@@ -707,12 +785,13 @@ BufferSync(int percent, int maxpages)
 		if (bufHdr->flags & BM_IO_IN_PROGRESS)
 		{
 			WaitIO(bufHdr);
-			if (!(bufHdr->flags & BM_VALID) ||
-				(!(bufHdr->flags & BM_DIRTY) && !(bufHdr->cntxDirty)))
-			{
-				LWLockRelease(BufMgrLock);
+			/* Still need writing? */
+			if (!(bufHdr->flags & BM_VALID))
+				continue;
+			if (!BUFFERTAGS_EQUAL(&bufHdr->tag, &buftags[i]))
+				continue;
+			if (!(bufHdr->flags & BM_DIRTY) && !(bufHdr->cntxDirty))
 				continue;
-			}
 		}

 		/*
@@ -723,10 +802,11 @@ BufferSync(int percent, int maxpages)
 		PinBuffer(bufHdr);
 		StartBufferIO(bufHdr, false);	/* output IO start */

-		buffer = BufferDescriptorGetBuffer(bufHdr);
-
+		/* Release BufMgrLock while doing xlog work */
 		LWLockRelease(BufMgrLock);

+		buffer = BufferDescriptorGetBuffer(bufHdr);
+
 		/*
 		 * Protect buffer content against concurrent update
 		 */
@@ -740,8 +820,12 @@ BufferSync(int percent, int maxpages)

 		/*
 		 * Now it's safe to write buffer to disk. Note that no one else
-		 * should not be able to write it while we were busy with locking
-		 * and log flushing because of we setted IO flag.
+		 * should have been able to write it while we were busy with
+		 * locking and log flushing because we set the IO flag.
+		 *
+		 * Before we issue the actual write command, clear the just-dirtied
+		 * flag.  This lets us recognize concurrent changes (note that only
+		 * hint-bit changes are possible since we hold the buffer shlock).
 		 */
 		LWLockAcquire(BufMgrLock, LW_EXCLUSIVE);
 		Assert(bufHdr->flags & BM_DIRTY || bufHdr->cntxDirty);
@@ -767,12 +851,12 @@ BufferSync(int percent, int maxpages)
 		 * Release the per-buffer readlock, reacquire BufMgrLock.
 		 */
 		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
-		BufferFlushCount++;

 		LWLockAcquire(BufMgrLock, LW_EXCLUSIVE);

 		bufHdr->flags &= ~BM_IO_IN_PROGRESS;	/* mark IO finished */
 		TerminateBufferIO(bufHdr);		/* Sync IO finished */
+		BufferFlushCount++;

 		/*
 		 * If this buffer was marked by someone as DIRTY while we were
@@ -781,14 +865,16 @@ BufferSync(int percent, int maxpages)
 		if (!(bufHdr->flags & BM_JUST_DIRTIED))
 			bufHdr->flags &= ~BM_DIRTY;
 		UnpinBuffer(bufHdr);
-		LWLockRelease(BufMgrLock);
 	}

-	pfree(buffer_dirty);
+	LWLockRelease(BufMgrLock);

 	/* Pop the error context stack */
 	error_context_stack = errcontext.previous;

+	pfree(dirty_buffers);
+	pfree(buftags);
+
 	return num_buffer_dirty;
 }

@@ -818,11 +904,6 @@ WaitIO(BufferDesc *buf)
 }


-long		NDirectFileRead;	/* some I/O's are direct file access.
-								 * bypass bufmgr */
-long		NDirectFileWrite;	/* e.g., I/O in psort and hashjoin. */
-
-
 /*
 * Return a palloc'd string containing buffer usage statistics.
 */
@@ -892,9 +973,9 @@ AtEOXact_Buffers(bool isCommit)

 			if (isCommit)
 				elog(WARNING,
-				"buffer refcount leak: [%03d] (bufNext=%d, "
-				  "rel=%u/%u, blockNum=%u, flags=0x%x, refcount=%d %ld)",
-					 i, buf->bufNext,
+					 "buffer refcount leak: [%03d] "
+					 "(rel=%u/%u, blockNum=%u, flags=0x%x, refcount=%d %ld)",
+					 i,
 					 buf->tag.rnode.tblNode, buf->tag.rnode.relNode,
 					 buf->tag.blockNum, buf->flags,
 					 buf->refcount, PrivateRefCount[i]);
@@ -1021,6 +1102,26 @@ BufferGetBlockNumber(Buffer buffer)
 		return BufferDescriptors[buffer - 1].tag.blockNum;
 }

+/*
+ * BufferGetFileNode
+ *		Returns the relation ID (RelFileNode) associated with a buffer.
+ *
+ * This should make the same checks as BufferGetBlockNumber, but since the
+ * two are generally called together, we don't bother.
+ */
+RelFileNode
+BufferGetFileNode(Buffer buffer)
+{
+	BufferDesc *bufHdr;
+
+	if (BufferIsLocal(buffer))
+		bufHdr = &(LocalBufferDescriptors[-buffer - 1]);
+	else
+		bufHdr = &BufferDescriptors[buffer - 1];
+
+	return (bufHdr->tag.rnode);
+}
+
 /*
 * BufferReplace
 *
@@ -1663,7 +1764,11 @@ refcount = %ld, file: %s, line: %d\n",
 *
 * This routine might get called many times on the same page, if we are making
 * the first scan after commit of an xact that added/deleted many tuples.
- * So, be as quick as we can if the buffer is already dirty.
+ * So, be as quick as we can if the buffer is already dirty.  We do this by
+ * not acquiring BufMgrLock if it looks like the status bits are already OK.
+ * (Note it is okay if someone else clears BM_JUST_DIRTIED immediately after
+ * we look, because the buffer content update is already done and will be
+ * reflected in the I/O.)
 */
 void
 SetBufferCommitInfoNeedsSave(Buffer buffer)
@@ -2008,19 +2113,6 @@ AbortBufferIO(void)
 	}
 }

-RelFileNode
-BufferGetFileNode(Buffer buffer)
-{
-	BufferDesc *bufHdr;
-
-	if (BufferIsLocal(buffer))
-		bufHdr = &(LocalBufferDescriptors[-buffer - 1]);
-	else
-		bufHdr = &BufferDescriptors[buffer - 1];
-
-	return (bufHdr->tag.rnode);
-}
-
 /*
 * Error context callback for errors occurring during buffer writes.
 */
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -8,7 +8,7 @@
 *
 *
 * IDENTIFICATION
- *	  $PostgreSQL: pgsql/src/backend/storage/ipc/ipci.c,v 1.65 2004/02/25 19:41:22 momjian Exp $
+ *	  $PostgreSQL: pgsql/src/backend/storage/ipc/ipci.c,v 1.66 2004/04/19 23:27:17 tgl Exp $
 *
 *-------------------------------------------------------------------------
 */
@@ -60,7 +60,8 @@ CreateSharedMemoryAndSemaphores(bool makePrivate,
 		 * moderately-accurate estimates for the big hogs, plus 100K for the
 		 * stuff that's too small to bother with estimating.
 		 */
-		size = BufferShmemSize();
+		size = hash_estimate_size(SHMEM_INDEX_SIZE, sizeof(ShmemIndexEnt));
+		size += BufferShmemSize();
 		size += LockShmemSize(maxBackends);
 		size += XLOGShmemSize();
 		size += CLOGShmemSize();