Buffering GiST index build algorithm.

When building a GiST index that doesn't fit in cache, buffers are attached to some internal nodes in the index. This speeds up the build by avoiding random I/O that would otherwise be needed to traverse all the way down the tree to the find right leaf page for tuple. Alexander Korotkov
2025-12-02 23:42:46 +03:00 · 2011-09-08 17:51:23 +03:00
parent 09b68c70af
commit 5edb24a898
11 changed files with 2297 additions and 186 deletions
--- a/src/backend/access/gist/Makefile
+++ b/src/backend/access/gist/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global

 OBJS = gist.o gistutil.o gistxlog.o gistvacuum.o gistget.o gistscan.o \
-       gistproc.o gistsplit.o
+       gistproc.o gistsplit.o gistbuild.o gistbuildbuffers.o

 include $(top_srcdir)/src/backend/common.mk
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -24,6 +24,7 @@ The current implementation of GiST supports:
  * provides NULL-safe interface to GiST core
  * Concurrency
  * Recovery support via WAL logging
+  * Buffering build algorithm

 The support for concurrency implemented in PostgreSQL was developed based on
 the paper "Access Methods for Next-Generation Database Systems" by
@@ -31,6 +32,12 @@ Marcel Kornaker:

    http://www.sai.msu.su/~megera/postgres/gist/papers/concurrency/access-methods-for-next-generation.pdf.gz

+Buffering build algorithm for GiST was developed based on the paper "Efficient
+Bulk Operations on Dynamic R-trees" by Lars Arge, Klaus Hinrichs, Jan Vahrenhold
+and Jeffrey Scott Vitter.
+
+    http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.135.9894&rep=rep1&type=pdf
+
 The original algorithms were modified in several ways:

 * They had to be adapted to PostgreSQL conventions. For example, the SEARCH
@@ -278,6 +285,134 @@ would complicate the insertion algorithm. So when an insertion sees a page
 with F_FOLLOW_RIGHT set, it immediately tries to bring the split that
 crashed in the middle to completion by adding the downlink in the parent.

+Buffering build algorithm
+-------------------------
+
+In the buffering index build algorithm, some or all internal nodes have a
+buffer attached to them. When a tuple is inserted at the top, the descend down
+the tree is stopped as soon as a buffer is reached, and the tuple is pushed to
+the buffer. When a buffer gets too full, all the tuples in it are flushed to
+the lower level, where they again hit lower level buffers or leaf pages. This
+makes the insertions happen in more of a breadth-first than depth-first order,
+which greatly reduces the amount of random I/O required.
+
+In the algorithm, levels are numbered so that leaf pages have level zero,
+and internal node levels count up from 1. This numbering ensures that a page's
+level number never changes, even when the root page is split.
+
+Level                    Tree
+
+3                         *
+                      /       \
+2                *                 *
+              /  |  \           /  |  \
+1          *     *     *     *     *     *
+          / \   / \   / \   / \   / \   / \
+0        o   o o   o o   o o   o o   o o   o
+
+* - internal page
+o - leaf page
+
+Internal pages that belong to certain levels have buffers associated with
+them. Leaf pages never have buffers. Which levels have buffers is controlled
+by "level step" parameter: level numbers that are multiples of level_step
+have buffers, while others do not. For example, if level_step = 2, then
+pages on levels 2, 4, 6, ... have buffers. If level_step = 1 then every
+internal page has a buffer.
+
+Level        Tree (level_step = 1)                Tree (level_step = 2)
+
+3                      *                                     *
+                   /       \                             /       \
+2             *(b)              *(b)                *(b)              *(b)
+           /  |  \           /  |  \             /  |  \           /  |  \
+1       *(b)  *(b)  *(b)  *(b)  *(b)  *(b)    *     *     *     *     *     *
+       / \   / \   / \   / \   / \   / \     / \   / \   / \   / \   / \   / \
+0     o   o o   o o   o o   o o   o o   o   o   o o   o o   o o   o o   o o   o
+
+(b) - buffer
+
+Logically, a buffer is just bunch of tuples. Physically, it is divided in
+pages, backed by a temporary file. Each buffer can be in one of two states:
+a) Last page of the buffer is kept in main memory. A node buffer is
+automatically switched to this state when a new index tuple is added to it,
+or a tuple is removed from it.
+b) All pages of the buffer are swapped out to disk. When a buffer becomes too
+full, and we start to flush it, all other buffers are switched to this state.
+
+When an index tuple is inserted, its initial processing can end in one of the
+following points:
+1) Leaf page, if the depth of the index <= level_step, meaning that
+   none of the internal pages have buffers associated with them.
+2) Buffer of topmost level page that has buffers.
+
+New index tuples are processed until one of the buffers in the topmost
+buffered level becomes half-full. When a buffer becomes half-full, it's added
+to the emptying queue, and will be emptied before a new tuple is processed.
+
+Buffer emptying process means that index tuples from the buffer are moved
+into buffers at a lower level, or leaf pages. First, all the other buffers are
+swapped to disk to free up the memory. Then tuples are popped from the buffer
+one by one, and cascaded down the tree to the next buffer or leaf page below
+the buffered node.
+
+Emptying a buffer has the interesting dynamic property that any intermediate
+pages between the buffer being emptied, and the next buffered or leaf level
+below it, become cached. If there are no more buffers below the node, the leaf
+pages where the tuples finally land on get cached too. If there are, the last
+buffer page of each buffer below is kept in memory. This is illustrated in
+the figures below:
+
+   Buffer being emptied to
+     lower-level buffers               Buffer being emptied to leaf pages
+
+               +(fb)                                 +(fb)
+            /     \                                /     \
+        +             +                        +             +
+      /   \         /   \                    /   \         /   \
+    *(ab)   *(ab) *(ab)   *(ab)            x       x     x       x
+
+    - cached internal page
+x    - cached leaf page
+*    - non-cached internal page
+(fb) - buffer being emptied
+(ab) - buffers being appended to, with last page in memory
+
+In the beginning of the index build, the level-step is chosen so that all those
+pages involved in emptying one buffer fit in cache, so after each of those
+pages have been accessed once and cached, emptying a buffer doesn't involve
+any more I/O. This locality is where the speedup of the buffering algorithm
+comes from.
+
+Emptying one buffer can fill up one or more of the lower-level buffers,
+triggering emptying of them as well. Whenever a buffer becomes too full, it's
+added to the emptying queue, and will be emptied after the current buffer has
+been processed.
+
+To keep the size of each buffer limited even in the worst case, buffer emptying
+is scheduled as soon as a buffer becomes half-full, and emptying it continues
+until 1/2 of the nominal buffer size worth of tuples has been emptied. This
+guarantees that when buffer emptying begins, all the lower-level buffers
+are at most half-full. In the worst case that all the tuples are cascaded down
+to the same lower-level buffer, that buffer therefore has enough space to
+accommodate all the tuples emptied from the upper-level buffer. There is no
+hard size limit in any of the data structures used, though, so this only needs
+to be approximate; small overfilling of some buffers doesn't matter.
+
+If an internal page that has a buffer associated with it is split, the buffer
+needs to be split too. All tuples in the buffer are scanned through and
+relocated to the correct sibling buffers, using the penalty function to decide
+which buffer each tuple should go to.
+
+After all tuples from the heap have been processed, there are still some index
+tuples in the buffers. At this point, final buffer emptying starts. All buffers
+are emptied in top-down order. This is slightly complicated by the fact that
+new buffers can be allocated during the emptying, due to page splits. However,
+the new buffers will always be siblings of buffers that haven't been fully
+emptied yet; tuples never move upwards in the tree. The final emptying loops
+through buffers at a given level until all buffers at that level have been
+emptied, and then moves down to the next level.
+

 Authors:
 	Teodor Sigaev	<teodor@sigaev.ru>
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -24,33 +24,7 @@
 #include "utils/memutils.h"
 #include "utils/rel.h"

-/* Working state for gistbuild and its callback */
-typedef struct
-{
-	GISTSTATE	giststate;
-	int			numindexattrs;
-	double		indtuples;
-	MemoryContext tmpCtx;
-} GISTBuildState;
-
-/* A List of these is used represent a split-in-progress. */
-typedef struct
-{
-	Buffer		buf;			/* the split page "half" */
-	IndexTuple	downlink;		/* downlink for this half. */
-} GISTPageSplitInfo;
-
 /* non-export function prototypes */
-static void gistbuildCallback(Relation index,
-				  HeapTuple htup,
-				  Datum *values,
-				  bool *isnull,
-				  bool tupleIsAlive,
-				  void *state);
-static void gistdoinsert(Relation r,
-			 IndexTuple itup,
-			 Size freespace,
-			 GISTSTATE *GISTstate);
 static void gistfixsplit(GISTInsertState *state, GISTSTATE *giststate);
 static bool gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
 				 GISTSTATE *giststate,
@@ -88,138 +62,6 @@ createTempGistContext(void)
 								 ALLOCSET_DEFAULT_MAXSIZE);
 }

-/*
- * Routine to build an index.  Basically calls insert over and over.
- *
- * XXX: it would be nice to implement some sort of bulk-loading
- * algorithm, but it is not clear how to do that.
- */
-Datum
-gistbuild(PG_FUNCTION_ARGS)
-{
-	Relation	heap = (Relation) PG_GETARG_POINTER(0);
-	Relation	index = (Relation) PG_GETARG_POINTER(1);
-	IndexInfo  *indexInfo = (IndexInfo *) PG_GETARG_POINTER(2);
-	IndexBuildResult *result;
-	double		reltuples;
-	GISTBuildState buildstate;
-	Buffer		buffer;
-	Page		page;
-
-	/*
-	 * We expect to be called exactly once for any index relation. If that's
-	 * not the case, big trouble's what we have.
-	 */
-	if (RelationGetNumberOfBlocks(index) != 0)
-		elog(ERROR, "index \"%s\" already contains data",
-			 RelationGetRelationName(index));
-
-	/* no locking is needed */
-	initGISTstate(&buildstate.giststate, index);
-
-	/* initialize the root page */
-	buffer = gistNewBuffer(index);
-	Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
-	page = BufferGetPage(buffer);
-
-	START_CRIT_SECTION();
-
-	GISTInitBuffer(buffer, F_LEAF);
-
-	MarkBufferDirty(buffer);
-
-	if (RelationNeedsWAL(index))
-	{
-		XLogRecPtr	recptr;
-		XLogRecData rdata;
-
-		rdata.data = (char *) &(index->rd_node);
-		rdata.len = sizeof(RelFileNode);
-		rdata.buffer = InvalidBuffer;
-		rdata.next = NULL;
-
-		recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_CREATE_INDEX, &rdata);
-		PageSetLSN(page, recptr);
-		PageSetTLI(page, ThisTimeLineID);
-	}
-	else
-		PageSetLSN(page, GetXLogRecPtrForTemp());
-
-	UnlockReleaseBuffer(buffer);
-
-	END_CRIT_SECTION();
-
-	/* build the index */
-	buildstate.numindexattrs = indexInfo->ii_NumIndexAttrs;
-	buildstate.indtuples = 0;
-
-	/*
-	 * create a temporary memory context that is reset once for each tuple
-	 * inserted into the index
-	 */
-	buildstate.tmpCtx = createTempGistContext();
-
-	/* do the heap scan */
-	reltuples = IndexBuildHeapScan(heap, index, indexInfo, true,
-								   gistbuildCallback, (void *) &buildstate);
-
-	/* okay, all heap tuples are indexed */
-	MemoryContextDelete(buildstate.tmpCtx);
-
-	freeGISTstate(&buildstate.giststate);
-
-	/*
-	 * Return statistics
-	 */
-	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
-
-	result->heap_tuples = reltuples;
-	result->index_tuples = buildstate.indtuples;
-
-	PG_RETURN_POINTER(result);
-}
-
-/*
- * Per-tuple callback from IndexBuildHeapScan
- */
-static void
-gistbuildCallback(Relation index,
-				  HeapTuple htup,
-				  Datum *values,
-				  bool *isnull,
-				  bool tupleIsAlive,
-				  void *state)
-{
-	GISTBuildState *buildstate = (GISTBuildState *) state;
-	IndexTuple	itup;
-	MemoryContext oldCtx;
-
-	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
-
-	/* form an index tuple and point it at the heap tuple */
-	itup = gistFormTuple(&buildstate->giststate, index,
-						 values, isnull, true /* size is currently bogus */ );
-	itup->t_tid = htup->t_self;
-
-	/*
-	 * Since we already have the index relation locked, we call gistdoinsert
-	 * directly.  Normal access method calls dispatch through gistinsert,
-	 * which locks the relation for write.	This is the right thing to do if
-	 * you're inserting single tups, but not when you're initializing the
-	 * whole index at once.
-	 *
-	 * In this path we respect the fillfactor setting, whereas insertions
-	 * after initial build do not.
-	 */
-	gistdoinsert(index, itup,
-			  RelationGetTargetPageFreeSpace(index, GIST_DEFAULT_FILLFACTOR),
-				 &buildstate->giststate);
-
-	buildstate->indtuples += 1;
-	MemoryContextSwitchTo(oldCtx);
-	MemoryContextReset(buildstate->tmpCtx);
-}
-
 /*
 *	gistbuildempty() -- build an empty gist index in the initialization fork
 */
@@ -285,6 +127,11 @@ gistinsert(PG_FUNCTION_ARGS)
 * to the right of 'leftchildbuf', or updating the downlink for 'leftchildbuf'.
 * F_FOLLOW_RIGHT flag on 'leftchildbuf' is cleared and NSN is set.
 *
+ * If 'markfollowright' is true and the page is split, the left child is
+ * marked with F_FOLLOW_RIGHT flag. That is the normal case. During buffered
+ * index build, however, there is no concurrent access and the page splitting
+ * is done in a slightly simpler fashion, and false is passed.
+ *
 * If there is not enough room on the page, it is split. All the split
 * pages are kept pinned and locked and returned in *splitinfo, the caller
 * is responsible for inserting the downlinks for them. However, if
@@ -293,13 +140,16 @@ gistinsert(PG_FUNCTION_ARGS)
 * In that case, we continue to hold the root page locked, and the child
 * pages are released; note that new tuple(s) are *not* on the root page
 * but in one of the new child pages.
+ *
+ * Returns 'true' if the page was split, 'false' otherwise.
 */
-static bool
-gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
+bool
+gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 				Buffer buffer,
 				IndexTuple *itup, int ntup, OffsetNumber oldoffnum,
 				Buffer leftchildbuf,
-				List **splitinfo)
+				List **splitinfo,
+				bool markfollowright)
 {
 	Page		page = BufferGetPage(buffer);
 	bool		is_leaf = (GistPageIsLeaf(page)) ? true : false;
@@ -331,7 +181,7 @@ gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
 	 * one-element todelete array; in the split case, it's handled implicitly
 	 * because the tuple vector passed to gistSplit won't include this tuple.
 	 */
-	is_split = gistnospace(page, itup, ntup, oldoffnum, state->freespace);
+	is_split = gistnospace(page, itup, ntup, oldoffnum, freespace);
 	if (is_split)
 	{
 		/* no space for insertion */
@@ -362,7 +212,7 @@ gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
 				memmove(itvec + pos, itvec + pos + 1, sizeof(IndexTuple) * (tlen - pos));
 		}
 		itvec = gistjoinvector(itvec, &tlen, itup, ntup);
-		dist = gistSplit(state->r, page, itvec, tlen, giststate);
+		dist = gistSplit(rel, page, itvec, tlen, giststate);

 		/*
 		 * Set up pages to work with. Allocate new buffers for all but the
@@ -392,7 +242,7 @@ gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
 		for (; ptr; ptr = ptr->next)
 		{
 			/* Allocate new page */
-			ptr->buffer = gistNewBuffer(state->r);
+			ptr->buffer = gistNewBuffer(rel);
 			GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
 			ptr->page = BufferGetPage(ptr->buffer);
 			ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
@@ -463,7 +313,7 @@ gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
 			for (i = 0; i < ptr->block.num; i++)
 			{
 				if (PageAddItem(ptr->page, (Item) data, IndexTupleSize((IndexTuple) data), i + FirstOffsetNumber, false, false) == InvalidOffsetNumber)
-					elog(ERROR, "failed to add item to index page in \"%s\"", RelationGetRelationName(state->r));
+					elog(ERROR, "failed to add item to index page in \"%s\"", RelationGetRelationName(rel));
 				data += IndexTupleSize((IndexTuple) data);
 			}

@@ -474,7 +324,15 @@ gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
 			else
 				GistPageGetOpaque(ptr->page)->rightlink = oldrlink;

-			if (ptr->next && !is_rootsplit)
+			/*
+			 * Mark the all but the right-most page with the follow-right
+			 * flag. It will be cleared as soon as the downlink is inserted
+			 * into the parent, but this ensures that if we error out before
+			 * that, the index is still consistent. (in buffering build mode,
+			 * any error will abort the index build anyway, so this is not
+			 * needed.)
+			 */
+			if (ptr->next && !is_rootsplit && markfollowright)
 				GistMarkFollowRight(ptr->page);
 			else
 				GistClearFollowRight(ptr->page);
@@ -506,9 +364,10 @@ gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
 		dist->page = BufferGetPage(dist->buffer);

 		/* Write the WAL record */
-		if (RelationNeedsWAL(state->r))
-			recptr = gistXLogSplit(state->r->rd_node, blkno, is_leaf,
-								   dist, oldrlink, oldnsn, leftchildbuf);
+		if (RelationNeedsWAL(rel))
+			recptr = gistXLogSplit(rel->rd_node, blkno, is_leaf,
+								   dist, oldrlink, oldnsn, leftchildbuf,
+								   markfollowright);
 		else
 			recptr = GetXLogRecPtrForTemp();

@@ -547,7 +406,7 @@ gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
 		if (BufferIsValid(leftchildbuf))
 			MarkBufferDirty(leftchildbuf);

-		if (RelationNeedsWAL(state->r))
+		if (RelationNeedsWAL(rel))
 		{
 			OffsetNumber ndeloffs = 0,
 						deloffs[1];
@@ -558,7 +417,7 @@ gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
 				ndeloffs = 1;
 			}

-			recptr = gistXLogUpdate(state->r->rd_node, buffer,
+			recptr = gistXLogUpdate(rel->rd_node, buffer,
 									deloffs, ndeloffs, itup, ntup,
 									leftchildbuf);

@@ -570,8 +429,6 @@ gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
 			recptr = GetXLogRecPtrForTemp();
 			PageSetLSN(page, recptr);
 		}
-
-		*splitinfo = NIL;
 	}

 	/*
@@ -608,7 +465,7 @@ gistplacetopage(GISTInsertState *state, GISTSTATE *giststate,
 * this routine assumes it is invoked in a short-lived memory context,
 * so it does not bother releasing palloc'd allocations.
 */
-static void
+void
 gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 {
 	ItemId		iid;
@@ -1192,10 +1049,12 @@ gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
 	List	   *splitinfo;
 	bool		is_split;

-	is_split = gistplacetopage(state, giststate, stack->buffer,
+	is_split = gistplacetopage(state->r, state->freespace, giststate,
+							   stack->buffer,
 							   tuples, ntup, oldoffnum,
 							   leftchild,
-							   &splitinfo);
+							   &splitinfo,
+							   true);
 	if (splitinfo)
 		gistfinishsplit(state, stack, giststate, splitinfo);

--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
--- a/src/backend/access/gist/gistbuildbuffers.c
+++ b/src/backend/access/gist/gistbuildbuffers.c
@@ -0,0 +1,787 @@
+/*-------------------------------------------------------------------------
+ *
+ * gistbuildbuffers.c
+ *	  node buffer management functions for GiST buffering build algorithm.
+ *
+ *
+ * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/gist/gistbuildbuffers.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/genam.h"
+#include "access/gist_private.h"
+#include "catalog/index.h"
+#include "miscadmin.h"
+#include "storage/buffile.h"
+#include "storage/bufmgr.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+
+static GISTNodeBufferPage *gistAllocateNewPageBuffer(GISTBuildBuffers *gfbb);
+static void gistAddLoadedBuffer(GISTBuildBuffers *gfbb,
+					GISTNodeBuffer *nodeBuffer);
+static void gistLoadNodeBuffer(GISTBuildBuffers *gfbb,
+				   GISTNodeBuffer *nodeBuffer);
+static void gistUnloadNodeBuffer(GISTBuildBuffers *gfbb,
+					 GISTNodeBuffer *nodeBuffer);
+static void gistPlaceItupToPage(GISTNodeBufferPage *pageBuffer,
+					IndexTuple item);
+static void gistGetItupFromPage(GISTNodeBufferPage *pageBuffer,
+					IndexTuple *item);
+static long gistBuffersGetFreeBlock(GISTBuildBuffers *gfbb);
+static void gistBuffersReleaseBlock(GISTBuildBuffers *gfbb, long blocknum);
+
+static void ReadTempFileBlock(BufFile *file, long blknum, void *ptr);
+static void WriteTempFileBlock(BufFile *file, long blknum, void *ptr);
+
+
+/*
+ * Initialize GiST build buffers.
+ */
+GISTBuildBuffers *
+gistInitBuildBuffers(int pagesPerBuffer, int levelStep, int maxLevel)
+{
+	GISTBuildBuffers *gfbb;
+	HASHCTL		hashCtl;
+
+	gfbb = palloc(sizeof(GISTBuildBuffers));
+	gfbb->pagesPerBuffer = pagesPerBuffer;
+	gfbb->levelStep = levelStep;
+
+	/*
+	 * Create a temporary file to hold buffer pages that are swapped out of
+	 * memory.
+	 */
+	gfbb->pfile = BufFileCreateTemp(true);
+	gfbb->nFileBlocks = 0;
+
+	/* Initialize free page management. */
+	gfbb->nFreeBlocks = 0;
+	gfbb->freeBlocksLen = 32;
+	gfbb->freeBlocks = (long *) palloc(gfbb->freeBlocksLen * sizeof(long));
+
+	/*
+	 * Current memory context will be used for all in-memory data structures
+	 * of buffers which are persistent during buffering build.
+	 */
+	gfbb->context = CurrentMemoryContext;
+
+	/*
+	 * nodeBuffersTab hash is association between index blocks and it's
+	 * buffers.
+	 */
+	hashCtl.keysize = sizeof(BlockNumber);
+	hashCtl.entrysize = sizeof(GISTNodeBuffer);
+	hashCtl.hcxt = CurrentMemoryContext;
+	hashCtl.hash = tag_hash;
+	hashCtl.match = memcmp;
+	gfbb->nodeBuffersTab = hash_create("gistbuildbuffers",
+									   1024,
+									   &hashCtl,
+									   HASH_ELEM | HASH_CONTEXT
+									   | HASH_FUNCTION | HASH_COMPARE);
+
+	gfbb->bufferEmptyingQueue = NIL;
+
+	/*
+	 * Per-level node buffers lists for final buffers emptying process. Node
+	 * buffers are inserted here when they are created.
+	 */
+	gfbb->buffersOnLevelsLen = 1;
+	gfbb->buffersOnLevels = (List **) palloc(sizeof(List *) *
+											 gfbb->buffersOnLevelsLen);
+	gfbb->buffersOnLevels[0] = NIL;
+
+	/*
+	 * Block numbers of node buffers which last pages are currently loaded
+	 * into main memory.
+	 */
+	gfbb->loadedBuffersLen = 32;
+	gfbb->loadedBuffers = (GISTNodeBuffer **) palloc(gfbb->loadedBuffersLen *
+												   sizeof(GISTNodeBuffer *));
+	gfbb->loadedBuffersCount = 0;
+
+	/*
+	 * Root path item of the tree. Updated on each root node split.
+	 */
+	gfbb->rootitem = (GISTBufferingInsertStack *) MemoryContextAlloc(
+							gfbb->context, sizeof(GISTBufferingInsertStack));
+	gfbb->rootitem->parent = NULL;
+	gfbb->rootitem->blkno = GIST_ROOT_BLKNO;
+	gfbb->rootitem->downlinkoffnum = InvalidOffsetNumber;
+	gfbb->rootitem->level = maxLevel;
+	gfbb->rootitem->refCount = 1;
+
+	return gfbb;
+}
+
+/*
+ * Returns a node buffer for given block. The buffer is created if it
+ * doesn't exist yet.
+ */
+GISTNodeBuffer *
+gistGetNodeBuffer(GISTBuildBuffers *gfbb, GISTSTATE *giststate,
+				  BlockNumber nodeBlocknum,
+				  OffsetNumber downlinkoffnum,
+				  GISTBufferingInsertStack *parent)
+{
+	GISTNodeBuffer *nodeBuffer;
+	bool		found;
+
+	/* Find node buffer in hash table */
+	nodeBuffer = (GISTNodeBuffer *) hash_search(gfbb->nodeBuffersTab,
+												(const void *) &nodeBlocknum,
+												HASH_ENTER,
+												&found);
+	if (!found)
+	{
+		/*
+		 * Node buffer wasn't found. Initialize the new buffer as empty.
+		 */
+		GISTBufferingInsertStack *path;
+		int			level;
+		MemoryContext oldcxt = MemoryContextSwitchTo(gfbb->context);
+
+		nodeBuffer->pageBuffer = NULL;
+		nodeBuffer->blocksCount = 0;
+		nodeBuffer->queuedForEmptying = false;
+
+		/*
+		 * Create a path stack for the page.
+		 */
+		if (nodeBlocknum != GIST_ROOT_BLKNO)
+		{
+			path = (GISTBufferingInsertStack *) palloc(
+										   sizeof(GISTBufferingInsertStack));
+			path->parent = parent;
+			path->blkno = nodeBlocknum;
+			path->downlinkoffnum = downlinkoffnum;
+			path->level = parent->level - 1;
+			path->refCount = 0; /* initially unreferenced */
+			parent->refCount++; /* this path references its parent */
+			Assert(path->level > 0);
+		}
+		else
+			path = gfbb->rootitem;
+
+		nodeBuffer->path = path;
+		path->refCount++;
+
+		/*
+		 * Add this buffer to the list of buffers on this level. Enlarge
+		 * buffersOnLevels array if needed.
+		 */
+		level = path->level;
+		if (level >= gfbb->buffersOnLevelsLen)
+		{
+			int			i;
+
+			gfbb->buffersOnLevels =
+				(List **) repalloc(gfbb->buffersOnLevels,
+								   (level + 1) * sizeof(List *));
+
+			/* initialize the enlarged portion */
+			for (i = gfbb->buffersOnLevelsLen; i <= level; i++)
+				gfbb->buffersOnLevels[i] = NIL;
+			gfbb->buffersOnLevelsLen = level + 1;
+		}
+
+		/*
+		 * Prepend the new buffer to the list of buffers on this level. It's
+		 * not arbitrary that the new buffer is put to the beginning of the
+		 * list: in the final emptying phase we loop through all buffers at
+		 * each level, and flush them. If a page is split during the emptying,
+		 * it's more efficient to flush the new splitted pages first, before
+		 * moving on to pre-existing pages on the level. The buffers just
+		 * created during the page split are likely still in cache, so
+		 * flushing them immediately is more efficient than putting them to
+		 * the end of the queue.
+		 */
+		gfbb->buffersOnLevels[level] = lcons(nodeBuffer,
+											 gfbb->buffersOnLevels[level]);
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+	else
+	{
+		if (parent != nodeBuffer->path->parent)
+		{
+			/*
+			 * A different parent path item was provided than we've
+			 * remembered. We trust caller to provide more correct parent than
+			 * we have. Previous parent may be outdated by page split.
+			 */
+			gistDecreasePathRefcount(nodeBuffer->path->parent);
+			nodeBuffer->path->parent = parent;
+			parent->refCount++;
+		}
+	}
+
+	return nodeBuffer;
+}
+
+/*
+ * Allocate memory for a buffer page.
+ */
+static GISTNodeBufferPage *
+gistAllocateNewPageBuffer(GISTBuildBuffers *gfbb)
+{
+	GISTNodeBufferPage *pageBuffer;
+
+	pageBuffer = (GISTNodeBufferPage *) MemoryContextAlloc(gfbb->context,
+														   BLCKSZ);
+	pageBuffer->prev = InvalidBlockNumber;
+
+	/* Set page free space */
+	PAGE_FREE_SPACE(pageBuffer) = BLCKSZ - BUFFER_PAGE_DATA_OFFSET;
+	return pageBuffer;
+}
+
+/*
+ * Add specified block number into loadedBuffers array.
+ */
+static void
+gistAddLoadedBuffer(GISTBuildBuffers *gfbb, GISTNodeBuffer *nodeBuffer)
+{
+	/* Enlarge the array if needed */
+	if (gfbb->loadedBuffersCount >= gfbb->loadedBuffersLen)
+	{
+		gfbb->loadedBuffersLen *= 2;
+		gfbb->loadedBuffers = (GISTNodeBuffer **)
+			repalloc(gfbb->loadedBuffers,
+					 gfbb->loadedBuffersLen * sizeof(GISTNodeBuffer *));
+	}
+
+	gfbb->loadedBuffers[gfbb->loadedBuffersCount] = nodeBuffer;
+	gfbb->loadedBuffersCount++;
+}
+
+/*
+ * Load last page of node buffer into main memory.
+ */
+static void
+gistLoadNodeBuffer(GISTBuildBuffers *gfbb, GISTNodeBuffer *nodeBuffer)
+{
+	/* Check if we really should load something */
+	if (!nodeBuffer->pageBuffer && nodeBuffer->blocksCount > 0)
+	{
+		/* Allocate memory for page */
+		nodeBuffer->pageBuffer = gistAllocateNewPageBuffer(gfbb);
+
+		/* Read block from temporary file */
+		ReadTempFileBlock(gfbb->pfile, nodeBuffer->pageBlocknum,
+						  nodeBuffer->pageBuffer);
+
+		/* Mark file block as free */
+		gistBuffersReleaseBlock(gfbb, nodeBuffer->pageBlocknum);
+
+		/* Mark node buffer as loaded */
+		gistAddLoadedBuffer(gfbb, nodeBuffer);
+		nodeBuffer->pageBlocknum = InvalidBlockNumber;
+	}
+}
+
+/*
+ * Write last page of node buffer to the disk.
+ */
+static void
+gistUnloadNodeBuffer(GISTBuildBuffers *gfbb, GISTNodeBuffer *nodeBuffer)
+{
+	/* Check if we have something to write */
+	if (nodeBuffer->pageBuffer)
+	{
+		BlockNumber blkno;
+
+		/* Get free file block */
+		blkno = gistBuffersGetFreeBlock(gfbb);
+
+		/* Write block to the temporary file */
+		WriteTempFileBlock(gfbb->pfile, blkno, nodeBuffer->pageBuffer);
+
+		/* Free memory of that page */
+		pfree(nodeBuffer->pageBuffer);
+		nodeBuffer->pageBuffer = NULL;
+
+		/* Save block number */
+		nodeBuffer->pageBlocknum = blkno;
+	}
+}
+
+/*
+ * Write last pages of all node buffers to the disk.
+ */
+void
+gistUnloadNodeBuffers(GISTBuildBuffers *gfbb)
+{
+	int			i;
+
+	/* Unload all the buffers that have a page loaded in memory. */
+	for (i = 0; i < gfbb->loadedBuffersCount; i++)
+		gistUnloadNodeBuffer(gfbb, gfbb->loadedBuffers[i]);
+
+	/* Now there are no node buffers with loaded last page */
+	gfbb->loadedBuffersCount = 0;
+}
+
+/*
+ * Add index tuple to buffer page.
+ */
+static void
+gistPlaceItupToPage(GISTNodeBufferPage *pageBuffer, IndexTuple itup)
+{
+	Size		itupsz = IndexTupleSize(itup);
+	char	   *ptr;
+
+	/* There should be enough of space. */
+	Assert(PAGE_FREE_SPACE(pageBuffer) >= MAXALIGN(itupsz));
+
+	/* Reduce free space value of page to reserve a spot for the tuple. */
+	PAGE_FREE_SPACE(pageBuffer) -= MAXALIGN(itupsz);
+
+	/* Get pointer to the spot we reserved (ie. end of free space). */
+	ptr = (char *) pageBuffer + BUFFER_PAGE_DATA_OFFSET
+		+ PAGE_FREE_SPACE(pageBuffer);
+
+	/* Copy the index tuple there. */
+	memcpy(ptr, itup, itupsz);
+}
+
+/*
+ * Get last item from buffer page and remove it from page.
+ */
+static void
+gistGetItupFromPage(GISTNodeBufferPage *pageBuffer, IndexTuple *itup)
+{
+	IndexTuple	ptr;
+	Size		itupsz;
+
+	Assert(!PAGE_IS_EMPTY(pageBuffer)); /* Page shouldn't be empty */
+
+	/* Get pointer to last index tuple */
+	ptr = (IndexTuple) ((char *) pageBuffer
+						+ BUFFER_PAGE_DATA_OFFSET
+						+ PAGE_FREE_SPACE(pageBuffer));
+	itupsz = IndexTupleSize(ptr);
+
+	/* Make a copy of the tuple */
+	*itup = (IndexTuple) palloc(itupsz);
+	memcpy(*itup, ptr, itupsz);
+
+	/* Mark the space used by the tuple as free */
+	PAGE_FREE_SPACE(pageBuffer) += MAXALIGN(itupsz);
+}
+
+/*
+ * Push an index tuple to node buffer.
+ */
+void
+gistPushItupToNodeBuffer(GISTBuildBuffers *gfbb, GISTNodeBuffer *nodeBuffer,
+						 IndexTuple itup)
+{
+	/*
+	 * Most part of memory operations will be in buffering build persistent
+	 * context. So, let's switch to it.
+	 */
+	MemoryContext oldcxt = MemoryContextSwitchTo(gfbb->context);
+
+	/*
+	 * If the buffer is currently empty, create the first page.
+	 */
+	if (nodeBuffer->blocksCount == 0)
+	{
+		nodeBuffer->pageBuffer = gistAllocateNewPageBuffer(gfbb);
+		nodeBuffer->blocksCount = 1;
+		gistAddLoadedBuffer(gfbb, nodeBuffer);
+	}
+
+	/* Load last page of node buffer if it wasn't in memory already */
+	if (!nodeBuffer->pageBuffer)
+		gistLoadNodeBuffer(gfbb, nodeBuffer);
+
+	/*
+	 * Check if there is enough space on the last page for the tuple.
+	 */
+	if (PAGE_NO_SPACE(nodeBuffer->pageBuffer, itup))
+	{
+		/*
+		 * Nope. Swap previous block to disk and allocate a new one.
+		 */
+		BlockNumber blkno;
+
+		/* Write filled page to the disk */
+		blkno = gistBuffersGetFreeBlock(gfbb);
+		WriteTempFileBlock(gfbb->pfile, blkno, nodeBuffer->pageBuffer);
+
+		/*
+		 * Reset the in-memory page as empty, and link the previous block to
+		 * the new page by storing its block number in the prev-link.
+		 */
+		PAGE_FREE_SPACE(nodeBuffer->pageBuffer) =
+			BLCKSZ - MAXALIGN(offsetof(GISTNodeBufferPage, tupledata));
+		nodeBuffer->pageBuffer->prev = blkno;
+
+		/* We've just added one more page */
+		nodeBuffer->blocksCount++;
+	}
+
+	gistPlaceItupToPage(nodeBuffer->pageBuffer, itup);
+
+	/*
+	 * If the buffer just overflowed, add it to the emptying queue.
+	 */
+	if (BUFFER_HALF_FILLED(nodeBuffer, gfbb) && !nodeBuffer->queuedForEmptying)
+	{
+		gfbb->bufferEmptyingQueue = lcons(nodeBuffer,
+										  gfbb->bufferEmptyingQueue);
+		nodeBuffer->queuedForEmptying = true;
+	}
+
+	/* Restore memory context */
+	MemoryContextSwitchTo(oldcxt);
+}
+
+/*
+ * Removes one index tuple from node buffer. Returns true if success and false
+ * if node buffer is empty.
+ */
+bool
+gistPopItupFromNodeBuffer(GISTBuildBuffers *gfbb, GISTNodeBuffer *nodeBuffer,
+						  IndexTuple *itup)
+{
+	/*
+	 * If node buffer is empty then return false.
+	 */
+	if (nodeBuffer->blocksCount <= 0)
+		return false;
+
+	/* Load last page of node buffer if needed */
+	if (!nodeBuffer->pageBuffer)
+		gistLoadNodeBuffer(gfbb, nodeBuffer);
+
+	/*
+	 * Get index tuple from last non-empty page.
+	 */
+	gistGetItupFromPage(nodeBuffer->pageBuffer, itup);
+
+	/*
+	 * If we just removed the last tuple from the page, fetch previous page on
+	 * this node buffer (if any).
+	 */
+	if (PAGE_IS_EMPTY(nodeBuffer->pageBuffer))
+	{
+		BlockNumber prevblkno;
+
+		/*
+		 * blocksCount includes the page in pageBuffer, so decrease it now.
+		 */
+		nodeBuffer->blocksCount--;
+
+		/*
+		 * If there's more pages, fetch previous one.
+		 */
+		prevblkno = nodeBuffer->pageBuffer->prev;
+		if (prevblkno != InvalidBlockNumber)
+		{
+			/* There is a previous page. Fetch it. */
+			Assert(nodeBuffer->blocksCount > 0);
+			ReadTempFileBlock(gfbb->pfile, prevblkno, nodeBuffer->pageBuffer);
+
+			/*
+			 * Now that we've read the block in memory, we can release its
+			 * on-disk block for reuse.
+			 */
+			gistBuffersReleaseBlock(gfbb, prevblkno);
+		}
+		else
+		{
+			/* No more pages. Free memory. */
+			Assert(nodeBuffer->blocksCount == 0);
+			pfree(nodeBuffer->pageBuffer);
+			nodeBuffer->pageBuffer = NULL;
+		}
+	}
+	return true;
+}
+
+/*
+ * Select a currently unused block for writing to.
+ */
+static long
+gistBuffersGetFreeBlock(GISTBuildBuffers *gfbb)
+{
+	/*
+	 * If there are multiple free blocks, we select the one appearing last in
+	 * freeBlocks[].  If there are none, assign the next block at the end of
+	 * the file (causing the file to be extended).
+	 */
+	if (gfbb->nFreeBlocks > 0)
+		return gfbb->freeBlocks[--gfbb->nFreeBlocks];
+	else
+		return gfbb->nFileBlocks++;
+}
+
+/*
+ * Return a block# to the freelist.
+ */
+static void
+gistBuffersReleaseBlock(GISTBuildBuffers *gfbb, long blocknum)
+{
+	int			ndx;
+
+	/* Enlarge freeBlocks array if full. */
+	if (gfbb->nFreeBlocks >= gfbb->freeBlocksLen)
+	{
+		gfbb->freeBlocksLen *= 2;
+		gfbb->freeBlocks = (long *) repalloc(gfbb->freeBlocks,
+											 gfbb->freeBlocksLen *
+											 sizeof(long));
+	}
+
+	/* Add blocknum to array */
+	ndx = gfbb->nFreeBlocks++;
+	gfbb->freeBlocks[ndx] = blocknum;
+}
+
+/*
+ * Free buffering build data structure.
+ */
+void
+gistFreeBuildBuffers(GISTBuildBuffers *gfbb)
+{
+	/* Close buffers file. */
+	BufFileClose(gfbb->pfile);
+
+	/* All other things will be freed on memory context release */
+}
+
+/*
+ * Data structure representing information about node buffer for index tuples
+ * relocation from splitted node buffer.
+ */
+typedef struct
+{
+	GISTENTRY	entry[INDEX_MAX_KEYS];
+	bool		isnull[INDEX_MAX_KEYS];
+	GISTPageSplitInfo *splitinfo;
+	GISTNodeBuffer *nodeBuffer;
+}	RelocationBufferInfo;
+
+/*
+ * At page split, distribute tuples from the buffer of the split page to
+ * new buffers for the created page halves. This also adjusts the downlinks
+ * in 'splitinfo' to include the tuples in the buffers.
+ */
+void
+gistRelocateBuildBuffersOnSplit(GISTBuildBuffers *gfbb, GISTSTATE *giststate,
+								Relation r, GISTBufferingInsertStack *path,
+								Buffer buffer, List *splitinfo)
+{
+	RelocationBufferInfo *relocationBuffersInfos;
+	bool		found;
+	GISTNodeBuffer *nodeBuffer;
+	BlockNumber blocknum;
+	IndexTuple	itup;
+	int			splitPagesCount = 0,
+				i;
+	GISTENTRY	entry[INDEX_MAX_KEYS];
+	bool		isnull[INDEX_MAX_KEYS];
+	GISTNodeBuffer nodebuf;
+	ListCell   *lc;
+
+	/* If the splitted page doesn't have buffers, we have nothing to do. */
+	if (!LEVEL_HAS_BUFFERS(path->level, gfbb))
+		return;
+
+	/*
+	 * Get the node buffer of the splitted page.
+	 */
+	blocknum = BufferGetBlockNumber(buffer);
+	nodeBuffer = hash_search(gfbb->nodeBuffersTab, &blocknum,
+							 HASH_FIND, &found);
+	if (!found)
+	{
+		/*
+		 * Node buffer should exist at this point. If it didn't exist before,
+		 * the insertion that caused the page to split should've created it.
+		 */
+		elog(ERROR, "node buffer of page being split (%u) does not exist",
+			 blocknum);
+	}
+
+	/*
+	 * Make a copy of the old buffer, as we're going reuse it as the buffer
+	 * for the new left page, which is on the same block as the old page.
+	 * That's not true for the root page, but that's fine because we never
+	 * have a buffer on the root page anyway. The original algorithm as
+	 * described by Arge et al did, but it's of no use, as you might as well
+	 * read the tuples straight from the heap instead of the root buffer.
+	 */
+	Assert(blocknum != GIST_ROOT_BLKNO);
+	memcpy(&nodebuf, nodeBuffer, sizeof(GISTNodeBuffer));
+
+	/* Reset the old buffer, used for the new left page from now on */
+	nodeBuffer->blocksCount = 0;
+	nodeBuffer->pageBuffer = NULL;
+	nodeBuffer->pageBlocknum = InvalidBlockNumber;
+
+	/* Reassign pointer to the saved copy. */
+	nodeBuffer = &nodebuf;
+
+	/*
+	 * Allocate memory for information about relocation buffers.
+	 */
+	splitPagesCount = list_length(splitinfo);
+	relocationBuffersInfos =
+		(RelocationBufferInfo *) palloc(sizeof(RelocationBufferInfo) *
+										splitPagesCount);
+
+	/*
+	 * Fill relocation buffers information for node buffers of pages produced
+	 * by split.
+	 */
+	i = 0;
+	foreach(lc, splitinfo)
+	{
+		GISTPageSplitInfo *si = (GISTPageSplitInfo *) lfirst(lc);
+		GISTNodeBuffer *newNodeBuffer;
+
+		/* Decompress parent index tuple of node buffer page. */
+		gistDeCompressAtt(giststate, r,
+						  si->downlink, NULL, (OffsetNumber) 0,
+						  relocationBuffersInfos[i].entry,
+						  relocationBuffersInfos[i].isnull);
+
+		/*
+		 * Create a node buffer for the page. The leftmost half is on the same
+		 * block as the old page before split, so for the leftmost half this
+		 * will return the original buffer, which was emptied earlier in this
+		 * function.
+		 */
+		newNodeBuffer = gistGetNodeBuffer(gfbb,
+										  giststate,
+										  BufferGetBlockNumber(si->buf),
+										  path->downlinkoffnum,
+										  path->parent);
+
+		relocationBuffersInfos[i].nodeBuffer = newNodeBuffer;
+		relocationBuffersInfos[i].splitinfo = si;
+
+		i++;
+	}
+
+	/*
+	 * Loop through all index tuples on the buffer on the splitted page,
+	 * moving them to buffers on the new pages.
+	 */
+	while (gistPopItupFromNodeBuffer(gfbb, nodeBuffer, &itup))
+	{
+		float		sum_grow,
+					which_grow[INDEX_MAX_KEYS];
+		int			i,
+					which;
+		IndexTuple	newtup;
+		RelocationBufferInfo *targetBufferInfo;
+
+		/*
+		 * Choose which page this tuple should go to.
+		 */
+		gistDeCompressAtt(giststate, r,
+						  itup, NULL, (OffsetNumber) 0, entry, isnull);
+
+		which = -1;
+		*which_grow = -1.0f;
+		sum_grow = 1.0f;
+
+		for (i = 0; i < splitPagesCount && sum_grow; i++)
+		{
+			int			j;
+			RelocationBufferInfo *splitPageInfo = &relocationBuffersInfos[i];
+
+			sum_grow = 0.0f;
+			for (j = 0; j < r->rd_att->natts; j++)
+			{
+				float		usize;
+
+				usize = gistpenalty(giststate, j,
+									&splitPageInfo->entry[j],
+									splitPageInfo->isnull[j],
+									&entry[j], isnull[j]);
+
+				if (which_grow[j] < 0 || usize < which_grow[j])
+				{
+					which = i;
+					which_grow[j] = usize;
+					if (j < r->rd_att->natts - 1 && i == 0)
+						which_grow[j + 1] = -1;
+					sum_grow += which_grow[j];
+				}
+				else if (which_grow[j] == usize)
+					sum_grow += usize;
+				else
+				{
+					sum_grow = 1;
+					break;
+				}
+			}
+		}
+		targetBufferInfo = &relocationBuffersInfos[which];
+
+		/* Push item to selected node buffer */
+		gistPushItupToNodeBuffer(gfbb, targetBufferInfo->nodeBuffer, itup);
+
+		/* Adjust the downlink for this page, if needed. */
+		newtup = gistgetadjusted(r, targetBufferInfo->splitinfo->downlink,
+								 itup, giststate);
+		if (newtup)
+		{
+			gistDeCompressAtt(giststate, r,
+							  newtup, NULL, (OffsetNumber) 0,
+							  targetBufferInfo->entry,
+							  targetBufferInfo->isnull);
+
+			targetBufferInfo->splitinfo->downlink = newtup;
+		}
+	}
+
+	pfree(relocationBuffersInfos);
+}
+
+
+/*
+ * Wrappers around BufFile operations. The main difference is that these
+ * wrappers report errors with ereport(), so that the callers don't need
+ * to check the return code.
+ */
+
+static void
+ReadTempFileBlock(BufFile *file, long blknum, void *ptr)
+{
+	if (BufFileSeekBlock(file, blknum) != 0)
+		elog(ERROR, "could not seek temporary file: %m");
+	if (BufFileRead(file, ptr, BLCKSZ) != BLCKSZ)
+		elog(ERROR, "could not read temporary file: %m");
+}
+
+static void
+WriteTempFileBlock(BufFile *file, long blknum, void *ptr)
+{
+	if (BufFileSeekBlock(file, blknum) != 0)
+		elog(ERROR, "could not seek temporary file: %m");
+	if (BufFileWrite(file, ptr, BLCKSZ) != BLCKSZ)
+	{
+		/*
+		 * the other errors in Read/WriteTempFileBlock shouldn't happen, but
+		 * an error at write can easily happen if you run out of disk space.
+		 */
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write block %ld of temporary file: %m",
+						blknum)));
+	}
+}
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -667,13 +667,30 @@ gistoptions(PG_FUNCTION_ARGS)
 {
 	Datum		reloptions = PG_GETARG_DATUM(0);
 	bool		validate = PG_GETARG_BOOL(1);
-	bytea	   *result;
+	relopt_value *options;
+	GiSTOptions *rdopts;
+	int			numoptions;
+	static const relopt_parse_elt tab[] = {
+		{"fillfactor", RELOPT_TYPE_INT, offsetof(GiSTOptions, fillfactor)},
+		{"buffering", RELOPT_TYPE_STRING, offsetof(GiSTOptions, bufferingModeOffset)}
+	};

-	result = default_reloptions(reloptions, validate, RELOPT_KIND_GIST);
+	options = parseRelOptions(reloptions, validate, RELOPT_KIND_GIST,
+							  &numoptions);
+
+	/* if none set, we're done */
+	if (numoptions == 0)
+		PG_RETURN_NULL();
+
+	rdopts = allocateReloptStruct(sizeof(GiSTOptions), options, numoptions);
+
+	fillRelOptions((void *) rdopts, sizeof(GiSTOptions), options, numoptions,
+				   validate, tab, lengthof(tab));
+
+	pfree(options);
+
+	PG_RETURN_BYTEA_P(rdopts);

-	if (result)
-		PG_RETURN_BYTEA_P(result);
-	PG_RETURN_NULL();
 }

 /*
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -263,7 +263,8 @@ gistRedoPageSplitRecord(XLogRecPtr lsn, XLogRecord *record)
 			else
 				GistPageGetOpaque(page)->rightlink = xldata->origrlink;
 			GistPageGetOpaque(page)->nsn = xldata->orignsn;
-			if (i < xlrec.data->npage - 1 && !isrootsplit)
+			if (i < xlrec.data->npage - 1 && !isrootsplit &&
+				xldata->markfollowright)
 				GistMarkFollowRight(page);
 			else
 				GistClearFollowRight(page);
@@ -411,7 +412,7 @@ XLogRecPtr
 gistXLogSplit(RelFileNode node, BlockNumber blkno, bool page_is_leaf,
 			  SplitedPageLayout *dist,
 			  BlockNumber origrlink, GistNSN orignsn,
-			  Buffer leftchildbuf)
+			  Buffer leftchildbuf, bool markfollowright)
 {
 	XLogRecData *rdata;
 	gistxlogPageSplit xlrec;
@@ -433,6 +434,7 @@ gistXLogSplit(RelFileNode node, BlockNumber blkno, bool page_is_leaf,
 	xlrec.npage = (uint16) npage;
 	xlrec.leftchild =
 		BufferIsValid(leftchildbuf) ? BufferGetBlockNumber(leftchildbuf) : InvalidBlockNumber;
+	xlrec.markfollowright = markfollowright;

 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = sizeof(gistxlogPageSplit);