From d2ec671092a1144fcaa6b465b4672e937a68b65f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 11 Jun 2025 09:17:29 -0400
Subject: [PATCH] Make _bt_killitems drop pins it acquired itself.

Teach nbtree's _bt_killitems to leave the so->currPos page that it sets
LP_DEAD items on in whatever state it was in when _bt_killitems was
called.  In particular, make sure that so->dropPin scans don't acquire a
pin whose reference is saved in so->currPos.buf.

Allowing _bt_killitems to change so->currPos.buf like this is wrong.
The immediate consequence of allowing it is that code in _bt_steppage
(that copies so->currPos into so->markPos) will behave as if the scan is
a !so->dropPin scan.  so->markPos will therefore retain the buffer pin
indefinitely, even though _bt_killitems only needs to acquire a pin
(along with a lock) for long enough to mark known-dead items LP_DEAD.

This issue came to light following a report of a failure of an assertion
from recent commit e6eed40e.  The test case in question involves the use
of mark and restore.  An initial call to _bt_killitems takes place that
leaves so->currPos.buf in a state that is inconsistent with the scan
being so->dropPin.  A subsequent call to _bt_killitems for the same
position (following so->currPos being saved in so->markPos, and then
restored as so->currPos) resulted in the failure of an assertion that
tests that so->currPos.buf is InvalidBuffer when the scan is so->dropPin
(non-assert builds got a "resource was not closed" WARNING instead).

The same problem exists on earlier releases, though the issue is far
more subtle there.  Recent commit e6eed40e introduced the so->dropPin
field as a partial replacement for testing so->currPos.buf directly.
Earlier releases won't get an assertion failure (or buffer pin leak),
but they will allow the second _bt_killitems call from the test case to
behave as if a buffer pin was consistently held since the original call
to _bt_readpage.  This is wrong; there will have been an initial window
during which no pin was held on the so->currPos page, and yet the second
_bt_killitems call will neglect to check if so->currPos.lsn continues to
match the page's now-current LSN.

As a result of all this, it's just about possible that _bt_killitems
will set the wrong items LP_DEAD (on release branches).  This could only
happen with merge joins (the sole user of nbtree mark/restore support),
when a concurrently inserted index tuple used a recently-recycled TID
(and only when the new tuple was inserted onto the same page as a
distinct concurrently-removed tuple with the same TID).  This is exactly
the scenario that _bt_killitems' check of the page's now-current LSN
against the LSN stashed in currPos was supposed to prevent.

A follow-up commit will make nbtree completely stop conditioning whether
or not a position's pin needs to be dropped on whether the 'buf' field
is set.  All call sites that might need to drop a still-held pin will be
taught to rely on the scan-level so->dropPin field recently introduced
by commit e6eed40e.  That will make bugs of the same general nature as
this one impossible (or make them much easier to detect, at least).

Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/545be1e5-3786-439a-9257-a90d30f8b849@gmail.com
Backpatch-through: 13
---
 src/backend/access/nbtree/nbtutils.c | 34 ++++++++++++++++------------
 1 file changed, 20 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 63976888afd..96dd1a1353f 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1713,9 +1713,9 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
  * current page and killed tuples thereon (generally, this should only be
  * called if so->numKilled > 0).
  *
- * The caller does not have a lock on the page and may or may not have the
- * page pinned in a buffer.  Note that read-lock is sufficient for setting
- * LP_DEAD status (which is only a hint).
+ * Caller should not have a lock on the so->currPos page, but may hold a
+ * buffer pin.  When we return, it still won't be locked.  It'll continue to
+ * hold whatever pins were held before calling here.
  *
  * We match items by heap TID before assuming they are the right ones to
  * delete.  We cope with cases where items have moved right due to insertions.
@@ -1747,7 +1747,8 @@ _bt_killitems(IndexScanDesc scan)
 	int			i;
 	int			numKilled = so->numKilled;
 	bool		killedsomething = false;
-	bool		droppedpin PG_USED_FOR_ASSERTS_ONLY;
+	bool		droppedpin;
+	Buffer		buf;
 
 	Assert(BTScanPosIsValid(so->currPos));
 
@@ -1766,29 +1767,31 @@ _bt_killitems(IndexScanDesc scan)
 		 * LSN.
 		 */
 		droppedpin = false;
-		_bt_lockbuf(scan->indexRelation, so->currPos.buf, BT_READ);
-
-		page = BufferGetPage(so->currPos.buf);
+		buf = so->currPos.buf;
+		_bt_lockbuf(scan->indexRelation, buf, BT_READ);
 	}
 	else
 	{
-		Buffer		buf;
+		XLogRecPtr	latestlsn;
 
 		droppedpin = true;
 		/* Attempt to re-read the buffer, getting pin and lock. */
 		buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ);
 
-		page = BufferGetPage(buf);
-		if (BufferGetLSNAtomic(buf) == so->currPos.lsn)
-			so->currPos.buf = buf;
-		else
+		latestlsn = BufferGetLSNAtomic(buf);
+		Assert(!XLogRecPtrIsInvalid(so->currPos.lsn));
+		Assert(so->currPos.lsn <= latestlsn);
+		if (so->currPos.lsn != latestlsn)
 		{
 			/* Modified while not pinned means hinting is not safe. */
 			_bt_relbuf(scan->indexRelation, buf);
 			return;
 		}
+
+		/* Unmodified, hinting is safe */
 	}
 
+	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
@@ -1905,10 +1908,13 @@ _bt_killitems(IndexScanDesc scan)
 	if (killedsomething)
 	{
 		opaque->btpo_flags |= BTP_HAS_GARBAGE;
-		MarkBufferDirtyHint(so->currPos.buf, true);
+		MarkBufferDirtyHint(buf, true);
 	}
 
-	_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+	if (!droppedpin)
+		_bt_unlockbuf(scan->indexRelation, buf);
+	else
+		_bt_relbuf(scan->indexRelation, buf);
 }