Add Bloom filter implementation.

A Bloom filter is a space-efficient, probabilistic data structure that can be used to test set membership. Callers will sometimes incur false positives, but never false negatives. The rate of false positives is a function of the total number of elements and the amount of memory available for the Bloom filter. Two classic applications of Bloom filters are cache filtering, and data synchronization testing. Any user of Bloom filters must accept the possibility of false positives as a cost worth paying for the benefit in space efficiency. This commit adds a test harness extension module, test_bloomfilter. It can be used to get a sense of how the Bloom filter implementation performs under varying conditions. This is infrastructure for the upcoming "heapallindexed" amcheck patch, which verifies the consistency of a heap relation against one of its indexes. Author: Peter Geoghegan Reviewed-By: Andrey Borodin, Michael Paquier, Thomas Munro, Andres Freund Discussion: https://postgr.es/m/CAH2-Wzm5VmG7cu1N-H=nnS57wZThoSDQU+F5dewx3o84M+jY=g@mail.gmail.com
2025-07-20 05:03:10 +03:00 · 2018-03-31 17:49:41 -07:00
parent ed69864350
commit 51bc271790
14 changed files with 625 additions and 2 deletions
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@ -9,6 +9,7 @@ SUBDIRS = \
 		  commit_ts \
 		  dummy_seclabel \
 		  snapshot_too_old \
+		  test_bloomfilter \
 		  test_ddl_deparse \
 		  test_extensions \
 		  test_parser \
--- a/src/test/modules/test_bloomfilter/.gitignore
+++ b/src/test/modules/test_bloomfilter/.gitignore
@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
--- a/src/test/modules/test_bloomfilter/Makefile
+++ b/src/test/modules/test_bloomfilter/Makefile
@ -0,0 +1,21 @@
+# src/test/modules/test_bloomfilter/Makefile
+
+MODULE_big = test_bloomfilter
+OBJS = test_bloomfilter.o $(WIN32RES)
+PGFILEDESC = "test_bloomfilter - test code for Bloom filter library"
+
+EXTENSION = test_bloomfilter
+DATA = test_bloomfilter--1.0.sql
+
+REGRESS = test_bloomfilter
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_bloomfilter
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
--- a/src/test/modules/test_bloomfilter/README
+++ b/src/test/modules/test_bloomfilter/README
@ -0,0 +1,68 @@
+test_bloomfilter overview
+=========================
+
+test_bloomfilter is a test harness module for testing Bloom filter library set
+membership operations.  It consists of a single SQL-callable function,
+test_bloomfilter(), plus a regression test that calls test_bloomfilter().
+Membership tests are performed against a dataset that the test harness module
+generates.
+
+The test_bloomfilter() function displays instrumentation at DEBUG1 elog level
+(WARNING when the false positive rate exceeds a 1% threshold).  This can be
+used to get a sense of the performance characteristics of the Postgres Bloom
+filter implementation under varied conditions.
+
+Bitset size
+-----------
+
+The main bloomfilter.c criteria for sizing its bitset is that the false
+positive rate should not exceed 2% when sufficient bloom_work_mem is available
+(and the caller-supplied estimate of the number of elements turns out to have
+been accurate).  A 1% - 2% rate is currently assumed to be suitable for all
+Bloom filter callers.
+
+With an optimal K (number of hash functions), Bloom filters should only have a
+1% false positive rate with just 9.6 bits of memory per element.  The Postgres
+implementation's 2% worst case guarantee exists because there is a need for
+some slop due to implementation inflexibility in bitset sizing.  Since the
+bitset size is always actually kept to a power of two number of bits, callers
+can have their bloom_work_mem argument truncated down by almost half.
+In practice, callers that make a point of passing a bloom_work_mem that is an
+exact power of two bitset size (such as test_bloomfilter.c) will actually get
+the "9.6 bits per element" 1% false positive rate.
+
+Testing strategy
+----------------
+
+Our approach to regression testing is to test that a Bloom filter has only a 1%
+false positive rate for a single bitset size (2 ^ 23, or 1MB).  We test a
+dataset with 838,861 elements, which works out at 10 bits of memory per
+element.  We round up from 9.6 bits to 10 bits to make sure that we reliably
+get under 1% for regression testing.  Note that a random seed is used in the
+regression tests because the exact false positive rate is inconsistent across
+platforms.  Inconsistent hash function behavior is something that the
+regression tests need to be tolerant of anyway.
+
+test_bloomfilter() SQL-callable function
+========================================
+
+The SQL-callable function test_bloomfilter() provides the following arguments:
+
+* "power" is the power of two used to size the Bloom filter's bitset.
+
+The minimum valid argument value is 23 (2^23 bits), or 1MB of memory.  The
+maximum valid argument value is 32, or 512MB of memory.
+
+* "nelements" is the number of elements to generate for testing purposes.
+
+* "seed" is a seed value for hashing.
+
+A value < 0 is interpreted as "use random seed".  Varying the seed value (or
+specifying -1) should result in small variations in the total number of false
+positives.
+
+* "tests" is the number of tests to run.
+
+This may be increased when it's useful to perform many tests in an interactive
+session.  It only makes sense to perform multiple tests when a random seed is
+used.
--- a/src/test/modules/test_bloomfilter/expected/test_bloomfilter.out
+++ b/src/test/modules/test_bloomfilter/expected/test_bloomfilter.out
@ -0,0 +1,22 @@
+CREATE EXTENSION test_bloomfilter;
+-- See README for explanation of arguments:
+SELECT test_bloomfilter(power => 23,
+    nelements => 838861,
+    seed => -1,
+    tests => 1);
+ test_bloomfilter 
+------------------
+ 
+(1 row)
+
+-- Equivalent "10 bits per element" tests for all possible bitset sizes:
+--
+-- SELECT test_bloomfilter(24, 1677722)
+-- SELECT test_bloomfilter(25, 3355443)
+-- SELECT test_bloomfilter(26, 6710886)
+-- SELECT test_bloomfilter(27, 13421773)
+-- SELECT test_bloomfilter(28, 26843546)
+-- SELECT test_bloomfilter(29, 53687091)
+-- SELECT test_bloomfilter(30, 107374182)
+-- SELECT test_bloomfilter(31, 214748365)
+-- SELECT test_bloomfilter(32, 429496730)
--- a/src/test/modules/test_bloomfilter/sql/test_bloomfilter.sql
+++ b/src/test/modules/test_bloomfilter/sql/test_bloomfilter.sql
@ -0,0 +1,19 @@
+CREATE EXTENSION test_bloomfilter;
+
+-- See README for explanation of arguments:
+SELECT test_bloomfilter(power => 23,
+    nelements => 838861,
+    seed => -1,
+    tests => 1);
+
+-- Equivalent "10 bits per element" tests for all possible bitset sizes:
+--
+-- SELECT test_bloomfilter(24, 1677722)
+-- SELECT test_bloomfilter(25, 3355443)
+-- SELECT test_bloomfilter(26, 6710886)
+-- SELECT test_bloomfilter(27, 13421773)
+-- SELECT test_bloomfilter(28, 26843546)
+-- SELECT test_bloomfilter(29, 53687091)
+-- SELECT test_bloomfilter(30, 107374182)
+-- SELECT test_bloomfilter(31, 214748365)
+-- SELECT test_bloomfilter(32, 429496730)
--- a/src/test/modules/test_bloomfilter/test_bloomfilter--1.0.sql
+++ b/src/test/modules/test_bloomfilter/test_bloomfilter--1.0.sql
@ -0,0 +1,11 @@
+/* src/test/modules/test_bloomfilter/test_bloomfilter--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_bloomfilter" to load this file. \quit
+
+CREATE FUNCTION test_bloomfilter(power integer,
+    nelements bigint,
+    seed integer DEFAULT -1,
+    tests integer DEFAULT 1)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
--- a/src/test/modules/test_bloomfilter/test_bloomfilter.c
+++ b/src/test/modules/test_bloomfilter/test_bloomfilter.c
@ -0,0 +1,138 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_bloomfilter.c
+ *		Test false positive rate of Bloom filter.
+ *
+ * Copyright (c) 2018, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_bloomfilter/test_bloomfilter.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "lib/bloomfilter.h"
+#include "miscadmin.h"
+
+PG_MODULE_MAGIC;
+
+/* Must fit decimal representation of PG_INT64_MAX + 2 bytes: */
+#define MAX_ELEMENT_BYTES		20
+/* False positive rate WARNING threshold (1%): */
+#define FPOSITIVE_THRESHOLD		0.01
+
+
+/*
+ * Populate an empty Bloom filter with "nelements" dummy strings.
+ */
+static void
+populate_with_dummy_strings(bloom_filter *filter, int64 nelements)
+{
+	char		element[MAX_ELEMENT_BYTES];
+	int64		i;
+
+	for (i = 0; i < nelements; i++)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		snprintf(element, sizeof(element), "i" INT64_FORMAT, i);
+		bloom_add_element(filter, (unsigned char *) element, strlen(element));
+	}
+}
+
+/*
+ * Returns number of strings that are indicated as probably appearing in Bloom
+ * filter that were in fact never added by populate_with_dummy_strings().
+ * These are false positives.
+ */
+static int64
+nfalsepos_for_missing_strings(bloom_filter *filter, int64 nelements)
+{
+	char		element[MAX_ELEMENT_BYTES];
+	int64		nfalsepos = 0;
+	int64		i;
+
+	for (i = 0; i < nelements; i++)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		snprintf(element, sizeof(element), "M" INT64_FORMAT, i);
+		if (!bloom_lacks_element(filter, (unsigned char *) element,
+								 strlen(element)))
+			nfalsepos++;
+	}
+
+	return nfalsepos;
+}
+
+static void
+create_and_test_bloom(int power, int64 nelements, int callerseed)
+{
+	int			bloom_work_mem;
+	uint64		seed;
+	int64		nfalsepos;
+	bloom_filter *filter;
+
+	bloom_work_mem = (1L << power) / 8L / 1024L;
+
+	elog(DEBUG1, "bloom_work_mem (KB): %d", bloom_work_mem);
+
+	/*
+	 * Generate random seed, or use caller's.  Seed should always be a
+	 * positive value less than or equal to PG_INT32_MAX, to ensure that any
+	 * random seed can be recreated through callerseed if the need arises.
+	 * (Don't assume that RAND_MAX cannot exceed PG_INT32_MAX.)
+	 */
+	seed = callerseed < 0 ? random() % PG_INT32_MAX : callerseed;
+
+	/* Create Bloom filter, populate it, and report on false positive rate */
+	filter = bloom_create(nelements, bloom_work_mem, seed);
+	populate_with_dummy_strings(filter, nelements);
+	nfalsepos = nfalsepos_for_missing_strings(filter, nelements);
+
+	ereport((nfalsepos > nelements * FPOSITIVE_THRESHOLD) ? WARNING : DEBUG1,
+			(errmsg_internal("seed: " UINT64_FORMAT " false positives: " INT64_FORMAT " (%.6f%%) bitset %.2f%% set" ,
+							 seed, nfalsepos, (double) nfalsepos / nelements,
+							 100.0 * bloom_prop_bits_set(filter))));
+
+	bloom_free(filter);
+}
+
+PG_FUNCTION_INFO_V1(test_bloomfilter);
+
+/*
+ * SQL-callable entry point to perform all tests.
+ *
+ * If a 1% false positive threshold is not met, emits WARNINGs.
+ *
+ * See README for details of arguments.
+ */
+Datum
+test_bloomfilter(PG_FUNCTION_ARGS)
+{
+	int			power = PG_GETARG_INT32(0);
+	int64		nelements = PG_GETARG_INT64(1);
+	int			seed = PG_GETARG_INT32(2);
+	int			tests = PG_GETARG_INT32(3);
+	int			i;
+
+	if (power < 23 || power > 32)
+		elog(ERROR, "power argument must be between 23 and 32 inclusive");
+
+	if (tests <= 0)
+		elog(ERROR, "invalid number of tests: %d", tests);
+
+	if (nelements < 0)
+		elog(ERROR, "invalid number of elements: %d", tests);
+
+	for (i = 0; i < tests; i++)
+	{
+		elog(DEBUG1, "beginning test #%d...", i + 1);
+
+		create_and_test_bloom(power, nelements, seed);
+	}
+
+	PG_RETURN_VOID();
+}
--- a/src/test/modules/test_bloomfilter/test_bloomfilter.control
+++ b/src/test/modules/test_bloomfilter/test_bloomfilter.control
@ -0,0 +1,4 @@
+comment = 'Test code for Bloom filter library'
+default_version = '1.0'
+module_pathname = '$libdir/test_bloomfilter'
+relocatable = true